Data Collection and Preprocessing

The goal for this project was to create a model for using player stats to predict the likelihood of a player being drafted to the professional league. The first task was finding data, lots of data. This was acquired by web-scraping data from the site Sports Reference at Sports Reference.com. Historical and current NCAA player statistics were pulled into a dataframe comprised of both drafted and undrafted players. Players were assigned a binary status of 0 for those who did not play professioanlly and 1 for those who played in a professional league. This provided a big data source to reduce our chances of an overfit model. Data was then cleaned and pre-processed to prepare for model training. The following statistics were chosen as our features to train and test our model.

Chosen Features:

fgapg = Field goal attempts per game

fgpct = Field goal percent

fgpg = Field goals per game

ftapg = Free throw attempts per game

ftpct = Free throw percent

ftpg = Free throws per game

height = Player height

pfpg = Personal fouls per game

ptspg = Points per game

sospg = Strength of schedule

trbpg = Total rebounds per game

Model Training

For a Random distribution we randomly split our model in half making one half a train dataset and one half a test data set. The train dataset was then split into random test train subsets. We then modeled and trained our data to identify feature strengths. The Correlation and Heatmap showed correlation between strength of schedule and field goal percent. Strength of schedule showed to be a key feature in all modules. The SHAP representation also showed points per game to have a strong importance. After assessing models we chose a Random Forest Classifier (RFC) model for training our data and testing our predictions. With the RFC model we were able to generate the relative feature importance among the player statistics. The flexibility of the RFC also allowed us to ulitlize several features while maintaining accuracy.

Correlation Matrix

Heat Map

Random Forest Classifier

SHAP

Time to test the predictions!

To prevent the algorithm from recognizing prior data we used the previously mentioned but unused test data set created from our scraped data. The known outcomes were removed and the statistical data was input through our trained model. We then generated a predicted outcome to test if predictions would match the known outcomes. The outputs were then coded into a dataframe with expected values. The two were compared for accuracy as seen below.

Generated Table of Expected vs Predicted Results.

Our results showed a 96.7% prediction accuracy.

College Basketball Draft Project

Machine Learning

To view our code visit Draft Prediction Project GitHub.