Machine Learning Model for Draft Predictions
The goal for this project was to create a model for using player stats to predict the likelihood of a player being drafted to the professional league. The first task was finding data, lots of data. This was acquired by web-scraping data from the site Sports Reference at Sports Reference.com. Historical and current NCAA player statistics were pulled into a dataframe comprised of both drafted and undrafted players. Players were assigned a binary status of 0 for those who did not play professioanlly and 1 for those who played in a professional league. This provided a big data source to reduce our chances of an overfit model. Data was then cleaned and pre-processed to prepare for model training. The following statistics were chosen as our features to train and test our model.
For a Random distribution we randomly split our model in half making one half a train dataset and one half a test data set. The train dataset was then split into random test train subsets. We then modeled and trained our data to identify feature strengths. The Correlation and Heatmap showed correlation between strength of schedule and field goal percent. Strength of schedule showed to be a key feature in all modules. The SHAP representation also showed points per game to have a strong importance. After assessing models we chose a Random Forest Classifier (RFC) model for training our data and testing our predictions. With the RFC model we were able to generate the relative feature importance among the player statistics. The flexibility of the RFC also allowed us to ulitlize several features while maintaining accuracy.
To prevent the algorithm from recognizing prior data we used the previously mentioned but unused test data set created from our scraped data. The known outcomes were removed and the statistical data was input through our trained model. We then generated a predicted outcome to test if predictions would match the known outcomes. The outputs were then coded into a dataframe with expected values. The two were compared for accuracy as seen below.