Every year, millions of people try to guess who will win every game in the NCAA Basketball Tournament (affectionately called “March Madness”). So, why not let AI take a shot at it! You are cordially invited to participate in the first annual March Madness Machine Learning (MMML) competition. The rules are simple – develop a machine learning model to predict the winners of college basketball games. These predictions will be used to fill in your bracket (see here to learn how brackets work). As the tournament continues over three weeks, we will update the results and compute your overall score.
We will have two categories of competition for both the men’s and women’s tournaments (four competitions overall – you can participate in any, or all, of them):
For all four competitions, prizes (CMU swag!) will be awarded for 1st, 2nd, and 3rd place.
Both the men's and women's tournaments include 68 teams. 8 teams play each other in the play-in round (also called "First Four"). The winners of those games fill in 4 slots of a 64-team bracket, divided into 4 regions. In the first round of the main bracket, in each region the 1st seeded team plays the 16th seeded team, the 2nd seeded team plays the 15th seeded team, etc. The seedings and which teams are in the play-in round will be decided on Sunday March 16. Game play begins on March 18 for the men, and March 19 for the women.
The competition is adapted from an existing Kaggle competition. That site has much relevant data for training ML models to predict the winner of men's and women's Division 1 basketball games. In paricular, the site has data, in some cases going back decades, for games played both during the regular season and tournaments. Statistics, for both the winning and losing teams, include points scored, field goals attempted and made, 3-pointers attempted and made, free throws attempted and made, offensive and defensive rebounds, assists, turnovers, steals, blocks, and fouls. It also includes data on the rankings of each team by various organizations, such as ESPN, USA Today.
You can use the data in any way you choose to train any model you choose. You are free to use whatever other data you can find to help train the model.
Contact me if you need help accessing the data and definitions of the data formats.
Points are awarded as follows:
Run python madness.py to see an example
(in the download, the predictions are chosen at random, so not such a
great bracket). Once you have generated your own predictions (see above) you can use the software to test how your predictions would do in a previous season's tournament. Look at the very end of the file to see how to create, seed, fill, and test a bracket (either regular or progressive).
A snippet of the bracket produced by madness.py is shown below.
Connecticu
|Connecticu
_Stetson__| |
|Connecticu
FL Atlanti Northweste| |
|FL Atlanti| |
Northweste| |Connecticu
|__Auburn__
San Diego |
|San Diego |
___UAB____| |San Diego |
|__Auburn__|
__Auburn__ Yale |
|__Auburn__|
___Yale___|
It is interpreted as follows: In the first round, 1st seed Connecticutt played 16th seeded Stetson; Connecticutt was predicted to win, and did. Then 8th seed FL Atlantic played 9th seeded Northwester; FL Atlantic was predicted to win, but Northwestern was the actual winner - the incorrect prediction is crossed out and the correct winner appears above. Similarly, San Deigo was correctly predicted as the winner, but Auburn was incorrectly predicted. In the second round, Connecticutt was predicted correctly, but Auburn was not. Note that this is a regular bracket - before the tournament began, it was predicted that Auburn would win all its games; once it lost in the first round, no points could be gotten for the subsequent rounds. If this were a progressive bracket, the second round would be a prediction between San Diego and Yale, and similarly in the third round it would be between Connecticutt and San Diego.
You can download the software from here.
Note: Strikethroughs for incorrect predictions do not work on all terminals. If you are not seeing the strikethroughs, invoke bracket.show() with the use_unicode=True option.
A submission to the competition consists of a CSV file that contains two columns labeled 'WTeamID' and 'LTeamID' (winning and losing teams identifiers, respectively). We are following the convention of the Kaggle competition to use numeric identifiers to uniquely identify each team. Men's team ids run from 1000-1999 and women's ids from 3000-3999.
The file should contain all possible matchups during the competition. For simplicity, just provide predictions for all possible pairs of ids -- that way, you'll know you have all the bases covered. Name the file 'MTourneyPredictions.csv' or 'WTourneyPredictions.csv' for the men's and women's tournaments, respectively.
There should be 72,010 predictions for the men's tournament (the number of combinations for the 380 Division 1 men's teams) and 71,253 predictions for the women's tournament (the number of combinations for the 378 Division 1 women's teams).
Use this Google form to submit you predictions. You may submit as many times as you want, up until March 18 at 5pm EDT, but only the last submission will be used.