I’ve finally finished the interactive match predictor I’ve been working on for a couple of weeks now, just in time for the UEFA Champions League Final. You can test it out at:
I wanted to post some of the details behind the program here for anyone interested in the mechanics behind it. There are two steps to the process – the first is calculating the “Skill” level of the two teams, which is based on the results of all games this season. The short version is that I apply a Generalized Partial Credit Model (GPCM), which is a member of the Rasch family to all league results during the season to calculate a team’s “skill” rating. 1 Full details can be found at http://soccer.chadmurphy.org/predicting-late-season-outcomes-the-method/
The next step was to collect game stats, which I did from a variety of sources around the internet. I merged various offensive, defensive, and discipline statistics with each result, whether a team was at home or not, the skill levels, and goals scored/against in the game. The full dataset has something like 31 variables for each game , and I entered most of the EPL games into the dataset, and each game was broken up into two separate entries (stats for the home and away teams). I ended up with 696 observations in the final dataset * 31 variables.
The next step was to do apply some machine learning algorithms to the data. So I started with something simple: k-means clustering. This method scales all the data into two dimensions, separates them into groups, or “clusters”, and attempts to classify them based on membership in these clusters. The classic example here seems to be based on classification of different species of iris. Here’s an example of one of the plots you’ll see from a nicely differentiated k-means clustering application.
Here’s what happened when I ran k-means clustering on my data:
As you can see, this wasn’t quite as clean as the canonical “iris” dataset. It also doesn’t predict nearly as well, classifying 27% of the observations correctly. For a point of reference, if I would have said “every team lost every game” I would have predicted 38% correctly.
I did a couple other steps that I may edit in here later, but in the interest of finishing this promptly, I finished with a Support Vector Machine (svm) model, which does something similar to k-means clustering, but adds multiple dimensions. Instead of using a 2-dimensional method, it cuts the data using multi-dimensional hyperplanes to predict outcomes correctly. This method ended up predicting 78% of all outcomes correctly, 65% of all “goals scored” correctly, and 67% of all “goals allowed” correctly. So that’s the method I use in the predictor for goals scored/allowed.
Plots are created using the “waffle” library in R, and the interactive data visualization is done in Shiny.
- I’m working on a method to use in-game stats to predict the skill rating, but that’s not a priority until the fall unless someone knows where I can detailed women’s soccer stats in time for the World Cup ↩