I thought it was important to go into some methodological details on my predictions, so if you’re interested here are the technical details on my method.
One of the things that gave me the idea was listening to announcers talk about how “difficult it is to travel to Old Trafford” or discussing how a team needs to perform as well on the road as they do at home. But how can we measure this?
There’s a fairly robust academic literature on this topic, with researchers using a deviation from expected values (0.5 in sports with two outcomes, 0.3 in sports like soccer where draws are an option), while others use teams that play each other both home and away as somewhat of a natural experiment 1 . Because my goal was both to quantify and to predict, I approached this problem in a different way.
Drawing from literature on Item Response Theory (IRT) in psychonomic assessment, I thought about soccer as a series of questions on a survey instrument and each team as a participant. These are the techniques used in standardized tests like the GRE – students are asked a series of questions and rather than simply giving you a percentage score the test attempts to find your skill level by examining which questions you got right and wrong. I liked this idea for a couple of reasons: first, soccer results don’t sort especially well. Every week has some odd results, so you can’t necessarily draw any conclusions on who will win the league based on one results. Second, a version of this model called the Generalized Partial Credit Model (GPCM) allows for “partial credit” on a question. Sports where a team can only win or lose would be fine with a simple dichotomous “correct/incorrect” choice, but soccer’s frequent draws offer a special challenge that can be solved by this model.
To get my predicted probabilities, I put together a matrix with all the results from the first 28 weeks of the English Premier League, scoring a loss as 0, a tie as 1, and a win as 2. Then I ran the GPCM analysis in R 2, with each team as a row (respondent) and each game (each team home and away) as the columns (questions). From there I can calculate predicted probabilities that each team answers each question correctly (win), earns partial credit (draws), or loses (zero credit). Through some inelegant R scripting, I was able to calculate these probabilities for all the remaining games in the league, plot each team’s likelihood function for all three outcomes, and plot the “skill level” for each of the teams in the league. 3
This lets me do a number of things, including predicting each game, run simulated “seasons” with the remaining games, rank the teams based on the predicted results, predict who will win the league, place in the top 4 Champions League spots, and who will get relegated. All of this controls for team performance, remaining games, and home field advantage so I’m relatively confident in my predictive abilities for the rest of the season. The model has performed well so far, correctly predicting 8/10 games correct in Week 30.
I’ve completed the simulations for the top of the table (not really spoilers in the footnote 4), and the top four, and am going to complete the simulations for the relegation fight sometime this week. It’s been interesting so far, and my summer project is to learn more about the GPCM model to see if I can improve my predictions and to take the next step and figure out a player-based model rather than a results-based model.
The newest iteration of the model includes in-game statistics to predict results. I’m trying to refine the model’s predictions and improve upon what I did last season, and the off-season was the perfect time to do this.
I merged various offensive, defensive, and discipline statistics with each result, whether a team was at home or not, the skill levels, and goals scored/against in the game. The full dataset has something like 31 variables for each game , and I entered most of the EPL games into the dataset, and each game was broken up into two separate entries (stats for the home and away teams). I ended up with 696 observations in the final dataset * 31 variables. I’m also working on Serie A right now, mostly because I’m a big fan of AC Milan and want to track their results a little more closely this year.
The next step was to apply some machine learning algorithms to the data. So I started with something simple: k-means clustering. This method scales all the data into two dimensions, separates them into groups, or “clusters”, and attempts to classify them based on membership in these clusters. The classic example here seems to be based on classification of different species of iris. Here’s an example of one of the plots you’ll see from a nicely differentiated k-means clustering application.
Here’s what happened when I ran k-means clustering on my data:
As you can see, this wasn’t nearly as clean as the canonical “iris” dataset (it’s quite ugly actually, with a huge amount of overlap between the circles and no real clusters at all). It also doesn’t predict nearly as well, classifying 27% of the observations correctly. For a point of reference, if I would have said “every team lost every game” I would have predicted 38% correctly. K-means clustering was a disaster, so I moved on to some better models.
The first was a Random Forest model, which has become quite popular in political analytics communities both for its accuracy and its relative ease of interpretation (and because it can process large datafiles very quickly, which isn’t a concern here). The Random Forest model is an improvement over the decision tree model, which is known to overfit data (basically when it responds too closely to the training data, making it less useful for out-of-sample predictions). The Random Forest model takes a sample of the data and trains it on some very short decision trees – effectively asking the data a series of questions until it gets a strong probability of the correct choice, then stopping, pulling new data, and growing a new tree. Here’s a graphic version of a Random Forest on the iris data:
When you get to the bottom of the trees, each tree will predict win, loss, or draw for a game given the statistics of the game. Then R’s randomForest library gives you the proportion of trees that predicted each outcome, and I use that for my predictions.
With ~30 variables, this model did fairly well. I forget the correct classification percentage, but it was somewhere around 70% or so in the in-sample data, and maybe around 50% in the few weeks of out-of-sample data I ran 5.
I finished with a Support Vector Machine (svm) model, which does something similar to k-means clustering, but adds multiple dimensions. Most predictors are linear models (picturing drawing a line through data, and everything above the line is a “win” while everything below the line is a “loss.” This implies that variables have a linear effect on the outcome – more passes/more shots/less fouls/more possession all correlate with winning a soccer game. While they probably do correlate, with apologies to Pep Guardiola and Andres Iniesta, the correlation isn’t as simple as “the team that passes the most wins.” This is where the SVM shines, because instead of using a 2-dimensional method, it cuts the data using multi-dimensional hyperplanes. I didn’t find any good visual interpretation of this, probably because our brains can’t really process more than three dimensions, but this I think is fairly close.
Now instead of a linear relationship between passing and outcome, we have a multi-dimensional relationship with an unclear functional form. Picture two of these cones (?) stacked on each other, and then all game outcomes plotted in this three-dimensional space. Now the SVM classifies results as whichever hyperplane is closest to the point, with different probabilities for each outcome.
The SVM and Random Forest were similar in outcomes, but I prefer the SVM because I think it fits reality better. Soccer is a multi-dimensional game, where there are countless different ways you can win. Passing works, counter-attacking works, pressing works, parking the bus can work, etc. There’s no optimal strategy, so the logic of the SVM that there’s an n-dimensional world out there and we have no way of knowing the functional form of that world makes a lot of sense to me.
Also, on the training data, This method ended up predicting 78% of all outcomes correctly, 65% of all “goals scored” correctly, and 67% of all “goals allowed” correctly. So that’s the method I used in my Champions League predictor for goals scored/allowed (and will potentially bring back this year time-permitting).
This was all simplified, but I tried to explain the concepts as clearly as possible. Please contact me on Twitter with any questions/suggestions/thoughts, partially because I want this post to be as strong as possible but also because I’m teaching poli sci majors an intro to this topic in the fall and would like feedback on where I’m unclear.
- For a great meta-review of the literature, read http://www.wjh.harvard.edu/~jamieson/JJ_JASP.pdf ↩
- I’ve been using the “tam” library because it does some of the things I’m looking for, but the “ltm” library seems to be highly recommended as well ↩
- Interesting notes on the skill level, and included on the featured image for this post, is that Crystal Palace is actually tougher to beat when they’re on the road than they are at home, and QPR is a relatively difficult team to beat at home but are the worst travelers in the league ↩
- Chelsea’s a 91% favorite to win the league ↩
- I initially discarded this model, but I’ve been running in parallel to the SVM and it performs almost identically in terms of total accuracy, but is a little “flatter” than the SVM. Games tend to be closer to that 33-33-33 mark, while the SVM is a little more confident in its predictions (maybe falsely so) ↩