One thing I’ve noticed in the soccer analytics community is that R^2 seems to be the dominant measure of model fit. This is where I should probably go back and embed a few tweets of people reporting R^2, but I think it’s prevalent enough where people will believe me. This is almost never what we should be using, and only after we’ve set up the rest of the model correctly. Let me explain what I mean.

For the non-stats folks, or just anyone who needs a refresher, R^2 is is a measure of goodness of model fit. Basically it looks at the linear model and calculates how much of the variation of x on y is predicted by a linear regression. Higher R^2 means that more of the variation is predicted by the regression model (better fit), lower numbers mean that less of the variation is predicted (worse fit). Here’s a graphical example of the concept:

Both plots (and the remainder of the plots in this post) are a simple model: predicted points compared to actual points in a predictive model that tries to predict end of season points for each EPL team. The model details itself aren’t important, the data are all simulated in R as illustrative (except for my model at the very bottom). In the top half of this first figure, you can see that most points fall almost exactly on the regression line. This yields an R^2 of 0.95, meaning that 95% of the variance in the data is explained by the regression model.

The bottom half of the plot shows a model that has a much lower fit: the points are more scattered off the regression line, and only has an R^2 of 0.25. So far, so good. But this diagnostic test relies on some very specific assumptions that seem to be almost entirely ignored in what I’ve seen in soccer analytics.

R^2 is part of the final output of a linear regression model, specified in the bi-variate context as y = mx + b. If we’re calculating an R^2 value, it implies that this is what we’re doing, so it’s important to talk about exactly what this means and what we’re saying when we do this. You may remember this as slope-intercept form of a line back from middle school Algebra, where m = the slope of the line and b = the intercept. I want to talk about each of these components separately, why they’re important, and what they mean.

The above figure is a very successful model, specified as y = (1)x + 0, meaning it has a slope of 1, and an intercept of 0. In practical terms, this means that for every 1 point increase in our model’s predicted points a team will earn, the team earns 1 point. For example, a team that is predicted to earn 37 points will earn 37 actual points or a team that is predicted to earn 87 points will earn 87 actual points. This is clearly a successful model where predicted and actual match very closely.

Note that the R^2 is 0.95, which is very close to the maximum 1, and for the R^2 fans this would be a successful model as well. Let’s take a look at another example.

For this model I shifted the intercept, and the solid line represents the ideal (y = x) while the dotted line represents the actual regression line (y = x – 20). This model is specified as y = (1)x – 20. What this means in practical terms is that there is a 1:1 relationship between x and y, which is good. For every 1 point increase in predicted points, there is a 1 point increase in actual points. However, the intercept being -20 is a significant difference in the predictions. So in this case, if we predict a team will earn 57 points, they will only earn 37 points. And if we predict a team will earn 107 points, they will earn 87 points. Clearly this model is less successful at predicting outcomes than the the previous model – it is 20 points low for every team in the EPL.

Once again, note that the R^2 is 0.95. If we only look at this, we would assume this model was identical in its predictive power to the previous model, but logically we can see it is not. Let’s look at one more example, this time with the slope altered:

In this final model I have a model of y = 0.3x – 7.46, meaning that the slope is now 0.3, and the intercept is -7.46. So to calculate actual points, we have to multiply the predicted points by 0.3, and subtract 7.46. So a team who we predicted would earn 37 points would earn 37*0.3 – 7.46 points, or 3.64 points. A team we predicted to earn 87 points would earn 87*0.3-7.46, or 18.64 points. If we predicted a team would earn 250 points, they’d earn 250*0.3-7.46, or 67.54 points.

Think about that for a minute – in this model, if we predicted a team would earn 250 points (roughly 6.5 points per game) over the course of the season, they would earn enough points to probably challenge for a Europa League position. This model has to be considered wildly unsuccessful in plain terms.

However, let’s once again look at the R^2. The model fits incredibly well, leading to an R^2 of 0.95.

Clearly we need to look at other measures before we look at R^2. If we’re going to perform linear regression in a Null Hypothesis Testing format (NHST), we need to set-up a clear null hypothesis for both the slope and the intercept. For anything where there should be a 1:1 relationship (point prediction models and expected goals come to mind), the first step needs to be to set our calculators to test it with an intercept of 0 and a slope of 1. If we get coefficients that are statistically distinguishable from those values, then maybe the model isn’t predicting particularly well regardless of the R^2.

For comparison, I present the current state of my prediction model.

The R^2 isn’t as high as I’d like (City is a significant outlier at this point and as soon as they lose one they’ll drop back onto the line. Chelsea is also really underperforming and hurting the model), but when I set up the correct NHST (m = 1, b = 0) I get statistically insignificant results (y = -0.19x + 0.95)^{1}. The fit isn’t ideal, but it meets the major requirements for now with the hope that the model will converge more as the season progresses. The R^2 isn’t 0.95, but it’s clearly an improvement over some of the other models above.^{2}

- And when I set up the traditional null hypothesis where the slope = 0 I do get significant results, telling me the slope is unlikely to be 0, but is likely to be 1 ↩
- Some more technical details and replication scripts for the plots are available at http://soccer.chadmurphy.org/?p=277 ↩