I’ve been focusing on diagnostics with MOTSON lately, but one thing I haven’t thought about is comparing the model’s predictions to some of the underlying statistics measures. The model has done well, but how much of the prediction error comes from, for lack of a better word, “error”, and how much of it comes from random variation. So I wanted to test this with the gold standard in soccer analytics: xG.

As a refresher, my most recent model’s expected points correlates with actual points at 0.61, which is pretty solid but not spectacular. Chelsea and Leicester City are pretty big swings and misses for the model right now, dragging the correlation down from about 0.8, which would be solid and spectacular. Here’s the scatterplot of model fit with a regression line of b = 1.0 thrown in for good measure.

I would imagine all my readers know this, but for those who don’t the quick version of xG is it’s a measure of shot quality. How many goals would you expect a team to have scored given the quality and number of their shots? But we know there isn’t a perfect correlation between the two measures, and variance causes teams either score more or less goals than the xG measure would predict.

I downloaded xG data from Michael Caley’s fancy stats site (@MC_of_A on Twitter), which included both expected goals for and against for each team in the EPL. I merged in my “expected points” data and the actual points each EPL team has earned, and created a new variable with the difference between each team’s xG and the actual number of non-penalty goals (NPG). This is what I used for my analysis.

Question #1: Is MOTSON successful in predicting the underlying stats? Specifically, how well do MOTSON’s predictions match up with xG?

I correlated the xGD^{1} measure with my expected points. We know Goal Difference is a great predictor of table position (correlating at 0.92 as of this post), so I want to see if MOTSON’s expected points correlate with Expected Goals. I ran a quick bivariate correlation, and got a Pearson’s r value of 0.83, very high by any accounts.

*MOTSON passes this test, showing a high correlation between expected points and expected goal difference.*

Question #2: How much of MOTSON’s error is explained by variation from underlying statistics? Specifically, do xG and MOTSON share the same misses?

For this, I correlated the xG residual measure (calculated as NPGD – xGD) with MOTSON’s residuals (calculated as Expected Points – Actual Points).^{2} If xG and MOTSON’s residuals are highly correlated, it could be evidence that the teams MOTSON over/underpredicts are over/underperforming compared to expectations and the underlying statistics the soccer analytics community looks at. The overall correlation here was almost as strong, with a Pearson’s r of 0.74.

MOTSON’s biggest outliers this season are Chelsea, Leicester City, and Aston Villa. The first two would have been almost impossible to pick pre-season, but do models using current season data do better on them? It turns out those three are fairly problematic for xG models as well. MOTSON over-estimates Chelsea’s points by 16, but their xGD is actually positive at 4.8. This xGD score compared to their NPGD of -9 shows that even using in-season statistics has trouble with Chelsea’s performance this year. Similarly, Aston Villa underperforms for MOTSON by 10.71 points, and underperforms xG by 7.4 goals. Leicester fares better using in season stats, overperforming by 13.28 points for MOTSON but only off by 3.5 xG.

MOTSON has done well for me this season, but it’s interesting that it does a better job predicting xG than it does actual points. This makes me more confident that the model is onto something, and figured a lot of it out in the pre-season without any current season data.

The link you have given to Caley’s xG site is only updated till 21st December or am I missing something?