This is the second post in a series of methodsy blogs explaining my Expected Goals (xG) model. The first goes through some introductory concepts for xG and I highly recommend you read it if you’re unsure of any of the content here.
Learning Outcomes for Today’s Post:
- Linear Regression improves upon correlation because it shows the size of the effect in addition to the correlation between two variables
- Regression output shows us both the size of the effect and the uncertainty of our statistics
- A relationship between two
Expected Goals 201: xG For Soccer Analytics Majors
Yep, this raw regression output is still really ugly and you should judge anyone who posts things like this. But with the last post hopefully you can read what it says and understand fundamentally what it means while you (rightfully) judge me for posting it with no explanation or formatting.
Last time we went over the first steps to understanding regression output, but in this post I wanted to go over some slightly more advanced techniques. I’m going to explain how I got these numbers (with little or no math), and more detail on what some of the numbers mean.
Where did all these numbers come from?
They came from a technique called regression analysis. In my Intro to Analytics YouTube videos I explain the idea of correlation, and regression is the next step beyond correlation. As you may remember, and as I wrote about in the previous post, correlation tells us how strong the relationship between two variables is and in which direction that relationship goes. So a high correlation means a strong relationship, and a positive correlation means that the two variables vary in the same direction (while a negative correlation means that they go in opposite directions). Correlation tells us much of what we need to know, and for a lot of things you don’t need to do anything fancier than that.
You didn’t answer my question, you went on a side-drain about correlation!
I’m getting there, be patient. So while correlation tells us direction and strength, it doesn’t tell us the size of the effect. As one variable goes up the other goes up, but by how much? That’s what regression adds to our lives – it tells us how much individual variables matter. If we’re trying to predict things, that becomes important.
I’m not sure I understand: can you give me an example?
Sure! The math behind this is incredibly confusing and complicated, but the concept isn’t. The easiest way to think about this is to look at it graphically, so I’m going to take a couple of graphs from my Intro to Analytics videos and walk through them. The first is the correlation between shots outside the area and points.
The above graph shows the scatterplot of shots outside the area (average # per 90 minutes per team) and the number of points a team earned in the 2014-2015 season. Correlation tells us that there is a correlation between the two variables, and that it’s a positive correlation. Life is good – teams should shoot more outside the box, right?
Well yes and no. Now that we’ve learned whether there is a correlation and that the correlation is positive1, we need to look at the size of the effect. If someone is kind enough to graph their results (and all good analysis comes with something like this graph), then you can see exactly how much the variables affect each other.
In this graph, it’s as simple as lining up 6 shots outside the area per game on the horizontal (“X”) axis with the dotted line and seeing its value on the vertical (“Y”) axis. Let’s say it’s about 55 points. Not bad. Now to see how much an additional shot is worth, we look at the dotted line’s position at 7 shots outside the area. This is about 61 points or so, meaning adding one more shot outside the area per game gets you about 7 points per season, or 2 extra wins. Not bad, but let’s take a look at another metric.
The graph above here shows the effect of shots inside the area (per team per 90 minutes) on points per game. Let’s do the same thing for shots inside the area we did for shots outside the area: if we find where the dotted line falls for 6 shots per game, we see that this earns you about 40 points. If we do the same thing for 7 shots per game, we find that it earns you about 50 points. This is a 10 point difference, or 3 wins and a draw.
Adding an extra shot inside the area per game gives you 10 extra points, while adding an extra shot outside the area only gives you 6 points. I’m resisting make a “size matters” joke, but when you’re trying to measure the importance of different variables it matters in a very big way.
This is what regression tells us – the “estimate” column shows us the size of the effect (how much one variable increases while the other increases), while the stars show us if that difference is statistically significant.
Wait what? Statistically significant? You can’t just drop that term on me without explaining it.
I’m going to – don’t worry!
So what is it????
Statistical significance is a fancy way of saying “are we sure that there’s an effect of one variable on another?” We think that these two variables are related, but how sure are we really?
Everything in life is subject to some sort of uncertainty. Whether a coin comes up heads or tails, whether a team wins a game, or whether a shot goes into the back of the net or flies over the crossbar, or some non-soccer things that I hear people experience like jobs, friends, and social lives. Statistics are the same way: you can predict something will happen, but you can only be so certain that it will occur. There’s some math behind this, but the easiest way to think of it like the margin of error in polls: we run a poll and find that President Obama’s approval is (let’s say) 52%, plus or minus 3 points. That means his approval could reasonably be anywhere between 49 and 55, and far less likely could be outside that range.
How does this look in our stats table? Good question!
I didn’t ask.
Too bad, I’m going to tell you anyway. Back to my ugly regression output:
I’ve highlighted the “Estimate” column in red. This number is the size of the effect each variable has on the likelihood of scoring a goal (xG).
Do not interpret this number in any way other than “positive or negative” – it’s well beyond what I can teach you here (but maybe in xG 491), and the different “Estimates” cannot be compared to each other.
But “y” is way smaller than “angle” – that means that…
No, it doesn’t. It doesn’t mean anything.
OK fine…*mumbles under breath*
Back to our table – the “estimate” is the size of the effect. This can also be called the “coefficient” or the “slope” (think back to the graphs I showed earlier – we looked at the slope of the line, or the rate of change, to see how big the effect was). Now let’s look at the uncertainty we talked about earlier.
I’ve highlighted the uncertainty measure in blue, known as the “Standard Error” (The “z” in the highlight belongs to the next column – “z value”). This is a measure of how certain we are that the estimate is what we say it is. To say something is statistically significant, we are looking for an estimate that is much bigger than the standard error. That means we have an effect that is much larger than our uncertainty. Low uncertainty means we can be certain that the variable has an effect on the other variable, and we label this “statistically significant” and give it some stars. I’ll get into this more in a later post, but generally you’re looking for an estimate size (absolute value, or ignoring the negative sign if there is one) that is roughly twice as big as the standard error size. Let’s look at an example or two.
Returning to freekick, which is a simple variable looking at whether the shot came from a direct free kick or not. The estimate of the size of this effect is 0.019 (not very big), and the standard error is 0.610 (very big). Because we have a small estimated effect and a large standard error, this is not statistically significant. This is confirmed by a lack of stars in our regression table (stars mean statistically significant). Because of this we say that free kicks are no more or less likely to turn into goals than regular shots (no change in xG).
(If you remember a few lines ago I said the estimate needed to be twice the size of the standard error, and in this case 0.019/0.610 = 0.031, which isn’t even close to 2. Small effect + high uncertainty = no statistical significance)
Let’s look at another example to hammer this concept home, this time we’ll look at “y” which represents distance from the touchline, or how far out a shot was taken.
This time the estimate is -0.069, which is still pretty small, but the standard error is even smaller at 0.019. A small estimate with a really small amount of uncertainty means we probably have something statistically significant, and can confidently say that distance from goal affects the likelihood of scoring a goal (xG).
(Simple math confirms this: 0.069/0.019 = 3.696, which is greater than 2 which means we have a statistically significant relationship. This is why there are stars next to that row.)
That’s it: that’s how to read a linear regression table, the main focus of “Expected Goals 201.” The 491 senior seminar will get into the techniques more and how to do a regression of your own, and I’ll definitely be recording a video sometime in the relatively near future about it as well, showing you all how to create your own xG model so be sure to look for that!
- A reminder from last time, that’s done by looking at stars, and then positive/negative “estimate” column ↩