This is the second post in a series of methodsy blogs explaining my Expected Goals (xG) model. The first goes through some introductory concepts for xG and I highly recommend you read it if you’re unsure of any of the content here.
Learning Outcomes for Today’s Post:
- Linear Regression improves upon correlation because it shows the size of the effect in addition to the correlation between two variables
- Regression output shows us both the size of the effect and the uncertainty of our statistics
- A relationship between two
Expected Goals 201: xG For Soccer Analytics Majors
Yep, this raw regression output is still really ugly and you should judge anyone who posts things like this. But with the last post hopefully you can read what it says and understand fundamentally what it means while you (rightfully) judge me for posting it with no explanation or formatting.
Last time we went over the first steps to understanding regression output, but in this post I wanted to go over some slightly more advanced techniques. I’m going to explain how I got these numbers (with little or no math), and more detail on what some of the numbers mean.
Where did all these numbers come from?
They came from a technique called regression analysis. In my Intro to Analytics YouTube videos I explain the idea of correlation, and regression is the next step beyond correlation. As you may remember, and as I wrote about in the previous post, correlation tells us how strong the relationship between two variables is and in which direction that relationship goes. So a high correlation means a strong relationship, and a positive correlation means that the two variables vary in the same direction (while a negative correlation means that they go in opposite directions). Correlation tells us much of what we need to know, and for a lot of things you don’t need to do anything fancier than that.
You didn’t answer my question, you went on a side-drain about correlation!
I’m getting there, be patient. So while correlation tells us direction and strength, it doesn’t tell us the size of the effect. As one variable goes up the other goes up, but by how much? That’s what regression adds to our lives – it tells us how much individual variables matter. If we’re trying to predict things, that becomes important.
I’m not sure I understand: can you give me an example?
Sure! The math behind this is incredibly confusing and complicated, but the concept isn’t. The easiest way to think about this is to look at it graphically, so I’m going to take a couple of graphs from my Intro to Analytics videos and walk through them. The first is the correlation between shots outside the area and points.
The above graph shows the scatterplot of shots outside the area (average # per 90 minutes per team) and the number of points a team earned in the 2014-2015 season. Correlation tells us that there is a correlation between the two variables, and that it’s a positive correlation. Life is good – teams should shoot more outside the box, right?
Well yes and no. Now that we’ve learned whether there is a correlation and that the correlation is positive1, we need to look at the size of the effect. If someone is kind enough to graph their results (and all good analysis comes with something like this graph), then you can see exactly how much the variables affect each other.
In this graph, it’s as simple as lining up 6 shots outside the area per game on the horizontal (“X”) axis with the dotted line and seeing its value on the vertical (“Y”) axis. Let’s say it’s about 55 points. Not bad. Now to see how much an additional shot is worth, we look at the dotted line’s position at 7 shots outside the area. This is about 61 points or so, meaning adding one more shot outside the area per game gets you about 7 points per season, or 2 extra wins. Not bad, but let’s take a look at another metric.
The graph above here shows the effect of shots inside the area (per team per 90 minutes) on points per game. Let’s do the same thing for shots inside the area we did for shots outside the area: if we find where the dotted line falls for 6 shots per game, we see that this earns you about 40 points. If we do the same thing for 7 shots per game, we find that it earns you about 50 points. This is a 10 point difference, or 3 wins and a draw.
Adding an extra shot inside the area per game gives you 10 extra points, while adding an extra shot outside the area only gives you 6 points. I’m resisting make a “size matters” joke, but when you’re trying to measure the importance of different variables it matters in a very big way.
This is what regression tells us – the “estimate” column shows us the size of the effect (how much one variable increases while the other increases), while the stars show us if that difference is statistically significant.
Wait what? Statistically significant? You can’t just drop that term on me without explaining it.
I’m going to – don’t worry!
So what is it????
Statistical significance is a fancy way of saying “are we sure that there’s an effect of one variable on another?” We think that these two variables are related, but how sure are we really?
Everything in life is subject to some sort of uncertainty. Whether a coin comes up heads or tails, whether a team wins a game, or whether a shot goes into the back of the net or flies over the crossbar, or some non-soccer things that I hear people experience like jobs, friends, and social lives. Statistics are the same way: you can predict something will happen, but you can only be so certain that it will occur. There’s some math behind this, but the easiest way to think of it like the margin of error in polls: we run a poll and find that President Obama’s approval is (let’s say) 52%, plus or minus 3 points. That means his approval could reasonably be anywhere between 49 and 55, and far less likely could be outside that range.
How does this look in our stats table? Good question!
I didn’t ask.
Too bad, I’m going to tell you anyway. Back to my ugly regression output:
I’ve highlighted the “Estimate” column in red. This number is the size of the effect each variable has on the likelihood of scoring a goal (xG).
Do not interpret this number in any way other than “positive or negative” – it’s well beyond what I can teach you here (but maybe in xG 491), and the different “Estimates” cannot be compared to each other.
But “y” is way smaller than “angle” – that means that…
No, it doesn’t. It doesn’t mean anything.
OK fine…*mumbles under breath*
Back to our table – the “estimate” is the size of the effect. This can also be called the “coefficient” or the “slope” (think back to the graphs I showed earlier – we looked at the slope of the line, or the rate of change, to see how big the effect was). Now let’s look at the uncertainty we talked about earlier.
I’ve highlighted the uncertainty measure in blue, known as the “Standard Error” (The “z” in the highlight belongs to the next column – “z value”). This is a measure of how certain we are that the estimate is what we say it is. To say something is statistically significant, we are looking for an estimate that is much bigger than the standard error. That means we have an effect that is much larger than our uncertainty. Low uncertainty means we can be certain that the variable has an effect on the other variable, and we label this “statistically significant” and give it some stars. I’ll get into this more in a later post, but generally you’re looking for an estimate size (absolute value, or ignoring the negative sign if there is one) that is roughly twice as big as the standard error size. Let’s look at an example or two.
Returning to freekick, which is a simple variable looking at whether the shot came from a direct free kick or not. The estimate of the size of this effect is 0.019 (not very big), and the standard error is 0.610 (very big). Because we have a small estimated effect and a large standard error, this is not statistically significant. This is confirmed by a lack of stars in our regression table (stars mean statistically significant). Because of this we say that free kicks are no more or less likely to turn into goals than regular shots (no change in xG).
(If you remember a few lines ago I said the estimate needed to be twice the size of the standard error, and in this case 0.019/0.610 = 0.031, which isn’t even close to 2. Small effect + high uncertainty = no statistical significance)
Let’s look at another example to hammer this concept home, this time we’ll look at “y” which represents distance from the touchline, or how far out a shot was taken.
This time the estimate is -0.069, which is still pretty small, but the standard error is even smaller at 0.019. A small estimate with a really small amount of uncertainty means we probably have something statistically significant, and can confidently say that distance from goal affects the likelihood of scoring a goal (xG).
(Simple math confirms this: 0.069/0.019 = 3.696, which is greater than 2 which means we have a statistically significant relationship. This is why there are stars next to that row.)
That’s it: that’s how to read a linear regression table, the main focus of “Expected Goals 201.” The 491 senior seminar will get into the techniques more and how to do a regression of your own, and I’ll definitely be recording a video sometime in the relatively near future about it as well, showing you all how to create your own xG model so be sure to look for that!
- A reminder from last time, that’s done by looking at stars, and then positive/negative “estimate” column ↩
Quick note before starting: if you’re interested in this type of explanation you should watch my “Intro to Analytics” playlist on YouTube and subscribe to my channel to see future updates. Also, follow me on Twitter @Soccermetric.
I did one of my pet peeves yesterday: I posted raw R output1 of a preliminary cut at some xG data for the NWSL. I’ve spent a bunch of time collecting the data, and was curious whether I had anything interesting yet so I ran the model on limited data (~300 shots). Here’s the raw output
Ugly, right? Yeah…I should have at least formatted it nicely or named variables in meaningful ways. But there’s some interesting stuff here that I wanted to share with everyone, and since some people showed interest in understanding the model I wanted to write a blog post.
My ultimate goal is to provide three levels of explanation: xG 101 (Intro to Expected Goals), xg 201 (xG for Soccer Analytics majors), and xG 491 (Senior Seminar in xG). The first, which I’m including in this blog, should give you an adequate understanding of what I’m working on, the second will go a little further into the methods, and the third will get into more statistical detail and talk about some of the strengths and weaknesses of what I’ve done so far and where I need to go as this project progresses.
xG 101: Intro to Expected Goals
What is xG?
xG is short for Expected Goals. Basically what it measures is the probability that a shot turns into a goal. When a player takes a shot, how often will she score?
Why do we care?
xG has become one of the most popular statistics in the soccer analytics community, so it’s worth understanding. It’s important because it’s used to answer some questions:
- Which team had the better quality shots during a game?
- Which players should score the most goals?
- Which teams should score the most goals during a season?
The other side of the coin is Expected Goals Allowed (xGA), which answers how many goals a team would be expected to allow. Over the course of a season, xG correlates with how well a team does and how many games a team wins/loses/draws. The idea is that in a single game, teams can get lucky and defy probability but over a season these sorts of things even out.
Do you have an analogy for how this works?
Yes, yes I do. Think about it this way: if you flip a coin it has a 50% chance of coming up heads. If you flip this coin ten times you’d expect it to come up heads 5/10 times (50% of the time). But you wouldn’t be surprised if it came up heads 6 times or 4 times, and a little more surprised but not shocked if it came up 7 times or 3 times. Certainly you wouldn’t be surprised if any single flip came up heads or tails, but over the long run (hundreds or thousands of shots ) you’d expect the number of heads to be close to 50%.
The same goes for shots: if your star forward takes a shot inside the penalty area, you’d expect her to score (hypothetically) 40% of the time. If she scored on a single shot, you wouldn’t be surprised, but if the goalkeeper saved it you wouldn’t be too surprised either. In any single game, you might see a lot of goals scored (the coin comes up heads several times in a row) or not a lot of goals scored (tails several times in a row), but over the course of a season this should all balance out.
How is xG measured?
It’s fairly simple: it’s a number between 0-1 where higher numbers mean a greater likelihood of scoring (a higher quality shot) and lower numbers mean a lower likelihood of scoring (a lower quality shot). A header from 40 yards out would be really unlikely to score, and therefore would have a very low xG score. A shot kicked from 3 feet away on an open net would be really likely to score and would have a very high xG score.
The actual number itself is the probability of scoring: a shot with a 0.4 xG value has a 40% chance of being a goal, while a shot with a 0.15 xG value has a 15% chance of being a goal.
OK, I get what xG means, but what does the ugly regression output you posted mean?
Here’s the explanation I give all of my undergraduates in their Intro to American Government class. Each row is an individual factor (variable) that predicts whether a goal is scored. So you have things like the score at the time of the shot (“diff”), distance from the goal line the shot is taken (“y”), the angle between the shooter and the center of the goal (“angle”), etc.2
In the table you’re looking at two things:
- Are there stars on the same line as the variable?
- If yes, proceed to the next step
- Is the number under the “estimate” column positive or negative?
If there are stars, then the variable has what we call a “statistically significant effect” on the likelihood of a shot scoring (more on this in 201). This basically means that it matters – it correlates with a change in the likelihood of scoring (the xG value). If it doesn’t have stars, the two variables are unrelated and it effectively doesn’t matter. So let’s return to the table for a minute, and look at the “freekick” column.
This variable says “Is the shot from a direct free kick?” So we look to see if there are stars next to it, and there aren’t. So it turns out a shot made from a free kick is no more or less likely to score compared to a regular shot. Good times.
Now let’s look at “counter.” This variable represents whether the shot came as the result of a counter attack. There are a lot of these in the NWSL – it’s a fast-paced, athletic league, so we see a lot of fast-moving counterattacks. But are shots from a counterattack more likely to score?
There is a star next to “counter”, which means that whether the shot came from a counter attack is related to the probability of scoring. Life is good – we found something interesting here. Let’s move on to step #2~!
“Is the number in the estimate column positive or negative?”
The estimate number for “counter” is positive, which means counter attacks have a positive relationship with the probability of scoring (xG value). If you’ve watched my Intro To Analytics YouTube videos, you know what this means (and you should watch them, they’re really good!). But if you haven’t, a positive correlation means that when one variable increases the other increases. In this case, what it means is that shots after a counter attack are more likely to score/have a higher xG value. So teams that counter attack more frequently should score more goals.
Let’s look at one more example: distance from goal (labeled “y” in my picture).
So step 1: are there stars? Yes there are, so that means distance from goal is related to the probability of scoring (xG).
Step 2: is the “estimate” number positive or negative? It’s negative, so what does that mean? A negative number means a negative correlation, or a negative relationship between two variables. This means that as one variable goes up, the other goes down.
Specifically here, as the distance from goal goes up, the likelihood of a goal being scored goes down (xG value goes down). Shots taken from distance are less likely to score, which makes sense from a common sense perspective. The other side is that shots taken close in are more likely to score, which again makes sense.
That’s xG/regression analysis 101. I’ll probably turn this into a video and write up xG 201 when I get bored during the EPL games tomorrow, but hopefully this helped people understand what’s going on. 201 will go into a little more detail of how this worked, and then 491 will be a sophisticated treatment of regression analysis and how things work.
- diff – the “game state” or difference in score between the two teams
- y – the distance from the goal line
- angle – the angle (in radians) between the shooter and the center of the goal
- time – the time the shot was taken
- def.distance – the distance between the shooter and the nearest defender
- head – was the shot a header?
- foot – was the shot kicked?
- counter – was the shot the result of the counter attack?
- home.team – was the shot taken by someone on the home team?
- gk.error – was the shot after a goalkeeper’s error?
- freekick – was the shot a direct free kick?
- corner – was the shot assisted off a corner kick?
For anyone who is interested, here is the beginning of my Intro to Soccer Analytics “class.” We cover some introductory topics here: picking a question, the importance of theory, difference of means tests, and answer the question of whether Liverpool has improved under Jurgen Klopp.
If you enjoy it, please like the videos, subscribe to my channel, and tell a friend about it. I’m doing this in my free time and all I ask is that you help get the word out about this course and help me get as many people interested as we can!