Quick note before starting: if you’re interested in this type of explanation you should watch my “Intro to Analytics” playlist on YouTube and subscribe to my channel to see future updates. Also, follow me on Twitter @Soccermetric.
I did one of my pet peeves yesterday: I posted raw R output1 of a preliminary cut at some xG data for the NWSL. I’ve spent a bunch of time collecting the data, and was curious whether I had anything interesting yet so I ran the model on limited data (~300 shots). Here’s the raw output
Ugly, right? Yeah…I should have at least formatted it nicely or named variables in meaningful ways. But there’s some interesting stuff here that I wanted to share with everyone, and since some people showed interest in understanding the model I wanted to write a blog post.
My ultimate goal is to provide three levels of explanation: xG 101 (Intro to Expected Goals), xg 201 (xG for Soccer Analytics majors), and xG 491 (Senior Seminar in xG). The first, which I’m including in this blog, should give you an adequate understanding of what I’m working on, the second will go a little further into the methods, and the third will get into more statistical detail and talk about some of the strengths and weaknesses of what I’ve done so far and where I need to go as this project progresses.
xG 101: Intro to Expected Goals
What is xG?
xG is short for Expected Goals. Basically what it measures is the probability that a shot turns into a goal. When a player takes a shot, how often will she score?
Why do we care?
xG has become one of the most popular statistics in the soccer analytics community, so it’s worth understanding. It’s important because it’s used to answer some questions:
- Which team had the better quality shots during a game?
- Which players should score the most goals?
- Which teams should score the most goals during a season?
The other side of the coin is Expected Goals Allowed (xGA), which answers how many goals a team would be expected to allow. Over the course of a season, xG correlates with how well a team does and how many games a team wins/loses/draws. The idea is that in a single game, teams can get lucky and defy probability but over a season these sorts of things even out.
Do you have an analogy for how this works?
Yes, yes I do. Think about it this way: if you flip a coin it has a 50% chance of coming up heads. If you flip this coin ten times you’d expect it to come up heads 5/10 times (50% of the time). But you wouldn’t be surprised if it came up heads 6 times or 4 times, and a little more surprised but not shocked if it came up 7 times or 3 times. Certainly you wouldn’t be surprised if any single flip came up heads or tails, but over the long run (hundreds or thousands of shots ) you’d expect the number of heads to be close to 50%.
The same goes for shots: if your star forward takes a shot inside the penalty area, you’d expect her to score (hypothetically) 40% of the time. If she scored on a single shot, you wouldn’t be surprised, but if the goalkeeper saved it you wouldn’t be too surprised either. In any single game, you might see a lot of goals scored (the coin comes up heads several times in a row) or not a lot of goals scored (tails several times in a row), but over the course of a season this should all balance out.
How is xG measured?
It’s fairly simple: it’s a number between 0-1 where higher numbers mean a greater likelihood of scoring (a higher quality shot) and lower numbers mean a lower likelihood of scoring (a lower quality shot). A header from 40 yards out would be really unlikely to score, and therefore would have a very low xG score. A shot kicked from 3 feet away on an open net would be really likely to score and would have a very high xG score.
The actual number itself is the probability of scoring: a shot with a 0.4 xG value has a 40% chance of being a goal, while a shot with a 0.15 xG value has a 15% chance of being a goal.
OK, I get what xG means, but what does the ugly regression output you posted mean?
Here’s the explanation I give all of my undergraduates in their Intro to American Government class. Each row is an individual factor (variable) that predicts whether a goal is scored. So you have things like the score at the time of the shot (“diff”), distance from the goal line the shot is taken (“y”), the angle between the shooter and the center of the goal (“angle”), etc.2
In the table you’re looking at two things:
- Are there stars on the same line as the variable?
- If yes, proceed to the next step
- Is the number under the “estimate” column positive or negative?
If there are stars, then the variable has what we call a “statistically significant effect” on the likelihood of a shot scoring (more on this in 201). This basically means that it matters – it correlates with a change in the likelihood of scoring (the xG value). If it doesn’t have stars, the two variables are unrelated and it effectively doesn’t matter. So let’s return to the table for a minute, and look at the “freekick” column.
This variable says “Is the shot from a direct free kick?” So we look to see if there are stars next to it, and there aren’t. So it turns out a shot made from a free kick is no more or less likely to score compared to a regular shot. Good times.
Now let’s look at “counter.” This variable represents whether the shot came as the result of a counter attack. There are a lot of these in the NWSL – it’s a fast-paced, athletic league, so we see a lot of fast-moving counterattacks. But are shots from a counterattack more likely to score?
There is a star next to “counter”, which means that whether the shot came from a counter attack is related to the probability of scoring. Life is good – we found something interesting here. Let’s move on to step #2~!
“Is the number in the estimate column positive or negative?”
The estimate number for “counter” is positive, which means counter attacks have a positive relationship with the probability of scoring (xG value). If you’ve watched my Intro To Analytics YouTube videos, you know what this means (and you should watch them, they’re really good!). But if you haven’t, a positive correlation means that when one variable increases the other increases. In this case, what it means is that shots after a counter attack are more likely to score/have a higher xG value. So teams that counter attack more frequently should score more goals.
Let’s look at one more example: distance from goal (labeled “y” in my picture).
So step 1: are there stars? Yes there are, so that means distance from goal is related to the probability of scoring (xG).
Step 2: is the “estimate” number positive or negative? It’s negative, so what does that mean? A negative number means a negative correlation, or a negative relationship between two variables. This means that as one variable goes up, the other goes down.
Specifically here, as the distance from goal goes up, the likelihood of a goal being scored goes down (xG value goes down). Shots taken from distance are less likely to score, which makes sense from a common sense perspective. The other side is that shots taken close in are more likely to score, which again makes sense.
That’s xG/regression analysis 101. I’ll probably turn this into a video and write up xG 201 when I get bored during the EPL games tomorrow, but hopefully this helped people understand what’s going on. 201 will go into a little more detail of how this worked, and then 491 will be a sophisticated treatment of regression analysis and how things work.
- diff – the “game state” or difference in score between the two teams
- y – the distance from the goal line
- angle – the angle (in radians) between the shooter and the center of the goal
- time – the time the shot was taken
- def.distance – the distance between the shooter and the nearest defender
- head – was the shot a header?
- foot – was the shot kicked?
- counter – was the shot the result of the counter attack?
- home.team – was the shot taken by someone on the home team?
- gk.error – was the shot after a goalkeeper’s error?
- freekick – was the shot a direct free kick?
- corner – was the shot assisted off a corner kick?