I thought it was important to go into some methodological details on my predictions, so if you’re interested here are the technical details on my method.

Continue reading Predicting EPL Outcomes: The Method (Updated 8/19/2015)

I thought it was important to go into some methodological details on my predictions, so if you’re interested here are the technical details on my method.

Continue reading Predicting EPL Outcomes: The Method (Updated 8/19/2015)

One of the big open questions in expected goals research is accounting for individual finishing skill: a shot taken by Lionel Messi is worth more than a shot taken by Jesus Navas, but how do we account for that? There are any number of open issues here, mostly methodological, but I’ve recently started thinking about a more theoretical one that should be addressed before worrying about the statistics underlying the concept: What exactly do we mean by “finishing skill?”

As far as I’ve read, finishing skill is typically thought of as the ratio of goals above (or below) the number of expected goals. Dismissing the ideas of variance and imprecision in measurement for a moment, a player who scores more goals than expected is a good finisher while a player who scores fewer goals than expected is a bad one. Lionel Messi outperforms his expected goals, so we can say that he’s clinical in front of goal, while Jesus Navas underperforms so we can say that he’s…well as a Manchester City supporter I don’t want to talk about it. But is this all there is to be said?

One of the great contributions of expected goals is the idea that all shots are not created equal, that is to say a shot taken from out wide and from outside the penalty area is less likely to score than one taken from the center of the goal six yards away. But why shouldn’t we apply this to the idea of finishing as well? I’m coming around to the idea that there are as many types of finishing skill as there are types of shots, so the next step is to identify the important ones and identify which players are what types of finishers. To do this, I draw on some of the fundamental contributions of expected goals research along with the eyeball test from watching and playing however many thousands of hours of soccer.^{1}

**The Clinical Finisher**

This archetype comes from the idea of players who are clinical in front of the net. They’re calm, collected, and don’t miss easy opportunities. In xG terms, they score on high probability shots even more frequently than one would expect. A 0.5 xG shot kicked by Lionel Messi one-on-one vs. the goalkeeper is more likely to go in than one kicked by Fernando Torres at Chelsea. So theoretically that shot is 0.7 on Messi’s boot and 0.3 on Torres’s. This likely correlates with things like confidence, composure, and close ball control.

**The Long Range Sniper**

Shots taken outside the box and from an angle have a low xG value, yet some players continue to take them. Presumably some are better at these shots than others, and being able to shoot from distance is certainly a skill that can be developed. We could also expect some players to not realize that they don’t have this skill and still take a number of shots from distance, so we’d see some significant variation here. I’m thinking about Zlatan’s famous bicycle kick for Sweden: for a mortal man that shot would be 0.00001 xG, but for Zlatan maybe he makes that as many as 1/10 times (0.1 xG).

**The Free Kick Specialist**

Direct free kicks are a skill some players have while others don’t. Players like Cristiano Ronaldo, Yaya Toure, or Andrea Pirlo probably deserve a decent bump in xG from direct free kicks, while others players are probably below average. There’s some difficulty here in that only good free kick takers would really ever take any, but it’s a skill we could measure.

**The Head of the Class**

We know headed shots are lower value than those that are kicked, but obviously this isn’t equal across the board. Andrea Pirlo has never been known as a great header of the ball so maybe a header taken by him that would normally have an xG value of 0.4 would have a true value of 0.3, while someone like Zlatan or even Gerard Pique would have more talent in this area and would be worth a 0.5.

There are likely more types – players who are better on counter attacks, players who are better on corners, etc., but I wanted to present a few basic archetypes because it’s worth discussing and worth thinking about not just finishing skill, but types of finishers. If you’re trying to build a team, you wouldn’t just want the best finishers using the pure xG/actual goals metric, you’d want complementary players. Maybe you’d want to build a team filled with speedy, clinical players who could finish goals on counter attacks. Or maybe you want to play along the flanks and cross the ball into the box 30+ times a game, so you’d want some forwards who are strong headers of the ball. Maybe every team needs at least one free kick specialist, or maybe you’d want a balance. But regardless of the strategy, using xG to define different types of finishers would be a useful addition to the toolkit.

- All players and numbers I use here are hypothetical. The point isn’t the identify specific players or specific values, just to present illustrations of what I’m thinking ↩

One of the strategic questions that has always interested me is: what is the best way to catch up after going behind in a soccer match? To my mind, there are two options:

- Take a lot of low percentage shots, hoping that volume makes up for a lack of quality.
- Be patient and wait for the high quality chances, hoping that quality makes up for a lack of volume.

There are merits to both, and you could probably solve this mathematically based on expected number of shots and expected quality per shot given any number of variables. My head is spinning thinking about how you’d actually solve this equation, but given enough familiarity with teams and the right data the math would be easy enough. Solving this equation isn’t my goal with this post, instead I want to see what teams have done and use observable data to see what their strategies are/potentially how they’ve solved the problem for themselves.

To do this, I’ve undertaken two separate analyses. The first is simple enough: what is the likelihood that a shot goes in given the game state at the time of the shot? More simply put: does shot quality correlate with score?

To answer the question, I ran an analysis (full details in the appendix) looking at each shot in the NWSL this season and part of last season^{1}. I calculated the probability that each shot becomes a goal, and compared those probabilities when the score is even, the shooter’s team is one goal ahead/behind, two goals ahead/behind, three goals ahead/behind.

If teams look to catch up by taking lower probability shots when they are behind, we’d expect to see the average shot have a lower expected goal (xG) value the further behind they are, while when they are ahead the average shot would have a higher xG value.

Conversely, if teams look to catch up by taking higher probability shots when they are behind, we’d expect to see the average shot have a higher expected goal (xG) value the further behind they are, while when they are ahead the average shot would have a higher xG value. I present the results of my analysis in the figure below.

The points represent each shot taken, while the y-axis represents the Expected Goal value and the x-axis represents the goal difference at the time the shot was taken. The red boxes represent the average xG value for the shots taken at a given goal difference and the standard error around that average. If you compare the center lines in each box, you can see an upward trajectory from -3 to +3, meaning that teams take lesser quality shots when they are behind and focus on higher quality shots when they are ahead.

My analysis of shot data shows that teams focus on taking whatever shot is available when they are behind, hoping that taking enough lower quality shots will help them get back in the game. There are a number of potential explanations for this, but it seems like teams prefer to take any available shot when they are behind but can be more selective when they are ahead.

**Appendix**

Here are the results of my probit regression: my dependent variable was “did the shot result in a goal scored?” and my independent variables are in the left column of the below table. The explanatory variable here is “goal difference” and it is positive and statistically significant (p < 0.05). That indicates goal difference is a significant predictor of likelihood of a goal scoring, and when teams are leading they take higher quality shots.

Estimate | Std. Error | z value | Pr(>|z|) | |
---|---|---|---|---|

(Intercept) | 0.1865 | 0.2929 | 0.64 | 0.5243 |

Goal Difference |
0.1076 | 0.0512 | 2.10 | 0.0357 |

Distance from Goal | -0.0810 | 0.0120 | -6.74 | 0.0000 |

Angle to Center of Goal | -0.7620 | 0.1889 | -4.03 | 0.0001 |

Time | -0.0007 | 0.0023 | -0.32 | 0.7498 |

Was the Shot Pressured | -0.2115 | 0.1306 | -1.62 | 0.1054 |

Kicked | 0.1807 | 0.1891 | 0.96 | 0.3394 |

Counter Attack | 0.4137 | 0.1329 | 3.11 | 0.0019 |

Home Team | -0.0751 | 0.1197 | -0.63 | 0.5305 |

Goalkeeper Error | 1.9129 | 0.4599 | 4.16 | 0.0000 |

Direct Free Kick | 0.5552 | 0.3278 | 1.69 | 0.0903 |

Assisted from a Corner | -0.0103 | 0.2471 | -0.04 | 0.9668 |

@deepXG mentioned that the causal arrow might be going in the wrong direction: teams taking lower xG shots might be more likely to fall behind so I also wanted to do an analysis within games to show a change within games. I subdivided the data by the final score: winning/losing by 3, 2, and 1 goal, and ties (winning/losing by 0). Most of these final scores didn’t have enough shots across a variety of game states (games that finish in a tie tend to spend most of the game tied, meaning there’s not much variation on the dependent variable to analyze), but I was able to find a pattern among the most extreme results (+/- 3 goals).

For both outcomes, we see the same pattern as in the main analysis (although with more uncertainty because of a relative paucity of data). Expected goal values decrease as teams fall behind/increase as they take the lead. This provides a second level of evidence and a robustness check on the original findings. Figures are presented below.

- I’m collecting these xG values by hand, coding each shot individually. As of now I have weeks 16-20 of the 2015 NWSL season as well as the first 3 weeks of the 2016 season. ↩

I normally don’t participate in fantasy sports because they involve me rooting for weird things like Liverpool keeping a clean sheet while Daley Blind scores a goal and Yaya Toure gets a couple of assists. I can’t keep it straight, and it takes most of the enjoyment out of the game for me. However, NWSLFL has been fun for me and it’s forced me to immerse myself a little more in the league and learn more about all the players which is a good thing for someone trying to do analytics.^{1}

I wanted to share my thought processes for my third week’s success. Weeks 1 and 2 were pretty disastrous, but Week 3 I scored fairly well. I acknowledge there’s a lot of luck in this, but I do think I’ve improved my process and I figured I’d share it with people and maybe they can use it to do well, or at least join me in failure if this turns out to be a bad strategy long-term.

**Step 1: Find The Most Likely Winners**

I use my prediction model (OHAI) to see which teams are most likely to win, although I’m not 100% confident in the model so I also apply a logic test to it. This week, I’m looking at FC Kansas City to beat the Houston Dash or the Spirit over the Thorns. This is where I build my defense from, and usually where I pick my goalkeeper. I do like Nicole Barnhart, so I’ll probably pick her as my starting goalkeeper instead of Hope Solo (last week’s GK). I’ll pick a couple defenders from Kansas City, and a couple from the Spirit.

**Step 2: Find Teams Who Are Under/Overachieving Expected Goals**

My Expected Goals (xG) model predicts how many goals a team should score given the types of shots they have taken, and then I compare that to how many goals they’ve scored. If they’ve scored far more goals than I anticipated, they’re possibly due to have an off day. If they’ve scored less, they’re possibly due to have a good day. Last week the Houston Dash were *way* above the line, meaning that they’d scored far more goals than you’d expect given their shots. So I might pick against them and avoid their strikers – I could have even picked some midfielders from their opponents. I also might look at a team who has been expected to score a lot of goals but has come short and pick some of their attacking players. The WNY Flash look like they might be due for some goals, Seattle might be in for a little dry spell here.

Then I look at expected goals allowed to see who’s underachieving/overachieving there. The Spirit have been allowing fewer goals than expected, as have Sky Blue and FC Kansas City. That would mean a couple of things: they’re either due to allow some goals or their goalkeepers are extra good and are preventing goals from going in. I’ve watched Nicole Barnhart and she’s been fairly heroic in goal, so she might continue the pattern. Meanwhile, Orlando has let in more goals than expected so they might be due for some opponents to hit the post.

**Step 3: Picking the Rest**

I generally pick my USWNT designated players for the midfield – Tobin Heath and Christen Press always seem like safe bets to do good things. I like Kim Little right now because she’ll likely step up given all the injuries in Seattle. I also captain my goalkeeper because the top scorers usually seem to be goalkeepers (saves + clean sheet + winning bonus are a good combination if you can get it right). And I pick Kealia Ohai because she’s the namesake for my model so why not?

I haven’t picked my team this week, but this is the process. I like the new procedure, and I got super lucky last weekend with just about all my players scoring significant points. I missed Diana Matheson’s hat trick, but I think every player on my team had a goal or an assist last week, and all my defenders won or kept a clean sheet. Hope Solo didn’t face a ton of shots which hurt, but she won with a clean sheet so that was as much as I could have hoped for. Hopefully people can build on this, and I’d love to hear your refinements on the strategy!

- Subject matter is dramatically underrated in a lot of analytics exercises, but that’s a story for another day ↩

My newest Game Theory post about the value of rotation was inspired by a Gab Marcotti tweet:

God forbid you're in a CL semifinal + you prioritize winning the CL over a top 4 finish…

— Gabriele Marcotti (@Marcotti) May 1, 2016

He was speaking of Manchester City playing a “B Team” in the weekend’s Premier League fixture, prioritizing their mid-week CL semi-final return fixture against Real Madrid instead. The tweet was fairly controversial, especially among City’s fan base, and gave me a lot to think about. So as I like to do, I think about it from a utilitarian perspective and try to game the expected value for each choice.

The idea behind rotating before a big game is that you can increase your chances of winning the big game while diminishing your likelihood of winning the rotation game and diminishing your chances of obtaining a given league position. For Manchester City, they are currently in a fight for fourth place with Manchester United (and to a lesser extent after this weekend’s fixtures, Arsenal).

The first step is to think about which is more important to Manchester City fans: winning the Champions League semi-final (and possibly the entire tournament), or getting Arsene Wenger’s famous “fourth trophy” and ensuring Champions League football next year. I can see arguments for both, and despite the mocking of Wenger’s qualifying record, as a Milan fan I know the pain of missing out of Champions League football after you’ve become accustomed to it.

However, the expected happiness from advancing to the finals vs. securing 4th place is mitigated by a pretty significant factor: the probability of winning the semi-final match with a full strength squad, which leads us to the following equation.

**Expected Utility**_{(Rotation) }=* Pr(Advance to CL Finals _{(Rotation})*(Value of Advancing to CL Finals)-Pr(Miss CL Next Year_{(Rotation)})*(Pain of Missing CL Next Year)*

Manchester City’s expected utility (“good”) from rotating the squad is basically calculated by how much value they get from advancing to the Champions League finals^{1} multiplied by their probability of advancing to the finals. Then you subtract the probability that the rotated squad causes them to miss the CL next year multiplied by the pain of missing out. **In short: the biggest driver of value here is whether Manchester City fans think they can beat Madrid given a 0-0 draw in the home leg. If you don’t think this outcome if pretty likely,** then the first half of the equation approaches zero, meaning that the pain of missing next year hurts more than any potential pleasure gained from rotation. **In this case, it doesn’t make sense to rotate the squad.**

**However, if you assign a high probability to winning the semi-final at the Bernabeu** then the first half of the equation becomes higher, meaning that the potential pain of missing next year is less significant. **In this case, it makes perfect sense to rotate.**

But this isn’t the only factor. There’s a second equation at play here, which I present now:

**Expected Utility**_{(Full Strength}_{) }= Pr(Advance to CL Finals_{(Full Strength)}*(Value of Advancing to CL Finals) -Pr(Miss CL Next Year_{(Full Strength})* (Pain of Missing CL Next Year))

This represents the expected utility gained from playing a full strength squad. The equation is largely the same, but the values change because Manchester City played a full strength squad on the weekend. Presumably their likelihood of winning mid-week decreases because of fatigue (and potential injuries), while their likelihood of securing Champions League football next year increases because they have a greater likelihood of getting what would have been a crucial three points against Southampton.

**If you’re a Manchester City supporter and believe that the odds of beating Madrid are low**, then your values likely don’t change for the first half of the equation while your values for the second half of the equation increase. **In this case, you want a full strength squad during the weekend**.

**If you’re a Manchester City supporter and you believe that a fresh squad will beat Madrid while a fatigued squad will lose**, then your values for the first half of the equation are lower than they were previously. This lowers your expected value in a significant way, **meaning you want a rotated squad over the weekend.**

The final decision is calculated by which equation gives you a higher expected utility: which version makes you happier? **Ultimately the question depends on two major factors: how likely you think Manchester City is to beat Madrid on the road, and how much pain you’ll feel if they fail to qualify next year.** If you don’t have faith that they can pull of an upset mid-week, then you’ll oppose rotation and prioritize the league. If you believe there’s a chance, then you’ll support rotation and going all-in for this year’s Champions League.

**Part 2: Pellegrini’s Lame Duck Status**

Normally we can roughly argue a manager’s incentives are aligned with his team’s and the fans. However, Manchester City have done something strange this year, announcing Pep Guardiola will be the new manager of Manchester City regardless of what Manuel Pellegrini does this year. This introduces a new wrinkle, one that I think fully explains why he did what he did. I want to return to the expected utility equation from earlier, because the logic is the same while the values are different given Pellegrini’s unusual incentives here.

**Expected Utility**_{(Rotation) }=* Pr(Advance to CL Finals _{(Rotation})*(Value of Advancing to CL Finals)-Pr(Miss CL Next Year_{(Rotation)})*(Pain of Missing CL Next Year)*

Because Pellegrini is a lame duck manager with zero interest in what happens to Manchester City next year, he experiences literally zero pain from Manchester City missing out on the Champions League next year. Pep Guardiola gets all the benefits if he qualifies, and Pep gets all the pain from missing out if he doesn’t. The second half of this equation is literally zero, so it becomes completely irrelevant to our calculations. So when we combine the two equations from earlier, we get the following:

**Expected Utility**_{(Pellegrini) }=* Pr(Advance to CL Finals _{(Rotation})*(Value of Advancing to CL Finals)-Pr(Advance to CL Finals_{(Full Strength)}*(Value of Advancing to CL Finals)*

Because the value of advancing to the Champions League Finals is the same for Pellegrini in both cases, we can cancel that term out and we’re left with the following:

**Expected Utility**_{(Pellegrini) }=* Pr(Advance to CL Finals _{(Rotation})*-Pr(Advance to CL Finals_{(Full Strength)}*

Even if the probability is virtually zero in both circumstances, and even if the value of rotation is virtually zero, Pellegrini strongly prefers^{2} rotating the squad over the weekend to maximize his probability of winning the Champions League, something that could presumably bolster his CV and improve the contract at his next job. **Manuel Pellegrini has literally no reason to not rotate the squad, even if he sees virtually no value in it.**

The Manchester City case described here is a relatively unusual one, which is why it’s interesting to me. The conflict between a manager’s incentives, the fans’ incentives, and reasonably different incentives between fans makes this a difficult case to think about and one worth exploring more and provides a lively discussion.

Given the paucity of data and analysis for women’s soccer, I thought it would be a worthwhile summer project to build an Expected Goals (xG) model for the NWSL. If you’re unfamiliar with Expected Goals, I’ve written a few posts about the math behind the model that are probably worth reading: A Very Preliminary NWSL Expected Goals Model: xG 101 and Expected Goals 201: xG For Soccer Analytics Majors. The basic idea is taking characteristics of shots like distance from goal, angle to the center of goal, whether it was kicked or headed, whether it came from a counter attack, etc., and calculating the probability that a given shot turns into a goal. Shots are rated on a scale from 0-1, with the number being the probability of a shot scoring.

I’ve been tweeting some of the things I’ve found, but they’ve been scattered across a number of tweets and days, so I wanted to combine them all into one post and talk about some of the plots in a little more detail than is allowed in 140 characters (116 after the image).

This plot shows the relative xG scores for each game over the weekend. Most of the games were fairly close in terms of shot quality, except for Houston v. Orlando which was fairly one-sided (both in actual score and shot quality). I don’t have xG maps, but I think this is a clean, clear presentation of what happened in each of the games from the weekend.

The next plot shows cumulative player xG/Shot Quality scores over the first two weeks. Two things stand out here to me. The first is Jessica McDonald’s massive lead on everyone on her team (and the league which we’ll see in a minute), and the balance among the Portland Thorns: few players taking seemingly high quality shots. Comparing this to FC Kansas City with a larger number of players taking relatively low quality shots.

The third graph shows the top 20 players in the cumulative shot quality rankings. I tried color-coding by team, but with so many teams relying on red or blue it didn’t come out as well as I’d like. The good news is that Orlando (purple) and Houston (orange) stand out. USWNT players are doing well here – Alex Morgan is in second place (far) behind, with Jessica McDonald, Lindsey Horan, Carli Lloyd, and Christen Press all in the top few spots. Jessica McDonald is leading the pack by a long way though, with zero goals so far unfortunately for the WNY Flash.

This graph is similar to the previous one but instead of the top 20 it includes everyone who has taken a shot this season so you can see where your favorite player ranks.

The last two plots serve both as a diagnostic plot of my measure: how well does my xG score predict actual goals? In this first one, the dotted line represents a 1:1 relationship between expected goals and actual (non-penalty, non-own) goals scored, which is a “perfect” correspondence between my measure and the “real world.” I’ve got five teams (Chicago, Orlando, Washington, and Kansas City) basically on the line, which I’m really happy with, two other teams (Portland and Seattle) close, and three outliers (WNY, Boston, and Houston). I’m really happy with this so far, and despite the small sample size this season so far I think the model is performing well.

Beyond the diagnostics, if we assume that teams will eventually converge toward the 1:1 ratio, we’d expect WNY (mostly Jessica McDonald) and Boston to start scoring more goals soon, while the overachieving Houston Dash might be in for a dry spell soon.

The final plot is the other side of the last one – expected goals allowed for each teach vs. actual goals allowed. There are fewer teams that fit perfectly (Sky Blue and Portland), but there aren’t any extreme outliers this time (with Kansas City being the furthest off the line).

Kansas City is probably due to start conceding more goals soon. Nicole Barnhart has been strong in goal for them so far, so maybe she’s the reason for Kansas City over-achieving on this measure. Similarly, despite a strong start by the expansion Orlando Pride, they’ve actually conceded a goal more than my model expects they should have so they may be due for some luck going forward.

This is all I could think of as far as presenting xG/Shot quality data for the NWSL. There’s a lot of data here, and I tried cutting it as many ways as I could to present as much info as I could from a single dataset.

I may be a bit premature here, but it seems to me the major parts of the Premier League season are pretty much decided. Leicester City seems uncatchable at the top. Arsenal and Spurs will finish 2nd and 3rd (or 3rd and 2nd) while Man City looks pretty solid for 4th. Two of the three relegation spots are basically sealed, but there’s still the matter of whether Norwich City or Sunderland stay up. Because my interest in the season has waned significantly, I thought I’d do an early “year-in-review” where I assess MOTSON’s biggest hits and biggest misses. I’ll start with the obvious.

**Chelsea**

Yeah, I don’t know what to say about this. I had Chelsea in second place at the beginning of the season and they look to be stuck in literally the middle of the table. Everyone else was roughly in the same boat MOTSON was, and I honestly don’t know if this could have been foreseen. *Maybe* if you added a “Mourinho third year implosion” variable to the model, but even then would you have guessed 10th place? Nevertheless, it’s a pretty big miss and was the source of the majority of the error in my model.

**Leicester City**

I’m going to call this a hit and a miss, but more of a miss than a hit to be honest. I’ve been particularly proud of MOTSON predicting Leicester City higher than anyone else – 8th place on 60 points. Not bad, and if they finished 3rd or lower I was willing to call this a huge success for the model. On the other hand, if they win the title this year then it’s hard to say “I had the Champs in 8th place – I win!” I’m proud that my model recognized them as good long before anyone else did, and if you look at a lot of analytics prognostications for next year they’re saying “Leicester’s probably 7th or 8th place” so MOTSON is 9 months ahead of the curve there. But it’s a small victory assuming they win the league with 15-20 points more than I predicted.

That being said, MOTSON was ahead of the curve predicting them as Champions League qualifiers, picking them to qualify as of December 5. I know this because on December 4th I wrote that they should obviously sell Jamie Vardy because they had no expectation of the Champions League and December 6th changed my mind.

Game Theory: Top 4 Contenders Leicester City Should Absolutely Keep Jamie Vardy

**Nicolas Otamendi**

MOTSON *hated* this signing by Man City back in August, and it turned out to be right. He was a disaster in City’s backline, and is one of the reasons City’s fighting for 4th instead of comfortably coasting into the Champions League.

**West Ham**

So MOTSON didn’t get West Ham’s success right pre-season, but it did pick up on their top 6 challenge *very* early in the season (October 24). Mike Goodman and I had a conversation about this, and I argued that West Ham banking those 8 points over expectations would be enough to get them a top 6 spot. As of today they’re 10 points over expectation, so they’ve basically broken even since then and look to be in the top 6 at the end of the season.

In all the talk of Chelsea underachieving, we haven't talked about West Ham overachieving. They look to be for real pic.twitter.com/Uv81BWuSFy

— Chad Murphy (@Soccermetric) October 24, 2015

**Leicester City Redux**

MOTSON really liked Jamie Vardy to have a big year this year, something I didn’t notice until it had already happened because he wasn’t on my radar.

We Should Have Seen It Coming: Evaluating Jamie Vardy Against the EPL’s Elite Strikers

It also really liked Riyad Mahrez, pegging his replacement as something like a 10 point downgrade.

On the other hand, I never posted anything along these lines, but it didn’t really like the N’golo Kante signing which I’d classify as a pretty big miss. He’s been phenomenal for them and MOTSON would have told them to pass.

**Barcelona Will Be Fine Without Messi**

Lionel Messi got injured in the early part of the season, out for a month, and “real football men” wrote all sorts of thinkpieces about how Barcelona would be in trouble losing the world’s best player despite having two other world class strikers on the pitch even in Messi’s absence (and decent young backups filling in). MOTSON got it right: Barcelona would be fine without him, and they were. This is one of my favorite analytics pieces I’ve written, so I wanted to bump it.

Those are the big ones I can remember – plenty of successes for its first year with out of sample data but plenty of room for improvement as well. I may revisit this at some point, but for now I think this is a good recap of the model. Thanks for reading, and this summer I’ll be focusing on bringing statistics to the NWSL so keep an eye out for that.

If you follow me on Twitter, you’ve seen that I’ve been posting some preliminary Expected Goal (xG) data from the 2015 NWSL season. These data aren’t publicly available, so I’ve been collecting them by hand. To do this, I’ve been going on YouTube, watching every shot from every game, and coding a number of variables for each shot. As of today I’m up to around ~500 shots, and have built an xG model based on these shots which I will detail in another point when it’s done.

One thing I’ve added to typical xG models is a variable “whether a shooter was under pressure” – right now this is defined as whether the nearest defender was within a half yard of the shooter at the time of the shot. I was surprised to find that in my model, it didn’t reach statistical significance, both because it’s been considered an issue with typical xG models for a while now and because theoretically it makes sense that defensive pressure would lower the probability of scoring. So I did what any responsible analyst would do and plotted my data.

I plotted xG values (derived from my model) as a function of distance from the goal. The red line is when the shooter isn’t pressured (the nearest relevant defender isn’t within 0.5 yards), and the red shaded area is the 95% confidence interval. The blue line is when the shooter is under pressure (the nearest relevant defender is within 0.5 yards), and the blue shaded area is the 95% confidence interval for that estimate. As you can see, there’s a significant overlap between the two lines – from about 15 yards out not only do the 95% confidence intervals overlap, but the lines are almost identical. However, for closer shots there is a difference, highlighted in the graph below.

I’ve added a shaded area where the two lines are significantly different from each other – basically you’re looking at the area where the lines don’t overlap the other line’s shaded area to see where they are distinguishable from each other, which goes from about 3 yards to 13 yards away from goal. That is the zone where defensive pressure matters – basically anything between 3 yards from goal and the penalty spot is less likely to score if a defender is close, while anything further out than that defensive pressure is irrelevant. If I had to come up with a post hoc explanation, presumably shots in that area are difficult regardless of whether you have a defender in the way so distance is the limiting factor.

This is obviously limited to NWSL, and we may see differences for men vs. women here, but it’s a potentially interesting development given the paucity of defensive position data, and it’s an important methodological lesson to go beyond “star-gazing” and to look at the relationship between your variables. Defensive positioning matters most when the shooter is close to goal, even when controlling for a number of other factors (head v. foot, angle, etc.).

*This is the second post in a series of methodsy blogs explaining my Expected Goals (xG) model. The first goes through some introductory concepts for xG and I highly recommend you read it if you’re unsure of any of the content here. *

*Learning Outcomes for Today’s Post:*

*Linear Regression improves upon correlation because it shows the size of the effect in addition to the correlation between two variables**Regression output shows us both the size of the effect and the uncertainty of our statistics**A relationship between two*

**Expected Goals 201: xG For Soccer Analytics Majors**

Yep, this raw regression output is still really ugly and you should judge anyone who posts things like this. But with the last post hopefully you can read what it says and understand fundamentally what it means while you (rightfully) judge me for posting it with no explanation or formatting.

Last time we went over the first steps to understanding regression output, but in this post I wanted to go over some slightly more advanced techniques. I’m going to explain how I got these numbers (with little or no math), and more detail on what some of the numbers mean.

*Where did all these numbers come from?*

They came from a technique called regression analysis. In my Intro to Analytics YouTube videos I explain the idea of correlation, and regression is the next step beyond correlation. As you may remember, and as I wrote about in the previous post, correlation tells us how strong the relationship between two variables is and in which direction that relationship goes. So a high correlation means a strong relationship, and a positive correlation means that the two variables vary in the same direction (while a negative correlation means that they go in opposite directions). Correlation tells us much of what we need to know, and for a lot of things you don’t need to do anything fancier than that.

*You didn’t answer my question, you went on a side-drain about correlation!*

I’m getting there, be patient. So while correlation tells us direction and strength, it doesn’t tell us the size of the effect. As one variable goes up the other goes up, but by how much? That’s what regression adds to our lives – it tells us how much individual variables matter. If we’re trying to predict things, that becomes important.

*I’m not sure I understand: can you give me an example?*

Sure! The math behind this is incredibly confusing and complicated, but the concept isn’t. The easiest way to think about this is to look at it graphically, so I’m going to take a couple of graphs from my Intro to Analytics videos and walk through them. The first is the correlation between shots outside the area and points.

The above graph shows the scatterplot of shots outside the area (average # per 90 minutes per team) and the number of points a team earned in the 2014-2015 season. Correlation tells us that there is a correlation between the two variables, and that it’s a positive correlation. Life is good – teams should shoot more outside the box, right?

Well yes and no. Now that we’ve learned whether there is a correlation and that the correlation is positive^{1}, we need to look at the size of the effect. If someone is kind enough to graph their results (and all good analysis comes with something like this graph), then you can see exactly how much the variables affect each other.

In this graph, it’s as simple as lining up 6 shots outside the area per game on the horizontal (“X”) axis with the dotted line and seeing its value on the vertical (“Y”) axis. Let’s say it’s about 55 points. Not bad. Now to see how much an additional shot is worth, we look at the dotted line’s position at 7 shots outside the area. This is about 61 points or so, meaning adding one more shot outside the area per game gets you about 7 points per season, or 2 extra wins. Not bad, but let’s take a look at another metric.

The graph above here shows the effect of shots inside the area (per team per 90 minutes) on points per game. Let’s do the same thing for shots inside the area we did for shots outside the area: if we find where the dotted line falls for 6 shots per game, we see that this earns you about 40 points. If we do the same thing for 7 shots per game, we find that it earns you about 50 points. This is a 10 point difference, or 3 wins and a draw.

Adding an extra shot inside the area per game gives you 10 extra points, while adding an extra shot outside the area only gives you 6 points. I’m resisting make a “size matters” joke, but when you’re trying to measure the importance of different variables it matters in a very big way.

This is what regression tells us – the “estimate” column shows us the size of the effect (how much one variable increases while the other increases), while the stars show us if that difference is statistically significant.

*Wait what? Statistically significant? You can’t just drop that term on me without explaining it.*

I’m going to – don’t worry!

*So what is it????*

Statistical significance is a fancy way of saying “are we sure that there’s an effect of one variable on another?” We think that these two variables are related, but how sure are we really?

Everything in life is subject to some sort of uncertainty. Whether a coin comes up heads or tails, whether a team wins a game, or whether a shot goes into the back of the net or flies over the crossbar, or some non-soccer things that I hear people experience like jobs, friends, and social lives. Statistics are the same way: you can predict something will happen, but you can only be so certain that it will occur. There’s some math behind this, but the easiest way to think of it like the margin of error in polls: we run a poll and find that President Obama’s approval is (let’s say) 52%, plus or minus 3 points. That means his approval could reasonably be anywhere between 49 and 55, and far less likely could be outside that range.

How does this look in our stats table? Good question!

*I didn’t ask.*

Too bad, I’m going to tell you anyway. Back to my ugly regression output:

I’ve highlighted the “Estimate” column in red. This number is the size of the effect each variable has on the likelihood of scoring a goal (xG).

**Do not interpret this number in any way other than “positive or negative” – it’s well beyond what I can teach you here (but maybe in xG 491), and the different “Estimates” cannot be compared to each other. **

*But “y” is way smaller than “angle” – that means that…*

**No, it doesn’t. It doesn’t mean anything.**

*What abou…*

**No.**

*OK fine…*mumbles under breath**

Back to our table – the “estimate” is the size of the effect. This can also be called the “coefficient” or the “slope” (think back to the graphs I showed earlier – we looked at the slope of the line, or the rate of change, to see how big the effect was). Now let’s look at the uncertainty we talked about earlier.

I’ve highlighted the uncertainty measure in blue, known as the “Standard Error” (The “z” in the highlight belongs to the next column – “z value”). This is a measure of how certain we are that the estimate is what we say it is. To say something is statistically significant, we are looking for an estimate that is much bigger than the standard error. That means we have an effect that is much larger than our uncertainty. Low uncertainty means we can be certain that the variable has an effect on the other variable, and we label this “statistically significant” and give it some stars. I’ll get into this more in a later post, but generally you’re looking for an estimate size (absolute value, or ignoring the negative sign if there is one) that is roughly twice as big as the standard error size. Let’s look at an example or two.

Returning to freekick, which is a simple variable looking at whether the shot came from a direct free kick or not. The estimate of the size of this effect is 0.019 (not very big), and the standard error is 0.610 (very big). Because we have a small estimated effect and a large standard error, this is not statistically significant. This is confirmed by a lack of stars in our regression table (stars mean statistically significant). Because of this we say that free kicks are no more or less likely to turn into goals than regular shots (no change in xG).

(If you remember a few lines ago I said the estimate needed to be twice the size of the standard error, and in this case 0.019/0.610 = 0.031, which isn’t even close to 2. Small effect + high uncertainty = no statistical significance)

Let’s look at another example to hammer this concept home, this time we’ll look at “y” which represents distance from the touchline, or how far out a shot was taken.

This time the estimate is -0.069, which is still pretty small, but the standard error is even smaller at 0.019. A small estimate with a really small amount of uncertainty means we probably have something statistically significant, and can confidently say that distance from goal affects the likelihood of scoring a goal (xG).

(Simple math confirms this: 0.069/0.019 = 3.696, which is greater than 2 which means we have a statistically significant relationship. This is why there are stars next to that row.)

That’s it: that’s how to read a linear regression table, the main focus of “Expected Goals 201.” The 491 senior seminar will get into the techniques more and how to do a regression of your own, and I’ll definitely be recording a video sometime in the relatively near future about it as well, showing you all how to create your own xG model so be sure to look for that!

- A reminder from last time, that’s done by looking at stars, and then positive/negative “estimate” column ↩

*Quick note before starting: if you’re interested in this type of explanation you should watch my “Intro to Analytics” playlist on YouTube and subscribe to my channel to see future updates. Also, follow me on Twitter @Soccermetric.*

I did one of my pet peeves yesterday: I posted raw R output^{1} of a preliminary cut at some xG data for the NWSL. I’ve spent a bunch of time collecting the data, and was curious whether I had anything interesting yet so I ran the model on limited data (~300 shots). Here’s the raw output

Ugly, right? Yeah…I should have at least formatted it nicely or named variables in meaningful ways. But there’s some interesting stuff here that I wanted to share with everyone, and since some people showed interest in understanding the model I wanted to write a blog post.

My ultimate goal is to provide three levels of explanation: xG 101 (Intro to Expected Goals), xg 201 (xG for Soccer Analytics majors), and xG 491 (Senior Seminar in xG). The first, which I’m including in this blog, should give you an adequate understanding of what I’m working on, the second will go a little further into the methods, and the third will get into more statistical detail and talk about some of the strengths and weaknesses of what I’ve done so far and where I need to go as this project progresses.

**xG 101: Intro to Expected Goals**

*What is xG? *

xG is short for Expected Goals. Basically what it measures is the probability that a shot turns into a goal. When a player takes a shot, how often will she score?

*Why do we care?*

xG has become one of the most popular statistics in the soccer analytics community, so it’s worth understanding. It’s important because it’s used to answer some questions:

- Which team had the better quality shots during a game?
- Which players should score the most goals?
- Which teams should score the most goals during a season?

The other side of the coin is Expected Goals Allowed (xGA), which answers how many goals a team would be expected to allow. Over the course of a season, xG correlates with how well a team does and how many games a team wins/loses/draws. The idea is that in a single game, teams can get lucky and defy probability but over a season these sorts of things even out.

*Do you have an analogy for how this works?*

Yes, yes I do. Think about it this way: if you flip a coin it has a 50% chance of coming up heads. If you flip this coin ten times you’d expect it to come up heads 5/10 times (50% of the time). But you wouldn’t be surprised if it came up heads 6 times or 4 times, and a little more surprised but not shocked if it came up 7 times or 3 times. Certainly you wouldn’t be surprised if any single flip came up heads or tails, but over the long run (hundreds or thousands of shots ) you’d expect the number of heads to be close to 50%.

The same goes for shots: if your star forward takes a shot inside the penalty area, you’d expect her to score (hypothetically) 40% of the time. If she scored on a single shot, you wouldn’t be surprised, but if the goalkeeper saved it you wouldn’t be too surprised either. In any single game, you might see a lot of goals scored (the coin comes up heads several times in a row) or not a lot of goals scored (tails several times in a row), but over the course of a season this should all balance out.

*How is xG measured?*

It’s fairly simple: it’s a number between 0-1 where higher numbers mean a greater likelihood of scoring (a higher quality shot) and lower numbers mean a lower likelihood of scoring (a lower quality shot). A header from 40 yards out would be really unlikely to score, and therefore would have a very low xG score. A shot kicked from 3 feet away on an open net would be really likely to score and would have a very high xG score.

The actual number itself is the probability of scoring: a shot with a 0.4 xG value has a 40% chance of being a goal, while a shot with a 0.15 xG value has a 15% chance of being a goal.

*OK, I get what xG means, but what does the ugly regression output you posted mean?*

Here’s the explanation I give all of my undergraduates in their Intro to American Government class. Each row is an individual factor (variable) that predicts whether a goal is scored. So you have things like the score at the time of the shot (“diff”), distance from the goal line the shot is taken (“y”), the angle between the shooter and the center of the goal (“angle”), etc.^{2}

In the table you’re looking at two things:

- Are there stars on the same line as the variable?
- If yes, proceed to the next step

- Is the number under the “estimate” column positive or negative?

If there are stars, then the variable has what we call a “statistically significant effect” on the likelihood of a shot scoring (more on this in 201). This basically means that it matters – it correlates with a change in the likelihood of scoring (the xG value). If it doesn’t have stars, the two variables are unrelated and it effectively doesn’t matter. So let’s return to the table for a minute, and look at the “freekick” column.

This variable says “Is the shot from a direct free kick?” So we look to see if there are stars next to it, and there aren’t. So it turns out a shot made from a free kick is no more or less likely to score compared to a regular shot. Good times.

Now let’s look at “counter.” This variable represents whether the shot came as the result of a counter attack. There are a lot of these in the NWSL – it’s a fast-paced, athletic league, so we see a lot of fast-moving counterattacks. But are shots from a counterattack more likely to score?

There is a star next to “counter”, which means that whether the shot came from a counter attack is related to the probability of scoring. Life is good – we found something interesting here. Let’s move on to step #2~!

*“Is the number in the estimate column positive or negative?” *

The estimate number for “counter” is positive, which means counter attacks have a positive relationship with the probability of scoring (xG value). If you’ve watched my Intro To Analytics YouTube videos, you know what this means (and you should watch them, they’re really good!). But if you haven’t, a positive correlation means that when one variable increases the other increases. In this case, what it means is that shots after a counter attack are more likely to score/have a higher xG value. So teams that counter attack more frequently should score more goals.

Let’s look at one more example: distance from goal (labeled “y” in my picture).

So step 1: are there stars? Yes there are, so that means distance from goal is related to the probability of scoring (xG).

Step 2: is the “estimate” number positive or negative? It’s negative, so what does that mean? A negative number means a negative correlation, or a negative relationship between two variables. This means that as one variable goes up, the other goes down.

Specifically here, as the distance from goal goes up, the likelihood of a goal being scored goes down (xG value goes down). Shots taken from distance are less likely to score, which makes sense from a common sense perspective. The other side is that shots taken close in are more likely to score, which again makes sense.

That’s xG/regression analysis 101. I’ll probably turn this into a video and write up xG 201 when I get bored during the EPL games tomorrow, but hopefully this helped people understand what’s going on. 201 will go into a little more detail of how this worked, and then 491 will be a sophisticated treatment of regression analysis and how things work.

####

Variable codes:

- diff – the “game state” or difference in score between the two teams
- y – the distance from the goal line
- angle – the angle (in radians) between the shooter and the center of the goal
- time – the time the shot was taken
- def.distance – the distance between the shooter and the nearest defender
- head – was the shot a header?
- foot – was the shot kicked?
- counter – was the shot the result of the counter attack?
- home.team – was the shot taken by someone on the home team?
- gk.error – was the shot after a goalkeeper’s error?
- freekick – was the shot a direct free kick?
- corner – was the shot assisted off a corner kick?