Preliminary Evidence on How Defensive Pressure Affects xG: Data from NWSL

If you follow me on Twitter, you’ve seen that I’ve been posting some preliminary Expected Goal (xG) data from the 2015 NWSL season. These data aren’t publicly available, so I’ve been collecting them by hand. To do this, I’ve been going on YouTube, watching every shot from every game, and coding a number of variables for each shot. As of today I’m up to around ~500 shots, and have built an xG model based on these shots which I will detail in another point when it’s done.

One thing I’ve added to typical xG models is a variable “whether a shooter was under pressure” – right now this is defined as whether the nearest defender was within a half yard of the shooter at the time of the shot. I was surprised to find that in my model, it didn’t reach statistical significance, both because it’s been considered an issue with typical xG models for a while now and because theoretically it makes sense that defensive pressure would lower the probability of scoring. So I did what any responsible analyst would do and plotted my data.

NWSL xG Distance v Pressure no rectanagle

I plotted xG values (derived from my model) as a function of distance from the goal. The red line is when the shooter isn’t pressured (the nearest relevant defender isn’t within 0.5 yards), and the red shaded area is the 95% confidence interval. The blue line is when the shooter is under pressure (the nearest relevant defender is within 0.5 yards), and the blue shaded area is the 95% confidence interval for that estimate. As you can see, there’s a significant overlap between the two lines – from about 15 yards out not only do the 95% confidence intervals overlap, but the lines are almost identical. However, for closer shots there is a difference, highlighted in the graph below.

NWSL xG Distance vs Pressure no rectangle

I’ve added a shaded area where the two lines are significantly different from each other – basically you’re looking at the area where the lines don’t overlap the other line’s shaded area to see where they are distinguishable from each other, which goes from about 3 yards to 13 yards away from goal. That is the zone where defensive pressure matters – basically anything between 3 yards from goal and the penalty spot is less likely to score if a defender is close, while anything further out than that defensive pressure is irrelevant. If I had to come up with a post hoc explanation, presumably shots in that area are difficult regardless of whether you have a defender in the way so distance is the limiting factor.

This is obviously limited to NWSL, and we may see differences for men vs. women here, but it’s a potentially interesting development given the paucity of defensive position data, and it’s an important methodological lesson to go beyond “star-gazing” and to look at the relationship between your variables. Defensive positioning matters most when the shooter is close to goal, even when controlling for a number of other factors (head v. foot, angle, etc.).

 





Expected Goals 201: xG For Soccer Analytics Majors

This is the second post in a series of methodsy blogs explaining my Expected Goals (xG) model. The first goes through some introductory concepts for xG and I highly recommend you read it if you’re unsure of any of the content here. 

Learning Outcomes for Today’s Post:

  1. Linear Regression improves upon correlation because it shows the size of the effect in addition to the correlation between two variables
  2. Regression output shows us both the size of the effect and the uncertainty of our statistics
  3. A relationship between two 

 

Expected Goals 201: xG For Soccer Analytics Majors

NWSL xg

Yep, this raw regression output is still really ugly and you should judge anyone who posts things like this. But with the last post hopefully you can read what it says and understand fundamentally what it means while you (rightfully) judge me for posting it with no explanation or formatting.

Last time we went over the first steps to understanding regression output, but in this post I wanted to go over some slightly more advanced techniques. I’m going to explain how I got these numbers (with little or no math), and more detail on what some of the numbers mean.

Where did all these numbers come from?

They came from a technique called regression analysis. In my Intro to Analytics YouTube videos I explain the idea of correlation, and regression is the next step beyond correlation. As you may remember, and as I wrote about in the previous post, correlation tells us how strong the relationship between two variables is and in which direction that relationship goes. So a high correlation means a strong relationship, and a positive correlation means that the two variables vary in the same direction (while a negative correlation means that they go in opposite directions). Correlation tells us much of what we need to know, and for a lot of things you don’t need to do anything fancier than that.

You didn’t answer my question, you went on a side-drain about correlation!

I’m getting there, be patient. So while correlation tells us direction and strength, it doesn’t tell us the size of the effect. As one variable goes up the other goes up, but by how much? That’s what regression adds to our lives – it tells us how much individual variables matter. If we’re trying to predict things, that becomes important.

I’m not sure I understand: can you give me an example?

Sure! The math behind this is incredibly confusing and complicated, but the concept isn’t. The easiest way to think about this is to look at it graphically, so I’m going to take a couple of graphs from my Intro to Analytics videos and walk through them. The first is the correlation between shots outside the area and points.

Lesson 3 - Shots Outside v Points

The above graph shows the scatterplot of shots outside the area (average # per 90 minutes per team) and the number of points a team earned in the 2014-2015 season. Correlation tells us that there is a correlation between the two variables, and that it’s a positive correlation. Life is good – teams should shoot more outside the box, right?

Well yes and no. Now that we’ve learned whether there is a correlation and that the correlation is positive1, we need to look at the size of the effect. If someone is kind enough to graph their results (and all good analysis comes with something like this graph), then you can see exactly how much the variables affect each other.

In this graph, it’s as simple as lining up 6 shots outside the area per game on the horizontal (“X”) axis with the dotted line and seeing its value on the vertical (“Y”) axis. Let’s say it’s about 55 points. Not bad. Now to see how much an additional shot is worth, we look at the dotted line’s position at 7 shots outside the area. This is about 61 points or so, meaning adding one more shot outside the area per game gets you about 7 points per season, or 2 extra wins. Not bad, but let’s take a look at another metric.

Lesson 3 - Shots Inside v Points

The graph above here shows the effect of shots inside the area (per team per 90 minutes) on points per game. Let’s do the same thing for shots inside the area we did for shots outside the area: if we find where the dotted line falls for 6 shots per game, we see that this earns you about 40 points. If we do the same thing for 7 shots per game, we find that it earns you about  50 points.  This is a 10 point difference, or 3 wins and a draw.

Adding an extra shot inside the area per game gives you 10 extra points, while adding an extra shot outside the area only gives you 6 points. I’m resisting make a “size matters” joke, but when you’re trying to measure the importance of different variables it matters in a very big way.

This is what regression tells us – the “estimate” column shows us the size of the effect (how much one variable increases while the other increases), while the stars show us if that difference is statistically significant.

Wait what? Statistically significant? You can’t just drop that term on me without explaining it.

I’m going to – don’t worry!

So what is it????

Statistical significance is a fancy way of saying “are we sure that there’s an effect of one variable on another?” We think that these two variables are related, but how sure are we really?

Everything in life is subject to some sort of uncertainty. Whether a coin comes up heads or tails, whether a team wins a game, or whether a shot goes into the back of the net or flies over the crossbar, or some non-soccer things that I hear people experience like jobs, friends, and social lives. Statistics are the same way: you can predict something will happen, but you can only be so certain that it will occur. There’s some math behind this, but the easiest way to think of it like the margin of error in polls: we run a poll and find that President Obama’s approval is (let’s say) 52%, plus or minus 3 points. That means his approval could reasonably be anywhere between 49 and 55, and far less likely could be outside that range.

How does this look in our stats table? Good question!

I didn’t ask.

Too bad, I’m going to tell you anyway. Back to my ugly regression output:

blog 4

I’ve highlighted the “Estimate” column in red. This number is the size of the effect each variable has on the likelihood of scoring a goal (xG).

Do not interpret this number in any way other than “positive or negative” – it’s well beyond what I can teach you here (but maybe in xG 491), and the different “Estimates” cannot be compared to each other. 

But “y” is way smaller than “angle” – that means that…

No, it doesn’t. It doesn’t mean anything.

What abou…

No.

OK fine…*mumbles under breath*

Back to our table – the “estimate” is the size of the effect. This can also be called the “coefficient” or the “slope” (think back to the graphs I showed earlier – we looked at the slope of the line, or the rate of change, to see how big the effect was).  Now let’s look at the uncertainty we talked about earlier.

blog 5

I’ve highlighted the uncertainty measure in blue, known as the “Standard Error” (The “z” in the highlight belongs to the next column – “z value”). This is a measure of how certain we are that the estimate is what we say it is. To say something is statistically significant, we are looking for an estimate that is much bigger than the standard error. That means we have an effect that is much larger than our uncertainty. Low uncertainty means we can be certain that the variable has an effect on the other variable, and we label this “statistically significant” and give it some stars. I’ll get into this more in a later post, but generally you’re looking for an estimate size (absolute value, or ignoring the negative sign if there is one) that is roughly twice as big as the standard error size. Let’s look at an example or two.

blog 6

Returning to freekick, which is a simple variable looking at whether the shot came from a direct free kick or not. The estimate of the size of this effect is 0.019 (not very big), and the standard error is 0.610 (very big). Because we have a small estimated effect and a large standard error, this is not statistically significant. This is confirmed by a lack of stars in our regression table (stars mean statistically significant). Because of this we say that free kicks are no more or less likely to turn into goals than regular shots (no change in xG).

(If you remember a few lines ago I said the estimate needed to be twice the size of the standard error, and in this case 0.019/0.610 = 0.031, which isn’t even close to 2. Small effect + high uncertainty = no statistical significance)

Let’s look at another example to hammer this concept home, this time we’ll look at “y” which represents distance from the touchline, or how far out a shot was taken.

blog 7

This time the estimate is -0.069, which is still pretty small, but the standard error is even smaller at 0.019. A small estimate with a really small amount of uncertainty means we probably have something statistically significant, and can confidently say that distance from goal affects the likelihood of scoring a goal (xG).

(Simple math confirms this: 0.069/0.019 = 3.696, which is greater than 2 which means we have a statistically significant relationship. This is why there are stars next to that row.)

That’s it: that’s how to read a linear regression table, the main focus of “Expected Goals 201.” The 491 senior seminar will get into the techniques more and how to do a regression of your own, and I’ll definitely be recording a video sometime in the relatively near future about it as well, showing you all how to create your own xG model so be sure to look for that!

 

 





  1. A reminder from last time, that’s done by looking at stars, and then positive/negative “estimate” column

A Very Preliminary NWSL Expected Goals Model: xG 101

Quick note before starting: if you’re interested in this type of explanation you should watch my “Intro to Analytics” playlist on YouTube and subscribe to my channel to see future updates. Also, follow me on Twitter @Soccermetric.

I did one of my pet peeves yesterday: I posted raw R output1 of a preliminary cut at some xG data for the NWSL. I’ve spent a bunch of time collecting the data, and was curious whether I had anything interesting yet so I ran the model on limited data (~300 shots). Here’s the raw output

NWSL xg

Ugly, right? Yeah…I should have at least formatted it nicely or named variables in meaningful ways. But there’s some interesting stuff here that I wanted to share with everyone, and since some people showed interest in understanding the model I wanted to write a blog post.

My ultimate goal is to provide three levels of explanation: xG 101 (Intro to Expected Goals), xg 201 (xG for Soccer Analytics majors), and xG 491 (Senior Seminar in xG). The first, which I’m including in this blog, should give you an adequate understanding of what I’m working on, the second will go a little further into the methods, and the third will get into more statistical detail and talk about some of the strengths and weaknesses of what I’ve done so far and where I need to go as this project progresses.

xG 101: Intro to Expected Goals

What is xG? 

xG is short for Expected Goals. Basically what it measures is the probability that a shot turns into a goal. When a player takes a shot, how often will she score?

Why do we care?

xG has become one of the most popular statistics in the soccer analytics community, so it’s worth understanding. It’s important because it’s used to answer some questions:

  • Which team had the better quality shots during a game?
  • Which players should score the most goals?
  • Which teams should score the most goals during a season?

The other side of the coin is Expected Goals Allowed (xGA), which answers how many goals a team would be expected to allow. Over the course of a season, xG correlates with how well a team does and how many games a team wins/loses/draws. The idea is that in a single game, teams can get lucky and defy probability but over a season these sorts of things even out.

Do you have an analogy for how this works?

Yes, yes I do. Think about it this way: if you flip a coin it has a 50% chance of coming up heads. If you flip this coin ten times you’d expect it to come up heads 5/10 times (50% of the time). But you wouldn’t be surprised if it came up heads 6 times or 4 times, and a little more surprised but not shocked if it came up 7 times or 3 times. Certainly you wouldn’t be surprised if any single flip came up heads or tails, but over the long run (hundreds or thousands of shots ) you’d expect the number of heads to be close to 50%.

The same goes for shots: if your star forward takes a shot inside the penalty area, you’d expect her to score (hypothetically) 40% of the time. If she scored on a single shot, you wouldn’t be surprised, but if the goalkeeper saved it you wouldn’t be too surprised either.  In any single game, you might see a lot of goals scored (the coin comes up heads several times in a row) or not a lot of goals scored (tails several times in a row), but over the course of a season this should all balance out.

How is xG measured?

It’s fairly simple: it’s a number between 0-1 where higher numbers mean a greater likelihood of scoring (a higher quality shot) and lower numbers mean a lower likelihood of scoring (a lower quality shot). A header from 40 yards out would be really unlikely to score, and therefore would have a very low xG score. A shot kicked from 3 feet away on an open net would be really likely to score and would have a very high xG score.

The actual number itself is the probability of scoring: a shot with a 0.4 xG value has a 40% chance of being a goal, while a shot with a 0.15 xG value has a 15% chance of being a goal.

OK, I get what xG means, but what does the ugly regression output you posted mean?

Here’s the explanation I give all of my undergraduates in their Intro to American Government class. Each row is an individual factor (variable) that predicts whether a goal is scored. So you have things like the score at the time of the shot (“diff”), distance from the goal line the shot is taken (“y”), the angle between the shooter and the center of the goal (“angle”), etc.2

In the table you’re looking at two things:

  • Are there stars on the same line as the variable?
    • If yes, proceed to the next step
  • Is  the number under the “estimate” column positive or negative?

If there are stars, then the variable has what we call a “statistically significant effect” on the likelihood of a shot scoring (more on this in 201). This basically means that it matters – it correlates with a change in the likelihood of scoring (the xG value).  If it doesn’t have stars, the two variables are unrelated and it effectively doesn’t matter. So let’s return to the table for a minute, and look at the “freekick” column.

blog 1

This variable says “Is the shot from a direct free kick?” So we look to see if there are stars next to it, and there aren’t. So it turns out a shot made from a free kick is no more or less likely to score compared to a regular shot. Good times.

Now let’s look at “counter.” This variable represents whether the shot came as the result of a counter attack. There are a lot of these in the NWSL – it’s a fast-paced, athletic league, so we see a lot of fast-moving counterattacks. But are shots from a counterattack more likely to score?

blog 2

There is a star next to “counter”, which means that whether the shot came from a counter attack is related to the probability of scoring. Life is good – we found something interesting here. Let’s move on to step #2~!

“Is the number in the estimate column positive or negative?” 

The estimate number for “counter” is positive, which means counter attacks have a positive relationship with the probability of scoring (xG value). If you’ve watched my Intro To Analytics YouTube videos, you know what this means (and you should watch them, they’re really good!). But if you haven’t, a positive correlation means that when one variable increases the other increases. In this case, what it means is that shots after a counter attack are more likely to score/have a higher xG value. So teams that counter attack more frequently should score more goals.

Let’s look at one more example: distance from goal (labeled “y” in my picture).

blog 3

So step 1: are there stars? Yes there are, so that means distance from goal is related to the probability of scoring (xG).

Step 2: is the “estimate” number positive or negative? It’s negative, so what does that mean? A negative number means a negative correlation, or a negative relationship between two variables. This means that as one variable goes up, the other goes down.

Specifically here, as the distance from goal goes up, the likelihood of a goal being scored goes down (xG value goes down). Shots taken from distance are less likely to score, which makes sense from a common sense perspective. The other side is that shots taken close in are more likely to score, which again makes sense.

That’s xG/regression analysis 101. I’ll probably turn this into a video and write up xG 201 when I get bored during the EPL games tomorrow, but hopefully this helped people understand what’s going on. 201 will go into a little more detail of how this worked, and then 491 will be a sophisticated treatment of regression analysis and how things work.

####

Variable codes:

  • diff – the “game state” or difference in score between the two teams
  • y – the distance from the goal line
  • angle – the angle (in radians) between the shooter and the center of the goal
  • time – the time the shot was taken
  • def.distance – the distance between the shooter and the nearest defender
  • head – was the shot a header?
  • foot  – was the shot kicked?
  • counter – was the shot the result of the counter attack?
  • home.team – was the shot taken by someone on the home team?
  • gk.error – was the shot after a goalkeeper’s error?
  • freekick – was the shot a direct free kick?
  • corner – was the shot assisted off a corner kick?






  1. In my defense, at least it wasn’t raw Stata output…lol stats program jokes.
  2. I’ll go into all the variables in 201, so if you’re interested, keep reading!

Some Personal Reflections on Getting Started in Soccer Analytics

Yesterday Ravi (@Scribblr_42) wrote a great microblog titled “Is Fanalytics Intimidating?” and you can find it here: https://twitter.com/Scribblr_42/status/709803712856899584

It’s really interesting, and you all should read it and then come back to my post. I’ll wait.

This isn’t a methods post about how to get started, but I wanted to share some of my experiences on breaking into the analytics community and maybe offer some advice to people who want to get involved.

I’ve been doing this about a year now, and converted my personal Twitter account into a “soccer only” account around 7 months ago. In that time I’ve grown my following from about 110 friends and former students to over 2500 followers, and feel incredibly fortunate to have picked up a good-sized following in such a short time.

I’ve posted it before, but this started as an excuse to improve my data science skills that I didn’t think would go anywhere. I likely wouldn’t have kept up with it if there wasn’t an audience for what I’ve been doing – I’ve briefly tried political blogs in the past but they never really caught on and I don’t feel like I have anything unique to add to the political blogosphere.1 I don’t see the purpose in saying something that’s already been said for a couple dozen people to read, so I’ve never really stayed with it. But with soccer I’ve built up an audience for my work so I’ve continued to put out content that hopefully people continue to enjoy.

The biggest thing that has helped me has been bigger accounts sharing my work, and I couldn’t be more appreciative to the people who have done so. Mike Goodman has been particularly supportive from the beginning(@The_M_L_G), as have @GoalImpact, and @7amkickoff. I’m incredibly grateful to the people who regularly share my work, particularly Jake Kilov (@Kilonater3000) and Naveen Maliakkal (@njm1211) who have consistently retweeted me for a while, and anyone else who does.

Since I moved in to Women’s Soccer, @DasGherkin has been incredibly generous in promoting my account, and I’ve picked up over 200 followers just from her recommendations. She’s kind of a big deal in the WoSo Twitter world, and for her to share my work has given me a real credibility boost  in that world and has been invaluable in helping me get the word out for my upcoming NWSL analytics work. I know I’m leaving a bunch of people out, but I’m grateful to everyone who has helped me out.

But it’s not just about growing my audience – simple support from people in the community has meant a lot and kept me going. Tom Worville (@Worville)  helped me with coding and data issues more than a few times, especially in the beginning, and James Curley (@jalapic) and @UTVilla have both given me great help and have helped my R programming skills grow exponentially. And even something as simple as seeing prominent accounts favoriting my tweets has given me the motivation to go forward and keep doing work. Favoriting is especially important because it’s a costless act beyond pressing a button, but it’s a nice validation that I’m doing something interesting/good and a clear sign that someone’s reading it. The community embraced my work early, so it has been less intimidating for me than it has been for others, but I can see that it would be intimidating if someone doesn’t get this support. I don’t know if I would still be posting public analytics work without it, and I’m someone who was already confident in my analytics skills.

To people with a lot of followers: take the time to engage with people who don’t have as many eyes seeing their work. You don’t have to have long conversations (although it’s nice), but something as simple as a favorite, or even a retweet, can mean a lot to someone trying to find a place in the community. I do this every so often with my Retweet Days where I share work from people who have fewer followers than I do. And the best part is that the more followers I get, the more voices I can amplify. I’m happy to do it.

I’ve also been trying to give back in terms of my “Intro to Analytics” YouTube Course. People seem to be enjoying it, and I need to add some more chapters as soon as I can find the time and energy to do so. Maybe people who watch these will become involved in analytics, or maybe they’ll just be more able to participate in the conversation, but hopefully this will let more people get involved and make people a little less hesitant to participate. At some point we all started out with virtually no followers, so why not pay it forward and try to help people who might be part of the next generation of soccer analysts?

Final thoughts: for those of you looking to get involved,  don’t worry about the math. Learn some simple concepts from my YouTube channel (if you don’t know them already), find some public data, and go from there. I know my work focuses on predictions, but like someone said on Twitter I don’t think there’s much more room in that space unless you can out-predict my model and the other ones that are out there (which will be tough). Same goes for Expected Goals: I don’t think there’s a lot of room for new xG models in the analytics world unless you can put something together that significantly beats the prominent ones already out there. But there’s lots of work to be done, so look for those gaps and fill them.

Most importantly, be a good citizen, regardless of how many followers you have. If you’re doing interesting work, engage with other people’s work in a positive way, and write often you’ll have a good chance at building a following. Even if you don’t, you’ll have some positive experiences and share your work with like-minded people. Hopefully people wanting to get started can take my advice, and hopefully people who have a strong following can help encourage new people to participate in the community. It really helped me get a foothold and feel like my work was being appreciated, and the teacher in me wants to help others along the same lines.

 






  1. I do original research as part of my job, but that’s academic, not blogging.