Category Archives: Methods

Thoughts on Machine Learning, Black Boxes, and my SVM

There was quite a bit of discussion about machine learning (ML) techniques on Soccer Analytics TwitterTM today, so I expedited this post I’ve been planning for a few days.

I think it’s important to define what ML is for people. Normally I don’t like this, but Wikipedia has a good definition that I think works for what I wanted to communicate, so here we go:

Machine learning is a subfield of computer science that evolved from the study of pattern recognition and computational learning theory in artificial intelligence. Machine learning explores the study and construction of algorithms that can learn from and make predictions on data.

The big thing here is that ML techniques focus on prediction – if you’re trying to predict something, then you should look at ML first. What it doesn’t do well is explain things, as captured by these tweets from Michael Caley.

This was in response to Ola Lidmark Eriksson’s blog post talking about a machine learning approach to calculating xG. Michael’s approach is aimed at a mass audience, explaining makes a shot more likely to go in (e.g. closer to goal, more centrally located, not headed), while Ola focuses on greater accuracy and letting the model make the decisions. Both versions have their merit, and so much of it really depends on what trade-offs you’re willing to make. How much accuracy do you gain compared to the lack of explanatory power from these types of models? How much are you willing to sacrifice on either dimension?

All that being said, I wanted to talk about my method for predictions because I get a number of questions. I frequently get people asking “Why does your model  like Arsenal so much?” or “What stats are driving your results?” or “Why does your model think Theo Walcott is so good?” The answer to this is always “I don’t know”, and that’s a feature of the Support Vector Machine model. I explained the Support Vector Machine (SVM) in another post, so I’m not going to revisit the whole thing, but it’s worth a read for anyone interested in what’s under the hood. But I wanted to highlight the “black box” nature of these models, SVMs in particular.

The reason I like the SVM model for soccer is that it doesn’t assume any functional form – it doesn’t think that more passes or more possession is necessarily good. It looks at stats for results, and learns how many passes and how much possession is optimal given the other game stats, and predicts results that way. It doesn’t tell you what the inflection points are, and doesn’t tell you what the cutoffs are, and the interactions in such a big model are too complicated to present visually. But it does predict well, which is what matters to me for my purposes.

Another thing it does is recognizes the value of defense and balance in a team. One of the common refrains of Soccer Analytics folks is that it’s impossible to quantify defense. The SVM proves that this isn’t true, as it recognizes the value of having players who make some tackles/make clearances/win headers. I’m particularly proud of my most recent exploratory analysis, looking at the value of Man City replacing each of their midfielders with Lionel Messi.


The model shows Messi as an improvement over all of Man City’s front three midfielders, but is a small downgrade over Yaya Toure and a significant downgrade over Fernandinho. While it doesn’t give me specific reasons for this (the SVM is a black box, remember?), it’s pretty clear that Messi isn’t as good playing in the deeper role that Toure plays and certainly wouldn’t make a good holding midfielder like Fernandinho. Any increase in offense brought on by playing Messi instead of Fernandinho would be more than offset by the loss in defensive strength. This passes the common sense check.

Most importantly, I think this highlights one of the advantages my model has over some of the dominant models out there – specifically ones based on xG or some other offensive contribution. It knows if you’re playing too many attacking players, and will punish you for that. It can find places where your team is imbalanced (Mesut Ozil at Arsenal is a great example of that – my model much prefers Daniele de Rossi in his place, which is a very different role), and point out ways to fix that and can even recognize potential tactical improvements. It does all of this without knowing anything about players other than their average statistical contribution to a game.

Machine Learning techniques have their place, and if prediction is your goal then you really should learn something about them. But if you’re looking to explain things, then there are more appropriate methods and you should learn those. My SVM has predicted results well so far, and it quantifies individual player contribution to a team as well as anything out there (I would argue better, but I have no statistical proof of this). But it doesn’t explain outcomes particularly well, and it doesn’t explain why it prefers certain players over other players. That’s a job for other methods and people who are more interested in explanation. As usual, it’s about the right tool for the right job, and Machine Learning techniques are the right tool for predicting outcomes and quantifying individual contribution.

Game Theory: A Long-Term Look at Selling Young Players

My last post looked at the calculations involved in whether a team battling relegation should sell strong young players, and showed that in a one-shot game it basically comes down to your perceptions of probability of remaining in the league if you sell the young star vs. keeping the young star. For those who didn’t read (and don’t want to click the link above) here’s a quick refresher on the calcuation:

EV(Keep Grealish) = Pr(Staying in the EPL)* (Total Revenue from being in the EPL)  + Pr(Relegated)* (Total Revenue from the Championship)  – (Money spent improving the squad)

EV(Sell Grealish) = Pr(Staying in the EPL) * (Total Revenue from being in the EPL) + Pr(Relegated)* (Total Revenue from the Championship) + (Money gained from Selling Grealish) – (Money spent replacing him)

If EV(Keep) > EV(Sell), then Villa should keep Grealish. If EV(Sell) > EV(Keep), then Villa should sell him.

I left the last post as/is in the interest of simplicity and parsimony. However, the full math is a little more complicated than this and I wanted to walk through it a little more fully for interested readers here.

First, there is a psychic cost/benefit to selling young players. Fans may become upset at selling a club’s young star, there may be a decrease in locker room morale, or some other effects I can’t anticipate along those lines.  So add that to the “sell” side of the equation.1

Second, this isn’t really a one shot game. Aston Villa (or any team) retaining their Premier League status is a yearly task. 2 So they’d have to factor in the probability of retaining Premier League status this year (“t”), next year (“t+1”), the year after (“t+2”), etc. Presumably, if Grealish improves like he is expected to over the next 5-10 years, he will grow into a valuable member of Aston Villa’s squad, decreasing their probability of relegation in t+1, t+2, etc.  So with this in mind, here’s the new equation (some of the names are abbreviated so it doesn’t get too unruly.

Pr(Stay)* (EPL Revenue)  + Pr(Relegated)* (Championship Revenue)  – (Net Transfer spend) 

EV(Keep) = t + Pr(Stay)(t+1))* (EPL Revenue(t+1))  + Pr(Relegated(t+1))* (Championship Revenue(t+1))  

EV(Sell) = t + Pr(Stay)(t+1))* (EPL Revenue(t+1))  + Pr(Relegated(t+1))* (Championship Revenue(t+1)) + (Money gained from Selling Grealish(t+1)) – (Money spent replacing him(t+1))

In this formula, Pr(Stay)(t+1)  for EV(keep) presumably is higher based on Grealish’s improvement over a year. He will be a better player, and will be able to make a higher contribution to the team. How much higher, and how much will that improve their likelihood of staying in the Premier League?

Similarly, his value will increase over the year, so Money gained from Selling Grealish(t+1) will be higher, increasing EV(sell). So this changes the calculation, but both sides probably increase proportionally (his affect on probability of staying in the league will go up as his value goes up).

I’m not going to go through year two, but the process is the same, just adding the EV of t and t+1 to the formula for t+2, although Grealish’s value will continue to go up.

The final wrinkle to the process is that the game presumably ends if Aston Villa gets relegated. A prospect like Jack Grealish wouldn’t want to play for a team in the Championship, and would be expected to leave in the summer instead of sticking around another year. So our new formula would be3:

EV(Keep(t+1)|EPL(t)) = t + Pr(Stay)(t+1))* (EPL Revenue(t+2))  + Pr(Relegated(t))* (Championship Revenue(t+2))  

EV(Keep(t)|Championship(t+1)) = Championship Revenue(t+1) + Sale price for Grealish(t+1 Championship) 

EV(Sell(t+1)) = t + Pr(Stay)(t+1))* (EPL Revenue(t+2))  + Pr(Relegated(t+1))* (Championship Revenue(t+2)) + (Money gained from Selling Grealish(t+1 EPL)) – (Money spent replacing him(t+1))

This version reflects Grealish’s presumably dramatically changing value based on Aston Villa’s success in year t. If Aston Villa keeps him and stays up, his value increases ever year. But if Aston Villa keeps him and goes down, his value presumably decreases because teams know he won’t want to stay at a Championship level team which changes the expected value calculations yet again. There’s a risk/reward involved in keeping young stars around, and so much uncertainty that what a team chooses depends on assigning the correct probabilities to all the events and determining your risk aversion.

Finally, we add the psychic benefit/cost in, weighting for relegation and staying in the league to get our final, overly complicated looking equation.

EV(Keep(t+1)|EPL(t)) = t + Pr(Stay)(t+1))* (EPL Revenue(t+2))  + Pr(Stay)(t+1) *(Psychic Benefit Stay (t+2))  * Pr(Relegated(t))* (Championship Revenue(t+2))  – Pr(Relegated)(t+1) *(Psychic Cost(Relegated) (t+2)) – Psychic Cost(Sell)

EV(Keep(t)|Championship(t+1)) = Championship Revenue(t+1) + Sale price for Grealish(t+1 Championship) – Pr(Relegated)(t) *(Psychic Cost(Relegated) (t+1))

EV(Sell(t+1)) = t + Pr(Stay)(t+1))* (EPL Revenue(t+2))  + Pr(Stay)(t+1) *(Psychic Benefit Stay (t+2)) + Pr(Relegated(t+1))* (Championship Revenue(t+2)) – Pr(Relegated)(t+1) *(Psychic Cost(Relegated) (t+2)) + (Money gained from Selling Grealish(t+1 EPL)) – (Money spent replacing him(t+1)) – Psychic Cost(Sell)

All of this being said, the money earned for keeping/selling a young star is small compared to the money earned from staying in the EPL vs. the Championship. A couple million pounds difference in selling/keeping a young star would have only matter if the odds of relegation changed marginally based on selling him. The driving force here is the first part of the equation:

Pr(Staying in the EPL) * (Total Revenue from being in the EPL) + Pr(Relegated)* (Total Revenue from the Championship)

How much better would Aston Villa be by selling Jack Grealish and buying more experienced players with the money they made? Would it be enough to offset the psychic cost, increased likelihood of remaining in the EPL the subsequent year, and profit they would make by keeping him an extra year? Given that Aston Villa, like all EPL teams, are single-minded forsakers of relegation, how risk-tolerant are they? If it were up to me, looking at the numbers, I’d sell now, but I likely estimate their probability of finding replacements who can improve their chances of staying up more highly than most do. Regardless of the conclusion, the goal here was to formalize the thought process of any team in the decision to sell a young player. It’s a complicated process with a lot of uncertainty in many different places, which is why you’ll see people argue both sides so passionately.

  1. In political science, William Riker asserted there was a psychic benefit to voting, e.g. wearing your “I voted today” sticker makes you feel good about yourself, and this is really the only reason why people vote.
  2. One could even apply this to the 7 or 8 teams who are perennially “safe” – change “relegated” to “Champions League” or “Winning the Title” or whatever your goal is.
  3. EV(Keep(t)|EPL(t+1)) is a statement of conditional probability that should read “The Expected Value of keeping Grealish in year t+1 given that they remained in the EPL after year t is…”

Game Theory: The Case for Villa Selling off Youth Players

My previous game theory posts (“The Logic of Not Caring About the Champions League” and “EPL’s Treatment of Europe is a Tragedy (of the Commons“) were well-received, so I thought I’d put together another one related to a topic on my mind that may get a fair amount of attention when transfer rumor season starts. Aston Villa is currently involved in a serious relegation fight, that has seen their odds of relegation increase to almost 70%.  The heat map below shows how their odds of getting relegated have increased over the season, especially in the last couple of weeks.
Heat Map Aston Villas Finishes

I’ve run some analyses to see exactly where Villa’s best opportunities for growth are, and they all point to Jack Grealish being the biggest opportunity for them. The average gain in his position is a  very solid 5 points, and the maximum gain is an astounding 20 points. Grealish may be a future Aston Villa and England star, but my model think he’s holding them back significantly today. But how do we know if it’s time to sell?

The immediate calculation is fairly easy and straightforward:

EV(Keep Grealish) = Pr(Staying in the EPL)* (Total Revenue from being in the EPL)  + Pr(Relegated)* (Total Revenue from the Championship)  – (Money spent improving the squad)

EV(Sell Grealish) = Pr(Staying in the EPL) * (Total Revenue from being in the EPL) + Pr(Relegated)* (Total Revenue from the Championship) + (Money gained from Selling Grealish) – (Money spent replacing him)

Then all you’d have to do is compare the two numbers: If EV(Keep) > EV(Sell), then you keep him. If EV(Sell) > EV(Keep), you sell him. Assuming net transfer spend in both situations is the same, the equation comes entirely down to what Aston Villa believes their chances of staying in the EPL is with/without him. If you believe my model, the chances improve significantly with the right replacement (many of whom are in Aston Villa’s buying range at first glance), therefore EV(sell) is much greater than EV(keep), meaning that it’s time to sell him.

This all assumes a one-shot game, which may be a reasonable assumption if you believe next year will be a similar fight to stay in the Premier League. I’ll post the longer version in my next post, but this illustrates the expected value calculations that teams go through on these decisions. Single-minded forsakers of relegation may have to go against club ethos and sell a young star so they can achieve their primary goal of maintaining Premier League status.

Stats for Coaches and Journalists: Thoughts and a Draft Syllabus

I’ve mentioned this before, but my day job is professor of political science, and specifically I teach courses about statistical methods and research design (among other classes). With the latest round of “Analytics, LOL” foolishness from the media on Twitter, I thought I would do something productive and create a syllabus for a hypothetical course on soccer analytics for coaches and journalists.

I wanted to share a few thoughts about this idea, and have an open question for anyone who reads this (which I’ll tweet as well). The target audience is people who have some interest in learning about analytics, their uses, and some basics with a goal of being able to speak to analytics types/read blog posts written with analytics. I had a Twitter exchange with @unfitforpurpose about this, and it may be worth re-thinking without the assumption that people are interested in learning the material.

I’m assuming zero knowledge on the part of the audience members. Gab Marcotti tweeted something about many people not knowing what a standard deviation is, which I think is potentially even overstating both the lack of math knowledge and math awareness of the audience. I also think focusing on the math is problematic: at their core, analytics aren’t about math, they’re about using tools to answer a question. I’m a big believer that measurement for the sake of measurement (or math for the sake of math) is a waste of time. To really appreciate and understand analytics, you need to start with simple concepts like hypotheses, measurement, and operationalizing variables. Even in the analytics community, we often forget that measures of uncertainty only come after proper model building.

I broke the course down into four sections:

  1. “What is science?”
  2. “Case Study and Small Sample Research”
  3. “Stats and Large Sample Research”
  4. “Data visualization techniques”

I think people were picturing an hour long seminar on how to do analytics, and I don’t think that’s the best way to do it. This isn’t a semester’s worth of learning, but I think it’s at least four 2 hour sessions to get a basic understanding of what we’re doing, although one could cut data viz out if the goal was just to understand rather than to produce, leaving us with three 2 hour sessions. Longer might be better, but if the goal is basic understanding then I think this would be enough.

However, I’m curious if other people think this could be broken down into a 1 hour session? What would you include? I don’t see it, but I’m thinking about ways it could be done and am curious for suggestions.  Here’s the syllabus I wrote, and  I may flesh it out even further with readings or videos if people are interested. Let me know what you think.



Game Theory: EPL’s Treatment of Europe is a Tragedy (of the Commons)

My last post asserted that it’s individually rational for each team in the EPL to not care about the Champions League.1 However, this leads to the very real possibility of a “tragedy of the commons” effect, and England losing a coveted Champions League spot.

The tragedy of the commons comes from Garrett Hardin, and describes a village where farmers all graze their sheep in a common area. This area belongs to everyone, and people are free to have as many sheep graze there as they can afford. It is in each farmer’s rational self-interest to purchase as many sheep as they can so they can sell milk, wool, and whatever else sheep are good for. As they buy more and more sheep, the commons become over-grazed, and all the grass dies. No one gets to graze their sheep, costing everyone money. Individuals acting in their own self-interest can hurt the collective in the long run.

We’re seeing that right now in Europe for England. It may be rational for teams to focus on the league and ignore the Champions League so they can focus on finishing in the top 4 and qualifying for next year’s Champions League. This is likely even more true for Europa League teams who need every edge they can get in the league to try and make the top 4 next year, so they’re more likely to tank the European fixtures. However, with UEFA coefficients (and Champions League spots being allocated to leagues based on their coefficient) being largely based on performance in continental competition, we’re seeing a tragedy of the commons.

It’s in each team’s interest to not worry about European fixtures and to focus on the league instead, but if every team does this then England could easily lose their 4th CL spot. When the individual good conflicts with the collective good, the collective good can easily disappear. Right now the EPL has too many sheep, not enough farmers maintaining the common area. Ignoring Europe, as rational as it may be for the individuals, is bad for the collective.



  1. I use the word “rational” in the economic sense of the word – acting in one’s self-interest.

Game Theory: The Logic of Not Caring About the Champions League

John Burn-Murdoch’s thought-provoking piece about English soccer teams prioritizing qualifying for the Champions League than winning the Champions League has inspired a fair amount of discussion on Twitter, so I wanted to walk through the logic of such a strategy and the implications for English soccer.

You all should read the piece, but the general point he makes is that you can tell English soccer teams don’t prioritize the Champions League because they are more likely to play their full-strength squads in the league than they are in the European games. This, rather than a lack of quality in English teams, can explain at least some of why England’s results have been so abysmal this year. But is this the right choice?

My disclaimer here is that I only took one game theory class in graduate school, but this seems to be a fairly simple probability exercise. Are you better off playing a full-strength squad in the Champions League and the Premier League, or playing a rotated squad in the Champions League and a full-strength squad in the Premier League?1

Champions League

The figure represents the reduced form game tree here. “Nature” begins with a Champions League fixture, and the manager is given two choices: play a full-strength squad or rotate players out of the Starting XI? Playing a full-strength squad increases the probability of winning (Pr(win))/decreases the probability of a loss (Pr(Loss)). The payoff for winning remains the same, but the “pain” of losing decreases because the media and fans don’t second guess you the next day.

So what are the benefits of winning each fixture? The benefits of winning a Champions League fixture include the obvious: points for the fixture, representing an improved probability of qualifying for the knockout stages. That has monetary benefits, including presumably at least one more home match where you can charge fairly high ticket prices, and some added money from the total pool (although not nearly as much as you get just for qualifying). It also has non-tangible benefits such as prestige for the club (and maybe even “big club” status), which could help you sign higher quality players later, and improving morale. These are not insignificant benefits.

However, because we’re looking at an issue of probability here, it’s not as simple as “If we play a full-strength squad, we win, but if we rotate, we lose.” English clubs should still be considered favorites over many of their competitors even with a rotated squad, although the win probability decreases by a certain %.2. So rather than realizing the whole benefit, we have to multiply the expected benefits/pain by that percentage, and then have to multiply that number by how much this increased/decreased the likelihood of qualifying for the next round. It’s impossible to assign actual numbers to the benefits/pain, but we can see that it conceivably becomes a very small number when we look at the changes in expected value. You can see how this number would become virtually 0 for the Europa League, both in terms of money and the non-tangible benefits, and the pain would decrease because English fans, until recently, haven’t cared about the Europa League.

On the other side, if a team plays a full-strength side, that decreases their likelihood of winning their next EPL fixture. Losing that fixture comes with potentially tangible pain: particularly losing points that could cause a team to miss out on next year’s Champions League. This comes with a very significant financial loss, as well as a much bigger loss of face (and “big sidedness”) which will hurt recruitment of players next season. There are also pain issues of angering the fans and media, which are exaggerated if the next match is some sort of rivalry or important match against another top side, and maybe pain issues associated with losing against a relegation-level side because you played the wrong players. From a managerial standpoint, one can also see the potential for pain in terms of players not getting enough playing time who begin to complain, and the potential of over-working starters who are playing two games a week instead of having time to let little injuries heal. Once again, these are multiplied by the change in probability of winning the individual game, and then multiplied by the change in probability of the result affecting either Champions League qualification next year or winning the title. 3 These probabilities are also very small, but the benefits of qualifying for the Champions League seem to exponentially outweigh the benefits of qualifying for the knockout stages, so you can see how this number would still be higher.

It’s a simple expected value calculation: how much do you trust your rotated squad and what value do you get from the league vs. Europe? There are next level concerns as well: if you’re going to lose in the knockout stage to Barcelona or PSG, do you get any benefits from playing a full-strength squad and then playing two extra games? Do any of the English teams consider themselves likely to make it past the quarterfinals ahead of Barcelona, Madrid, Bayern, Juventus, or even PSG? The economics and “big club” nature of soccer lend me to believe expected value calculations favor not playing a full-strength squad in the Champions League unless you think you have a chance to make the semi-finals or beyond, and these teams likely have a strong enough second XI where they could rotate with little consequence regardless. Insert your own probabilities and values for winning/pain for losing, but I think you’d have to make some pretty strong assumptions about the Champions League to make it worth playing a full-strength squad.

  1. Two disclaimers here: I don’t present the third option: a rotated squad in England and full-strength in Europe for a lack of space. The results would be the same as full-strength in both, although presumably more exaggerated because Pr(win) in the next EPL fixture would be even lower. Second, I present a reduced form of the game here – I don’t look at the added effects of how an extra game could affect players 10+ games down the road. Again, this would exaggerate Pr(win) in subsequent EPL games and would argue for rotation.
  2. As an example, Manchester City without Aguero, Silva, and Kompany a few weeks back lost about 10% winning probability against a lower side
  3. I don’t figure there’s a big difference to supporters between 2nd and 3rd, but there is a massive difference between 1st and 2nd.

How Bad is Chelsea’s Start Exactly?

With the international break, there’s plenty of time to overthink any number of topics, so I wanted to start with Chelsea’s disappointing start to the season. With 8 points through 8 games, they’re closer to relegation than they are to the title chase. But how bad is this really?

To answer this question, I took the probabilities for Chelsea’s first 8 matches generated by my SVM model and ran 100,000 simulated seasons. I did this by drawing a random sample of “win, draw, loss” for game 1 based on the predicted probabilities for that game, then drawing another sample for game 2, game 3, etc. through game 8.   Then I added up the total points earned for those eight games, and counted it as 1 season (so far). I repeated this process 100,000 times, and tabulated the total number of points earned for each season. Here’s what I found.

Eight Game Sims Chelsea

The blue bars represent the proportion of times Chelsea earned the points listed on the x-axis in my 100,000 simulated seasons. The red bar toward the left shows how many points Chelsea actually earned through 8 games. As you can see, this start is well below almost any reasonable expectation and is 9 points below the most likely result from the simulations. The really bad news – Chelsea’s start ranks below the second percentile, meaning that over 98% of all simulations had Chelsea earning more points. It’s a bad start, but I had no idea how bad it was.

The good news is that I ran some preliminary full-season simulations, and they’re still about 70% likely to qualify for the Champions League, but the bad news is that depends on them getting it together soon. The international break likely couldn’t have come fast enough for Mourinho.


Striker Similarity Data Vis: Feedback Requested

I’m in the very early stages of a new project that could potentially be interesting – visualizing similarity of players through a multi-dimensional scaling. The quick version of the method is to take all the player stats I have, scale them down into two dimensions, and calculate the Euclidean distance between all of these points. Then I can plot those points in a typical Cartesian plane. Theoretically, players close to each other in the plot should have similar stats.

The proof-of-concept was doing it with all of the players in my dataset, and highlighting them by position. I see really good clustering with goalkeepers, good clustering with strikers and defenders, and middling clustering with midfielders.1 Here’s the plot:

MDS Plot First Cut

The next plot is where I need help. The dimensions in this type of plot aren’t necessarily meaningful, but you can see in the full plot that lower right tends to be defenders, left center tends to be attackers, and midfielders are kinda sorta upper right.  Here’s the zoomed in striker plot with selected players highlighted:

MDS Strikers Text

I labeled some major names, some names I find interesting, and then tried to highlight some names on the margins of the plot. When you do this type of plot, you don’t define the dimensions, but they potentially mean something.  So I’d appreciate any thoughts anyone has on what we might be seeing here given the players I’ve labeled. Tweet them to me @Soccermetric and life will be good.

  1. This makes sense because wingers are basically attackers, and holding midfielders are similar to defenders. We’d expect to see a holding mid have more in common with a defender than a winger

Experimenting with Points Above Replacement (PAR)

I’ve been working on a Points Above Replacement (PAR) measure for soccer, and there are plenty of challenges. Baseball is an easier game to do stats with – it’s a relatively closed system, one batter, one pitcher, one fielder per play.1 Soccer creates a new challenge, so I’ve been experimenting with my measure.

Similar to the “Each Team’s Best Striker” post from a couple of weeks ago, I trained my Support Vector Machine (SVM) on 2014 league data across the 5 major European Leagues, then read in player stats from those leagues and the English Championship to a separate database. I started with the first player of each team, and substituted each player who plays the same position in the database for the original player, calculating the new predicted points in the SVM. After finishing all players in the dataset for the first player, I move on to the second, third, fourth, etc. until I’ve finished the team.

For this analysis, I put all the players in the database in order from highest expected point total to lowest point total. Then I found the 50th percentile (the player where 50% of players in the database are expected to win more points and 50% are expected to win fewer), 25th percentile (75% are expected to win more, 25% are expected to win fewer), 10th  (90% more, 10% less), 5th, and 1st percentiles. I subtracted the number of points these players were supposed to win from the number of points each player in the team’s starting XI was expected to win, and calculated a “Points Above Replacement” score.

As an example, Djamel Mesbah is the 25th percentile player in defense. If he played Left Back for Arsenal, they would be expected to earn 73.3 points. Arsenal’s left back in my model is Nacho Monreal, and with him they are expected to earn 78.8 points. Subtracting 73.3 from 78.8 gives me 5.5 points, giving Monreal a +5.5 PAR.

I repeated this process for all players, and then some other level players to see what works and came up with some interesting results. Arsenal’s plots are below:

PAR Arsenal 4

I have each player represented in the bars and then the sum of all other players in the bottom bar. If the goal is to find the improvement of Arsenal’s squads over a team of generic replacement players, then I think the answer is somewhere around the 5th and 10th percentiles. Arsenal is expected to win somewhere around 80 points, and if a typical relegation team is worth somewhere around 35 points or so then we’d expect to see Arsenal have a team of ~45 points above a replacement squad of a team expected to get relegated from the EPL. The 1st percentile is too much (~65 points above replacement seems a bit high for any reasonable replacement), and the 25th percentile might be a little low (~25 points above replacement might not be enough).

I repeated the analysis for Chelsea, and found something similar:

PAR Chelsea 4 plots

Chelsea’s squad rates a little more highly here – somewhere around 53 points above replacement at the 5th percentile, and 45 PAR at the 10th percentile, but the results are pretty consistent here.

There’s a lot more to be done, but this is a good first cut at the data I think.

  1. There is obviously a little more to baseball strategy than this, but the point is that it’s obviously a lot more clear case than soccer

Technical Details on “Stats Notes: R-Squared Isn’t The Right Measure”

I tried to write “Stats Notes: R-Squared Isn’t The Right Measure” for a general audience, but I wanted to give a few more technical details, R code, and an expanded version of this argument from a political context I made in a Political Science methods paper with some co-authors here.

To be clear, there is no meaningful reason to calculate the R^2 unless you’re calculating a linear model of some sort. It measures goodness of fit off of a linear model, and the linear model matters here. If you’re merely looking at how well one measure correlates with another, then you should just use Pearson’s r and this rough guide to whether a correlation coefficient is good or not.  If you want to calculate R^2 on a bivariate model, just calculate Pearson’s r and then square it.

Typically NHST framework is rejection of the null hypothesis, almost exclusively that x has no effect on y (trying to disprove that the slope of “x” isn’t 0). However, in this context we often have more specific expectations – we don’t just want an effect of x on y, we want a specific effect that x and y have a 1:1 relationship. 1 As such, we need to calculate a model with a slope of 1, and make sure that the intercept isn’t statistically distinguishable from 0 (because the 1:1 relationship implies a starting point of 0,0). If either of these conditions isn’t true, we’re not predicting well regardless of the R^2.

I haven’t looked at this paper for a while, but some co-authors and I did a full technical explanation of this in a political context and you can read it here if you’re interested in an even longer, more technical discussion of these points.

Also, here is the R code to generate all the figures with simulated data (except for my model because I don’t have the data in a shareable format right now). Notice the “offset=1*x” part of the lm command, telling R to calculate the significance tests and coefficients from a slope of 1.0 rather than 0.0.


# Top Graph

set.seed(20) # setting the random seed for replication purposes
y=runif(20, min = 35, max = 95) #Simulated EPL final point totals

x=y+rnorm(20, 0, sd=5) # Simulated EPL Predicted Point Totals

df<-summary(lm(y~x)) # Regression output with b=0 as the null hypothesis
df.r<-round(df$r.squared, 2)
txt<-paste(“R-Squared is”, df.r)

summary(lm(y~x, offset=1*x)) # Regression output with b=1 as the null hypothesis

plot(x,y, ylim=c(0, 100), xlim=c(0,100), ylab=”Actual Points”, xlab=”Predicted Points”) #Scatterplot

text(x=80, y=10, txt, cex=0.7, font=2)

abline(a=0, b=1) # line with intercept of 0, slope of 1

y=runif(20, min = 35, max = 95) #Simulated EPL final point totals

x=y+rnorm(20, 0, sd=45) # Simulated EPL Predicted Point Totals

df<-summary(lm(y~x)) # Regression output with b=0 as the null hypothesis
df.r<-round(df$r.squared, 2)
txt<-paste(“R-Squared is”, df.r)

summary(lm(y~x, offset=1*x)) # Regression output with b=1 as the null hypothesis

plot(x,y, ylim=c(0, 100), xlim=c(0,100), ylab=”Actual Points”, xlab=”Predicted Points”) #Scatterplot

text(x=80, y=10, txt, cex=0.7, font=2)

abline(a=0, b=1) # line with intercept of 0, slope of 1
# y = x + 0

par(mar=c(4,4,2,2)) # increase y-axis margin.

set.seed(20) # setting the random seed for replication purposes
y=runif(20, min = 35, max = 95) #Simulated EPL final point totals

x=y+rnorm(20, 0, sd=5) # Simulated EPL Predicted Point Totals

df<-summary(lm(y~x)) # Regression output with b=0 as the null hypothesis
df.r<-round(df$r.squared, 2)
txt<-paste(“R-Squared is”, df.r)

summary(lm(y~x, offset=1*x)) # Regression output with b=1 as the null hypothesis

plot(x,y, ylim=c(0, 100), xlim=c(0,100), ylab=”Actual Points”, xlab=”Predicted Points”) #Scatterplot

text(x=80, y=10, txt, cex=0.7, font=2)

abline(a=0, b=1) # line with intercept of 0, slope of 1

# y = x – 20

y=runif(20, min = 35, max = 95) #Simulated EPL final point totals

x=y+rnorm(20, 30, sd=5) # Simulated EPL Predicted Point Totals

df<-summary(lm(y~x)) # Regression output with b=0 as the null hypothesis

df.r<-round(df$r.squared, 2)

txt<-paste(“R-Squared is”, df.r)

summary(lm(y~x, offset=1*x)) # Regression output with b=1 as the null hypothesis

plot(x,y, ylim=c(0, 130), xlim=c(0,130), ylab=”Actual Points”, xlab=”Predicted Points”) #Scatterplot

text(x=100, y=10, txt, cex=0.7, font=2)

abline(a=0, b=1) # line with intercept of 0, slope of 1

abline(a=-30, b=1, lty=2)

# y = 0.3x-7.46

y=runif(20, min = 35, max = 95) #Simulated EPL final point totals

x=y/0.3+rnorm(20, 0, sd=10) # Simulated EPL Predicted Point Totals
df<-summary(lm(y~x)) # Regression output with b=0 as the null hypothesis

df.r<-round(df$r.squared, 2)

txt<-paste(“R-Squared is”, df.r)

summary(lm(y~x, offset=1*x)) # Regression output with b=1 as the null hypothesis

plot(x,y, ylim=c(0, 300), xlim=c(0,300), ylab=”Actual Points”, xlab=”Predicted Points”) #Scatterplot

text(x=250, y=10, txt, cex=0.7, font=2)

abline(a=0, b=1) # line with intercept of 0, slope of 1

abline(coef=df$coefficients, lty=2)


  1. Often we don’t have that, such as @226Blog’s excellent work on SPFL team ratings. There’s no direct 1:1 expected relationship between team rating and points per game, nor would we expect there to be one because of how the team rating is calculated.