Arsenal Takes Back the Lead~! Week 7 Expected Table and Over-achievers

Week 7 results are in, and we have a new leader at the top of the table. With their tricky away win over Leicester, combined with Manchester City’s loss to Spurs, for the first time since week 3, Arsenal claims the top spot in my expected final table.

EPL Table Week 7-2

My model continues to miss the boat on Everton, as they are now the highest ranked team in the over-achiever table (performance compared to expectation). I’m a big believer that West Ham is just statistical noise and they’ll be back to mid-table in no time, but I sincerely think the model is off on Everton. I’ll do some exploratory data analysis later, but that’s my instinct. Deviation Table Week 7-2

Chelsea’s draw against fellow underachievers Newcastle drops them back another point or so, and I’m sincerely worried about Sunderland. They weren’t predicted to do that well in the pre-season model, and they’ve managed to underachieve even those low expectations.

Model fit is still pretty good, indistinguishable from a slope of 1 and intercept of 0, r = 0.55 which isn’t bad this early in the season. Teams above the line are over-achieving, teams under it are under-achieving. Most teams fit pretty well, which makes me confident in my model this early in the season.

Deviation Plot Week 7-2






 

 

Will Barcelona Be Fine Without Messi?

The big news this weekend was obviously Lionel Messi’s injury, and the question is what Barcelona will do without him. Michael Caley issued his take at the Washington Post saying that Barcelona will largely be fine without him, and ultimately I agree. I ran Barcelona through my transfer simulator, and found that Messi gives Barcelona on average an approximately 5% greater chance of winning each game he plays in over his replacement (El Haddadi). Over 8 games that translates to a little less than a point.

Messi Barplot

That’s not a big difference, so Barcelona shouldn’t really worry too much. As great as he is, the rest of the team is strong enough (and the rest of La Liga is weak enough) that it shouldn’t make a huge difference. However, the average doesn’t tell the full story and if we’re going to call what we do “fancy stats” then we need to do better than just calculating the average and moving on.

To see the potential loss of points, I ran 100,000 simulations of an 8 game window for Barcelona with Messi, and then ran 100,000 simulations of an 8 game window for Barcelona without Messi. I added up the total number of points for each of the 200,000 windows, and compared each window with Messi to the corresponding one without Messi. Here’s what I found.

The first plot isn’t the easiest to read, so let me explain: the height of the line (measured by the y-axis) represents the proportion of simulations where Barcelona earns a specific number of points (represented on the x-axis). The dark blue, solid line represents Barcelona with Messi, the light blue, dashed line, represents Barcelona without Messi. Notice the higher light blue line at lower values on the x-axis: this shows Barcelona is significantly more likely to earn lower numbers of points without Messi. Then notice the higher dark blue line at higher values on the x-axis, showing Barcelona is more likely to earn higher numbers of points with Messi.

Messi Density Two Lines

The next plot is a density plot showing the difference between the two scenarios: how often will Barcelona gain/lose certain point values when Messi plays compared to when Messi doesn’t play?  The y-axis represents the frequency of each result in my simulations, and the x-axis represents the number of points Barcelona gained in each of the simulations.

Messi Density Plot Diff

The distribution looks relatively normally distributed (a rough “bell curve”) with a mean around 1, but the standard deviation is somewhere around 5. Barcelona is more likely to do well when Messi plays compared to when he doesn’t, but there is significant variation here. They could easily improve by 5 points, and there’s a 5% chance they improve by more than 11 (out of 24)  points with Messi in the lineup compared to El Haddadi.

The final play I’m going to call the “luck” index. It shows the percentile of each possible point gain/loss based on Messi playing compared to El Haddadi. Basically depending on Barcelona’s luck, this is how many points they can expect to gain/lose.

Messi Out

Once again, this represents the difference in Messi vs. El Haddadi at every 10th percentile. So you can see if Barcelona gets really lucky (best 10% of all simulations) they will pick up 7 points with Messi. Even slightly above average luck (70th percentile, or 30% of the time) they gain 3 points with Messi playing over 8 games.

I largely agree with Michael Caley and Goalimpact’s analyses1 that Barcelona will be fine. But 8 games isn’t nearly enough for the law of large numbers to kick in, and in the short-term the probability distributions could hurt Barcelona substantially, especially with the razor-thin margins in La Liga.

 






  1. @Goalimpact on Twitter is a great resources, and his website is really interesting an creative for football stats fans

Fabregas Doesn’t Work: Replace him with a Defensive Mid

Quick blog post here looking at Chelsea’s form today, and one of the big narratives has been Cesc Fabregas’s underperformance. While this may be true, I’d argue that the bigger problem is that he’s a poor fit for Chelsea in the role he’s playing.

I ran the numbers, looking at Points Above Replacement for Fabregas’s position in Chelsea’s lineup, and unsurprisingly most of the top 10 were pure holding midfielders, or at least more defensive minded midfielders.  Here’s the graph:

PAR Cesc Fabregas

Out of the top 10 options, Marco Verratti and Robbie Brady are the only two that aren’t known for playing deeper. Even if Fabregas plays well, he’s out of place in Chelsea’s lineup and they should strongly consider replacing him/selling him in January.






Game Day 7 Predictions: Arsenal, City, and Chelsea Have Tough Away Fixtures

Potentially a big week this week at the top of the table. Chelsea, Arsenal, and Man City all have difficult away fixtures so any win could make a significant difference in the end of season table.

Meanwhile Manchester United is a big favorite, and Liverpool is an equally big favorite in a must-win game for Brendan Rodgers. Southampton fans should be happy with the model’s predictions of a win over Swansea, and West Ham’s a big favorite over Norwich City which almost certainly means we can expect them to lose.

CPtokA3WcAAwhl6.jpg large






Liverpool: Don’t Fire Brendan Rodgers

In the same vein as my post: “Everton: Don’t Fire Roberto Martinez”, I wanted to encourage Liverpool to not fire Brendan Rodgers. The story is that he’s under-performing and Liverpool would be much better with someone else (Klopp?) in charge. So I wanted to look at the numbers to see if they are actually under-performing.  They’re currently in 13th place with 8 points through 6 matches, which is obviously lower than Liverpool would want, but the current table is irrelevant. Does anyone think West Ham and Leicester City will be in the Champions League at year’s end? Or that Chelsea and Arsenal will be excluded? I’m an advocate for ignoring the current league table until we at least get to January, maybe longer because there’s just too much noise right now both in terms of a small N problem with results and wide variation in strength of schedule.

First, to see what expectations for Liverpool in the pre-season were, let’s look at my initial predictions for them:

Final Table August 10

As of August 10, I had them in 5th place, earning ~62 points or so.  Looking at this table, that seems about right. They’re probably a step outside of the top 4, certainly a couple steps out of the top 3, but they’re likely better than Spurs or Southampton. Safe Europa qualification, with Champions League being a bit of a stretch. Now let’s look at the current predicted table:

Predicted Table Week 6

September 20 they’re in 5th place, earning ~60 points or so. The big disappointment is that they’re seeing their Champions League hopes fade fast, but that’s more to Manchester United out-performing expectations than anything else.  They’re exactly in the same spot they were pre-season, so it’s hard to fault Brendan Rodgers for that. Now a look at the under-achievers table:

Deviation Week 6

Liverpool is mid-table here, but they’re only under-performing by a point or two. Also, the reality here is that West Ham, Leicester, and Man City will all be regressing to the mean soon enough, and when that happens Liverpool can expect to get back to the European places soon enough.

The problem at Liverpool isn’t Brendan Rodgers, it’s that Liverpool just isn’t as good as the anti-Rodgers crowd wants them to be. They haven’t really found a replacement for Gerrard, and anyone who could actually replace Luis Suarez is unlikely to want to sign for Liverpool. Those are far bigger issues, and have little to do with Brendan Rodgers.






Striker Similarity Data Vis: Feedback Requested

I’m in the very early stages of a new project that could potentially be interesting – visualizing similarity of players through a multi-dimensional scaling. The quick version of the method is to take all the player stats I have, scale them down into two dimensions, and calculate the Euclidean distance between all of these points. Then I can plot those points in a typical Cartesian plane. Theoretically, players close to each other in the plot should have similar stats.

The proof-of-concept was doing it with all of the players in my dataset, and highlighting them by position. I see really good clustering with goalkeepers, good clustering with strikers and defenders, and middling clustering with midfielders.1 Here’s the plot:

MDS Plot First Cut

The next plot is where I need help. The dimensions in this type of plot aren’t necessarily meaningful, but you can see in the full plot that lower right tends to be defenders, left center tends to be attackers, and midfielders are kinda sorta upper right.  Here’s the zoomed in striker plot with selected players highlighted:

MDS Strikers Text

I labeled some major names, some names I find interesting, and then tried to highlight some names on the margins of the plot. When you do this type of plot, you don’t define the dimensions, but they potentially mean something.  So I’d appreciate any thoughts anyone has on what we might be seeing here given the players I’ve labeled. Tweet them to me @Soccermetric and life will be good.






  1. This makes sense because wingers are basically attackers, and holding midfielders are similar to defenders. We’d expect to see a holding mid have more in common with a defender than a winger

EPL Game Day 6: Updated Predicted Final Table

Week 6 was a weird one – West Ham has officially taken the lead on the “over-achievers table” and Chelsea is out of the bottom. Watford’s big upset moved them fairly safely out of the relegation zone for now, and Sunderland’s under-performance has put them in serious danger of relegation. Here’s the new predicted EPL Final Table:

Predicted Table Week 6

And here is the over-achievers table:

Deviation Week 6

Model fit took a little bit of a hit this week with all the weirdness (and the fact that virtually all the games were toss-ups), but we’re seeing a slope of 1.0 and an intercept of 0.0 so life is good there.

Deviation Line Week 6






Europe’s Most Valuable Strikers

Today’s post is meant as a follow-up to my post from a week ago detailing each team’s best possible striking option.  In that post, I found that most teams’ best option was to buy Theo Walcott, which was a bit of a surprise but matches up with what Goalimpact says fairly nicely (and apparently received the same “huh…that’s odd” reaction on Twitter). As a reminder, here’s the main chart from that post.

Best StrikerToday I ran a new analysis, looking at each striker’s added value for each of the 20 EPL teams and calculated their points above replacement value over the 25th percentile. What I’ve found is that the PAR measure as I currently calculate it is very team dependent, which I think is useful in some ways, for example if you were trying to find a replacement player for your current striker you’d want to know who fits into your team best. However, it isn’t as useful for comparisons between players across teams more generally. So what I did was calculate the number of points each team would earn with each striker in the database, just like in my previous post. Then I found the striker in the 25th percentile would earn, and subtracted that value from each striker’s expected points in the database.

As an example, I calculated the average score for Theo Walcott across all 20 EPL teams, which was ~63.4 points. I then subtracted the striker at the 25th percentile (Deportivo’s Lucas Perez) who earned ~55.5 points, arriving at an overall PAR value of ~7.9 points. I repeated this for all the strikers across the league and here are the top 20 most valuable strikers in Europe.

PAR Strikers

Theo Walcott is at the very top of the list, Leo Messi comes in at #4, Zlatan at #5, and Cristiano Ronaldo at #11. There are a few surprises on the list, but a lot of it looks like you’d expect. One could take issue with individual ratings, but keep in mind these are based on expected statistics, shots on goal are included but not goals scored1 and based off of a model trained on EPL data so you’d expect some changes if this was done in La Liga or Serie A.

Also, it’s interesting to me that after Walcott, the next 7 positions are virtually tied, and then the last 12 are virtually tied with each other. Someone like Ronaldo who finishes his chances significantly above expectation could easily move up 5 spots or more, so I understand why he’s lower than one might think.

The other interesting thing to me is that Glenn Murray finished as the #1 option for a few teams, but didn’t make the top 20 overall. That tells me he’s a really good option for a few teams, and a fairly bad one for many of the others.

This is just the latest iteration of the PAR project. I’m really in an exploratory data analysis mode right now, so I’m open to any feedback people have on other ways to do this. Follow me on Twitter @Soccermetric.

 






  1. Including goals scored causes the broader model to overfit and and skews things in any number of ways for non-strikers.

Experimenting with Points Above Replacement (PAR)

I’ve been working on a Points Above Replacement (PAR) measure for soccer, and there are plenty of challenges. Baseball is an easier game to do stats with – it’s a relatively closed system, one batter, one pitcher, one fielder per play.1 Soccer creates a new challenge, so I’ve been experimenting with my measure.

Similar to the “Each Team’s Best Striker” post from a couple of weeks ago, I trained my Support Vector Machine (SVM) on 2014 league data across the 5 major European Leagues, then read in player stats from those leagues and the English Championship to a separate database. I started with the first player of each team, and substituted each player who plays the same position in the database for the original player, calculating the new predicted points in the SVM. After finishing all players in the dataset for the first player, I move on to the second, third, fourth, etc. until I’ve finished the team.

For this analysis, I put all the players in the database in order from highest expected point total to lowest point total. Then I found the 50th percentile (the player where 50% of players in the database are expected to win more points and 50% are expected to win fewer), 25th percentile (75% are expected to win more, 25% are expected to win fewer), 10th  (90% more, 10% less), 5th, and 1st percentiles. I subtracted the number of points these players were supposed to win from the number of points each player in the team’s starting XI was expected to win, and calculated a “Points Above Replacement” score.

As an example, Djamel Mesbah is the 25th percentile player in defense. If he played Left Back for Arsenal, they would be expected to earn 73.3 points. Arsenal’s left back in my model is Nacho Monreal, and with him they are expected to earn 78.8 points. Subtracting 73.3 from 78.8 gives me 5.5 points, giving Monreal a +5.5 PAR.

I repeated this process for all players, and then some other level players to see what works and came up with some interesting results. Arsenal’s plots are below:

PAR Arsenal 4

I have each player represented in the bars and then the sum of all other players in the bottom bar. If the goal is to find the improvement of Arsenal’s squads over a team of generic replacement players, then I think the answer is somewhere around the 5th and 10th percentiles. Arsenal is expected to win somewhere around 80 points, and if a typical relegation team is worth somewhere around 35 points or so then we’d expect to see Arsenal have a team of ~45 points above a replacement squad of a team expected to get relegated from the EPL. The 1st percentile is too much (~65 points above replacement seems a bit high for any reasonable replacement), and the 25th percentile might be a little low (~25 points above replacement might not be enough).

I repeated the analysis for Chelsea, and found something similar:

PAR Chelsea 4 plots

Chelsea’s squad rates a little more highly here – somewhere around 53 points above replacement at the 5th percentile, and 45 PAR at the 10th percentile, but the results are pretty consistent here.

There’s a lot more to be done, but this is a good first cut at the data I think.

  1. There is obviously a little more to baseball strategy than this, but the point is that it’s obviously a lot more clear case than soccer

Each EPL Team’s Best Possible Striking Option in One Chart~!

I’ve been working on spreadsheets of the value of every player for every EPL team, and it’s FINALLY done1 If you’re interested in the method you can read more at http://soccer.chadmurphy.org/methods/arsenals-room-to-improve/, but the quick version is that I calculate the number of points each team is expected to earn using a Random Forest model. Then I replace their striker with each striker in my database, re-running the Random Forest and calculating the new expected points. I sort them by number of expected points, and listed the top striker in this plot and plotted the number of points the team would be expected to gain with that striker. The plot is available below:

Best Striker

There are a couple of interesting things to note: first is the relatively low number of points each new striker would add to each of these teams. Theo Walcott would be a huge upgrade to Sunderland and Spurs, but the average gain is only 7 points. This could mean two things: most teams already have a fairly good striker, or strikers are generally overvalued in terms of their contribution to the team.  Or maybe I’m undervaluing how much 7 points is for one team to improve based on signing one player.

The second thing is how popular Theo Walcott is in this model. He’s the best striking option for 7 of the 20 EPL teams, which is remarkable considering he’s not considered a top tier elite striker. Lionel Messi was the next common best option, and was almost always in the top 10 choices for the rest of the teams, giving me a nice validity check on my model.  Bournemouth’s Glenn Murray was also really popular, which was surprising, although Messi was considered an upgrade over him for Bournemouth.

This was an interesting exercise, and is the first check on the way to my “Points Above Replacement” measure (PAR). Thinking about doing defenders next.

Follow me on Twitter @Soccermetric

  1. Each team takes ~300 minutes to run. Multiply that times 20 teams plus a couple extras for coding errors along the way/accidentally deleting files and just generally being sloppy…