Intro To Soccer Analytics YouTube “Class”

For anyone who is interested, here is the beginning of my Intro to Soccer Analytics “class.” We cover some introductory topics here: picking a question, the importance of theory, difference of means tests, and answer the question of whether Liverpool has improved under Jurgen Klopp.

If you enjoy it, please like the videos, subscribe to my channel, and tell a friend about it. I’m doing this in my free time and all I ask is that you help get the word out about this course and help me get as many people interested as we can!







Please Everyone: Slow Down, Explain Your Methods

I love reading all the interesting and impressive analytics work being done out there. There are so many people doing so many interesting things in Soccer Analytics TwitterTM, and it’s truly amazing to me how much great work is being done. I wanted to make one suggestion to the community though: slow down, explain what you’re doing step-by-step, and be as clear as possible.

I have a Ph.D. in Political Science, with a minor field of quantitative methodology. I teach two difference research methods courses at a university, have won awards for articles in methods journals, and have worked as a statistical consultant for multiple political organizations. I don’t say this because I think it’s particularly impressive, I say it to establish that I’m at least above the average blog reader in my understanding of math and statistics, yet in a significant amount of articles I read, can’t figure out what people are doing. If I can’t understand, I assume I’m not the only one. So I wanted to offer some advice for public discussion of statistical methods taken from my own experience and 7 years of teaching.

Slow down

The math in most cases isn’t particularly complicated – most people are calculating some average or comparing differences in two averages, and I rarely read anything that uses math beyond high school algebra. To be clear, I don’t mean this as a criticism. I’m not at all a believer in fancy stats for the sake of fancy stats, but the actual math isn’t complicated. However when you rush through several steps of the process, present a formula and then move on to your next point without explaining the formula you’re going too fast.

Take a minute, explain the formula fully in words, step by step   and point by point. Devote a full paragraph to it, making sure that the reader could re-create exactly what you’re doing without inferring what any steps are

Slow down some more

When you think you’ve slowed down enough and have explained it thoroughly enough, you still probably haven’t. There’s a company out there who asks potential employees to describe how they’d cook an egg. Some people say “You toss the egg in the pan, wait a couple minutes, and then put it on a plate.” Those people don’t get hired. Others say “You get an egg out of the refrigerator. Then you put a small pan on top of a burner on the stove. Then you turn the burner on to high, waiting 2 minutes to let the pan get hot. Then you crack the egg on the side of the pan, separate the shell, and drop the inside into the pan.” etc…these are the people who get hired, and this is the level of detail you should aspire to.

Create a separate paragraph with the formula itself. Explain each term in the formula. Walk the reader through a sample calculation. Explain the results.

Never Use Jargon when Regular Words Work

Jargon is created for specialists who need to communicate specific concepts to each other in a very clear, precise way. That’s probably not what you’re trying to do when you blog. Worse, when you start using technical terms without explaining them, you lose a percentage of that audience with every term. Maybe you think xG is a ubiquitous abbreviation that everyone has heard, but a percentage of your readers haven’t. I read articles that reference PDO all the time and for some reason I can never remember what it means. Maybe I’m the only one, but I doubt it. Maybe you think everyone knows what “regression to the mean” is, but I doubt it. Give a half-sentence explanation every time you use a word or phrase that you wouldn’t use outside of Twitter.

There’s likely more advice to be given here, but starting with these three pieces of advice will help the clarity of a lot of the things I see. You have spent a significant of time on your project, so you likely know it better than anyone. That’s both a good thing and a bad thing: you’ve hopefully spent a lot of time thinking it through and have created the best possible measure, but you also likely have a hard time filling in the blanks a new reader will not understand. If you step back and think about this advice it will both expand your audience and expand the audience of analytics in general, which are both worthwhile goals in my opinion.

Initial Thoughts on my “Intro to Analytics” Class

I posted a “syllabus” for an Intro to Soccer Analytics “class” a while ago, and I’ve been meaning to go forward with it but haven’t been able to push forward with it because I haven’t  been sure about the interest. I posted a call today on Twitter and got a great response.

I have a quick four part syllabus planned where we talk about picking a topic (thanks to @Sam_Jackson for pointing out I already slipped into academic world talking about “research questions”) and the general principles of what analytics should be, the logic of inference, basic quantitative methodology, and then data visualization techniques. If it’s successful I’ll post more videos about different topics, but my goal is to create something where people who want to get involved in creating things can get a foothold in the community, for media and people at clubs to understand the value of analytics and maybe become more educated consumers, and to maybe help build a bigger community.

I have two concerns: the first is that I’m not particularly good on YouTube. My day job is Political Science professor, and I teach a Research Methods course and an Advanced Research Methods course to undergraduates, but I also teach an American Government class partially online. I’m better in person, so I’ve moved the class partially in person so students can get to know me. So I’m going to try and be entertaining and informative at the same time, hopefully I’ll succeed and people will enjoy what I’m working and will learn a lot from the whole process.

My second concern is that this will be a fairly big time commitment from me – writing the scripts, putting together examples, editing videos, etc., and this is in addition to my day job (and a new puppy who has more energy in one day than I’ve had in my entire life). So all I’m asking from you all is to share any tweets/videos/posts I make on this topic as much as you can and help get as many people to watch them as possible. Please tell a friend, share your projects from the course, and help me get as many eyes on this as possible. I’m not trying to make any money off of this, but I do get a great deal of satisfaction from knowing people are enjoying what I’m doing so hopefully you all can help me share this and get as big of an audience as possible for these!

I’m overwhelmed by the interest so far. Thanks everyone for your enthusiasm, and I’ll hopefully be posting the first set of videos soon so we can get started.


Statistical Modeling, Knowing Your Limitations, and Some Reflections

I tweeted today about the difference between good and bad models, and the importance of recognizing that fancy stats aren’t necessarily good stats. I wanted to write a little bit about that, post some thoughts about understanding the limitations of models, and reflect on my own model’s successes and misses.

With the semi-regular attacks on statistical models in the media, I see the wagons circle on Analytics TwitterTM and understandably so. People work hard on their models, and as far as I can tell the vast majority of people do it for little or no reward beyond Twitter likes and an increased following. When they feel that work (or the similar work of others) is diminished as done by weaklings in air-conditioned rooms, it’s easy to feel attacked. That becomes a vicious cycle though, where we end up trusting “stats” in general over anything else, regardless of the method. So when we have a dozen or more prediction models out there, all coming to different conclusions, it’s important to evaluate those models to see which one is closest to the truth.

Everyone is entitled to use their own criteria, and really learning how to do it involves more statistical training than I can do in a simple blog post. For me, I don’t trust any model that doesn’t post their methods and predictions publicly. There can be some proprietary elements, but if I don’t have enough info to evaluate how someone came to their conclusions I discount that model’s conclusions heavily.1 I don’t necessarily like models that include salary data because it over-values teams like Manchester City who overpay for players, and makes some general assumptions that higher paid players are better. More than anything, I want to see public predictions and some sort of validation of those predictions. Let me see how well your model does, let me see where it succeeds and where it misses, and hopefully learn from that. Again, people can post what they choose, but I try to be as transparent as possible and I think people have responded to that.

I’ve been fortunate to gain a lot of followers very quickly – this was just supposed to be a fun project for me, a way to learn some machine learning techniques, and maybe people would enjoy it. To have picked up 2100 followers in ~6 months is beyond anything I thought would happen, and I couldn’t be more appreciative of all the people who read and share my work. I’m glad people find what I do interesting, and hope to continue over this year. I have needed to adjust my thinking accordingly, and wanted to post those thoughts/concerns publicly so people could evaluate them and my project accordingly.

My goal isn’t to call out any model or stat specifically, so I want to talk about my model for a minute. I don’t do so out of narcissism, and will focus on the “growth opportunities” as much as I do the successes. You can learn both from being right and being wrong, but in many cases the opportunities to improve are in fixing bad predictions rather than congratulating yourself for your correct ones.

I’ve posted this before, but the original goal of my model was to quantify the contribution of individual players. I experimented with some “Points Above Replacement” metrics, but got some pushback so I put that on the backburner to validate the model before I could confidently assert its value. So I decided to let my pre-season predictions run the entire season and to see how well they do. As of last week I’m leading the 90+ entrants in Scoreboard Journalism’s prediction contest, and was the closest to identifying the black swan that is Leicester City by tapping them for 8th place with 60 points. I’m overall pleased with the model’s results, with a few caveats I wanted to mention and some cautionary notes on my predictions and be as transparent with my thoughts as I can so people who follow me can understand what I’m doing better. As a reference, here’s this week’s predicted probabilities of each team’s final table position.

Week 26-2 Heat Map

Arsenal leads the pack with a ~75% probability of finishing first. Leicester City is in second with a ~15% chance, and Spurs and City both have a ~5% chance of winning the league. Arsenal fans should be happy with this, but there are some caveats here. Here is my diagnostic plot, showing MOTSON’s predicted points vs. the actual points earned.

Week 26 Expected v Actual

Arsenal, United, Southampton, and Man City are all basically on the regression line, which means MOTSON has predicted their points perfectly through 26 weeks. They’ve all hovered around that line for the first 26 weeks – United was around +5 or so for a while, but quickly slipped back to the mean as their form slipped into what it’s been recently. Southampton was -5 or so, but has improved in recent weeks, but otherwise they’ve all been fairly close to expectations all season. Some temporary deviations are to be expected, so what this means is that my model has a really good handle on exactly how good Arsenal, United, Southampton, and City are. When the title race seemed like a two team race between Arsenal and City, then I was very confident in how highly my model rated Arsenal (even when the rest of the world was picking City – a pick that seems to have been validated recently).

However, my model has done better with Leicester than anyone else, but still underestimates their ability by a significant amount. How much, I’m still not entirely sure. It did like Arsenal to beat them at home, which happened, but it also liked City to beat them, which didn’t happen. To be fair, the simple in-season results model liked City in that match as well so it may have just been an upset, but it’s hard to tell. Regardless, Leicester’s overperformance means the model likely underestimates their “true” ability, which means their predicted likelihood of winning the title is understated. How much? I’m not sure, but I am personally confident the number is higher than 15%.

The same thing goes for Spurs: they’re in the middle of a special season where they’re out-performing expectations. They’re not doing it as much as Leicester obviously, but MOTSON really seems to have underrated them. So their number is probably higher than the 5% chance they’re being given right now, but again, I’m not sure by how much.

I’m torn on whether I want to keep presenting the model’s predictions as/is, knowing that the percentages are skewed against Arsenal. For me, it’s an academic exercise, but it’s taken on more of a following than I anticipated so I wanted to be transparent with what I think is going on with the model. I’m not altering it from the pre-season, so it comes with certain assumptions. Primarily, that Leicester City will play like the 8th best team in the EPL instead of a top 3 team, and that Spurs are a top 6 team but not a top 3 team. Those affect the model’s predictions, and it’s become particularly relevant the last couple of weeks so it’s something people should be aware of when they evaluate my model (and everyone else’s).

I’m confident Arsenal will end the season right around 75 points. I’m not as confident that Leicester will end at 71 (where I’m currently predicting them) or that Spurs will only have 68.  The model is presented as such because for this type of work you don’t update just because you want to. It’s not necessarily a bad thing because it eliminates recency bias. If I did that, I’d have put West Ham in the top 4 early in the year, dropped Southampton into the bottom half of the table six weeks ago, and would have handed City the title at least a half dozen times (like many other modelers did by the way). Those would have been big mistakes, and would have happened because I trusted my own (flawed) judgment instead of the model’s. Trust in the numbers, but be aware of their limitations. This applies to any statistics you read, including, but not limited to, mine. I’m just more transparent about it than others.


  1. Personally I think people overvalue the proprietary nature of their models – if you’re truly good at statistical modeling you should  be able to just outperform other people regardless of how much you share. I would also never pay anyone who isn’t transparent with how they do things, but who am I to tell people how to earn or spend their money?

Information Processing, Statistical Modeling, and “Expertise”

I recently did a guest post over at Scoreboard Journalism showing that statistical models are beating media experts pretty soundly in predicting the EPL table this year, and I saw a pattern in some of the feedback.  A number of people suggested that the reason modelers are beating the experts is because the media folks didn’t burn a lot of calories on their predictions while the modelers spent far more time and energy on the competition. There’s a few issues with this that I won’t address here, but I did want to point out the fundamental flaw with this argument: “time spent” isn’t a function of filling out the form, instead it’s about how much time you spend taking in information about soccer.

For people who didn’t read the original post, I wanted to post the dataviz showing how the modelers are consistently beating the media. The vertical line represents the simple model of “everything is the same as last year”, a line which virtually no media experts or fans beat, while over 20 models are ahead of that point. If the experts can’t even beat the simple model, that brings their expertise into question.

Week 26 - Prediction Data

Even in the face of this evidence, a number of commenters weren’t satisfied. My best explanation for this is that statistical modelers are facing some serious motivated reasoning: the idea that we reject arguments that disagree with our preconceived notions regardless of the quality of the evidence, while only accepting ones that fit what we already believe. But even if it’s not, I wholeheartedly reject the idea that modelers spend more time on their predictions. Even if they spent more conscious time building the model than the media experts did, that’s not really how information processing works.1

Borrowing from psychology (and more importantly for my background, political psychology), we don’t actively study most topics as if we’re going to take an exam on it for a college course. We also take in far more information than our active memory can actually process and convert to long-term memory. As a way of coping with the overwhelming amount of information we encounter, we resort to something called “online information processing.” Basically what this means is that we keep a running tally in our head of whether we think something is good or bad. We don’t know exactly why we feel the way we do, but we update our preferences with new information as it comes in.

This process works in soccer pretty simply: every time you see a team win you update their information and think they are better, and every time you see a team lose you update their information and think they are worse. Shocking results stand out more for people – Leicester City beating Manchester City made a lot of people update their belief on whether Leicester City could win the title. Injuries, transfers, coaching changes, etc., all add to the running tally in our head and help us update our beliefs. We’re not actively doing anything other than watching soccer or reading articles, but our subconscious mind is updating that running tally with every piece of information that we come across.

I read a lot of articles, tweets, watch a lot of games, and do all the things that media people do, but my model hasn’t changed based on any of this information. I found some data, scraped it, built the model, ran the script to calculate the final table positions, and haven’t touched the model since. If you don’t count the headaches of finding useful public soccer data, I spent less than 10 hours actually building it. That’s more time than the media people spent filling out their predictions and e-mailing them to Simon, but it’s far less time than is involved in online processing and building the running tally that contributed to media experts’ predictions.

Media experts do this for a living and are likely paid quite handsomely to follow soccer. My impression is that most of the modelers follow soccer as a hobby, albeit a fairly obsessive one for most of us. And even with that, we haven’t updated our models since the pre-season based on all of that information. My model was completed sometime in July, updated August 1 with new rosters, and I’ve been done with it. Every minute people spend reading/watching/learning counts toward their subjective predictions: the running tally in your head is always being updated. The real issue is that statistical models are better at processing all of this information than we are. Statistical models sort through information better than we do, they update the running tally in a scientific data, weighting it properly, and ignoring irrelevant information better than our brains do.2 It’s not a matter of time, it’s a matter of limitations brought on by our brain’s ability to process information correctly. MOTSON runs through my laptop in a few seconds, and can do the 10,000 simulated seasons in under a minute. My brain can’t do anything nearly that quick or nearly that accurately, and that’s why the modelers are winning.

  1. Another issue is that media people knew their predictions would be made public, so I would think that knowledge would encourage them to try harder than simply throwing some numbers together without a lot of forethought.
  2. All of this assumes the model is built correctly, which is a big assumption.

Week 25 EPL Model Comparisons

So this is the fourth week of my in-season results based model (TAM), and I wanted to continue comparing its performance to MOTSON’s. My goal is, with enough data, to learn as much as I can about the advantages and disadvantages of the two approaches and maybe how they can compliment each other and how I can improve my predictions for next season. You can see last week’s post here, and I’ll be updating every week. Below is a table with each model’s modal category and the actual result.

GameMOTSON (Pre-Season)TAM (In-Season)Actual Result
Stoke City v. EvertonStokeDrawEverton
Manchester City v. Leicester CityMan CityMan CityLeicester City
Southampton v. West Ham UnitedSouthamptonDrawSouthampton
Bournemouth v. ArsenalArsenalArsenalArsenal
Chelsea v. Manchester UnitedChelseaManchester UnitedDraw
Newcastle v. West BromDrawDrawNewcastle
Tottenham Hotspur v. WatfordTottenhamTottenhamTottenham
Aston Villa v. Norwich CityAston VillaDrawAston Villa
Liverpool v. SunderlandLiverpoolLiverpoolDraw
Swansea City v. Crystal PalaceSwanseaDrawDraw

Not a great week for either model, and MOTSON wins again with a meager 4-3 (22-18 overall). Not surprising after such a hot week last week, but what can we learn. First is that both MOTSON and the TAM picked Man City over Leicester, which is a testament to how strongly Leicester is over-performing. Even using this season’s results, the models were wrong as Leicester still pulled out what turned out to be a fairly comfortable victory over Manchester City. We’re witnessing something special here, and no model seems to be able to capture exactly how special.

As part of Southampton’s regression to the mean, MOTSON picked that game correctly while the TAM struggled with it, and MOTSON correctly liked Aston Villa to win at home against Norwich City. The TAM on the other hand picked the Swansea v. Crystal Palace draw correctly.

I’m not seeing any patterns emerging (other than over-valuing Chelsea), but if anyone has any hypotheses on ways the TAM outperforms MOTSON I’d appreciate you sharing them with me on Twitter (@Soccermetric). Check back soon for my Serie A and Bundesliga prediction diagnostics – not as good results for the TAM  in those leagues as last week either.

The Math Behind Why Leicester Will Win The Title

After Week 25, MOTSON has Arsenal as a 67% favorite to win the EPL, with Leicester City and Manchester City behind at around 14-15% each. However, while I’m confident that City’s overall number is correct (relative to Arsenal) MOTSON also has underestimated Leicester’s performance significantly this season, so the question is whether they’re in a better position than Arsenal is for the run-in. I think they are, and I want to look at the numbers.

First thing’s first, next week’s game is almost literally a title decider. The winner becomes MOTSON’s prohibitive favorite to win the league, the loser takes over the lead for 2nd (and possibly settles into 4th or even 5th with some bad luck). Here’s the heat map of the predictions:

Week 26 Arsenal v Leicester

The plots look similar, with Arsenal faring a little better on account of MOTSON’s belief that they’re a stronger team overall. But what is the likelihood that MOTSON gets it wrong and Leicester outperforms Arsenal in the run-in?

Leicester’s real opportunity here is that MOTSON, as optimistic as it was for Leicester in pre-season, still predicted they’d finish around 8th place. So all of their predictions are based on an expectation of “How would the 8th best team in the Premier League do?” Compare this to Arsenal, which MOTSON has listed as the title favorite for the entire season, and predicts each match accordingly. Arsenal has very little room for error, including the Week 25 game against Leicester City, while Leicester can drop some points and still improve even further on their expected points. Arsenal can only hold serve, while Leicester still has some tough, but possible, opportunities to improve.

After Week 25, Arsenal is a five point favorite in the Expected Final table. Almost the entire difference is based on Week 26, where Arsenal is expected to take 2.63 points over 0.22 points for Leicester. A loss is a virtual 6 point swing, so it’s basically a must-win if they want to maintain their lead. The title chase for both depends on the run-in, so let’s look at the remainder of the games.

WeekOpponentExpected PointsChance to Gain or Hold Serve
Week 27Norwich2.13Hold
Week 28West Brom1.98Hold
Week 29@ Watford1.46Gain
Week 30Newcastle2.02Hold
Week 31@ Crystal Palace0.72Gain
Week 32Southampton1.73Hold
Week 33@ Sunderland1.09Gain
Week 34West Ham1.97Hold
Week 35Swansea City1.97Hold
Week 36@ Manchester United0.5Gain
Week 37Everton1.98Hold
Week 38@ Chelsea0.27Gain

I’ve listed MOTSON’s Expected Points for each of the remaining games (after Arsenal), and subjectively described each of the games as a chance to gain some expected points, or needing to hold serve. Games where Leicester seems like an appropriate favorite need a win to hold serve, while games where they are less favored/are an underdog are listed as a chance to gain. In 5/12 remaining games MOTSON underestimates them and they have a good opportunity to pick up some points. In particular, the trip to Sunderland on Week 33 and the trip to Stamford Bridge week 38 are both big chances for them to pick up significant points over what MOTSON expects. I’m not revising the model mid-season, but a qualitative look would pick Leicester to pick up at least a few points. They’ll likely drop a couple in at least one of the “hold serve” games, but the Chelsea game is a perfect storm of over-estimating Chelsea/under-estimating Leicester so that could be problematic for Arsenal.

A win against Arsenal puts Leicester in the driver’s seat for MOTSON’s expected final table, but a loss isn’t as tragic for their title chances as the model thinks. I’ve learned enough about Arsenal fans to know they won’t get too comfortable, and they shouldn’t. They’ll still be the favorites, but there are plenty of chances for Leicester to pick up even more points vs. expectations in the last third of the season.

Meaningful Equality and Why It’s Important to do Women’s Soccer Analytics

I hadn’t planned on writing anything today, and I’m not sure what started the discussion on Twitter, but there’s a discussion going on about whether countries should invest in Women’s Soccer. I’ve read some articles lately about gender equality in Australia, and there’s the USWNT/USSF lawsuit going on, but I didn’t see if there was a particular spark that set this off. Either way, it’s an important topic to discuss so I wanted to post some longer form thoughts on my short tweetstorm earlier.  This will likely be a series of synaptic misfires loosely related to women’s soccer and gender equality, but hopefully I can convey my thoughts and make some people think about the systemic issues with gender and sports in a different way.

The counterargument to paying women’s national teams equally to men/funding them the same/letting them play on the federation’s preferred surface is that “Women’s soccer isn’t popular/doesn’t make money. Once they bring in the amount of money that the men do, then they can have the same money.” The problem is that the two aren’t playing on the same playing field (literally in the case of the USMNT/WNT). For context I stole a line of thought from my favorite political science professor back in undergrad:

With zero data in my pocket to support this, I feel fairly comfortable saying women’s soccer isn’t nearly as popular as men’s soccer worldwide. Shockingly, if you treat something as unimportant and as a lower quality product, people will see it that way. Even if we stop doing that and treat the women’s game as an equal product to men’s soccer today, we still haven’t reached anything near equality. You can’t undo generations of conditioning with a single World Cup – it’s a start, but it’s not the end product. Not by a long shot.

A few weeks ago I posted a call to some of the bigger accounts in Soccer Analytics TwitterTM to have a “let’s retweet articles written by women” day, and had no takers. One person suggested “We should retweet women every day”, which is a good suggestion except virtually no one in my timeline does it. A couple of people suggested that they retweet quality content regardless of gender, which is a great sentiment except for the fact that 95% (likely higher) of the things I see retweeted are from men. Either men are just naturally better at writing about soccer, or there’s some gender bias in what we see as quality or what topics we’re interested in.

Admitting my own bias, I don’t follow women’s soccer particularly closely. 1 I do watch just about every match of the Women’s World Cup, and watch as many USWNT matches as I can, but there’s a few issues with it that end up in a nasty feedback loop preventing me, and I assume others, from following women’s soccer.

First, men’s soccer is so much easier to find on TV. NBCSN televises 5-6 EPL games every weekend, so I can put it on without needing to worry about setting up my Roku or streaming from my laptop to the computer2 I actually prefer Serie A, but even bein Sports makes it difficult to find Milan many weeks, so I’ve started following the EPL more.

This makes it easier to read articles about the EPL, or to a lesser extent Serie A.3 So I follow writers who write interesting content about the EPL because I have some sort of knowledge base there. There are a couple of really interesting German people I follow, but don’t read much of their content because I know nothing about the Bundesliga and the articles don’t mean much to me. This problem is exaggerated with Women’s Soccer: I don’t know that much about it, so when I read tactical analysis of the NWSL finals I struggle to keep up. It’s a nasty feedback loop: I don’t follow the league so I don’t appreciate the analyses as much as I should, which means I have less interest in the league because I don’t have a community to talk about it and learn about it with.

Similarly, the big EPL writers don’t write much about women’s soccer. I get why: I want to write more about the relegation round-up, but when I do I get far fewer retweets/likes/shares/clicks so I stop writing about Aston Villa/Sunderland/Newcastle. I’m just a hobbyist here, but I don’t want to spend time writing things people don’t read, so I end up drawn back to “Who’s going to win the EPL?” and “Is Leicester City for real?” and “What’s the deal with Chelsea?” It reinforces the big team bias, even in the biggest league in the world. Think what this does to something trying to get a foothold in the marketplace like women’s soccer.

This is a big reason I’ve decided to start doing predictive models for the NWSL this season.4 First, I want a reason to become more interested in women’s soccer, so investing myself into a project like this gives me a reason to follow it more closely. Second, maybe if I build it, people will come. I’m not necessarily a big dog5 in the soccer analytics community, but I have a decent following on Twitter and maybe if I start writing then we can start a women’s soccer analytics community. Someone has to be the first mover, and I’ll get enjoyment out of it so why not? Third, I think it’s an important thing to do. We have plenty of people writing about all the men’s leagues in the world, but not nearly enough writing about women’s soccer. I would never want to step out of my lane and speak over any of the amazingly talented women writing about women’s soccer, but if I can find a niche and bring in some people who wouldn’t otherwise be interested then maybe that’s a good thing.

I don’t know if this made for a particularly coherent blog post, but I really think it’s important to address the systemic inequalities in gender and sports (soccer in particular) and to think about how hard it is to break the cycle where we don’t invest in women’s soccer because it has a small fanbase/it has a small fanbase because we don’t invest in it. It’s an important step toward equality in sports, and hopefully we can move the ball forward.

  1. To be fair, I also don’t follow the Bundesliga, La Liga, or Ligue 1  closely either, but even for those leagues I know the biggest players and teams.
  2. I refuse to watch TV on my laptop. My students would cringe at me being an old man and watching TV on an actual TV, kind of like how I use my phone to be a phone.
  3. #forzaMilan
  4. There’s another problem here with availability of stats: it’s harder to do a decent analytic model for women’s soccer because there aren’t as many publicly available stats, so no one writes about it. But then no one collects stats for women’s soccer because no one’s interested, but no one’s interested because there aren’t any stats…feedback loop.
  5. Digby pun intended for Mike Goodman should he read this

Week 24 EPL Model Comparisons

This is the third week of my in-season results based model (TAM), and I wanted to continue comparing its performance to MOTSON’s. My goal is, with enough data, to learn as much as I can about the advantages and disadvantages of the two approaches and maybe how they can compliment each other and how I can improve my predictions for next season. You can see last week’s post here, and I’ll be updating every week. Below is a table with each model’s modal category and the actual result.

GameMOTSONTAM (In-Season)Actual Result
Norwich v. TottenhamTottenhamDrawTottenham
West Ham v. Aston VillaWest HamWest HamWest Ham
Leicester City v. LiverpoolLeicester City/Draw (equal)Leicester CityLeicester City
Crystal Palace v. BournemouthCrystal PalaceBournemouthBournemouth
Arsenal v. SouthamptonArsenalArsenalDraw
Sunderland v. Man CityMan CityMan CityMan City
Man United v. Stoke CityMan UnitedDrawMan United
West Brom v. Swansea CityDrawDrawDraw
Watford v. ChelseaChelseaDrawDraw
Everton v. NewcastleEvertonEvertonEverton

MOTSON did well this week, getting 7/10 games correct, and the TAM stepped up, also getting 7/10 correct. MOTSON missed on Crystal Palace, Arsenal, and Chelsea, while the TAM missed on Tottenham, Man United, and Arsenal. So what can we learn from these games?

Once again, MOTSON overestimates Chelsea. There’s not much to learn here because we’ve already learned this a bunch of times – I still don’t have any good ideas how I could have modeled this pre-season, and maybe this is just a statistical anomaly. Either way, it’s a known issue with the model that seems to be corrected fairly well by the in-season results model.

Both models missing on Arsenal is a bit surprising, and this may just be a legit low-probability event. If two disparate models predict the same outcome and both get it wrong, then that might be the explanation.

TAM’s misses on Tottenham and Man United are surprising. I would have picked both of those teams to be favorites, especially with Man United at home. Maybe Spurs has struggled to convert road fixtures to wins, which explains that prediction, but United drawing at home against Stoke is a weird one to me. Not sure why that is – worth thinking about more.

MOTSON’s miss on Crystal Palace v. Bournemouth is a tough one – I would have picked Palace to win, but TAM got this one right so apparently the data are better than my intuition and the pre-season model here. I don’t have any real insight here, but it’s important to take notice of this in case a pattern arises that would give me an opportunity to improve the model next year.

Overall MOTSON still leads with 18 correct picks over TAM’s 15. Both models had good weeks, and I’m curious to see if the TAM model improves as it gets more data while MOTSON keeps the same inputs from last year. In-season results did much better this week than the past two, so I’m curious to see if this is just because the results were more predictable this week or if it’s improving.




Is The EPL More Unpredictable Than Other Leagues? An (Early) Comparison of Models

Parental Discretion Advised: Over-generalizations from incredibly small sample sizes to follow.

Those of you who follow me on Twitter (which I assume is almost all of my readers, but if not you should follow me @Soccermetric) probably know I debuted prediction models for Serie A and the Bundesliga this weekend. They’re based on the TAM model (which needs a better nickname – I want to backronym NAGBE if possible), which is described in full here. The short version is it only looks at basically two variables: in-season results and goal differential. The math is a lot more complicated than that, but this is the basic idea. The TAM model hasn’t done especially well through its first two weeks in the EPL, predicting 8/20 matches correctly, a meager 40%, or barely better than the 33% we’d expect from just flipping a three-sided coin.

However, it did much better in Serie A and the Bundesliga this weekend. It predicted 6/9 correct this week in the Bundesliga, or 67%. Here are the first week’s predictions:

Week 24 (19) Bundesliga Predictions

The errors were Wolfsburg, Hamburg, and Hertha Berlin. I’m incredibly pleased with 6/9, and am equally pleased with the 6/10 in Serie A (especially because it failed to predict Milan’s win in the Derby).

Week 24 (22) Serie A TAM

In one week, the two new models each got 6 outcomes correct, which is almost as many as the same EPL model got correct in two weeks (MOTSON outperforms the simple model). This is obviously a small sample, and “correct” outcomes isn’t the right way to measure this, but it’s evidence that maybe the EPL is just really difficult to predict (especially this season). I’m not planning on putting together a model for La Liga, but that one might be even easier: you’d be hard-pressed to put together a bad model as long as you started with Barcelona > Real Madrid = Atletico.

The success of a model depends on the difficulty of the task, and if the same model performs much better in other leagues, then we may have evidence of a greater challenge in predicting EPL results rather than the other major European leagues and should adjust expectations of our models accordingly.