Thoughts on Parsimony v. Accuracy

My model has what a lot of people are considering an odd prediction, heavily favoring Arsenal over Manchester City at the Emirates this week. I’ve had a few people ask “Why?” and I’ve pointed them to my blog post about the model being a “black box”. I can’t point to any reason why it likes Arsenal so much, although it does like teams with a home field advantage and has a math-crush on Theo Walcott (calling him “Europe’s most valuable striker”).  James Yorke of Stats Bomb pushed back on this a little bit:

James is a smart guy and a great writer that you all should follow if you’re not already, so I wanted to write a longer form post talking about why what I’m doing is important for soccer analytics and why I choose a “black box” model over a typical regression format with clear coefficients and tests of statistical significance.

I wrote about it in more detail in another post, but I’m a firm believer in using the “right” model rather than the convenient one in most cases. 1 The SVM assumes no functional form for the individual variables, and we have no idea what the correct functional form is for things like possession or number of passes so we shouldn’t be doing any sort of linear regression on these variables. If the trade-off is not being able to say “Arsenal makes 37 more passes a game than Manchester City, therefore they’re more likely to win” then I’m ok with that.2

But more importantly, I think my model gets at things that much of the rest of the community isn’t getting at. Expected Goals and Assists are interesting and useful, but I think if we force everyone into that sort of analysis and language, then we’re really limiting ourselves. They’re good because they’re easily observable events, and goals are the…well “goal” of every team, and assists are the one-off event (the one immediately before the goal). However, they’re such a small part of what actually happens on the pitch, and there’s a little bit of the “Drunkard’s Search” going on if we limit ourselves there.

My model looks at the entire game, and the entire universe of (publicly available) statistics and makes no judgment on what is important. It likes tackles, headers, and other defensive actions quite a bit, which are generally thought to be unusable by most of the analytics community. While xG and  xA models, and all other models I’m aware of would prefer Lionel Messi to Fernandinho in a holding midfielder role for Manchester City, mine recognizes that he’d be a downgrade there.


Models that only look at offense and observable outcomes don’t get this right, but intuitively I think we can agree that this is true. We know limiting xG is important, but we don’t know how that happens. My model seems to understand that, and even though it doesn’t have a great answer for “why” it does lead to potential explanations and hopefully some testable hypotheses. Limiting ourselves to one way of thinking is ultimately going to leave the soccer analytics world stagnant, and there is value in multiple methods and multiple approaches. Parsimony is good, but we shouldn’t exclude complexity that generates insights into the game because it doesn’t fit well into 140 characters.


  1. If the convenient model gives roughly the same result, then go with the convenient one, but I’m a big believer that accuracy should never be second to parsimony.
  2. Reasonable people can disagree on this point, as I’ve written before.

Leave a Reply

Your email address will not be published. Required fields are marked *