Thoughts on Machine Learning, Black Boxes, and my SVM

There was quite a bit of discussion about machine learning (ML) techniques on Soccer Analytics TwitterTM today, so I expedited this post I’ve been planning for a few days.

I think it’s important to define what ML is for people. Normally I don’t like this, but Wikipedia has a good definition that I think works for what I wanted to communicate, so here we go:

Machine learning is a subfield of computer science that evolved from the study of pattern recognition and computational learning theory in artificial intelligence. Machine learning explores the study and construction of algorithms that can learn from and make predictions on data.

The big thing here is that ML techniques focus on prediction – if you’re trying to predict something, then you should look at ML first. What it doesn’t do well is explain things, as captured by these tweets from Michael Caley.

This was in response to Ola Lidmark Eriksson’s blog post talking about a machine learning approach to calculating xG. Michael’s approach is aimed at a mass audience, explaining makes a shot more likely to go in (e.g. closer to goal, more centrally located, not headed), while Ola focuses on greater accuracy and letting the model make the decisions. Both versions have their merit, and so much of it really depends on what trade-offs you’re willing to make. How much accuracy do you gain compared to the lack of explanatory power from these types of models? How much are you willing to sacrifice on either dimension?

All that being said, I wanted to talk about my method for predictions because I get a number of questions. I frequently get people asking “Why does your model  like Arsenal so much?” or “What stats are driving your results?” or “Why does your model think Theo Walcott is so good?” The answer to this is always “I don’t know”, and that’s a feature of the Support Vector Machine model. I explained the Support Vector Machine (SVM) in another post, so I’m not going to revisit the whole thing, but it’s worth a read for anyone interested in what’s under the hood. But I wanted to highlight the “black box” nature of these models, SVMs in particular.

The reason I like the SVM model for soccer is that it doesn’t assume any functional form – it doesn’t think that more passes or more possession is necessarily good. It looks at stats for results, and learns how many passes and how much possession is optimal given the other game stats, and predicts results that way. It doesn’t tell you what the inflection points are, and doesn’t tell you what the cutoffs are, and the interactions in such a big model are too complicated to present visually. But it does predict well, which is what matters to me for my purposes.

Another thing it does is recognizes the value of defense and balance in a team. One of the common refrains of Soccer Analytics folks is that it’s impossible to quantify defense. The SVM proves that this isn’t true, as it recognizes the value of having players who make some tackles/make clearances/win headers. I’m particularly proud of my most recent exploratory analysis, looking at the value of Man City replacing each of their midfielders with Lionel Messi.


The model shows Messi as an improvement over all of Man City’s front three midfielders, but is a small downgrade over Yaya Toure and a significant downgrade over Fernandinho. While it doesn’t give me specific reasons for this (the SVM is a black box, remember?), it’s pretty clear that Messi isn’t as good playing in the deeper role that Toure plays and certainly wouldn’t make a good holding midfielder like Fernandinho. Any increase in offense brought on by playing Messi instead of Fernandinho would be more than offset by the loss in defensive strength. This passes the common sense check.

Most importantly, I think this highlights one of the advantages my model has over some of the dominant models out there – specifically ones based on xG or some other offensive contribution. It knows if you’re playing too many attacking players, and will punish you for that. It can find places where your team is imbalanced (Mesut Ozil at Arsenal is a great example of that – my model much prefers Daniele de Rossi in his place, which is a very different role), and point out ways to fix that and can even recognize potential tactical improvements. It does all of this without knowing anything about players other than their average statistical contribution to a game.

Machine Learning techniques have their place, and if prediction is your goal then you really should learn something about them. But if you’re looking to explain things, then there are more appropriate methods and you should learn those. My SVM has predicted results well so far, and it quantifies individual player contribution to a team as well as anything out there (I would argue better, but I have no statistical proof of this). But it doesn’t explain outcomes particularly well, and it doesn’t explain why it prefers certain players over other players. That’s a job for other methods and people who are more interested in explanation. As usual, it’s about the right tool for the right job, and Machine Learning techniques are the right tool for predicting outcomes and quantifying individual contribution.

Leave a Reply

Your email address will not be published. Required fields are marked *