The Adolescence of Soccer Stats
There is a school of thought in the baseball analytics community that the world would have been better off without the box score. The argument goes that the the box score legitimized a certain set of numbers, and that those numbers weren’t actually the important ones; they were just the easiest ones to count. The theory, regardless of whether you agree with it, speaks to one of the quirks of the baseball analytics debate. It’s not about whether or not to trust numbers; it’s about old numbers and new numbers. Everybody who even casually follows baseball is comfortable with the old numbers, in part because they’ve been around so long. Being wary of analytics amounts, in effect, to preferring the statistics you know over the ones you have to learn.
Soccer doesn’t have old numbers. It’s actually really, really hard to count things in a sport as fluid as soccer. It’s only been in the last five to 10 years that people have started tracking statistics in anything resembling a systematic way. We live in an age where there is interest in, and profit to be made from, the gathering and dissemination of statistics. Companies like Opta (which FiveThirtyEight profiled this summer) and Prozone have sprung up to track data, with the purpose of selling that data to teams or the media (ESPN Stats & Info’s soccer data is powered by Opta).
It’s tempting to think of all this new soccer data in the same way we think of advanced metrics in baseball or other more number-oriented sports — as if the “newness” of the numbers means they must be the product of analytics. That’s wrong. Soccer’s data-collection companies have actually just brought soccer to where baseball was a hundred years ago.
Finally, the sport has data. Now the question is what to do with it. That puts soccer in a unique position, because unlike any other mainstream sport, the distribution of statistics and the analysis of those statistics are both developing at the same time.
The Statistics Side
Back to that question of whether baseball would have been better served without the box score: It’s actually important to the development of soccer statistics.
A number of websites, like WhoScored, Squawka, and FourFourTwo’s Stats Zone, have emerged in recent years (all powered by Opta) with the express purpose of delivering game data to fans. Among other things, these sites provide individual and team stats for matches. That sounds suspiciously like a box score, right? Any one of those sites can give you an idea of who did what, and how, over the course of a game or a season — complete with heat maps and passing charts to help visualize the numbers.
What they don’t do is provide any of the broader context inherent to the hazy phrase “analytics.” Does it matter if Johnny Defender averages a lot of tackles per game? Does Carlos Shootsalot’s low scoring percentage mean he’s a bad shooter who should stop taking all of those attempts on goal? Or is he just unlucky? Does a defender’s 95 percent pass completion rate mean he’s a better distributor than a winger who’s at 78 percent? We have these numbers. The question is, which ones should we care about?
You can see why this might make a weary numbers warrior have flashbacks to days of baseball yore. Why cite stats if you don’t know if they’re meaningful!! That’s how I got stuck seeing those useless RBI numbers on every baseball broadcast for the past 50 years!
Sure, using numbers to make assumptions about which players or teams excel without having the proof they matter is a dangerous road to walk down. But, that doesn’t mean those numbers shouldn’t be used.
Arguing against the box score (and counting numbers) argues against using statistics for descriptive purposes. While traditional baseball stats are not particularly predictive of what will happen, they are very descriptive of what has happened. Try describing how the Baltimore Orioles played over the last week without using statistics, or try explaining how good Clayton Kershaw is. Stats that are increasingly becoming discredited in baseball don’t fail to describe how good or bad performance has been; they fail to explain the whys of that performance and, consequently, whether or not they would continue.
A decade ago, nobody knew how many passes Xavi played, so they didn’t know how many he completed. Without that information, how can anybody even begin trying to break down exactly how great he was, or whether he was in decline? It’s the equivalent of watching Derek Jeter now and only being able to say, “Three years ago, I watched him get a lot of hits, and this year he gets fewer hits.”
Statistics, even raw counting statistics, help the world agree on what happened. It’s hard to say why it happened, or what will happen next, without that first step. That first step has always existed in American sports, but it’s a new phenomenon in soccer.
The Analytics Side
The hope is that buried in all of those stats are bits of information that are more than simply descriptive of what happened. That work is being done as well, and while some major progress has been made, it remains in its infancy.
Total Shots Ratio is the granddaddy of advanced soccer stats. It was originally based on hockey’s Corsi and brought over to soccer by James Grayson. The key insight is astonishingly simple: Good teams shoot a lot, and they keep their opponents from shooting. The power of the stat comes from the fact that it is both predictive of itself, and of success. That is, a team’s past TSR does a good job of predicting its future TSR, and a team’s TSR does a better job predicting future goals and results than past goals and results do.
TSR is incredibly helpful when it comes to figuring out what teams might do well in any given season (though, of course, some dumb writers will ignore it and pick Manchester United to finish third), but it isn’t all that helpful when it comes to actually managing a team.
In hockey, there are any number of combinations of players on the ice, and a high volume of shots, which makes it possible to break down which players are on the ice when their team is most effective. In soccer, with only three substitutions and relatively few shots, TSR doesn’t (or at least hasn’t yet been shown to) offer anything on the individual player level.
More recently, the concept of expected goals has gained prominence. From a predictive standpoint, ExG performs about as well as TSR does (Grayson compares TSR and an ExG model from Michael Caley here, as well as providing lots of methodological background), but it has the added benefit of being applicable to players as well as teams. In other words, it’s possible to look at players’ goal totals and compare them to how many goals you’d expect them to score, based on the kinds of shots they took.
Splitting shooting apart this way is somewhat similar to examining the batting average on balls in play component of batting average. It turns out that, similar to having a sustainably high or low BABIP, it’s very difficult for players to reliably overperform or underperform their ExG totals. It again bears repeating, though, that all of this work is in its very early stages.
Reconciling Stats and Analytics
One thing you may have noticed is that the bulk of analytics work seems to deal with shots. This is largely but not entirely true. Some ExG models, like Caley’s, take into account the type of pass that leads up to a shot. More broadly, Ted Knutson of StatsBomb has created player radars that focus on painting more accurate pictures of players’ statistical output by adjusting for minutes played and — on the defensive side of the ball — amount of possession. (Again, it’s worth noting that we’re flying close to completely blind as to how all of this adds up to team results.) But, for the large part, the heaviest analytical work doesn’t yet involve the vast array of statistics that we now have at our disposal.
There are a lot of reasons for that. Some of it is simply a function of the relatively small amount of time people in the public sphere have been working on the information. Also, it turns out that when you want to do heavy analytics, having a hundred years of data, like they do in baseball, makes it easier than having just five or so, like soccer does. It’s possible there are all sorts of important things in that soccer data that people just haven’t gotten to yet. At the same time, much of it is never going to be more than descriptive. The trick will be figuring out the difference.
Ultimately, analytics is about figuring out answers to questions. You need statistical data to help do that. But, if baseball has taught us anything, it’s that it’s just as easy to use stats to jump to ultimately erroneous conclusions as it is to discover anything truly enlightening. The newness of statistics in soccer makes it feel like everything about them is advanced. But just because soccer is a century behind baseball doesn’t mean the sport can’t have its own share of RBIs. Soccer does have one advantage, though. In baseball, the assumptions got a 100-year head start. In soccer, they’re both leaving the starting line at the same time.