» Is It Time to Give Up on the Red Sox Pitching Staff? What Statistical Thinking Can Tell Us

Here are some troubling numbers: Clay Buchholz’s ERA is 5.73; Wade Miley’s is 5.60. Collectively, the Boston Red Sox’s staff ERA is 4.90, second worst in the majors. By the time Justin Masterson walked off the mound on Tuesday night, his ERA had risen to 6.37. The staff WHIP is 1.43, third worst in baseball. Sox pitchers have allowed 38 home runs, fourth most in the league.

Maintaining calm in the face of numbers like these is difficult, especially on the heels of a losing season and an offseason in which the coveted pitchers — James Shields, Cole Hamels, and most especially the beloved Jon Lester — got away. As Sox starters continue to get pounded, the fans are left to wonder and argue about what to believe. Baseball’s statistical revolution arrived long ago, but a lot of people still don’t know what kind of confidence to place in numbers. Even while sophisticated baseball statistics like OPS, BABIP, and UZR¹ have grown in popularity, statistical thinking remains relatively obscure to most fans. To put this more plainly: It’s early May, and Buchholz’s ERA (or his FIP!)² can still come down. It’s a small sample size, as someone on a radio show is possibly saying right this minute. But what exactly does that mean?

Size Matters, But Small Samples Aren’t Meaningless

A sample is just a set of individual figures from a population. As Aubrey Clayton, an insurance risk analyst with a PhD in mathematics (and a lifelong attachment to the Rangers), explains it, the popular phrase “small sample size” does not have a specific meaning. The size matters, he says, not because smaller samples are worthless, but because they change what kinds of questions we can ask and answer about an event. “The amount of information you can extract from [a sample] depends on how big it is,” Clayton says.

In other words, it would be a mistake to go to a ballgame, watch Wade Miley get shelled, and chalk it up as meaningless. But the information you gather is cumulative — you get a little bit more each time Miley takes the hill.

More than that, Clayton explained, you can’t properly draw conclusions from statistical data unless you start with a hypothesis. Rather than looking at Miley’s bad outing and trying broadly to decide how good he is, a statistician would come to the park with an expectation, then use what he sees to evaluate its accuracy.

This may sound like a pedantic point, but it’s actually clarifying, maybe even comforting, because we do have an expectation for how good Wade Miley is, based on what statisticians call prior information, i.e., Miley’s career so far. And in fact, our expectation is based on far better data than what we have for 2015; Miley has pitched fewer than 40 innings in a Red Sox uniform, but more than 600 innings with the Diamondbacks, during which time he had an ERA of 3.79. This means we have a strong hypothesis for him, namely that he is a pitcher who gives up fewer than four runs every nine innings. It also means that he has stretches when it might seem like he gets battered every fifth night.

The same goes for Buchholz, Masterson, and Joe Kelly, all of whom are currently sporting ERAs well above their career averages, and none of whom has pitched even 40 innings in this humbling stretch.

Data Samples Have Properties Other Than Size

Is it a coincidence that three of the five Sox starters are also new to the team and the AL East? There are many senses in which the data we’re collecting right now — the outcomes of the first 34 games of the 2015 season — may not be representative of the starting rotation’s collective performance over the long term. Generalizing from an unrepresentative sample is another classic statistics mistake. As Clayton explained, if you wanted to test a hypothesis that 2 percent of Americans had red hair, it would be a mistake to sample only people with freckles.

Accounting for such problems gets arcane pretty quickly. You’ve got to mathematically determine whether your sample is representative of the overall population and then mathematically correct for it if you want to make a useful generalization from your data. Changes like pitching in a new park, with a new set of teammates, or in a new league need to be taken into consideration. They can affect performance significantly enough to change future expectations.

Remember, for instance, that John Lackey allowed the most earned runs in baseball in 2011 before turning right back into his old self after elbow surgery. Health is surely the most significant factor that affects a sample’s representativeness, but it is not the only one.

All this is why good statistics sites like FanGraphs try to look for statistics that are subject to fewer variables, such as FIP. Buchholz’s FIP is notably far better than his ERA, which is a good reason for optimism in his case. His most recent turn, a 6.1-inning, three-run effort in Toronto, may well be more predictive than some of the early-inning flameouts he’s had in other games. In fact, it’s more or less in line with his career averages. Which brings us to one last point:

Regression to the Mean Is Not Regression Below the Mean

Confusion over the concept of regression to the mean is widespread, and not just in baseball circles. It is common in casinos, where it has acquired the name the gambler’s fallacy. The general (mistaken) idea here is that when an abnormal pattern occurs, it must be offset or corrected by some opposite abnormal pattern in the future. For instance, if Joe Kelly normally gives up four home runs in a month, and for two months he gives up six each, he’ll be likely to give up two in each of the two months following. In plain English, this idea sounds ridiculous, because it is ridiculous. Or at least, it asks a lot from our notions of cause and effect. How could two bad months of pitching create two good months of pitching?

Regression to the mean exists, of course, and is a valuable statistical concept. But it looks different from what some eager gamblers may hope for. Imagine a roulette wheel, Clayton said. “If you have a sample where you’ve got an abnormally high percentage of red spins on a roulette wheel, it is now likely that over the next period of time the number of reds you get will be less than it was in that sample … but not necessarily less than it always has been on average.”

If you have 100 roulette spins that turn up 75 percent red, the next 100 spins are likely to regress to 50 percent red. The converse is just as true, too: If your first 100 spins are 25 percent red, the next hundred spins are likely to regress to 50 percent. That’s still regression to the mean, even though we tend not to talk about it that way. Underperforming players are just as likely to improve as overperforming players are to revert. If they weren’t, the mean wouldn’t be the mean.

So while we can’t assume that an underperforming pitcher like Kelly is “due” for a few months of his best baseball, we can reasonably expect that he’ll play like his old self. Whether that’s enough to win the AL East is another question, especially given that the Sox have put themselves in a hole. But it’s still the most useful predictor of what we’ll get from Kelly from here on out. And the same goes for his fellow aces, Buchholz, Miley, Masterson, and Rick Porcello.

Statistically, then, it would seem that patience is the right approach, which is perhaps why Sox GM Ben Cherington suggested recently that the team isn’t likely to shake up its rotation right now. Then again, the team seems to think change will help, since pitching coach Juan Nieves was recently given his walking papers. Boston’s logic seems obvious enough — as noted above, four of the five starters have ERAs much higher than their career numbers would predict. But they have reasons to maintain confidence in their hypotheses about those players’ capabilities.

To really test this thinking, consider Masterson, whose career ERA, established over more than 1,100 lifetime innings, is tied with Porcello for highest of the group at 4.31. He’s got the second-highest ERA on the team, up to 6.37 after a two-inning performance Tuesday night against the A’s. He’s over 30 (just barely), and his pitch velocity is down (as per an appropriately headlined article at SB Nation). It’s tempting to say that Masterson is finished — and the early word Wednesday is that he is headed for the disabled list. But this likely has more to do with those velocity numbers than Masterson’s ERA, because pitch velocity isn’t affected by as many variables. Thus it’s much easier to establish that a small sample of this number is representative, at least in terms of thinking about who Masterson is now. A likely scenario is that he’s got some kind of injury.

If Masterson gets his arm strength back, don’t be distracted by the current outcomes; the prior information on Masterson’s career is far more substantial than any six-week stretch of games. From April 9 to May 13 (i.e., his season so far), Masterson has surrendered 25 earned runs in 35.1 innings; remarkably, from April 17 to May 13, 2012, he also gave up 25 earned runs in 35.1 innings. This didn’t stop him from having a (marginally) acceptable 4.78 ERA the rest of that year, or from having one of the best seasons of his career, including his only All-Star appearance, in 2013. In fact, that 2012 season illustrates that even an entire year of below-average pitching may only incrementally change your expectations for a pitcher with a substantial track record.

If you want to establish a timeline for giving up on a player’s season, you have to take into account other factors: the quality of available replacements, say, or balancing the team’s other needs. What the numbers can tell you is not what you might expect: be patient. Regression can be a good thing.

Ben Adams (@bendadams) is a book editor at PublicAffairs. His writing has appeared at Vice Sports, SB Nation, and Sports on Earth, among other places.