» Ready, Set, Statcast: What the New Data Stream Can Teach Us About MLB

Sometime this spring, Major League Baseball Advanced Media completed the installation of Statcast, its long-anticipated everything-tracking technology. The system is now active in all 30 parks, and the petabytes of records it generates by capturing the physical position of every player, pitch, and batted ball many times per second are accumulating somewhere on MLBAM’s massive servers.

Thus far, the sexier components of Statcast have kept a low public profile. Statcast’s official account hasn’t tweeted a new stat-enhanced video since Alex Gordon stopped at third, and we haven’t seen any sign of the baserunning and defensive data that should shed some light on two of the most mysterious areas of a fairly well-quantified sport.

But despite the lack of fanfare and the features that remain under wraps, Statcast’s soft launch has been better news than most statheads had allowed themselves to hope for, given the commissioner’s caginess about the time frame for releasing raw data online. If you’ve used MLB Gameday to keep track of your team during this season’s first few games, you may have noticed a new tab called “Feed,” where embedded tweets and videos compete for screen real estate with stats the public hadn’t seen before they surfaced in snippets of video last season: batted-ball velocity and distance.

statcast-2015

The events displayed in Gameday and the mobile MLB At Bat app live in files that look like gibberish but are easy to import into a more friendly format. The files exist primarily to serve MLBAM’s products, but because MLBAM doesn’t restrict access, they’re going to get some hop-ons — fortunately for us. Those hop-ons have built the wealth of free statistical resources that draw on MLBAM’s data, and they noticed almost immediately when unfamiliar fields appeared in the files on Opening Day: hit_speed, hit_angle, hit_direction, and hit_distance. Without any warning, MLBAM was giving away Statcast-tracked velocity, direction, and distance of batted balls, potentially eliminating one of the biggest gaps in knowledge between fans and front offices.

For front offices, this information isn’t anything new: Teams have HITf/x data going back to 2008. Although HITf/x relies on a different data source than Statcast’s TrackMan-radar-based ball tracking, its output and potential applications are similar. Teams have had this cool new batted-ball data we’re all hot and bothered about since before the series premiere of Parks and Recreation.

People getting excited about batted-ball baseball data is adorable.

— Christopher D. Long (@octonion) April 6, 2015

@dj_mosfett @dj_mosfett It was released so long ago I'd forgotten the public didn't have access to it.

— Christopher D. Long (@octonion) April 6, 2015

For public-sphere sabermetricians, though, this is the most promising new data set to appear since PITCHf/x. Thus far, the technical side has been understandably spotty: Not every park is reporting hit speeds, and the values disappear on one day only to return the next. As of Wednesday night, however, Baseball Savant had data on 259 brand-new batted balls, which is 259 more than we had a week ago.¹

That total is still too small for any ambitious analysis — although it’s never too early to look at a leaderboard — but we can separate hits and outs into two groups and compare the velocity of each:

outs-hits-batted-ball-data

As one would expect, the more successful batted balls tend to be hit harder. OK, so “sign guys who hit the ball hard” is not the new inefficiency. But before long, we’ll be able to do much more with this data than make obvious infographics. Here’s a quick look at several ways in which even this limited stream of Statcast info — with more likely to follow in the near future — can teach us more about baseball, provided MLBAM doesn’t turn off the tap.

Outcome-Independent Player Evaluations

This is the big one: the low-hanging fruit that also happens to taste the juiciest. Statcast has the potential to make players’ lives a little less unfair. “Outs” and “hits” are broad categories, and many batted balls that would usually belong in one group end up in the other just because a fielder happens to do something extraordinary or stands in an unusual spot. Take the third-hardest-hit ball of the sample so far, a 111.6 mph screamer centered by Jorge Soler on Opening Night:

[protected-iframe id=”5a45077ae3266ba3ce9dca1f8c9afa5b-60203239-35703816″ info=”http://wheatleyschaller.com/dev/video_embed.php?id=BlissfulGrandAmphibian” width=”530″ height=”297″]

Soler’s outcome-based stat lines consider that near hit an out like any other. The outfielder’s batting average, slugging percentage, and WAR don’t distinguish between a squared-up at-’em ball and an easy popup, even though the former bodes better for the future.

Compare Soler’s loud out to the weakest batted ball in the sample, a 40.6 mph single by Sam Fuld that would have earned his owners points in any overpopulated fantasy leagues deep enough for teams to employ him.

[protected-iframe id=”dc7b9a40fc11c15049a077cfcd0d7715-60203239-35703816″ info=”http://wheatleyschaller.com/dev/video_embed.php?id=InexperiencedFreshCapeghostfrog” width=”530″ height=”297″]

Beating out infield dribblers and hitting baseballs hard enough to make Matt Holliday’s life flash before his eyes are both skills, but the latter is more valuable in the long run. The larger the sample, the more closely slash lines reflect reality: It’s hard to look good by luck alone over a two- or three-season span. Over a single season, though (or a fraction of one), it’s possible for a player to pass as a better hitter than he is when lucky bounces help his cause.

Without access to the tracking information that teams have had for several seasons, we use crude proxies to try to separate regression candidates from deserved sluggers: BABIP, “expected BABIP” (based on bucketed batted-ball distributions and an estimate of speed), or, if we’re lucky, “hard-hit average.” With a large enough Statcast sample, we could do better by assigning an expected run value to every batted ball, based only on the observed run values of previous batted balls hit on similar trajectories and at similar speeds. That way, Soler would get credit for crushing a ball in a direction where it would normally fall for a hit, even if Holliday happened to catch it. For pitchers, the system would work the same way: Adam Wainwright, who grooved the pitch to Soler, would get dinged, while Colby Lewis, who got Fuld to hit a tapper, wouldn’t be penalized significantly just because one weak dribbler happened to be perfectly placed.

Obviously, outcomes still matter — expected run values notwithstanding, a game-winning bleeder beats a game-losing line drive — but expected run value per plate appearance would be more telling than any rate stat we can consult now. One front-office veteran who’s seen what this data can do when applied in that way described it to me as “incredibly fruitful,” and I didn’t even tell him that fruit was the motif for this section.

Settling DIPS Debates

By pairing pitch tracking with hit tracking, we could get a clearer idea of the types of pitches that produce weaker contact and identify pitchers who can throw those pitches consistently, helping us identify arms whose below-average BABIPs are “real.” We’d also end up with more accurate pitch-type linear weights.

Injury/Fatigue Detection

In keeping with the idea that the result of play-tracking technology often isn’t coming to different conclusions, but coming to the same conclusions more quickly, batted-ball speed could be used as an indicator of health, and as a means of distinguishing between a harmless, passing slump and one with an underlying physical cause.

Killing the Batted-Ball Bucket List

Fly balls, line drives, and grounders will always belong to the baseball lexicon, but they’re on their way out as analytical tools. Batted-ball classification is a subjective exercise: Different people (or the same person, sitting closer to or farther from the field) can easily come to different conclusions about where “line drive” ends and “fly ball” begins. Soon, we’ll be able to refer to a given hitter’s average angle of elevation (and hit speed) instead of his line-drive rate. While we’re not likely to hear broadcasters incorporate batted-ball angle into home run calls — “That’s a deeeeep fly ball!” still sounds better than “That’s a 25-degreeeee batted ball with a 107 mph hit speeeeed!” — objective angles, rather than arbitrary, imprecise categories, are about to become the sabermetric lingua franca.

Understanding Defensive Positioning

Player-tracking technology has the potential to improve defensive positioning by offering insight into fielders: how quickly they react and reach top speed, how direct their routes are, and how much ground they can cover in different directions. Of course, the fielders’ abilities are only half the equation when aligning defenders. The other, equally important variable is where the ball is likely to go, and how quickly. Not only should TrackMan-measured batted-ball directions be more reliable than human-entered trajectories, but standard spray charts don’t factor in batted-ball speed, an important consideration when deciding where to station a fielder in order to give him the best chance to intercept balls. Teams have used their HITf/x head start to better anticipate batted-ball locations, and Statcast will give them the last puzzle piece. And while those of us watching at home won’t be positioning fielders, better batted-ball data would help us figure out why the shifts we see make sense.

Adjusting Defensive Stats

More precise measurements of batted-ball speed should permit more accurate estimates of the difficulty of defensive plays — and thus, the skill of fielders. Some pitching staffs probably allow consistently harder contact than others, and it’s possible the defenders playing behind them should be given more credit for handling that extra zip than they currently receive. And just as broad batted-ball classifications can give way to more granular angles for hitters, fielding “zones” superimposed on the field can give way (with a sufficient sample) to more granular play probabilities at every point on the field for defenders.

Studying Hot Streaks

New technology might enable us to take a fresh approach to a tired topic. If we define a successful plate appearance as one with a high expected run value rather than one that necessarily results in a hit, we’ll remove some randomness from the process and (maybe) discover a tendency for players in hot streaks to stay in them. I’m not holding my breath.

As pleasant a surprise as it was to get raw data on Day 1 in addition to the stripped-down stats in Gameday, At Bat, and (soon) TV broadcasts, the most encouraging aspect of Statcast’s first week as a leaguewide system isn’t the potential for useful discoveries as soon as this season. It’s the precedent it sets in the ongoing battle between proprietary and public. Teams will always have access to data sources we don’t: scouting reports, medical records, biomechanics, and more. But based on the form this first trickle of tracking info has taken, it’s starting to seem more realistic that we won’t end up with a watered-down version of the uncut Statcast that teams can expect. And in that case, the new numbers would be more likely to narrow the gap between insiders and outsiders than they would be to push them further apart.

@Bbl_Astrophyscs @statcast @TrackManBB @mlbam all things in time! ALL things

— Cory Schwartz (@schwartzstops) April 8, 2015

Thanks to Rob Arthur and Daren Willman for research assistance.