» DataBall

On February 13, 2013, the San Antonio Spurs found themselves in a surprisingly close game in Cleveland. Late in the fourth quarter, Cavs rookie shooting guard Dion Waiters made the biggest basket of his young NBA career, knocking down a tough jumper to give his team a two-point lead with 9.5 seconds left. Smelling an upset, Cavs fans at Quicken Loans Arena started going bananas.

The Spurs called a timeout, advanced the ball to half court, and decided to run one of their favorite plays. Matt Bonner inbounded the ball to Tony Parker, who stood 30 feet from the rim. Parker quickly attacked the left wing, where a massive Tim Duncan screen forced Tyler Zeller to switch onto Parker. With 6.7 seconds on the clock, Parker raced toward the basket, poised to attempt a high-percentage game-tying layup. A split second later, he saw something and changed his mind.

Kawhi Leonard was standing unattended in the weakside corner. Parker’s aggressive slash to the rim had drawn Leonard’s defender, Waiters, all the way into the lane. He was standing flat-footed in no-man’s-land. As soon as Parker noticed this, he hurled a perfect bullet pass along the baseline to Leonard. The rest was almost a formality; at that point, the play was made and all Leonard had to do was sink one of his favorite shots. He did, and the Spurs won the game by one point.

The box score reduces that sequence to a few basic integers; Kawhi Leonard is credited with one field goal attempt, one field goal made, and three points scored. Tim Duncan’s screen goes undocumented, and the totality of Parker’s catalytic undertakings gets recorded as one measly assist.

Afterward, Parker reflected on the game’s final sequence: “I thought I could have made the layup, but I saw Kawhi open. I wasn’t only playing for the win, but I’d been setting up my teammates all night, so I wanted to make the right play at the end.”

♦♦♦

Shortly after the 2012 MIT Sloan Sports Analytics Conference, I received a call from Brian Kopp, the John the Baptist of player-tracking data in the NBA and the person in charge of the SportVU project at STATS LLC in Chicago. I was working at Harvard at the time, and Kopp was offering to share his incredible data set with basketball-minded academics; he asked me if I wanted to “play with some optical tracking data.” I leapt at the chance, having no clear idea what I was getting into.

A few weeks after the call, I got my first glimpse of the raw data that, according to many, was going to change basketball analytics forever; it was a true “Holy shit!” moment. At the time, I was working on a giant, 27-inch iMac display, and when I double-clicked that first SportVU file, data immediately filled the screen. All I could see was an ocean of decimal points, trailing digits, and hundreds of XML tags sporadically interleaved among them. Right away, it was obvious this was the “biggest” data I had ever seen. I’ll always remember my surprise when it occurred to me that everything on my screen amounted to only a few seconds of player action from one quarter of one game. I had thousands of these files; I knew I needed help.

I approached Luke Bornn, a young professor of spatial statistics, and told him about my predicament. Luke suggested we form a research group on campus and enroll graduate students to create projects using the data. The group quickly attracted four PhD students in statistics and computer science. By early 2013, each student was tackling different projects. We called it “XY Hoops.”

Dan Cervone and Alex D’Amour were two of our first members. Both guys are 27 and fourth-year PhD students in the statistics department at Harvard. Both love sports, but not as much as they love coding and numbers. Shortly after seeing the data and doing some brainstorming, they pitched the group a project idea that sounded equally innovative and impossible.

♦♦♦

On the quest for the perfect analytical device, the first discovery should always be the inescapable fact that there is no perfect analytical device. There is no singular metric that explains basketball any more than there is a singular metric that explains life. It’s hard not to improperly elevate the role of “big data” in contemporary sports analyses, but romanticizing them is dangerous. Data are necessarily simplified intermediaries that unite performances and analyses, and the world of sports analytics is built upon one gigantic codec that itself is built upon the defective assumption that digits can represent athletics.

Still, the reality in 2014 is that Adam Silver’s NBA has cameras in the arenas measuring every player’s every move. These stationary drones in the rafters are beaming gigabytes of potentially vital intelligence back to video rooms and practice facilities across the league. Whereas just a few years back acquiring good data was the hard part, the burden now largely falls upon an analytical community that may not be equipped to translate robust surveillance into reliable intelligence. The new bottleneck is less about data and more about human resources, as overworked analysts often lack the hardware, the software, the training, and most of all the time to perform these emerging tasks.

Despite all that, in the hands of talented, well-equipped statisticians, SportVU data is indeed awesome in terms of its potentially massive contribution to understanding the league we all cherish. In Kopp’s words, “We are just scratching the surface, and it takes a lot of work just to get to the point to begin advanced analyses.” The NBA’s big-data possession is just getting started, and everyone is rooting for a slam dunk that benefits teams, athletes, media, and most of all, fans. But that’s not guaranteed, and in the words of Parker, we just have to make sure we “make the right play in the end.”

♦♦♦

Tony Parker is one of the best playmakers in the world. For more than a decade now, he’s been the straw that stirs the Spurs’ stiff offensive drink. But despite winning three rings and an NBA Finals MVP,¹ Parker has never quite been considered a true superstar. Once again this year, he’ll begin the All-Star Game on the bench, playing behind guards who have somehow turned slighter successes into superior Q scores. Maybe this is because Parker is a foreign player, or maybe it’s because he plays in a smaller market deep in the heart of Texas.

But maybe it’s also because our box scores undervalue the importance of the “little things” that players like Parker do and overvalue the most easily quantifiable events like made baskets and rebounds.

On one hand, the notion that we award Leonard three points for his buzzer-beating shot in Cleveland makes sense. After all, he was the one who made the freaking shot. On the other hand, giving Leonard credit for the basket is like awarding George Clooney the credit for Gravity.

“We practiced that play 1,000 times, so I knew we’d be able to execute it,” San Antonio coach Gregg Popovich said after the game.

If we applied this conventional basketball accounting to the game of chess, we’d assign far too much importance to the singular checkmate move, while entirely overlooking that move’s hugely relevant tactical precedents. Chess matches are rarely won or lost in one final action, and the same goes for basketball possessions. They are rarely decided by their terminal actions, and players like Parker or Chris Paul commonly put their teams in advantageous situations one way or another.

In the era of “big data,” the current statistical system — the one that produces the box score — is a typewriter, albeit a reliable one. It was born out of pencil-and-paper convenience rather than a desire to truly measure the contributions of the 10 athletes on the floor. Still, it has worked well, and as a result it’s persisted from the time of Bill Russell, through the Michael Jordan years, and well into the LeBron James era; its derivative dogmas have morphed into things we have termed “advanced stats” and “basketball analytics.”

In the last few decades, pioneers like Ken Pomeroy, Dean Oliver, and John Hollinger effectively took advantage of spreadsheets and other newfangled accoutrements of the personal computing era to launch us headlong into basketball’s computational era. We continue to learn from their contributions, but things are still rapidly evolving.

♦♦♦

Early in the spring semester of 2013, Cervone and D’Amour proposed a new project to measure performance value in the NBA. The nature of their idea was relatively simple, but the computation required to pull it off was not. Their core premise was this:

Every “state” of a basketball possession has a value. This value is based on the probability of a made basket occurring, and is equal to the total number of expected points that will result from that possession. While the average NBA possession is worth close to one point, that exact value of expected points fluctuates moment to moment, and these fluctuations depend on what’s happening on the floor.

Furthermore, it was their belief that, using the troves of SportVU data, we could — for the first time — estimate these values for every split second of an entire NBA season. They proposed that if we could build a model that accounts for a few key factors — like the locations of the players, their individual scoring abilities, who possesses the ball, his on-ball tendencies, and his position on the court — we could start to quantify performance value in the NBA in a new way.

In other words, imagine if you paused any NBA game at any random moment. Cervone and D’Amour’s central thesis is that no matter where you pause the game, that you could scientifically estimate the “expected possession value,” or EPV, of that possession at that time.

For example, imagine LeBron James holding the basketball completely unguarded underneath the basket. We would expect him to score two points. The EPV at that moment would be very close to two. Conversely, imagine Dwight Howard holding the ball 40 feet from the hoop with one second remaining on the shot clock and three defenders in his face. It’s highly unlikely that Howard would score. That moment would be ascribed an EPV very close to zero. Of course, most on-court predicaments in the NBA are not this extreme, but they still can be evaluated through this EPV framework.

While that is novel, the really big ideas trickle down from here.

If we can estimate the EPV of any moment of any given game, we can start to quantify performance in a more sophisticated way. We can derive the “value” of things like entry passes, dribble drives, and double-teams. We can more accurately quantify which pick-and-roll defenses work best against certain teams and players. By extracting and analyzing the game’s elementary acts, we can isolate which little pieces of basketball strategy are more or less effective, and which players are best at executing them.

But the clearest application of EPV is quantifying a player’s overall offensive value, taking into account every single action he has performed with the ball over the course of a game, a road trip, or even a season. We can use EPV to collapse thousands of actions into a single value and estimate a player’s true value by asking how many points he adds compared with a hypothetical replacement player, artificially inserted into the exact same basketball situations.² This value might be called “EPV-added” or “points added.”

Let’s examine that Parker-Leonard buzzer beater again, this time through the lens of EPV. The play begins with the Cavs leading by two points and just under nine seconds remaining in the game. As Parker initiates the offensive sequence, the model estimates that the possession is worth 0.97 points.

After Duncan’s screen frees up Parker to attack Zeller, EPV actually decreases as Parker penetrates through the midrange closely marked by Zeller. But as he gets close to the basket, the EPV surges to 1.36. Parker’s dribble drive has already elevated that value of the possession by 0.39 points — but he’s not done. He increases the value of the play further when he fires that crazy baseline pass to Leonard, standing open in the corner. EPV accounts for both Leonard’s great corner shooting prowess and that he is wide open. As a result, the EPV peaks at 1.75 as Parker throws the game-winning assist. There is a slight decrease in value, to 1.58, as Dion Waiters frantically attempts to close out Leonard, but Waiters is too late.

[protected-iframe id=”36d4ec83e646cfae879c8e42db05136a-60203239-28493950″ info=”https://grantland.com/wp-content/themes/vip/espn-grantland/assets/parker_animation/index.html” width=”550″ height=”650″ style=”border: none; overflow: hidden;”]

There are several ways to divvy up credit for the changes in EPV during this play, but the simplest method is to attribute them to the person in possession of the basketball at the time of those changes. Using this approach, Parker would earn plus-0.78 points for this particular sequence, corresponding to the amount of EPV gain from the time he began his attack (EPV = 0.97) until the time Leonard received the pass in the corner (1.75).

In a parallel universe, without our conventional box scores, and with only EPV, Parker would get more of the numerical glory for this play. But within our points-rebounds-assists framework, Leonard does. The YouTube clip of the play is entitled “Kawhi Leonard’s Game-Winning 3-pointer!”

♦♦♦

Cervone and D’Amour began building a model to estimate EPV last year. They derived their approach from “competing risk models,” which are mostly used in survival analysis to evaluate multiple risks of death and changes to those risks over time. Cervone saw an opportunity to apply them to basketball. Instead of the duration of a human life, he would examine NBA possessions; in place of various causes of death would be various possession outcomes in basketball games.

This unique approach to basketball analysis is the subject of a research paper to be presented later this month at the 2014 MIT Sloan Sports Analytics Conference in Boston.

“Instead of death,” Cervone explains, “we’re applying the ‘risk’ to look at the probabilities of different on-court actions or outcomes over time.” As described in the full paper, the model estimates two key values at every moment of every game:

By definition, the current EPV of a possession is the weighted average of the outcomes of all future paths that the possession could take. Calculating this requires a model that defines a probability distribution over what the ballhandler is likely to do next, given the spatial configuration of the players on the court, as we need to understand what future paths the possession can take and how likely they are given the present state. We call this model the possession model. Using a Markovian assumption, the possession model allows us to estimate both (a) the probability that a player will make a particular decision in a given situation and (b) the resulting EPV of the possession after the player makes that decision. Taken together, we learn both how valuable any moment in a possession is, as well as the features of the offense’s configuration that produce this value.

Here’s an example of the key values estimated by the model taken from a Spurs-Thunder game from last season. Kawhi Leonard holds the ball near the top of the arc; the model estimates what is most likely to happen next, and what changes in EPV will result from that particular action.

If you ask Cervone the hardest part of this project, he’ll quickly address computation. His answer is instructive as sports analytics races headlong into its own version of the “big data” era. In total, the 2012-13 SportVU data set used for the paper included 800 million locations of NBA players. Keep in mind, the data was collected in only 14 NBA arenas last season — it’s collected in every arena this season. The database the group built for the project ended up being 93 gigabytes.

As a means to squeeze this unwieldy database through the demanding model, Cervone and D’Amour turned to Harvard’s cluster computing service, known as Odyssey. It took the enhanced computational horsepower of 500 parallel processors and two terabytes of memory to complete the analysis.

Relative to the basic math and smallish data that we’ve been dealing with over the past few decades, analyzing this new kind of data is extremely challenging, both from a personnel and a computational standpoint. How many NBA teams have employees who even know what a competing risk model is, let alone people who know how to design and implement one? How many NBA teams have computational clusters or employees who know how to use them? Those answers may not be zero, but they are surely much closer to zero than they are to 30.

♦♦♦

Of all the players in the NBA, Chris Paul had the highest “points added” at 3.48 per game over the 2012-13 season. This makes sense. Most people around the league would describe him as the best point guard in the game. Parker also ranks very well. By adding 1.5 points per game, he is 20th out of 327 players with a qualifying amount of possessions. Ricky Rubio had the lowest, at minus-3.33 points added per game. “When we say Paul had a points added score of 3.48,” D’Amour explains, “we mean that his team was expected to score 3.48 more points per game because it was Paul, and not the average player, who made the decisions every time he touched the ball.”

“In general, players who are well adapted to the tools they have to work with, like their own shooting ability and the abilities of their teammates, add lots of points,” D’Amour said. “If a player has a shot that he can hit at a higher rate than anybody else (say, Dirk Nowitzki’s midrange shot), or takes advantage of the unusual talents of a teammate (say, passing often to a 3-point shooting guard like Ray Allen), the player comes out as a positive. If a player takes hard shots when most other players would pass to better shooters, or fails to adapt to a teammate who is having an unusually poor shooting season (Ricky Rubio when Kevin Love had a bad wrist), the player comes out as a negative.”

Rubio owes most of his “points lost” to his unfortunate shooting skills. Relative to league averages, he struggles to make shots in every part of the scoring area. In terms of EPV “over replacement,” almost every jumper Rubio decides to take is a proposition less valuable than if almost any other similar player took that exact same shot. For this reason, although Rubio regularly adds value with his non-shooting plays, the model docks him overall.³

♦♦♦

The overall contribution of the EPV project remains unknown; it is in its infancy and is by no means going to “revolutionize” basketball analysis. But perhaps it will eventually fuel a new supplementary approach to assessing performance in the NBA. Maybe it won’t, and its contribution will be less about figuring out true player value and more about simply raising awareness about the kinds of thinking, computation, and analysis that are required to really generate new basketball insights going forward. For years, we have talked about “advanced stats” when what we were really talking about was slightly savvier arithmetic. That’s going to change, whether we want it to or not. Don’t get me wrong — metrics like points per possession and PER have significantly improved the analytical discourse surrounding basketball. Still, there’s a tremendous amount left to do. And given these vast haystacks of newfangled player tracking data, we’re in desperate need of similarly newfangled needle-extraction techniques.

Sadly, as the best data sets become harder to acquire and the computational requirements become more intense, the days of bedroom analytics might be numbered.

Illustration by Aaron Dana.