» Behind DataBall: A Discussion on the Methodology of Expected Possession Value

On Thursday, Grantland published “DataBall,” a piece by Kirk Goldsberry that looks inside the brave new world of SportVU data and how it could change how we quantify what happens during an NBA game. It also introduced a new statistic, Expected Possession Value (EPV), which posits that you can estimate the predicted point value of any moment during any possession of an NBA game. Heady stuff! We wanted to hear more about the methodology behind EPV from the people who are most familiar with it, Harvard PhD students Alex D’Amour and Dan Cervone. Below is a Q&A between them and Goldsberry, discussing the process behind EPV and the possible future uses for the stat.

What would you guys say inspired the key ideas behind “DataBall” and Expected Possession Value?

Alex D’Amour: When I was an undergrad, I read an interview where Carl Morris, a professor in our department, described an expected runs model for baseball that allowed you to take the number of outs and the configuration of runners on base, and compute the expected number of runs the batting team would drive in before the end of the inning. It was beautiful — a way to answer the question in every fan’s gut: “How well is my team doing right now?” — in terms you could see on the scoreboard.

Since then, I had wondered whether such a model could also be developed for more free-flowing sports like basketball or soccer. Without optical tracking data, it seemed pretty far-fetched. That was in 2008. By the time Kirk and Luke [Bornn, a professor of spatial statistics at Harvard] came to us with the data in 2013, I’d been thinking about a model like this for almost five years.

Dan Cervone: When we were initially talking about modeling the real-time expected value of an NBA possession, we were imagining one of those mood ticker graphics we saw running at the bottom of the presidential debates on CNN and other networks. Those would spike up every time someone said the words “create jobs” and nosedive every time someone said the words “binders full of women.” Could we do the same thing for basketball, using statistical and machine learning methods in place of fans with mood knobs, and recognize when the value of a possession skyrockets or plummets?

This is obviously a computationally intense project. What would you say was the hardest part?

D’Amour: Fitting the sub-model that tells you how likely a player is to pass to his teammate given the spatial positions of both. To get realistic estimates of EPV, you have to fit this model for each player to capture his tendencies, and for each player, you have to know how the pass probability changes as the positions of the player and his four teammate change. This boils down to estimating four four-dimensional surfaces for each player … try wrapping your head around that. And that’s just for passing. There’s a similar problem that we had to solve for shooting as well. Dan did a great job of designing this model and putting together the software pipeline that did the estimation very efficiently.

Cervone: That efficiency is relative. Even though we used some approximations and state-of-the-art algorithms for estimating the shot and pass surfaces, those parts of the model took several hours to run — on 500 processors with two terabytes of memory. Thankfully we had Harvard’s Research Computing clusters rather than our own laptops, which probably would have melted after a few days under this computational load.

We also needed to design hierarchical models to share information between players in our data set. This helps us estimate the probabilities of events we rarely see; for instance, Dwight Howard is 2-for-2 on his corner 3 attempts this year … but do you really think he’ll sink that shot every time? (Our model thinks it’s more like 22 percent.)

What do you think the future of EPV is?

D’Amour: Like gravity, EPV is a fundamental quantity that’s been a part of basketball since the beginning, and based on what we’ve heard from the basketball community, something like EPV has been on players’ and coaches’ minds for a while. Our contribution was to formalize the idea and show that you can feasibly estimate it. We’re hoping that the idea of EPV will let people start asking the questions that they’ve been meaning to ask forever, but just didn’t know how to do it in a quantitative way. So many tactical questions can be translated into estimating a difference in EPV. Should Steph Curry take more 3s? Who is the best on-ball defender against LeBron James? Further down the line, the analyses could extend into strategy as well — you could imagine a game-theoretic analysis of optimal offensive and defensive responses that are built around maximizing or minimizing EPV.

Cervone: Using EPV to systematically study players’ defensive abilities is the next big step. Studying on-ball defense follows easily from what we already have. But oftentimes, the mark of a great defender is that the player he guards never even touches the ball. Our friends and colleagues, Alex Franks and Andy Miller, have been working on a model that measures how defensive matchups impact the shot selection and efficiency of the offense. In the future we can combine our work to measure defensive value similarly to how we measure offensive value now.

What skills do you think are critical for people looking to get into this type of analysis?

D’Amour: You need a mix of skills in computer science and statistics or machine learning. Operationally, you need to be able to store and process data on a scale that hasn’t been necessary in sports analytics until now. Analytically, these data have a structure you’d never see in a statistics textbook, so a statistics or machine learning background that extends beyond regression to building and fitting custom probability models is essential.

Cervone: On the flip side, we shouldn’t forget the value of simple models like linear regression: They’re easy to understand and interpret, we know how and when to use them, and we can empirically validate their performance. In order to make useful and valuable insights from increasingly large data sets, we need to make sure the same is true for more sophisticated models.

What are the big limitations in the current model?

D’Amour: There are some obvious things that need to be added for completeness, like the shot clock, defensive mismatches, fouls, and giving up fast breaks. We also have to do a better job of characterizing how players move with the ball when they don’t pass or shoot for long stretches.

Cervone: We also need to work on translating EPV curves into actionable player insight. Off-ball events like screens and players breaking toward the basket do impact the EPV curve, but it’s difficult to attribute the EPV change to any one player’s actions — so far, we’ve just been crediting the ballhandler. These types of limitations apply less to our particular model, and more to the goal of inferring completely individualized summaries of value in a team sport.

Do you see yourselves working in sports in the future?

Cervone: I wouldn’t rule it out, but sports isn’t the only industry in which we’ve witnessed an explosion of data and emphasis on data-driven insight over the last decade.

D’Amour: It’s tough to say. I’m pretty focused on an academic career in statistics at the moment, but I’d love to do more research projects in sports. I’ve also heard of at least one academic who’s gotten a gig as a sportswriter …