(This is the sort of post that makes me wish Blosxom allowed multiple categories per story: this one could just as easily go under sports or books. For no good reason I’ve assigned it to books.)
I’m in the middle of reading a couple books on baseball, including the 1983 edition of The Hidden Game Of Baseball. The subject of that book is the statistics that baseball fans use to quantify players’ skill, including RBI and ERA. The traditional statistics suffer from not isolating a player enough: a pitcher may have a high ERA (Earned Run Average — how many runs he allowed batters to score) if his defense is very good; contrariwise, his defense may mask his poor performance. A batter may have a low RBI because he comes early in the batting order and thereby doesn’t have a chance to score with many people on base.
The book does a decent job presenting the failings in these statistics, and does a subpar job explaining how one should interpret the new statistics that the book’s authors, and others like Bill James, have proposed. One such statistic is Linear Weights, which professional statisticians will utter a loud “Duh!” upon discovering: Linear Weights models the total number of runs that a batter is responsible for as a linear combination of stolen bases, bases the runner was caught while trying to steal, home runs, singles, doubles, and triples. The question is: on average, how many runs does a home run cause? How many runs does a single cause? How fast a runner do you have to be before it becomes worth it to steal bases?
That last question is particularly intriguing, and Linear Weights seems to be a decent way to address it. If the leadoff runner steals a base successfully, then he’s just made it a little bit easier for the next batter to get another run; if the leadoff runner gets caught stealing, then his team gets an out and his team has lost an opportunity to score a run. The question is: when does the expected benefit of the stolen base exceed the expected loss when the runner is caught stealing? The book doesn’t go into detail about the derivation, but it seems to be using computer simulations to estimate the coefficients on the summed variables; by this estimate, a stolen base contributes 3/10 of a run on average, and getting caught stealing loses 6/10 of a run. So unless you’re a very consistent base-stealer, it’s not worth your while to run. Presumably baseball players know this already, but maybe that’s an unreasonable presumption.
The book’s main point, which it makes well, is that the whole point of the statistics should be to estimate how many runs you produce for your team, or if you’re a pitcher how many runs you foiled. Everything else is academic, since the whole point of baseball is that the team with the most runs wins.
The trouble is that the statistician who coauthored the book with a sportswriter apparently didn’t chime in as much as he should have; the parts where the book talks about rigorous statistics are absolutely execrable. Those of us who know some statistics will be as aghast as I was, I hope, when they see the book’s treatment of the data. Yes, most people will only be interested in the results — .3 runs for a steal, -.6 runs when caught stealing, etc. — but the derivation and the justification is what’s important. I’d like to see some graphs comparing the model to the data, for one thing. Maybe some r2’s here and there. The book tries to provide rigorous statistics in the footnotes at the end of each section, but those footnotes are often worse than having no explanation at all: without explaining why, the book says that two statistics are computed iteratively, and that they’re intertwined, and gives the iterative formulas. It’s not clear why they ought to be iterated; part of my brain tells me that these statistics are using Iteratively Reweighted Least Squares, but the book never bothers to tell us that. This is a pretty common failing, I’ve found: pretend mathematical rigor that ends up saying more about the author’s failings than about the data.
Perhaps this is all beside the point, given that I think the book’s intended audience does not include the statistically minded. Perhaps I should dig into the technical papers (some of them from operations-research journals) that the book refers to. Other than its failings of rigor and technical exposition, the book is actually quite good; it manages to intrigue me into playing with the data.
I’m not quite done the book, but my other complaint about it is that it focuses too much on the past: using the data to figure out who the best hitter of all time was, or whatnot. That’s not especially important to me; my interest is in finding the most undervalued players now, and assembling them on a team that will make the best use of them. Apparently this is the point of Michael Lewis’s book Moneyball — how the coach of the Oakland A’s assembled a great team on the cheap by looking more closely at the numbers. I look forward to reading Lewis’s book.
All of these books, by the way, come to me from my local library, which I’ve switched to (from buying every book) out of sheer financial desperation. It’s a good habit to get into, though, so maybe eventually I’ll be like my friend Seth in yet another way: not buying another book for years, yet still reading more than anyone around me.