A bit of data fiddling for your Sunday — November 8, 2015

A bit of data fiddling for your Sunday

I saw the headline The unemployment rate doubled under Bush. It’s fallen by more than one-third under Obama. when I was reading Vox this morning, and I got ready to bust out the stat that “the labor-force participation rate is still way down” — as indeed it is:

That is, the fraction of Americans working hit its peak under Clinton, fell under Bush, really fell when the housing bubble popped, and hasn’t really recovered.

Some of that drop can come from young people deciding to stay in school and get graduate degrees when the economy is doing poorly, or from older people deciding to retire early. So what if you focus on ages 25 to 54, i.e., the “prime-age labor-force participation rate”? The story there is somewhat better:

The rate still took a noticeable hit in 2008, but we’ve regained some ground. Let’s zoom in on the period starting in 2005:

Moving slowly in the right direction. Now, much of the gain since 1948 can be attributed, one assumes, to women entering the workforce. Do the data bear that out? Seemingly yes:

That’s interesting: women’s labor-force participation seems to have flat-lined starting in 1990. Why? And what can be done to get it moving again?

On the flip side, how about the male labor-force participation rate? That’s quite striking:

It has decreased more or less continuously since 1960.

There’s no real moral here. I just find it interesting that, as you dig into the data, there’s something more going on than a story about the 2009 recession. Seems like, recession or not, men are leaving the workforce. And women aren’t entering it fast enough to offset that drop.

P.S.: A friend asks whether labor-force participation is really an end in itself. The short answer is probably “No, though it’s a good proxy for what we actually care about.”

Perhaps, for instance, people choose to stop working because they want to be full-time parents. Let’s call that a “happy” labor-force detachment. On the other hand, perhaps they drop out of the labor force because they know they’ll never get a job. Or maybe (I’ve seen this happen a lot) they’re mothers who want to spend time with their kids, but the only jobs that they could get would hardly cover the cost of child care; they want to work, but for economic reasons they choose not to. Call that a “sad” labor-force detachment: they’d like to work, but can’t.

It’s going to be hard to measure this in full detail, of course, and there are going to be almost as many boundary cases as there are people who aren’t working. But if you want to measure “how is the economy doing?” you have to set your boundaries somewhere. That’s why the Bureau of Labor Statistics has a number of different measures of unemployment:

U-1, persons unemployed 15 weeks or longer, as a percent of the civilian labor force;
U-2, job losers and persons who completed temporary jobs, as a percent of the civilian labor force;
U-3, total unemployed, as a percent of the civilian labor force (this is the definition used for the official unemployment rate);
U-4, total unemployed plus discouraged workers, as a percent of the civilian labor force plus discouraged workers;
U-5, total unemployed, plus discouraged workers, plus all other marginally attached workers, as a percent of the civilian labor force plus all marginally attached workers; and
U-6, total unemployed, plus all marginally attached workers, plus total employed part time for economic reasons, as a percent of the civilian labor force plus all marginally attached workers.

U-6, for instance, has improved noticeably over the last few years.

There are a lot of terms in here with precise definitions, and the definitions matter, and you need to think carefully about what you’re counting and aren’t. For instance, what does “civilian labor force” mean? Who’s in it and who’s not? This isn’t secret or mysterious at all; the BLS explains it in clear language. Here you go:

Civilian noninstitutional population: Persons 16 years of age and older residing in the 50 states and the District of Columbia, who are not inmates of institutions (e.g., penal and mental facilities, homes for the aged), and who are not on active duty in the Armed Forces.

Civilian labor force: All persons in the civilian noninstitutional population classified as either employed or unemployed.

Note well: this means that if you’re in prison, you’re not part of the labor force. This is where unemployment definitions intersect with Becky Pettit’s Invisible Men: Mass Incarceration and the Myth of Black Progress. To put it briefly: if every single black man but one were in prison, and that remaining black man had a job, then by the official statistics the unemployment rate among black males would be zero. Obviously we would consider this situation horrifying. So “a low rate of unemployment” is not necessarily synonymous with “a happy economy”. Maybe we want to add the institutionalized population to the current definition of the labor force. Or maybe not: those in prison surely cannot work and are not looking for work. And if we’re going to add those who can’t work for reasons of imprisonment, why then wouldn’t we add back lots of other people who cannot work and aren’t looking for work because, e.g., they’re permanently disabled? It’s certainly useful to measure all such populations. Different data series have different uses. Probably the best you can say is that different questions require different sorts of data, that no one data series can answer all questions, that you really need to look carefully at multiple sources of data, and that you should carefully look at the assumptions embedded in each.

What if you count the total civilian labor force (which, again, includes the noninstitutionalized population) and divide it by the overall population? You get this:

Earlier, we were tallying the “labor force participation rate”, which is defined as “The labor force as a percent of the civilian noninstitutional population.” As more people are imprisoned (“institutionalized”) or enter the military (i.e., they’re no longer “civilian”), the denominator goes down, which means the participation rate goes up. Whereas if you divide by the total population, an increasing prison population would cause the participation rate to decrease — arguably closer to what we actually want.

Again, this graph is likely dominated by women’s entry into the workforce. FRED seems to track the right thing here, namely the employment-to-population ratio over time for males. In the numerator, that’s going to include men who choose to stay in school longer, and men who choose to retire early, so one wants the employment-to-population ratio among prime-age males. In the denominator, it’s going to include the full U.S. population rather than just the labor force, so the ratio will decrease as more black men are imprisoned. FRED has the correct series, seemingly, but it’s via a different (OECD) data source that I’ve not dug into yet. It has the parallel data source for females.

The moral is just that there are many ways to measure unemployment, and which measure you pick will depend on which question you want answered. If you want to measure whether people are opting out of the labor force for happy reasons or sad reasons, the government tracks that. If you hear someone say that government statistics are bunk and that they don’t address Objection Objection x, your first assumption should be that the speaker is wrong.

Reminding myself how beautiful statistics is — October 19, 2014

Reminding myself how beautiful statistics is

As I think I’ve mentioned here before, my partner is taking a biostatistics course and thereby reminding me of how much I loved this stuff. And I’m reminded of the Galton quote about the Central Limit Theorem:

> I know of scarcely anything so apt to impress the imagination as the wonderful form of cosmic order expressed by the “Law of Frequency of Error.” The law would have been personified by the Greeks and deified, if they had known of it. It reigns with serenity and in complete self-effacement amidst the wildest confusion. The huger the mob, and the greater the apparent anarchy, the more perfect is its sway. It is the supreme law of Unreason. Whenever a large sample of chaotic elements are taken in hand and marshalled in order of their magnitude, an unsuspected and most beautiful form of regularity proves to have been latent all along.

It’s not only beautiful, but it’s obviously extremely useful. Yet, given how often I’ve failed to explain how a random sample of a couple thousand people can adequately capture the political views of a nation of 318 million, clearly there’s something mysterious and objectionable about it. For that matter, given how many people took umbrage at Nate Silver’s election forecasts, even though basically all he did was average poll data, it seems like this antipathy to statistics is pretty widespread; statistical laws explain exactly where, and under what conditions, you’d expect individual chaos to yield collective order, yet people really seem to recoil from the thought that their collective actions might be rule-governed.

It really does often feel like I’m possession of a kind of occult knowledge that everyone could learn but few choose to. And I’m nowhere near the level of statistical knowledge that I want to attain. Even just the bit of probability and statistics that I know is enough to resolve a lot of mental muddle.

Landed on an old interview with Charles Stein for some reason — April 20, 2014

Landed on an old interview with Charles Stein for some reason

I remember reading this interview between Morrie DeGroot and Charles Stein back in the day, probably when I was an undergrad at the department DeGroot founded. I was struck in particular by this bit:

> This doesn’t answer the question, “When I say the probability is 1/6 that this die will come up 6 on the next toss, what does that statement mean?” But then in no serious work in any science do we answer the question, “What does this statement mean?” It is an erroneous philosophical point of view that leads to this sort of question.

Reminds me of the bit by Gellner describing reductionism:

> Reductionism, roughly speaking, is the view that everything in this world is really something else, and that the something else is always in the end unedifying. So lucidly formulated, one can see that this is a luminously true and certain idea. The hope that it could ever be denied or refuted is absurd. One day, the Second Law of Thermodynamics may seem obsolete; but reductionism will stand for ever.

Oh, and then there’s this line of Stein’s; think “big data” when you read it:

> There are so many more possibilities for computation, and some of them are clearly useful. People can find things by using somewhat arbitrary computational methods that could not be found by using traditional statistical methods. On the other hand, they can also find things that probably aren’t really there.

The interview was from 1986.

Not to cavil with Krugman, but … — March 7, 2014

Not to cavil with Krugman, but …

Today he says that “private-sector wages…continue to run well below pre-crisis levels”, and uses this graph to support that claim:

Average hourly earnings of all employees

He’s not being quite accurate. As you can see from the y-axis, that’s year-over-year *growth* in hourly wages. Since the y-axis is everywhere above zero, we conclude that wages have always been growing. They’ve just been growing less than they were before the crisis.

…Which is Krugman’s point, I think. The main argument for increasing interest rates is to keep inflation in check. Inflation might be running amok if labor costs are skyrocketing. Labor costs are not skyrocketing; they’re under control. If interest rates need to rise now because labor costs are out of control, then they needed to rise back in 2007-2009 as well.

My buddy FRED will show you average earnings, as opposed to year-over-year change in earnings.

Honestly, this was probably just a typo on Krugman’s part. In context it’s obvious what he meant. But I would be shocked if the typo didn’t start propagating.

Anyone know how to get current dollars out of FRED? — March 5, 2014

Anyone know how to get current dollars out of FRED?

The Disposable Personal Income(DPI) graph counts total DPI across the whole U.S. in nominal dollars. I can transform it to per-capita nominal DPI easily enough. And there’s a CPI graph, so that’s cool. But now I want to combine the two to get DPI in current dollars. I could divide the nominal DPI by (CPI/100). The CPI equals 100 in some base year (1982-84, as it happens), so CPI/100 there would tell us how much dollars have deflated from the base year to now; if CPI = 200 today, dollars are worth half what they were in 1982-84. So then dividing DPI by CPI/100 would give us everything in 1982-1984 dollars.

I’d like to get everything in 2014 dollars. I can’t think of any obvious way to get that out of FRED. Am I just not thinking straight? If the CPI index in year x was 150, and it’s 200 now, then I want to multiply disposable income from year x by 200/150 to get everything in 2014 dollars.

I’d really like median DPI, but that doesn’t seem to be available. In fact I’d also like post-tax, post-transfer income (which should be basically DPI with Social Security, food stamps, etc. added in), but I don’t think that’s in FRED; I’ve not looked around much, but at least the CBO measures this stuff; I could probably mine their sources for the raw data.

The Census Bureau has real DPI in 2005 dollars, measured from 1980 to 2000, apparently derived “U.S. Bureau of Economic Analysis, Survey of Current Business, April 2011, earlier reports and unpublished data”. So I’ll look there, too. And if worse comes to worst, I’ll find various people at the Census Bureau and the BEA.

This concludes your daily data-mongering.

__P.S.__: I mean, I could just grab the most recent CPI number from the CPI raw-data series, and divide by that rather than by 100. But I’d like whatever graph I form here to auto-update as new current-day CPI numbers come in.

__P.P.S.__: Ah. Disposable Personal Income: Per capita: Current dollars (A229RC0). That was easy.

Medicare releasing data — January 23, 2014

Medicare releasing data

So this is really interesting: the more data Medicare releases on provider payments, the better. But there are real concerns about patient privacy here. I remember when I was a wee undergraduate at CMU, Professor Fienberg was working on how to release raw data from the Census Bureau without revealing personally identifiable information. You can imagine the problem like this: in towns like the one I grew up in in Vermont, revealing that “the average black person” earns a certain sum of money could well mean that you’ve just revealed John Smith’s income; there just aren’t that many black people in Vermont.

As I understood it at the time — note here that my understanding is many years out of date — the Census Bureau had a couple ways of releasing its multidimensional contingency tables. First, it would only publish data in a given cell if the number of observations in that cell was above some threshold (that is, if the cell didn’t uniquely identify John Smith). I believe they also applied some scaling factor to every cell, deliberately obfuscating it so that any summary statistics from the table would come out right, but raw data were all incorrect.

These problems get harder if you’re able to combine, say, Census Data with data that you get from credit-card companies or data from (as above) hospitals. The more data you can agglomerate, the less anonymous any one source is, no matter how hard you try. I’m sure there are lots of people, all around the country, working very hard to de-anonymize various databases for marketing and law-enforcement purposes.

Point being just that, while releasing raw Medicare data would be terrific (the AMA’s comment in that link that people wouldn’t know what to do with all that raw data, and would take it out of context, is thoroughly disingenuous), there are difficult problems to surmount first. I wish them luck. I should check to see where Professor Fienberg’s work has taken him; the last update I got was more than a decade ago.

I hate to be a stickler about Krugman’s data analysis, but — December 7, 2013

I hate to be a stickler about Krugman’s data analysis, but

When he starts a column with the phrase “underneath the apparent stability of the Great Moderation lurked a rapid rise in debt that is now being unwound”, and uses this graph as evidence

debt as % of GDP. Shows a rise from before 1990, with a kink upward at around 2000Q4, then a decline when the recession ended.

, then someone who is as much of a fan of FRED as I am is going to want to reproduce Krugman’s data, whence we end up with

Same graph as above, but on a longer time scale. Basically debt as a % of GDP increased continuously from the early 1950s until the recession.

I’ll grant that something especially crazy started happening around the year 2000, but I don’t think you’d really single out that particular era, if this graph were your only bit of evidence, and say, “Ah ha! Debt really became unsustainably large right there!” Debt has been increasing continuously since the 1950s. The point of Krugman’s article lies in other directions (namely, how much of a hit on future GDP we’ll take because the country is now deleveraging), but my question would be: how far do we have to fall? Back to where it was in 1999? Or back to where it was in 1960?