A bit of data fiddling for your Sunday — November 8, 2015

A bit of data fiddling for your Sunday

I saw the headline The unemployment rate doubled under Bush. It’s fallen by more than one-third under Obama. when I was reading Vox this morning, and I got ready to bust out the stat that “the labor-force participation rate is still way down” — as indeed it is:

That is, the fraction of Americans working hit its peak under Clinton, fell under Bush, really fell when the housing bubble popped, and hasn’t really recovered.

Some of that drop can come from young people deciding to stay in school and get graduate degrees when the economy is doing poorly, or from older people deciding to retire early. So what if you focus on ages 25 to 54, i.e., the “prime-age labor-force participation rate”? The story there is somewhat better:

The rate still took a noticeable hit in 2008, but we’ve regained some ground. Let’s zoom in on the period starting in 2005:

Moving slowly in the right direction. Now, much of the gain since 1948 can be attributed, one assumes, to women entering the workforce. Do the data bear that out? Seemingly yes:

That’s interesting: women’s labor-force participation seems to have flat-lined starting in 1990. Why? And what can be done to get it moving again?

On the flip side, how about the male labor-force participation rate? That’s quite striking:

It has decreased more or less continuously since 1960.

There’s no real moral here. I just find it interesting that, as you dig into the data, there’s something more going on than a story about the 2009 recession. Seems like, recession or not, men are leaving the workforce. And women aren’t entering it fast enough to offset that drop.

P.S.: A friend asks whether labor-force participation is really an end in itself. The short answer is probably “No, though it’s a good proxy for what we actually care about.”

Perhaps, for instance, people choose to stop working because they want to be full-time parents. Let’s call that a “happy” labor-force detachment. On the other hand, perhaps they drop out of the labor force because they know they’ll never get a job. Or maybe (I’ve seen this happen a lot) they’re mothers who want to spend time with their kids, but the only jobs that they could get would hardly cover the cost of child care; they want to work, but for economic reasons they choose not to. Call that a “sad” labor-force detachment: they’d like to work, but can’t.

It’s going to be hard to measure this in full detail, of course, and there are going to be almost as many boundary cases as there are people who aren’t working. But if you want to measure “how is the economy doing?” you have to set your boundaries somewhere. That’s why the Bureau of Labor Statistics has a number of different measures of unemployment:

U-1, persons unemployed 15 weeks or longer, as a percent of the civilian labor force;
U-2, job losers and persons who completed temporary jobs, as a percent of the civilian labor force;
U-3, total unemployed, as a percent of the civilian labor force (this is the definition used for the official unemployment rate);
U-4, total unemployed plus discouraged workers, as a percent of the civilian labor force plus discouraged workers;
U-5, total unemployed, plus discouraged workers, plus all other marginally attached workers, as a percent of the civilian labor force plus all marginally attached workers; and
U-6, total unemployed, plus all marginally attached workers, plus total employed part time for economic reasons, as a percent of the civilian labor force plus all marginally attached workers.

U-6, for instance, has improved noticeably over the last few years.

There are a lot of terms in here with precise definitions, and the definitions matter, and you need to think carefully about what you’re counting and aren’t. For instance, what does “civilian labor force” mean? Who’s in it and who’s not? This isn’t secret or mysterious at all; the BLS explains it in clear language. Here you go:

Civilian noninstitutional population: Persons 16 years of age and older residing in the 50 states and the District of Columbia, who are not inmates of institutions (e.g., penal and mental facilities, homes for the aged), and who are not on active duty in the Armed Forces.

Civilian labor force: All persons in the civilian noninstitutional population classified as either employed or unemployed.

Note well: this means that if you’re in prison, you’re not part of the labor force. This is where unemployment definitions intersect with Becky Pettit’s Invisible Men: Mass Incarceration and the Myth of Black Progress. To put it briefly: if every single black man but one were in prison, and that remaining black man had a job, then by the official statistics the unemployment rate among black males would be zero. Obviously we would consider this situation horrifying. So “a low rate of unemployment” is not necessarily synonymous with “a happy economy”. Maybe we want to add the institutionalized population to the current definition of the labor force. Or maybe not: those in prison surely cannot work and are not looking for work. And if we’re going to add those who can’t work for reasons of imprisonment, why then wouldn’t we add back lots of other people who cannot work and aren’t looking for work because, e.g., they’re permanently disabled? It’s certainly useful to measure all such populations. Different data series have different uses. Probably the best you can say is that different questions require different sorts of data, that no one data series can answer all questions, that you really need to look carefully at multiple sources of data, and that you should carefully look at the assumptions embedded in each.

What if you count the total civilian labor force (which, again, includes the noninstitutionalized population) and divide it by the overall population? You get this:

Earlier, we were tallying the “labor force participation rate”, which is defined as “The labor force as a percent of the civilian noninstitutional population.” As more people are imprisoned (“institutionalized”) or enter the military (i.e., they’re no longer “civilian”), the denominator goes down, which means the participation rate goes up. Whereas if you divide by the total population, an increasing prison population would cause the participation rate to decrease — arguably closer to what we actually want.

Again, this graph is likely dominated by women’s entry into the workforce. FRED seems to track the right thing here, namely the employment-to-population ratio over time for males. In the numerator, that’s going to include men who choose to stay in school longer, and men who choose to retire early, so one wants the employment-to-population ratio among prime-age males. In the denominator, it’s going to include the full U.S. population rather than just the labor force, so the ratio will decrease as more black men are imprisoned. FRED has the correct series, seemingly, but it’s via a different (OECD) data source that I’ve not dug into yet. It has the parallel data source for females.

The moral is just that there are many ways to measure unemployment, and which measure you pick will depend on which question you want answered. If you want to measure whether people are opting out of the labor force for happy reasons or sad reasons, the government tracks that. If you hear someone say that government statistics are bunk and that they don’t address Objection Objection x, your first assumption should be that the speaker is wrong.

Reminding myself how beautiful statistics is — October 19, 2014

Reminding myself how beautiful statistics is

As I think I’ve mentioned here before, my partner is taking a biostatistics course and thereby reminding me of how much I loved this stuff. And I’m reminded of the Galton quote about the Central Limit Theorem:

> I know of scarcely anything so apt to impress the imagination as the wonderful form of cosmic order expressed by the “Law of Frequency of Error.” The law would have been personified by the Greeks and deified, if they had known of it. It reigns with serenity and in complete self-effacement amidst the wildest confusion. The huger the mob, and the greater the apparent anarchy, the more perfect is its sway. It is the supreme law of Unreason. Whenever a large sample of chaotic elements are taken in hand and marshalled in order of their magnitude, an unsuspected and most beautiful form of regularity proves to have been latent all along.

It’s not only beautiful, but it’s obviously extremely useful. Yet, given how often I’ve failed to explain how a random sample of a couple thousand people can adequately capture the political views of a nation of 318 million, clearly there’s something mysterious and objectionable about it. For that matter, given how many people took umbrage at Nate Silver’s election forecasts, even though basically all he did was average poll data, it seems like this antipathy to statistics is pretty widespread; statistical laws explain exactly where, and under what conditions, you’d expect individual chaos to yield collective order, yet people really seem to recoil from the thought that their collective actions might be rule-governed.

It really does often feel like I’m possession of a kind of occult knowledge that everyone could learn but few choose to. And I’m nowhere near the level of statistical knowledge that I want to attain. Even just the bit of probability and statistics that I know is enough to resolve a lot of mental muddle.

Landed on an old interview with Charles Stein for some reason — April 20, 2014

Landed on an old interview with Charles Stein for some reason

I remember reading this interview between Morrie DeGroot and Charles Stein back in the day, probably when I was an undergrad at the department DeGroot founded. I was struck in particular by this bit:

> This doesn’t answer the question, “When I say the probability is 1/6 that this die will come up 6 on the next toss, what does that statement mean?” But then in no serious work in any science do we answer the question, “What does this statement mean?” It is an erroneous philosophical point of view that leads to this sort of question.

Reminds me of the bit by Gellner describing reductionism:

> Reductionism, roughly speaking, is the view that everything in this world is really something else, and that the something else is always in the end unedifying. So lucidly formulated, one can see that this is a luminously true and certain idea. The hope that it could ever be denied or refuted is absurd. One day, the Second Law of Thermodynamics may seem obsolete; but reductionism will stand for ever.

Oh, and then there’s this line of Stein’s; think “big data” when you read it:

> There are so many more possibilities for computation, and some of them are clearly useful. People can find things by using somewhat arbitrary computational methods that could not be found by using traditional statistical methods. On the other hand, they can also find things that probably aren’t really there.

The interview was from 1986.

Not to cavil with Krugman, but … — March 7, 2014

Not to cavil with Krugman, but …

Today he says that “private-sector wages…continue to run well below pre-crisis levels”, and uses this graph to support that claim:

Average hourly earnings of all employees

He’s not being quite accurate. As you can see from the y-axis, that’s year-over-year *growth* in hourly wages. Since the y-axis is everywhere above zero, we conclude that wages have always been growing. They’ve just been growing less than they were before the crisis.

…Which is Krugman’s point, I think. The main argument for increasing interest rates is to keep inflation in check. Inflation might be running amok if labor costs are skyrocketing. Labor costs are not skyrocketing; they’re under control. If interest rates need to rise now because labor costs are out of control, then they needed to rise back in 2007-2009 as well.

My buddy FRED will show you average earnings, as opposed to year-over-year change in earnings.

Honestly, this was probably just a typo on Krugman’s part. In context it’s obvious what he meant. But I would be shocked if the typo didn’t start propagating.

Anyone know how to get current dollars out of FRED? — March 5, 2014

Anyone know how to get current dollars out of FRED?

The Disposable Personal Income(DPI) graph counts total DPI across the whole U.S. in nominal dollars. I can transform it to per-capita nominal DPI easily enough. And there’s a CPI graph, so that’s cool. But now I want to combine the two to get DPI in current dollars. I could divide the nominal DPI by (CPI/100). The CPI equals 100 in some base year (1982-84, as it happens), so CPI/100 there would tell us how much dollars have deflated from the base year to now; if CPI = 200 today, dollars are worth half what they were in 1982-84. So then dividing DPI by CPI/100 would give us everything in 1982-1984 dollars.

I’d like to get everything in 2014 dollars. I can’t think of any obvious way to get that out of FRED. Am I just not thinking straight? If the CPI index in year x was 150, and it’s 200 now, then I want to multiply disposable income from year x by 200/150 to get everything in 2014 dollars.

I’d really like median DPI, but that doesn’t seem to be available. In fact I’d also like post-tax, post-transfer income (which should be basically DPI with Social Security, food stamps, etc. added in), but I don’t think that’s in FRED; I’ve not looked around much, but at least the CBO measures this stuff; I could probably mine their sources for the raw data.

The Census Bureau has real DPI in 2005 dollars, measured from 1980 to 2000, apparently derived “U.S. Bureau of Economic Analysis, Survey of Current Business, April 2011, earlier reports and unpublished data”. So I’ll look there, too. And if worse comes to worst, I’ll find various people at the Census Bureau and the BEA.

This concludes your daily data-mongering.

__P.S.__: I mean, I could just grab the most recent CPI number from the CPI raw-data series, and divide by that rather than by 100. But I’d like whatever graph I form here to auto-update as new current-day CPI numbers come in.

__P.P.S.__: Ah. Disposable Personal Income: Per capita: Current dollars (A229RC0). That was easy.

Medicare releasing data — January 23, 2014

Medicare releasing data

So this is really interesting: the more data Medicare releases on provider payments, the better. But there are real concerns about patient privacy here. I remember when I was a wee undergraduate at CMU, Professor Fienberg was working on how to release raw data from the Census Bureau without revealing personally identifiable information. You can imagine the problem like this: in towns like the one I grew up in in Vermont, revealing that “the average black person” earns a certain sum of money could well mean that you’ve just revealed John Smith’s income; there just aren’t that many black people in Vermont.

As I understood it at the time — note here that my understanding is many years out of date — the Census Bureau had a couple ways of releasing its multidimensional contingency tables. First, it would only publish data in a given cell if the number of observations in that cell was above some threshold (that is, if the cell didn’t uniquely identify John Smith). I believe they also applied some scaling factor to every cell, deliberately obfuscating it so that any summary statistics from the table would come out right, but raw data were all incorrect.

These problems get harder if you’re able to combine, say, Census Data with data that you get from credit-card companies or data from (as above) hospitals. The more data you can agglomerate, the less anonymous any one source is, no matter how hard you try. I’m sure there are lots of people, all around the country, working very hard to de-anonymize various databases for marketing and law-enforcement purposes.

Point being just that, while releasing raw Medicare data would be terrific (the AMA’s comment in that link that people wouldn’t know what to do with all that raw data, and would take it out of context, is thoroughly disingenuous), there are difficult problems to surmount first. I wish them luck. I should check to see where Professor Fienberg’s work has taken him; the last update I got was more than a decade ago.

I hate to be a stickler about Krugman’s data analysis, but — December 7, 2013

I hate to be a stickler about Krugman’s data analysis, but

When he starts a column with the phrase “underneath the apparent stability of the Great Moderation lurked a rapid rise in debt that is now being unwound”, and uses this graph as evidence

debt as % of GDP. Shows a rise from before 1990, with a kink upward at around 2000Q4, then a decline when the recession ended.

, then someone who is as much of a fan of FRED as I am is going to want to reproduce Krugman’s data, whence we end up with

Same graph as above, but on a longer time scale. Basically debt as a % of GDP increased continuously from the early 1950s until the recession.

I’ll grant that something especially crazy started happening around the year 2000, but I don’t think you’d really single out that particular era, if this graph were your only bit of evidence, and say, “Ah ha! Debt really became unsustainably large right there!” Debt has been increasing continuously since the 1950s. The point of Krugman’s article lies in other directions (namely, how much of a hit on future GDP we’ll take because the country is now deleveraging), but my question would be: how far do we have to fall? Back to where it was in 1999? Or back to where it was in 1960?

The 1940 census is awesome — December 3, 2013

The 1940 census is awesome

The Census Bureau, 72 years after the 1940 census, put the raw data from the 1940 census up on the web last year. It is completely fascinating.

It’s also tricky, for me anyway, to find my ancestors’ information. My partner has an easier task for her grandparents: they lived in New York City, and the New York Public Library helpfully posted 1940 phone books expressly to help people navigate the 1940 census (thanks, New York Public Library!). No such luck for Burlington, Vermont. But by asking my parents, I was able to find my dad’s parents, 7 years before my dad was born, when it was just my grandparents and my aunt. Among the interesting tidbits:

* My grandfather was listed as unemployed (and seeking work) at the time of the census; in 1939 he had only been employed 30 weeks. He had been unemployed for the four weeks preceding March 30.
* In 1935 they had lived in Alburg, Vermont on a farm.
* My grandfather’s profession was listed as ‘weaver’ at ‘woolen mill’. I knew him as a watchmaker, though I imagine he was just an all-around handyman.
* As of 1939, his salary was $630. Looking around a bit, I found a Social Security Administration document from 1947, which says that the median family income in 1939 for a family with 3 people, with a male head of household under age 35 (my grandfather was 30) was $1,373. So as of 1940, it looks like my grandparents weren’t doing so well. I’ll be curious how that changes when the 1950 census data become available in 2022.
* Both my grandmother and grandfather had fourth-grade educations.

There are a couple things to note about this. First, my method for chasing down the census records was basically ad-hoc; I asked my parents, who asked my aunt, who guessed what their street address had been when she was seven years old and was basically right on the money. Even with that information, the census data aren’t terribly easy to navigate. With luck, you can use an address to get an Enumeration District, which is basically the terrain that a single census-taker covers. But even within an ED, there are a lot of scanned census forms to peruse. This seems like a case that would derive a lot of value from some crowdsourcing: people using the 1940-census site would be able to tag individual records or pages with whatever information they want to contribute: street addresses, names, etc. Over time, it ought to be possible to write SQL queries against raw census data (“SELECT * FROM 1940_data WHERE state = ‘Vermont’ and LastName = ‘Laniel'”).

Even in my partner’s case, which is less ad-hoc, not everyone had a phone in 1940. What would we do if we wanted to look up the census information of someone alive in 1940? I’m sure there’s a well-known way to bootstrap oneself to a family tree, but I’m not familiar with it. And I’d vastly prefer a SQL query to a complicated bootstrapping process.

What’s the definition of “disposable personal income”? — November 8, 2010

What’s the definition of “disposable personal income”?

Matt Yglesias today looked at the change in disposable personal income over the last few years, and I wanted to check that the definition of the term didn’t count “disposable income” as income less, say, mortgage and credit-card payments. If it did, then you’d expect to see disposable income go down as people pay down debt.

Turns out the definition doesn’t deduct debt payments, but it confuses me in other ways. Here it is:

Personal income is the income received by persons from participation in production, from government and business transfer payments, and from government interest. Personal income includes income received by non-profit institutions serving households, by private non-insured welfare funds, and by private trust funds. Income from production is generated both by the labor of individuals and by the capital that they own. Private income not earned in production, such as from capital gains or the sale of assets, is excluded. Personal income is calculated as the sum of wage and salary disbursements, employer contributions for employee pension and insurance funds, proprietors income, property income (personal interest, dividend and rental income), and transfer payments to individuals, less personal contributions for social insurance.

Disposable personal income is personal income less personal tax payments. While personal income does not include capital gains realized through the sale of assets, personal income taxes do include the taxes paid for these capital gains.

(Internal footnote omitted.)

I’m puzzled by a couple aspects of this definition:

  1. “[E]mployer contributions for employee pension and insurance funds” makes it in? So when my employer contributes to my 401(k), that counts as disposable income? Okay, I can half-see that: if need be, I could raid the 401(k). But I’d pay a penalty if I did, so I hope that something less than 100% of my contribution counts toward my disposable income. But then what about the “insurance funds” part? My employer’s contribution to unemployment insurance counts toward my disposable income? The employer contribution to long-term disability? To health insurance? This would seriously inflate this measure of disposable income: as has been well-documented, health-care costs have been rising, so a lot of money that might otherwise have gone toward rising salaries has instead gone into health-insurance payments.

  2. Disposable income doesn’t include capital gains? But why? That’s income I can spend, just as much as is income earned through honest toil. And if they’re not going to include capital-gains income, then why do they deduct capital-gains taxes?

I’ll look around for a more in-depth discussion of this definition. If anyone can clarify, please do.

A quick note on life expectancies — August 24, 2010

A quick note on life expectancies

The next time you hear someone say that the Social Security retirement age needs to go up because “back when Social Security was started, people weren’t expected to live much past retirement age,” first point out to them that the terminology is confusing: “life expectancy” means “life expectancy at birth.” Life expectancy at birth goes down if you die in the crib. What’s actually important, when setting the retirement age, is your life expectancy at age 65. Since we’ve made big strides on reducing child mortality, life expectancy at birth has gone way up; life expectancy at age 65 has only gone up by a little less than six years across all races and sexes, and has only gone up by a bit less than three years for black men. See the tables (with sources linked) below.

A couple other things to note:

* Suppose we’re in a recession when you’re in your late 60s. You get laid off. How likely do you think it will be that you’ll get re-hired? (Though as a friend mentioned the other night: employers may refuse to hire older folks because they know that their new employees will only be working until they hit 65; an increase in the retirement age might make employers think they can get a few more productive years out of them, thereby making age discrimination less of a problem.)

* There’s a gap in life expectancy by income, which the figures by race and sex don’t necessarily capture. (Though since race and sex affect income — women and black people are paid less — the life-expectancy numbers based only on race and sex may already capture an income effect. What we want is are models that predict life expectancy as a function of race, sex, and income, holding each constant while the other varies.)

I’ve had a book in my queue for a while, namely [book: Working Longer: The Solution to the Retirement Income Challenge], which seems to address these issues. I’ve had a visceral resistance to reading it — namely that if someone suggests I work later in life, I might suggest in response that they perform an anatomical sexual impossibility. But I’ll overcome that resistance and read it for you, out of affection.

Life expectancies, 1939-1941:
All White men White women Black men Black women
At birth: 63.62 62.81 67.29 52.26 55.56
At age 65: 77.80 77.07 78.56 77.21 78.93
Life expectancies, 2006:
All White men White women Black men Black women
At birth: 77.7 75.7 80.6 69.7 76.5
At age 65: 83.5 83.1 84.8 80.1 83.6