A new approach to error in public surveys

Alex Singleton and I just had a paper accepted into the Annals of the Association of American Geographers.  The paper develops a novel strategy for dealing with the high margins of error in the census tract level estimates from the American Community Survey (ACS).  For example, the ACS tells you really useful things like, “the number of children under 5 in poverty in Census Tract 203 in Autauga County, Alabama is 139 children +/- 178”.  Implying that number of poor children in the tract is somewhere between 0 and 317.  This isn’t a unique case, in the 2007-2011 ACS release 72% of all census tracts in the US have margins of error greater than the estimate of children under 5 in poverty.  In most places, the best you can learn from the ACS is that somewhere between 0 and X kids live in poverty.

We think that one way to deal with this problem is to ignore the individual data points.  More technically,  we argue that:

The value of a large and comprehensive survey like the ACS is that it provides a richly detailed, multivariate, composite picture of small areas.

That is, users of the ACS shouldn’t focus on single variables if they are working at the tract scale, they should examine a multivariate composite picture.  These multivariate composites will be less affected by error because the margin of error in a single ACS estimate, like household income, is a symmetrically distributed random variable.  This means that positive and negative errors are equally likely.  Since the variable specific estimates are largely independent from each other, when looking at a large collection of variables these random errors average to zero.  The key take-away is that while single variables can be methodologically problematic at the census tract scale, a large collection of such variables provide utility as a contextual descriptor of the place(s) under investigation.

We developed a workflow involving cluster analysis, that allows one to develop multivatriate composities of census tracts.  The data and code are available here (the repo is down because github deleted all of our data!!).