Lots and Lots of Census Maps (part 2)

My last post on the Census Maps got as far as running a simple comparison of every combination of every possible map at LSOA level to obtain a similarity metric. There are 2,558 possible variables that can be mapped, so my dataset contains 6,543,364 lines. I’ve used the graph from the last post to set a cut off of 20 (in RGB units) to select only the closest matches. As the metric I’m using is distance in RGB space, it’s actually a dissimilarity metric, so 0 to 20 gives me about 4.5% of the top matches, resulting in 295,882 lines. Using an additional piece of code I can link this data back to the plain text description of the table and field so I can start to analyse it.

The first thing I noticed in the data is that all my rows are in pairs. A matches with B in the same way that B matches with A and I forgot that the results matrix only needs to be triangular, so I’ve got twice as much data as I actually needed. The second thing I noticed was that most of the data relates to Ethnic Group, Language and Country of Birth or Nationality. The top of my data looks like the following:

0.1336189 QS211EW0094 QS211EW0148 (Ethnic Group (detailed)) Mixed/multiple ethnic group: Israeli AND (Ethnic Group (detailed)) Asian/Asian British: Italian
0.1546012 QS211EW0178 QS211EW0204 (Ethnic Group (detailed)) Black/African/Caribbean/Black British: Black European AND (Ethnic Group (detailed)) Other ethnic group: Australian/New Zealander
0.1546012 QS211EW0204 QS211EW0178 (Ethnic Group (detailed)) Other ethnic group: Australian/New Zealander AND (Ethnic Group (detailed)) Black/African/Caribbean/Black British: Black European
0.1710527 QS211EW0050 QS211EW0030 (Ethnic Group (detailed)) White: Somalilander AND (Ethnic Group (detailed)) White: Kashmiri
0.1883012 QS203EW0073 QS211EW0113 (Country of Birth (detailed)) Antarctica and Oceania: Antarctica AND (Ethnic Group (detailed)) Mixed/multiple ethnic group: Peruvian
0.1883012 QS211EW0113 QS203EW0073 (Ethnic Group (detailed)) Mixed/multiple ethnic group: Peruvian AND (Country of Birth (detailed)) Antarctica and Oceania: Antarctica
0.1889113 QS211EW0170 QS211EW0242 (Ethnic Group (detailed)) Asian/Asian British: Turkish Cypriot AND (Ethnic Group (detailed)) Other ethnic group: Punjabi
0.1925942 QS211EW0133 KS201EW0011 (Ethnic Group (detailed)) Asian/Asian British: Pakistani or British Pakistani AND (Ethnic Group) Asian/Asian British: Pakistani

The data has had the leading diagonal removed so there are no matches between datasets and themselves. The columns show match value (0.133), first column code (QS211EW0094), second column code (QS211EW0148) and finally the plain text description. This takes the form of the Census Table in brackets (Ethnic Group (Detailed)), the column description (Mixed/multiple ethnic group: Israeli), then “AND” followed by the same format for the second table and field being matched against.

It probably makes sense that the highest matches are ethnicity, country of birth, religion and language as there is a definite causal relationship between all these things. The data also picks out groupings between pairs of ethnic groups and nationalities who tend to reside in the same areas. Some of these are surprising, so there must be a case for extracting all the nationality links and producing a graph visualisation of the relationships.

There are also some obvious problems with the data which you can see by looking at the last line of the table above: British Pakistani matches with British Pakistani. No surprise there, but it does highlight the fact that there are a lot of overlaps between columns in different data tables containing identical, or very similar data. At the moment I’m not sure how to remove this, but it needs some kind of equivalence lookup. This also occurs at least once on every table as there is always a total count column that matches with population density:

0.2201077 QS101EW0001 KS202EW0021 (Residence Type) All categories: Residence type AND (National Identity) All categories: National identity British

These two columns are just the total counts for the QS101 and KS202 tables, so they’re both maps of population. Heuristic number one is: remove anything containing “All categories” in both descriptions.

On the basis of this, it’s probably worth looking at the mid-range data rather than the exact matches as this is where it starts to get interesting:

10.82747 KS605EW0020 KS401EW0008 (Industry) A Agriculture, forestry and fishing AND (Dwellings, Household Spaces and Accomodation Type) Whole house or bungalow: Detached
10.8299 QS203EW0078 QS402EW0012 (Country of Birth (detailed)) Other AND (Accomodation Type – Households) Shared dwelling

To sum up, there is a lot more of this data than I was expecting, and my method of matching is rather naive. The next iteration of the data processing is going to have to work a lot harder to remove more of the trivial matches between two sets of data that are the same thing. I also want to see some maps so I can explore the data.