Indicators of Deprivation Part 1

Those of you who read my previous post, Deprivation In England, will notice that this is in a slightly different format. In this analysis, I will mainly be explaining the methods I used to load the data and turn it into a usable format to allow me to analyse it with a multiple regression. I couldn't find any Python libraries which help with the selection of regression variables for a multiple regression and as such I wrote a few functions to achieve this. If you notice any ways I could improve my methods, please do say in the comments.

This analysis mainly focuses on the use of Python as a data analysis tool, with heavy use of the Pandas and Numpy libraries to wrangle the data and the Sklearn, Statsmodels and Scipy libraries to actually do the regression.

Because this analysis is looooong and Wordpress won't let me upload it in one, I have split it into two parts. The first part deals with how I loaded and cleaned the data, whilst the second part details how I actually did the analysis.

This analysis is done in an iPython Notebook which is packaged with the fantastic Anaconda scientific computing distribution. I wasn't paid to say that.

The cell below contains the import statements for the libraries I'll be using. matplotlib.pyplot and seaborn are visualisation libraries (which will see a little use); numpy and pandas allow me to modify, control and present the raw data; sklearn, scipy and statsmodels are all statistical software libraries which will do a lot of the heavy lifting for me.

In [6]:
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from pandas import Series, DataFrame
import pandas as pd
from sklearn import linear_model, cross_validation
from scipy import stats
import statsmodels.api as sm
import itertools
import copy
%matplotlib inline

Importing and cleaning the data

The data I use in this analysis is in two formats, .csv and .xls. The pandas library allows me to read the data in and convert it to a 'DataFrame'. A DataFrame displays the data in a tabular way, but does not function like a spreadsheet; the cells cannot be edited directly.

The cell below loads a .csv file, converts it into a DataFrame (geo_df), drops the column 'Unnamed: 0', and finally takes a copy of the data (base) using .loc[ : , : ] which is then previewed by calling base.head().

This DataFrame contains the Office for National Statistics geography codes which are used to delineate geographical areas of varying sizes. Most of the codes begin with 'E' followed by two number. The smaller these numbers, the larger the geographical area they represent. The code and data I used to create this table can be found at my GitHub account.

In [9]:
#LOADS AN ONS GEOGRAPHY CODES LOOKUP AND TAKES A COPY OF IT
geo = pd.read_csv("C:\\Users\\Rytch\\Google Drive\\Data Analysis\\Data\\ONS Geo Codes\\ONS Geo Lookup.csv", encoding = "ISO-8859-1", low_memory=False)
geo_df = DataFrame(geo)
geo_df.drop(['Unnamed: 0'], axis=1, inplace=True)
base = geo_df.loc[:,:]
base.head()
Out[9]:
OA LSOA MSOA LAD WARD PARISH GOR GOR NAME COUNTY COUNTY CODE COUNTY NAME LAD CODE AUTH/DIST/BORO NAME WARD CODE
0 E00000001 E01000001 E02000001 E09000001 E05000001 NaN H London NaN 0 Greater London 00AA City of London 00AAFA
1 E00000003 E01000001 E02000001 E09000001 E05000001 NaN H London NaN 0 Greater London 00AA City of London 00AAFA
2 E00000005 E01000001 E02000001 E09000001 E05000001 NaN H London NaN 0 Greater London 00AA City of London 00AAFA
3 E00166756 E01000001 E02000001 E09000001 E05000001 NaN H London NaN 0 Greater London 00AA City of London 00AAFA
4 E00166758 E01000001 E02000001 E09000001 E05000001 NaN H London NaN 0 Greater London 00AA City of London 00AAFA

After investigating the other datasets, I found that the only Area Code which they all had in common was 'LAD' (Local Area District (I think!)). For clarity, I dropped all of the other, superfluous columns from this DataFrame.

In [10]:
#DROPS THE COLUMNS WHICH I CAN'T MERGE ON IN THE OTHER DATASETS
base_dep = base.drop(['OA','LSOA','MSOA','WARD','PARISH','GOR','GOR NAME','COUNTY','COUNTY CODE','COUNTY NAME','AUTH/DIST/BORO NAME'], axis=1, )
base_dep.drop_duplicates(subset=['LAD'], inplace=True)
base_dep.head()
Out[10]:
LAD LAD CODE WARD CODE
0 E09000001 00AA 00AAFA
36 E09000002 00AB 00ABFX
589 E09000003 00AC 00ACFX
1625 E09000004 00AD 00ADGA
2353 E09000005 00AE 00AEGJ

In a similar manner to the previous file that was loaded, pandas loads the Deprivation.xls file and converts it into a DataFrame. When loading a .xls file you must parse the file for the individual Excel Worksheet names (here IMD 2010). I also rename a column for clarity and drop the Deprivation Rank column as it serves no purpose in this analysis.

After previewing the table, I noticed that although the LAD Code is present (recorded as LA CODE in the data), there are several observations for each unique LA Code. The observations correspond to each unique LSOA Code, a smaller geographical unit. From the preview it's clear that the Deprivation Score varies a great deal within a single LA Code, for example the La Code '00AA' has a minimum Deprivation Score of 5.6 and a maximum of 21.4.

I decided to take the mean of the Deprivation Scores for each LA Code, supplementing this information with the standard deviation of the values should I need to take into account the variation present within the LA Codes (this turned out not to be necessary).

In [11]:
#LOADS A DEPRIVATION SCORE XLS FILE WHICH IS AVAILABLE ON THE ONS WEBSITE, RENAMES/DROPS A COUPLE OF COLUMNS
deprivation_xl = pd.ExcelFile('Deprivation.xls')
dep_df = deprivation_xl.parse('IMD 2010')
dep_df.rename(columns={'IMD SCORE':'DEPRIVATION SCORE'},inplace=True)
dep_df=DataFrame(dep_df)
dep_df.drop('RANK OF IMD SCORE (where 1 is most deprived)', 
            axis=1, inplace=True)
dep_df.head(10)
Out[11]:
LSOA CODE LA CODE LA NAME GOR CODE GOR NAME DEPRIVATION SCORE
0 E01000001 00AA City of London H London 6.161637
1 E01000002 00AA City of London H London 5.585054
2 E01000003 00AA City of London H London 13.292920
3 E01000004 00AA City of London H London 11.171350
4 E01000005 00AA City of London H London 21.364660
5 E01000006 00AB Barking and Dagenham H London 17.077650
6 E01000007 00AB Barking and Dagenham H London 37.235280
7 E01000008 00AB Barking and Dagenham H London 45.216770
8 E01000009 00AB Barking and Dagenham H London 29.413560
9 E01000010 00AB Barking and Dagenham H London 38.447980

Using various functions from the pandas library, I created a dataframe with only the unique values of LA Code (done below with pd.groupby, but also possible with pd.unique). I then created a DataFrame (mean_dep) with the mean Deprivation Scores for each LA Code and another with the standard deviation, before merging the two tables together using 'LA Code' as the key.

The DataFrame, deprivation_df contains the variable 'MEAN DEP SCORE' which I will use as the response variable in my analysis.

In [12]:
#CREATES A DATAFRAME CONTAINING ONLY UNIQUE VALUES OF 'LA CODE'
dep = dep_df.groupby('LA CODE')
#CALCULATES THE MEAN DEPRIVATION SCORE FOR EACH LA CODE BY TAKING THE MEAN OF THE DEPRIVATION SCORES OF THE LSOA CODES 
#WITHIN THAT LA
mean_dep = DataFrame(dep['DEPRIVATION SCORE'].mean())
#THE LA CODE WAS SET AS THE INDEX, THIS CHANGES IT TO BE A COLUMN
mean_dep['LA CODE'] = mean_dep.index
#RENAMES THE DEPRIVATION SCORE APPROPRIATES, RESETS THE INDEX
mean_dep.rename(columns={'DEPRIVATION SCORE':'MEAN DEP SCORE'}, inplace=True)
mean_dep.reset_index(drop=True,inplace=True)

#REPEATS THE ABOVE PROCESS, BUT FOR THE STANDARD DEVIATIO OD THE DEPRIVATION SCORE
std_dep = DataFrame(dep['DEPRIVATION SCORE'].std())
std_dep['LA CODE'] = std_dep.index
std_dep.rename(columns={'DEPRIVATION SCORE':'STD DEP SCORE'}, inplace=True)
std_dep.reset_index(drop=True, inplace=True)

#MERGES THE MEAN AND STD DEPRIVATION SCORES TOGETHER
deprivation_df = mean_dep.merge(std_dep, on='LA CODE')
deprivation_df.head()
Out[12]:
MEAN DEP SCORE LA CODE STD DEP SCORE
0 11.515124 00AA 6.405421
1 34.094765 00AB 8.255361
2 16.488612 00AC 8.427992
3 16.520170 00AD 10.268777
4 30.166109 00AE 11.454649

I then merged the deprivation_df with the DataFrame containing the geography codes; the keys are named differently in each DataFrame and I accounted for this using 'left_on=' and 'right_on=' in the pd.merge argument list. If you're using an iPython Notebook, after opening the brackets of a function call, press shift+tab to see the arguments and keyword arguments for the function.

In [13]:
#MERGES THE ONS BASE GEOGRAPHY DATAFRAME WITH THE DEPRIVATION; THIS FORMS THE INTIAL DATASET ONTO WHICH ALL THE 
#PREDICTORS OF DEPRIVATION WILL BE MERGED, EACH ON LA CODE, LAD OR WARD CODE
base_dep_df = deprivation_df.merge(base_dep, how='left', left_on='LA CODE', right_on='LAD CODE')
base_dep_df.head()
Out[13]:
MEAN DEP SCORE LA CODE STD DEP SCORE LAD LAD CODE WARD CODE
0 11.515124 00AA 6.405421 E09000001 00AA 00AAFA
1 34.094765 00AB 8.255361 E09000002 00AB 00ABFX
2 16.488612 00AC 8.427992 E09000003 00AC 00ACFX
3 16.520170 00AD 10.268777 E09000004 00AD 00ADGA
4 30.166109 00AE 11.454649 E09000005 00AE 00AEGJ

The data loaded in this cell contains information about the number of people of each age in each geographical area. This data was loaded in the same way as the data for deprivation_df, however I also had to do an additional pd.dropna() function to remove rows populated entirely with missing values as they made it look messy!

The data in this DataFrame are not in a nice format. Firstly, there are over 100 columns of data, one for the number of people of each age in each area. This is far too specific for me to draw conclusions from, and I will have to aggregate the data somehow. Furthermore as you can see, the Area Codes all represent geographical areas of different sizes; rows cannot be easily compared with those above and below. For example, if my memory serves me correctly, row 3 (Area Code K04) corresponds to the entirety of the UK, whilst for 7 (E12) references the North East. This was a pain to fix, as you'll see below . . .

In [14]:
#LOADS AN XLS FILE WHICH CONTAINS INFORMATION ABOUT THE AGE OF PEOPLE IN DIFFERENT GEOGRAPHICAL AREAS. AVAILABLE FROM ONS WEBSITE
#RENAMES AND DROPS COLUMNS, DROPS ALL NA TO TIDY UP
age_xl = pd.ExcelFile('Age.xls')
age_df = age_xl.parse('QS103EW_Numbers',header=10)
age_df = DataFrame(age_df)
age_df.drop(['Unnamed: 2','Unnamed: 3', 'Area name'],axis=1, inplace=True)
age_df.dropna(axis=0, inplace=True)
age_df.head(10)
Out[14]:
Area code All categories: Age Age under 1 Age 1 Age 2 Age 3 Age 4 Age 5 Age 6 Age 7 ... Age 91 Age 92 Age 93 Age 94 Age 95 Age 96 Age 97 Age 98 Age 99 Age 100 and over
3 K04000001 56075912 711529 704155 698777 699399 682890 666320 633615 627104 ... 87515 52939 40013 36269 27972 22324 15959 10976 7145 11186
5 E92000001 53012456 675065 668662 663119 663574 648029 632090 600847 594673 ... 82336 49584 37630 34145 26370 21040 15044 10327 6738 10576
7 E12000001 2596886 30606 30202 29686 29934 29415 29041 27545 27103 ... 3717 2373 1732 1410 1120 841 601 410 323 456
9 E06000047 513242 5884 5645 5602 5826 5617 5366 5325 5127 ... 709 461 331 262 215 187 144 87 76 80
10 E06000005 105564 1344 1371 1302 1297 1265 1289 1271 1230 ... 148 104 91 69 50 34 26 32 22 28
11 E06000001 92028 1097 1176 1155 1159 1111 1150 1051 1001 ... 127 63 51 34 29 15 16 15 6 14
12 E06000002 138412 1979 1965 1797 1863 1827 1835 1703 1640 ... 183 103 90 56 40 46 30 19 12 15
13 E06000048 316028 3148 3150 3205 3325 3175 3336 3207 3320 ... 517 377 231 242 183 140 72 62 53 81
14 E06000003 135177 1575 1478 1483 1481 1536 1560 1462 1340 ... 193 140 97 73 68 54 30 22 17 20
15 E06000004 191610 2477 2485 2482 2435 2443 2388 2204 2232 ... 211 136 125 73 62 48 30 27 19 14

10 rows × 103 columns

This cell creates a list of the different columns in the DataFrame (ages). Using this list as a master list, the columns can be split into children (Age Under 1 - Age 17), adults (Age 18 - Age 65) and pensioners (Age 66 - Age 100 and over).

In [15]:
#CREATES A LIST OF COLUMN NAMES WHICH CORRESPOND TO THE AGES OF CHILDREN (UNDER 18), ADULTS (18-65) AND PENSIONERS (66+)
ages = list(age_df.loc[:])
adult = ages[20:68]
child = ages[2:20]
pensioner=ages[68:103]

I then sum the columns present in each individual list to create new columns with the aggregated data, before dropping the original columns. As a quick check, I summed the numbers for Children, Adults and Pensioners to see if the total matched the total for All categories: Age. It did.

In [16]:
#SUMS THE NUMBER OF PEOPLE IN EACH AGE BRACKET FOR CHILDREN, ADULTS AND PENSIONERS
age_df['Children']=age_df[child].sum(axis=1)
age_df['Adults']=age_df[adult].sum(axis=1)
age_df['Pensioners']=age_df[pensioner].sum(axis=1)

#DROPS EACH INDIVIDUAL AGE COLUMN, LEAVING ONLY THE AGGREGATED COLUMNS
for i in range(len(child)):
    age_df.drop([child[i]], axis=1, inplace=True)
for j in range(len(adult)):
    age_df.drop([adult[j]], inplace=True, axis=1)
for k in range(len(pensioner)):
    age_df.drop([pensioner[k]], inplace=True, axis=1)
    
age_df.head()
Out[16]:
Area code All categories: Age Children Adults Pensioners
3 K04000001 56075912 11970367 35440176 8665369
5 E92000001 53012456 11336960 33538775 8136721
7 E12000001 2596886 526042 1648921 421923
9 E06000047 513242 100091 326759 86392
10 E06000005 105564 22928 65370 17266

Aggregating the ages was the easy bit. This took quite a while and I may have given up a few times . . .

First of all I created a 'Short Code' column using the Python lambda function (it's hella cool), took the unique values from this column and used a list comprehension to create a list of the codes which begin with E, calling this unique_E

I then used a dictionary comprehension to take the observations from age_df which matched each individual short code in unique_E. The dictionary therefore contained several smaller DataFrames, each corresponding to a different Short Code.

Next, I found out which of these short codes were present in the base_df (remember that?) and saved them to another list (unique_geo_short2). I will use this list to determine which DataFrames I'll take from the dictionary of DataFrames corresponding to the short codes . . .

In [17]:
#TAKES THE FIRST THREE LETTERS OF THE AREA CODE AS THE 'SHORT CODE', SAVES EACH UNIQUE VALUE TO A LIST
age_df['Short Code'] = age_df['Area code'].apply(lambda x: x[:3])
unique = list(pd.unique(age_df['Short Code'].ravel()))

#LIST OF SHORT CODES WHERE THE FIRST LETTER IS 'E'
unique_E = [x for x in unique if x[0] == 'E']

#TAKES THE OBSERVATIONS FROM AGE_DF WHERE THE SHORT CODE IS EQUAL TO EACH UNIQUE SHORT CODE, SAVING THEM TO SEPARATE DICTS
# THIS SPLITS THE DATFRAME INTO SEVERAL SMALLER ONES
unique_dict = {x: age_df.loc[age_df['Short Code']==x] for x in unique_E}
#CREATES A LIST OF THE UNIQUE VALUES OF 'LAD' FROM THE BASE DATAFRAME
unique_geo = list(pd.unique(base['LAD'].ravel()))
#CREATES SHORT CODES (FIRST 3 CHARS) OF EACH WHERE THEY ARE PRESENT IN THE AGE_DF (PRESENT IN UNIQUE_DICT)
unique_geo_short = [x[:3] for x in unique_geo if x[:3] in unique_dict]
#TAKES UNIQUE VALUES OF SHORT CODES
unique_geo_short2 = list(pd.unique(unique_geo_short))
unique_geo_short2
Out[17]:
['E09', 'E08', 'E06', 'E07']

. . . next I created a blank DataFrame (age_df_refined), and concatenated the DataFrames from the dictionary if the Short Code key was present in unique_geo_short2, before merging this onto the main DataFrame in the following cell.

This was a pretty roundabout way of getting the result I needed but I couldn't find another, more sensible way. I'm sure there's a shorter, more pythonic way of doing it; if you know what that may be, please post in the comments!

In [18]:
#CREATES BLANK DATAFRAME TO WHICH THE DATAFRAMES FROM UNIQUE_DICT WILL BE APPENDED IF THE SHORT CODES MATCH
age_df_refined = DataFrame()
for key in unique_dict.keys():
    if key in unique_geo_short2:
        age_df_refined = pd.concat([age_df_refined, unique_dict[key]])
age_df_refined.head()
Out[18]:
Area code All categories: Age Children Adults Pensioners Short Code
18 E08000020 200214 40276 126742 33196 E08
19 E08000021 280177 54130 189534 36513 E08
20 E08000022 200801 40344 127323 33134 E08
21 E08000023 148127 29662 93148 25317 E08
22 E08000024 275506 54945 176573 43988 E08
In [19]:
#MERGES WITH BASE DATAFRAME
age_dep_df = base_dep_df.merge(age_df_refined, how='left',
                              left_on='LAD',right_on='Area code')
age_dep_df.head()
Out[19]:
MEAN DEP SCORE LA CODE STD DEP SCORE LAD LAD CODE WARD CODE Area code All categories: Age Children Adults Pensioners Short Code
0 11.515124 00AA 6.405421 E09000001 00AA 00AAFA E09000001 7375 692 5721 962 E09
1 34.094765 00AB 8.255361 E09000002 00AB 00ABFX E09000002 185911 53544 114160 18207 E09
2 16.488612 00AC 8.427992 E09000003 00AC 00ACFX E09000003 356386 83073 228823 44490 E09
3 16.520170 00AD 10.268777 E09000004 00AD 00ADGA E09000004 231997 54140 142786 35071 E09
4 30.166109 00AE 11.454649 E09000005 00AE 00AEGJ E09000005 311215 70364 210212 30639 E09

I loaded the data on Communal Living in a very similar way to that for Age although I did not have to aggregate columns together. The cell immediately below is the standard loading code I use for Excel files, and the cell below that follows the same procedure that I used for the Age data to take the observations which relate to the LA Codes.

In [20]:
#LOADS A DATAFRAME ON COMMUNAL LIVING FROM ONS WEBSITE
communal_xl = pd.ExcelFile('Communal Living.xls')
communal_df = communal_xl.parse('QS101EW_Numbers', header=10)
communal_df = DataFrame(communal_df)
communal_df.drop(['Unnamed: 2', 'Unnamed: 3', 'Area name'], axis=1, inplace=True)
communal_df.dropna(axis=0, inplace=True)
communal_df.head()
Out[20]:
Area code All categories: Residence type Lives in a household Lives in a communal establishment Communal establishments with persons sleeping rough identified
3 K04000001 56075912 55071113 1004799 208
5 E92000001 53012456 52059931 952525 198
7 E12000001 2596886 2550818 46068 17
9 E06000047 513242 500288 12954 8
10 E06000005 105564 104111 1453 0
In [21]:
#REPEATS THE SAME PROCEDURE AS FOR AGE_DF
communal_df['Short Code'] = communal_df['Area code'].apply(lambda x: x[:3])
unique_com = list(pd.unique(communal_df['Short Code'].ravel()))
unique_C = [x for x in unique_com if x[0]=='E']
unique_cdict = {x:communal_df.loc[communal_df['Short Code']==x]for x in unique_C}
communal_keys = list(communal_df.loc[:,:])
communal_df_refined = DataFrame()
for key in unique_cdict.keys():
    if key in unique_geo_short2:
        communal_df_refined = pd.concat([communal_df_refined, unique_cdict[key]])
#MERGES        
age_com_dep_df = age_dep_df.merge(communal_df_refined, how='left',
                                 left_on='LAD', right_on='Area code')
age_com_dep_df.head()
Out[21]:
MEAN DEP SCORE LA CODE STD DEP SCORE LAD LAD CODE WARD CODE Area code_x All categories: Age Children Adults Pensioners Short Code_x Area code_y All categories: Residence type Lives in a household Lives in a communal establishment Communal establishments with persons sleeping rough identified Short Code_y
0 11.515124 00AA 6.405421 E09000001 00AA 00AAFA E09000001 7375 692 5721 962 E09 E09000001 7375 7187 188 0 E09
1 34.094765 00AB 8.255361 E09000002 00AB 00ABFX E09000002 185911 53544 114160 18207 E09 E09000002 185911 184901 1010 0 E09
2 16.488612 00AC 8.427992 E09000003 00AC 00ACFX E09000003 356386 83073 228823 44490 E09 E09000003 356386 352597 3789 1 E09
3 16.520170 00AD 10.268777 E09000004 00AD 00ADGA E09000004 231997 54140 142786 35071 E09 E09000004 231997 230902 1095 1 E09
4 30.166109 00AE 11.454649 E09000005 00AE 00AEGJ E09000005 311215 70364 210212 30639 E09 E09000005 311215 308895 2320 1 E09

When loading these .csv files, I had to change the encoding (thankyou StackOverflow!) to allow them to be read in properly. The data in this .csv concerns the number of people of different ethnicities in each LA Code. As I found out, the data in this .csv were stored as text which created a bit of a headache! I also had to rename one of the columns as the title was truncated to a potentially offensive length. . . .

In [22]:
#LOADS ETHNICITY CSV FROM ONS WEBSITE, ENCODING NEEDED TO CHANGE
eth_xl = pd.read_csv('EthnicityLAD.csv', header=10,encoding = "ISO-8859-1", low_memory=False)
eth_df = DataFrame(eth_xl)
#CHANGED COLUMN NAMES FOR EASE OF UNDERSTANDING
eth_df.rename(columns={'Unnamed: 2':'Area code',"Asian/Asian British: Pakistani or British Pakistani":"Asian/Asian British: Pakistani/British Pakistani"}, inplace=True)
eth_df.drop(['Unnamed: 0','Unnamed: 1','Unnamed: 3','Unnamed: 4'], axis=1, inplace=True)
eth_df.dropna(axis=0, inplace=True)
eth_df.head()
Out[22]:
Area code All categories: Ethnic group (detailed) White: English/Welsh/Scottish/Northern Irish/British White: Irish White: Gypsy or Irish Traveller White: Afghan White: Albanian White: Anglo Indian White: Argentinian White: Australian/New Zealander ... Other ethnic group: Punjabi Other ethnic group: Somali Other ethnic group: Somalilander Other ethnic group: Sri Lankan Other ethnic group: Tamil Other ethnic group: Thai Other ethnic group: Turkish Other ethnic group: Turkish Cypriot Other ethnic group: Vietnamese Other ethnic group: Any other ethnic group
2 E06000047 513,242 495,664 1,229 467 0 4 2 0 157 ... 0 1 0 0 0 1 20 1 2 129
3 E06000005 105,564 98,898 328 350 0 0 2 0 37 ... 0 0 0 0 0 4 8 0 0 56
4 E06000001 92,028 88,924 193 40 0 0 0 0 23 ... 0 0 0 3 0 0 4 0 0 14
5 E06000002 138,412 119,106 574 85 11 14 0 0 15 ... 0 0 0 0 0 0 24 2 0 108
6 E06000048 316,028 307,097 833 156 0 3 2 3 124 ... 0 0 0 0 0 0 35 0 0 71

5 rows × 252 columns

Only the numbers which were greater than 1000 were stored as text (because of the comma separator!) and this was a huge headache to figure out. The orig_col_list is a list of all the columns in the DataFrame, the num_col_list is a list of all the numeric columns in the DataFrame (I will use this list to convert the data to numeric) and the sum_col_list is a list of the columns I may need to sum in order to reduce the amount of variables.

In [23]:
#MAKES A LIST OF THE COLUMNS IN ETH_DF
orig_col_list = list(eth_df)
#MAKES A LIST OF THE NUMERIC COLUMNS IN THE ETH_DF - NUMERIC STORED AS TEXT IN CSV, NEED TO CONVERT
# ONLY STORED AS TEXT IF NUMBER IS >1000 AND THEREFORE HAS A',' IN
num_col_list = orig_col_list[1:]
#MAKES A LIST OF THE COLUMNS I MAY NEED TO AGGREGATE 
#(LEAVE ALL CATEGORIES (COL1) FOR CHECKING, AND OTHER ETHNIC GROUP(COL-1) FOR AGGREGATION)
sum_col_list = orig_col_list[2:-1]

I used the numpy method .dtype to check if each column had data in that was stored as text. If one data item in the column contained a comma, the entire column would be stored as text. This step was important, as the next line of code uses the Python str.replace method which only works on strings, throwing an error if the column is stored as integer or float. The str.replace method simply replaces every occurrenve of a ',' in each affected cell with a string of 0 length, effectively removing the comma.

The final line converts all the numeric data in the DataFrame to the same type (float32). You can see the different between the preview below and that displayed two cells up: all the commas have gone! The underlying structure of the data has also changed, which is actually the important bit.

In [24]:
#REPLACES THE COMMAS IN NUMBERS GREATER THAN 999
for col in num_col_list:
    if eth_df[col].dtype == object:
        eth_df[col] = eth_df[col].str.replace(',','')
#CONVERTS TO NUMERIC
eth_df[num_col_list] = eth_df[num_col_list].convert_objects(convert_numeric=True).astype('float32')
eth_df.head()
Out[24]:
Area code All categories: Ethnic group (detailed) White: English/Welsh/Scottish/Northern Irish/British White: Irish White: Gypsy or Irish Traveller White: Afghan White: Albanian White: Anglo Indian White: Argentinian White: Australian/New Zealander ... Other ethnic group: Punjabi Other ethnic group: Somali Other ethnic group: Somalilander Other ethnic group: Sri Lankan Other ethnic group: Tamil Other ethnic group: Thai Other ethnic group: Turkish Other ethnic group: Turkish Cypriot Other ethnic group: Vietnamese Other ethnic group: Any other ethnic group
2 E06000047 513242 495664 1229 467 0 4 2 0 157 ... 0 1 0 0 0 1 20 1 2 129
3 E06000005 105564 98898 328 350 0 0 2 0 37 ... 0 0 0 0 0 4 8 0 0 56
4 E06000001 92028 88924 193 40 0 0 0 0 23 ... 0 0 0 3 0 0 4 0 0 14
5 E06000002 138412 119106 574 85 11 14 0 0 15 ... 0 0 0 0 0 0 24 2 0 108
6 E06000048 316028 307097 833 156 0 3 2 3 124 ... 0 0 0 0 0 0 35 0 0 71

5 rows × 252 columns

This cell aggregates the numbers in a similar way to how I aggregated the numbers in age_df. The main difference is that the column is only aggregated if there are less than 500,000 people in that category. I chose this number through trial and error, trying to choose one large enough to reduce the number of variables to a manageable level, but also small enough to maintain enough granularity in the data to make it interesting!

In [25]:
#AGGREGATES THE SMALLER ETHNIC GROUPS INTO 'OTHER' TO KEEP NUMBER OF VARIABLES MANAGEABLE
for col in sum_col_list:
    if eth_df[col].sum() < 500000:
        eth_df['Other ethnic group: Any other ethnic group'] += eth_df[col]
        eth_df.drop([col], axis=1, inplace=True)
eth_df.head()       
Out[25]:
Area code All categories: Ethnic group (detailed) White: English/Welsh/Scottish/Northern Irish/British White: Irish White: Polish Asian/Asian British: Indian or British Indian Asian/Asian British: Pakistani/British Pakistani Black/African/Caribbean/Black British: African Black/African/Caribbean/Black British: Caribbean Other ethnic group: Any other ethnic group
2 E06000047 513242 495664 1229 1694 1368 463 447 158 12219
3 E06000005 105564 98898 328 857 726 145 206 105 4299
4 E06000001 92028 88924 193 287 266 291 129 36 1902
5 E06000002 138412 119106 574 592 1477 6811 1470 92 8290
6 E06000048 316028 307097 833 542 939 351 193 114 5959

This cell check that the columns were summed corrected and the one below merges the ethnicity data with the DataFrame that we have slowly been building.

In [26]:
#MAKES A LIST OF THE NEW COLUMNS OF WHICH TO SUM THE NUMERIC VARIABLES TO CHECK THEY MATCH THE 'ALL CATEGORIES'
new_col_list = eth_df.columns.values.tolist()
new_col_list = new_col_list[2:]
eth_df['sum_test'] = eth_df[new_col_list].sum(axis=1)
eth_df.head()
#COMMENT THIS OUT TO CHECK, THEN UNCOMMENT TO DROP THE COLUMN IF THE SUMMING WORKED CORRECTLY
eth_df.drop(['sum_test'], axis=1, inplace=True)
In [27]:
eth_age_com_dep_df = age_com_dep_df.merge(eth_df, how='left', left_on='LAD',right_on='Area code')
eth_age_com_dep_df.head()
Out[27]:
MEAN DEP SCORE LA CODE STD DEP SCORE LAD LAD CODE WARD CODE Area code_x All categories: Age Children Adults ... Area code All categories: Ethnic group (detailed) White: English/Welsh/Scottish/Northern Irish/British White: Irish White: Polish Asian/Asian British: Indian or British Indian Asian/Asian British: Pakistani/British Pakistani Black/African/Caribbean/Black British: African Black/African/Caribbean/Black British: Caribbean Other ethnic group: Any other ethnic group
0 11.515124 00AA 6.405421 E09000001 00AA 00AAFA E09000001 7375 692 5721 ... E09000001 7375 4243 180 40 216 16 98 46 2536
1 34.094765 00AB 8.255361 E09000002 00AB 00ABFX E09000002 185911 53544 114160 ... E09000002 185911 91949 1730 1624 7436 8007 28685 5227 41253
2 16.488612 00AC 8.427992 E09000003 00AC 00ACFX E09000003 356386 83073 228823 ... E09000003 356386 162117 8685 7289 27920 5344 19392 4468 121171
3 16.520170 00AD 10.268777 E09000004 00AD 00ADGA E09000004 231997 54140 142786 ... E09000004 231997 179250 2596 953 7047 730 15952 2381 23088
4 30.166109 00AE 11.454649 E09000005 00AE 00AEGJ E09000005 311215 70364 210212 ... E09000005 311215 55887 12320 9170 58017 14381 24391 23723 113326

5 rows × 28 columns

The Religion data was loaded in the same way as the data for Ethnicity.

In [28]:
#LOADS A CSV FROM ONS WEBSITE ABOUT RELIGION BREAKDOWN IN AREAS, CHANGE ENCODING AGAIN
rel_xl = pd.read_csv('ReligionLAD.csv', header=10,encoding = "ISO-8859-1", low_memory=False)
rel_df = DataFrame(rel_xl)
#RENAMING AND DROPPING
rel_df.rename(columns={'Unnamed: 2':'Area code'}, inplace=True)
rel_df.drop(['Unnamed: 0','Unnamed: 1','Unnamed: 4', 'Unnamed: 3'], axis=1, inplace=True)
rel_df.dropna(axis=0, inplace=True)
rel_df.head()
Out[28]:
Area code All categories: Religion Christian Buddhist Hindu Jewish Muslim (Islam) Sikh Other religion: Total Other religion: Animism ... No religion: Total No religion: No religion No religion: Agnostic No religion: Atheist No religion: Free Thinker No religion: Heavy Metal No religion: Humanist No religion: Jedi Knight No religion: Realist Religion not stated
2 E06000047 513,242 369,715 1,001 607 208 1,934 609 1,525 5 ... 107,281 105,468 188 201 4 34 136 1,248 2 30,362
3 E06000005 105,564 71,122 307 317 45 971 361 310 2 ... 25,415 24,977 56 43 4 9 25 301 0 6,716
4 E06000001 92,028 64,349 152 168 9 689 97 178 1 ... 20,507 20,217 34 38 0 5 15 198 0 5,879
5 E06000002 138,412 87,511 370 591 45 9,757 564 246 0 ... 30,797 30,340 53 46 1 10 16 331 0 8,531
6 E06000048 316,028 216,673 578 335 169 1,018 494 980 1 ... 75,620 74,515 127 113 2 23 100 739 1 20,161

5 rows × 60 columns

The numbers greater than 999 were stored as text, however the columns with smaller number were already aggregated into the 'Other religion: Total','No religion: Total' and 'Religion not stated' columns. Therefore, I dropped every column but these and the columns for the big six religions (Christianity, Buddhism, Hinduism, Islam, Jewish and Sikh). I then merged this DataFrame with the main DataFrame.

In [29]:
#TAKES A LIST OF COLUMNS, GOING TO CONVERT CHAR TO NUMERIC AND SUM AS WITH PREVIOUS DATAFRAME
rel_orig_col_list = list(rel_df)
rel_sum_col_list = rel_orig_col_list[8:]
#OTHER RELIGION:TOTAL IS THE SUM OF ALL COLUMNS BEGINNING WITH 'OTHER RELIGION:' 
#WILL DROP THEM AS THEY HAVE ALREADY BEEN SUMMED IN THE DATA
for col in rel_sum_col_list:
    if col not in ['Other religion: Total','No religion: Total','Religion not stated']:
        rel_df.drop([col], axis=1, inplace=True)
        
#REPLACES THE COLUMN LIST WITH ONE FROM UPDATED DATAFRAM (AFTER DROPPING UNNECESSARY COLUMNS)   
rel_orig_col_list = list(rel_df)
rel_num_col_list = rel_orig_col_list[1:]

#CONVERTS CHARACTERS TO NUMERIC
for col in rel_num_col_list:
    if rel_df[col].dtype == object:
        rel_df[col] = rel_df[col].str.replace(',','')
rel_df[rel_num_col_list] = rel_df[rel_num_col_list].convert_objects(convert_numeric=True).astype('float32')
rel_df.head()

#MERGES
rel_eth_age_com_dep_df = eth_age_com_dep_df.merge(rel_df, how='left', left_on='LAD',right_on='Area code')
rel_eth_age_com_dep_df.head()
Out[29]:
MEAN DEP SCORE LA CODE STD DEP SCORE LAD LAD CODE WARD CODE Area code_x All categories: Age Children Adults ... All categories: Religion Christian Buddhist Hindu Jewish Muslim (Islam) Sikh Other religion: Total No religion: Total Religion not stated
0 11.515124 00AA 6.405421 E09000001 00AA 00AAFA E09000001 7375 692 5721 ... 7375 3344 92 145 166 409 18 28 2522 651
1 34.094765 00AB 8.255361 E09000002 00AB 00ABFX E09000002 185911 53544 114160 ... 185911 104101 842 4464 425 25520 2952 533 35106 11968
2 16.488612 00AC 8.427992 E09000003 00AC 00ACFX E09000003 356386 83073 228823 ... 356386 146866 4521 21924 54084 36744 1269 3764 57297 29917
3 16.520170 00AD 10.268777 E09000004 00AD 00ADGA E09000004 231997 54140 142786 ... 231997 144093 1377 3547 234 5645 4156 724 55995 16226
4 30.166109 00AE 11.454649 E09000005 00AE 00AEGJ E09000005 311215 70364 210212 ... 311215 129080 4300 55449 4357 58036 1709 3768 33054 21462

5 rows × 39 columns

The data in the .xls on Population Density was loaded and wrangled in the same way as that for Communal Living. Only the data which corresponded to the Area Codes that I am interested in was kept (I really could have written a function for this. . . ) and this DataFrame was merged with the main DataFrame.

In [30]:
#TAKES AN EXCEL FILE ON POPULATION DENSITY WHICH I DOWNLOADED FROM ONS WEBSITE
dens_xl = pd.ExcelFile('Population Density.xls')
dens_df = dens_xl.parse('QS102EW_Numbers', header=10)
dens_df = DataFrame(dens_df)
#RENAMES AND DROPS COLUMNS AND NA
dens_df.drop(['Unnamed: 2','Unnamed: 3','Area name'], axis=1, inplace=True)

#TAKES THE SHORT CODES AND REFINES THE DATAFRAME TO ONLY CONTAIN OBSERVATIONS WHICH CAN BE MERGED ON 'AREA CODE'
#REPEAT OF PROCESS FOR COMMUNAL_DF AND AGE_DF
dens_df.dropna(axis=0, inplace=True)
dens_df['Short Code'] = dens_df['Area code'].apply(lambda x: x[:3])
unique_dens = list(pd.unique(dens_df['Short Code'].ravel()))

unique_D = [x for x in unique_dens if x[0]=='E']
unique_ddict = {x:dens_df.loc[dens_df['Short Code']==x]for x in unique_D}
dens_keys = list(dens_df.loc[:,:])
dens_df_refined = DataFrame()
for key in unique_ddict.keys():
    if key in unique_geo_short2:
        dens_df_refined = pd.concat([dens_df_refined, unique_ddict[key]])  

#MERGES WITH EVERYTHING ELSE
final_data = rel_eth_age_com_dep_df.merge(dens_df_refined, how='left', left_on='LAD',right_on='Area code')
final_data.head(5)
Out[30]:
MEAN DEP SCORE LA CODE STD DEP SCORE LAD LAD CODE WARD CODE Area code_x All categories: Age Children Adults ... Muslim (Islam) Sikh Other religion: Total No religion: Total Religion not stated Area code All usual residents Area Density Short Code
0 11.515124 00AA 6.405421 E09000001 00AA 00AAFA E09000001 7375 692 5721 ... 409 18 28 2522 651 E09000001 7375 290 25.5 E09
1 34.094765 00AB 8.255361 E09000002 00AB 00ABFX E09000002 185911 53544 114160 ... 25520 2952 533 35106 11968 E09000002 185911 3609 51.5 E09
2 16.488612 00AC 8.427992 E09000003 00AC 00ACFX E09000003 356386 83073 228823 ... 36744 1269 3764 57297 29917 E09000003 356386 8674 41.1 E09
3 16.520170 00AD 10.268777 E09000004 00AD 00ADGA E09000004 231997 54140 142786 ... 5645 4156 724 55995 16226 E09000004 231997 6056 38.3 E09
4 30.166109 00AE 11.454649 E09000005 00AE 00AEGJ E09000005 311215 70364 210212 ... 58036 1709 3768 33054 21462 E09000005 311215 4324 72 E09

5 rows × 44 columns

The merging process created a few duplicate Area Code columns; these were dropped in this cell; the LA and LAD Codes still serve to identify the individual geographic areas.

In [31]:
#DROPS COLUMNS WHICH HAVE BEEN DUPLICATED DURING THE MERGES, CREATES A MASTER LIST OF ALL COLUMNS
columns =  pd.unique(final_data.columns.values.tolist())
for column in columns:
    if 'Area code' in column or 'Short Code' in column:
        final_data.drop([column], axis=1, inplace=True)

I created a list of all the column names, chopping this down to create a list which contained only the column names of the variables of interest; I then split this list down even further into columns which correspond to each type of data that I loaded (Ethnicity, Religion etc)

In [32]:
#CREATES A LIST OF ALL THE COLUMNS WHICH CONTAIN VARIABLES OF INTEREST
columns2 = list(final_data.loc[:,:])
cols2 = columns2[6:]
cols2
#SPLITS THE MASTER LIST OF COLUMNS INTO SECTIONS FOR EACH CATEGORY
age_cols = cols2[:4]
comm_cols = cols2[4:8]
eth_cols = cols2[8:17]
rel_cols = cols2[17:27]
res_cols = cols2[27:]
age_cols
Out[32]:
['All categories: Age', 'Children', 'Adults', 'Pensioners']

At this point I saved the final_data DataFrame to .csv as it had become time consuming to run all the previous code every time I restarted work on the project.

In [33]:
#SAVES TO CSV TO SPEED UP LOADING PROCESS WHEN RETURNING TO ANALYSIS
final_data.to_csv('Predictors of Deprivation.csv')
You can find the second part of this analysis here You can also find both analyses (as well as many more) on my Github. Thanks for reading!