Indicators of Deprivation Part 2

By the end of the previous part of this analysis I had loaded and cleaned all of the different data and collected them together into a single DataFrame.

I have kept the same set of imports as the Part 1; I won't use them all, but I do like consistency.

In [6]:
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from pandas import Series, DataFrame
import pandas as pd
from sklearn import linear_model, cross_validation
from scipy import stats
import statsmodels.api as sm
import itertools
import copy
%matplotlib inline

Analysing the Data

Loading the previously saved data from the .csv . . .

In [35]:
#LOADS DATA FROM CSV
data = pd.read_csv("Predictors of Deprivation.csv", encoding = "ISO-8859-1", low_memory=False)
data_df = DataFrame(data)
data_df.head()
Out[35]:
Unnamed: 0 MEAN DEP SCORE LA CODE STD DEP SCORE LAD LAD CODE WARD CODE All categories: Age Children Adults ... Hindu Jewish Muslim (Islam) Sikh Other religion: Total No religion: Total Religion not stated All usual residents Area Density
0 0 11.515124 00AA 6.405421 E09000001 00AA 00AAFA 7375 692 5721 ... 145 166 409 18 28 2522 651 7375 290 25.5
1 1 34.094765 00AB 8.255361 E09000002 00AB 00ABFX 185911 53544 114160 ... 4464 425 25520 2952 533 35106 11968 185911 3609 51.5
2 2 16.488612 00AC 8.427992 E09000003 00AC 00ACFX 356386 83073 228823 ... 21924 54084 36744 1269 3764 57297 29917 356386 8674 41.1
3 3 16.520170 00AD 10.268777 E09000004 00AD 00ADGA 231997 54140 142786 ... 3547 234 5645 4156 724 55995 16226 231997 6056 38.3
4 4 30.166109 00AE 11.454649 E09000005 00AE 00AEGJ 311215 70364 210212 ... 55449 4357 58036 1709 3768 33054 21462 311215 4324 72.0

5 rows × 37 columns

It became apparent that there are a few cells which give basically the same information, that is, the total population in each area ('All categories: Age' and 'All usual residents' for example). I dropped these columns as they do not satisfy the assumption of indepence of variables for inclusion in a multiple regression. I am aware that Religion and Ethnicity may not be truly independent, however the 'Other ethnic groups' aggregate category contains so many different ethnicities that I think the assumption holds true enough for this analysis.

After dropping the columns, I then made a new list of columns pertaining to each data source (making the cell a few lines up an exercise in pointlessness . . .); the explanatory variables in these lists will be individually regressed against the response variable MEAN DEP SCORE . . .

In [36]:
#DROPS COLUMNS WHICH ONLY GIVE THE POPULATION INFORMATION 
#EG 'ALL USUAL RESIDENTS' GIVES THE SAME INFO AS 'ALL CATEGORIESS RELIGION'
#SPLITS THE COLUMN NAMES INTO THE ORIGINAL CATEGORIES
x_vals = data_df.drop(['MEAN DEP SCORE','All usual residents ','All categories: Religion','All categories: Ethnic group (detailed)','All categories: Residence type','All categories: Age','Unnamed: 0','LA CODE','STD DEP SCORE','LAD','LAD CODE','WARD CODE'], axis=1)
x_cols = list(x_vals.columns)
age_vals = x_cols[:3]
com_vals = x_cols[3:6]
eth_vals = x_cols[6:14]
rel_vals = x_cols[14:23]
dens_vals = x_cols[23:]

The cell below defines a function which takes a list of columns containing explanatory variables and the DataFrame from which they came, as well as the response variable as inputs.

It returns a DataFrame which contains information about a simple regression for each variable against the response variable. This DataFrame will contain information about:

  • the Regression Coefficient (the slope of the line - how much the Y variable changes for a given change in X)
  • the Intercept (where the regression line crosses the Y-axis)
  • the RSS (how well the line fits the data - lower values are better)
  • the Variance (how spread out the data is)
  • R-Squared (the percent of the variance that is explained by knowing the X variables - higher values are better)
  • the P-value (helps to determine if the regression equation is statistically significant - lower values are better, under 0.05 is pretty good)
  • and the Standard Error (helps to assess the precision of any predictions).

For each item above, I have appended it in the DataFrame with either (sk) - indicating I used Sklearn for the calculation, or (sci) - which indicates I used SciPy for the calculation.

The explanatory variables are grouped in the categories they were loaded from as I intend to pick one variable from each group to be included in a multiple regression.

In [37]:
#GETS THE INDIVIDUAL REGRESSION FOR EACH VARIABLE AGAINST THE MEAN DEPRIVATION SCORE
def get_regs(x_cols, x_variables, y_variable):
    """Takes a list of columns, a dataframe of x_variables and a y_variables (Series or DataFrame), and returns a DataFrame
    of the Intercept, Coerfficient, RSS and Variable for each x_variable and the y_variable. The list x_cols should be 
    the names of a subset of columns from the x_variable DataFrame.
    """
    outputlist = []
    regression_funcs = []
    for i in range(len(x_cols)):
        #CREATES SK LEARN REGRESSION MODEL
        reg_func = linear_model.LinearRegression()
        #TRAINING/TESTING SPLIT - TEST SIZE IS DEFAULT
        x_train, x_test, y_train, y_test = cross_validation.train_test_split(x_variables[x_cols[i]],y_variable, test_size = 0.25)
        #CREATES SCIPY REGRESSION MODEL FOR R, P AND STD ERR
        slope,intercept,r_val,p_val,stderr = stats.linregress(y=y_train,x=x_train)
        #ADDS THE DATA ABOUT THE REGRESSION MODELS TO A DICTIONARY
        outputlist.append({'X-Variable':str(x_cols[i]).replace("'",''),
                           'Coef(sk)':reg_func.fit(np.reshape(x_train,((len(x_train),1))), y_train).coef_[0],
                           'Intercept(sk)':reg_func.fit(np.reshape(x_train,((len(x_train),1))), y_train).intercept_,
                           'RSS(sk)':np.mean((reg_func.predict(np.reshape(x_test,((len(x_test),1))))-y_test)**2),
                           'Variance(sk)':reg_func.score(np.reshape(x_test,((len(x_test),1))), y_test),
                          'R-value(sci)':r_val,
                          'R-squared(sci)':r_val**2,
                          'P-value(sci)':p_val,
                          'Std Err(sci)':stderr})
        
        regression_funcs.append({'X-Variable':str(x_cols[i]).replace("'",''),
                                'Regression Object':reg_func})
    #OUTPUTS A DATAFRAME AND REORDERS THE VARIABLES SENSIBLY
    df = DataFrame(outputlist)
    df = df[['X-Variable','Intercept(sk)','Coef(sk)',  'RSS(sk)', 'Variance(sk)','R-value(sci)','R-squared(sci)','P-value(sci)','Std Err(sci)' ]]
    return df,regression_funcs

This cell runs the lists of variables through the function defined above

In [38]:
#GETS REGRESSIONS ON EACH VARIABLE AGAINST THE MEAN DEPRIVATION SCORE
age_reg_df,age_regs = get_regs(age_vals,x_vals,data_df['MEAN DEP SCORE'])
com_reg_df,com_regs = get_regs(com_vals,x_vals,data_df['MEAN DEP SCORE'])
eth_reg_df,eth_regs = get_regs(eth_vals,x_vals,data_df['MEAN DEP SCORE'])
rel_reg_df,rel_regs = get_regs(rel_vals,x_vals,data_df['MEAN DEP SCORE'])
dens_reg_df,dens_regs = get_regs(dens_vals,x_vals,data_df['MEAN DEP SCORE'])

# now to decide which variable from each group to include in the final multiple regression          

This dataframe shows the values from the regression for the Population Density variables. I decided to take Population Density as it has a higher R-squared, a lower p-value and a smaller RSS than Area.

In [39]:
dens_reg_df #take Density - high R^2, lower p-value, smaller RSS
Out[39]:
X-Variable Intercept(sk) Coef(sk) RSS(sk) Variance(sk) R-value(sci) R-squared(sci) P-value(sci) Std Err(sci)
0 Area 20.364735 -0.000034 69.332510 0.055812 -0.218199 0.047611 5.984320e-04 0.000010
1 Density 16.222283 0.178736 61.649141 0.276955 0.493837 0.243875 2.091625e-16 0.020231

This dataframe shows the values from the regression for the Religion variables. I decided to take 'No Religion: Total' and 'Religion not stated' as they have reasonable R^2 values has a higher R-squared, a low p-value and small RSS. There could be a case for including almost any of the other variables, but I wanted to limit the amount I would consider for inclusion in the multiple regression. Please feel free to try the analysis yourself with different combinations!

In [40]:
rel_reg_df # take no religion: total and religion not stated - lowest RSS, reasonable R^2, lowest p-value
Out[40]:
X-Variable Intercept(sk) Coef(sk) RSS(sk) Variance(sk) R-value(sci) R-squared(sci) P-value(sci) Std Err(sci)
0 Christian 14.346330 0.000051 72.059505 0.102854 0.394980 0.156009 1.551414e-10 0.000008
1 Buddhist 15.987272 0.004164 59.519725 0.058685 0.348671 0.121572 2.205961e-08 0.000720
2 Hindu 19.054903 0.000199 51.209229 0.057785 0.150973 0.022793 1.828966e-02 0.000084
3 Jewish 18.935744 0.000127 71.406320 0.001787 0.058712 0.003447 3.611468e-01 0.000139
4 Muslim (Islam) 17.042289 0.000290 83.951083 -0.082738 0.470447 0.221320 7.681220e-15 0.000035
5 Sikh 19.050486 0.000481 70.367367 -0.009192 0.255785 0.065426 5.290271e-05 0.000117
6 Other religion: Total 17.283141 0.003222 53.723960 0.108819 0.267568 0.071593 2.279769e-05 0.000746
7 No religion: Total 14.871529 0.000109 71.945679 0.114103 0.354501 0.125671 1.233217e-08 0.000019
8 Religion not stated 14.504361 0.000444 42.887706 0.121009 0.406599 0.165322 3.941129e-11 0.000064

This dataframe shows the values from the regression for the Ethnicity variables. I decided to take 'Black/African/Caribbean/Black British: African' and 'Other ethnic group: Any other ethnic group' as they both have high R-squared values low p-values, and (in the case of 'Black/African/Caribbean/Black British: African') a small RSS. Once again, I could have easily chosen more, or other variables for inclusion in the multiple regression.

In [41]:
eth_reg_df # take African and other ethnic group - lowest RSS, highest r^2 and lowest p-value
Out[41]:
X-Variable Intercept(sk) Coef(sk) RSS(sk) Variance(sk) R-value(sci) R-squared(sci) P-value(sci) Std Err(sci)
0 White: English/Welsh/Scottish/Northern Irish/B... 15.347880 0.000031 61.234447 0.097515 0.282606 0.079866 7.342738e-06 0.000007
1 White: Irish 17.187757 0.001217 71.437101 0.098294 0.329031 0.108261 1.437104e-07 0.000224
2 White: Polish 16.507810 0.002110 65.887255 0.058538 0.398824 0.159061 9.916653e-11 0.000312
3 Asian/Asian British: Indian or British Indian 17.731878 0.000225 79.777338 -0.016017 0.302906 0.091752 1.428503e-06 0.000046
4 Asian/Asian British: Pakistani/British Pakistani 18.672440 0.000234 55.443517 0.021578 0.350451 0.122816 1.849438e-08 0.000040
5 Black/African/Caribbean/Black British: African 17.800460 0.000533 62.589540 0.214160 0.408438 0.166821 3.157152e-11 0.000077
6 Black/African/Caribbean/Black British: Caribbean 18.172565 0.000627 54.076499 0.127721 0.380181 0.144537 8.240526e-10 0.000098
7 Other ethnic group: Any other ethnic group 16.572930 0.000153 61.735524 0.104961 0.441273 0.194722 4.756173e-13 0.000020

This dataframe shows the values from the regression for the Communal Living variables. I decided to take 'Lives in a household' as it has a higher R-squared, a lower p-value and a smaller RSS than the others. This may be due to magnitude of the numbers recorded for 'Lives in a communal establishment' and 'Communal establishments with persons sleeping rough' being lower than those of 'Lives in a household'.

In [42]:
com_reg_df # take lives in a household - lowest RSS, p-value and highest r^2. 
Out[42]:
X-Variable Intercept(sk) Coef(sk) RSS(sk) Variance(sk) R-value(sci) R-squared(sci) P-value(sci) Std Err(sci)
0 Lives in a household 13.156717 0.000037 60.686792 0.177968 0.428568 0.183671 2.544955e-12 0.000005
1 Lives in a communal establishment 17.486256 0.000656 73.405096 0.073749 0.257198 0.066151 4.792102e-05 0.000158
2 Communal establishments with persons sleeping ... 18.586784 1.712720 65.731140 -0.031709 0.225864 0.051015 3.764627e-04 0.474855

This dataframe shows the values from the regression for the Age variables. I decided to take Children as the variable from this list as it has the highest R-squared and a low p-value and RSS.

In [43]:
age_reg_df # take children, low RSS and p-value, highest R^2
Out[43]:
X-Variable Intercept(sk) Coef(sk) RSS(sk) Variance(sk) R-value(sci) R-squared(sci) P-value(sci) Std Err(sci)
0 Children 14.157832 0.000152 53.832585 0.227314 0.407424 0.165994 3.568239e-11 0.000022
1 Adults 14.138562 0.000050 48.294427 0.245643 0.437462 0.191373 7.924612e-13 0.000007
2 Pensioners 14.931408 0.000178 66.742276 -0.055553 0.325079 0.105677 2.062952e-07 0.000033

This cell takes the chosen variables from each category and saves them in a list to be plugged into a multiple regression

In [44]:
#TAKES THE INDIVIDUAL COLUMNS WHICH I DECIDED UPON AND ADDS THEM TO THE LIST OF POTENTIAL VARIABLES TO BE INCLUDED IN A
#MULTIPLE LINEAR REGRESSION
multimodel_columns = [age_vals[0],com_vals[0],eth_vals[5],eth_vals[7],rel_vals[7], rel_vals[8],dens_vals[1]]
multimodel_columns
Out[44]:
['Children',
 'Lives in a household',
 'Black/African/Caribbean/Black British: African',
 'Other ethnic group: Any other ethnic group',
 'No religion: Total',
 'Religion not stated',
 'Density']

In preparation for the multiple regression, I made a function which creates a matrix of the different combinations of variables. For example, with 3 variables, there would be 2^3 = 8 possible combinations:

  1. [0][0][0]
  2. [1][0][0]
  3. [1][1][0]
  4. [1][0][1]
  5. [1][1][1]
  6. [0][1][0]
  7. [0][1][1]
  8. [0][0][1]

The diagram above shows all the possible combinations of three different variables; where a 1 is in the box, that variable would be included in the model.

Disregarding the combination which has 0 variables, I will have to run (2^7)-1 = 127 different combinations. One more variable would have resulted in 255 combinations; you can see why I wanted to restrict them somewhat.

For details on how the function works, check the comments on each line.

In [45]:
#makes a matrix of the possible combinations OF ALL THE POTENTIAL REGRESSION VARIABLES, AND SAVES AS A DATAFRAME
# BE CAREFUL, THIS GIVES a little (2^^n)-1 COMBINATIONS

def create_dataframes_for_multi_linear_model(inputlist, inputdata):
    '''Takes an inputlist of column names which should be a subset of columns in inputdata, calculates every possible combination
    of the columns and returns a list of dataframes, each containing a different combination of the input data'''
    
    output_dataframes = []
    #this makes the matrix of 1s and 0s and maps the column list to it, saving a dataframe of that particular column where a
    #1 would be in the matrix
    #ACTUALLY MAKES THE MATRIX
    template = [list(i) for i in itertools.product([0,1], repeat = len(inputlist)) if sum(i) >0]
    check_template = copy.deepcopy(template)

    #ASSIGNS THE DATAFRAME COLUMNS TO EACH '1'
    for j in range(len(template)):
        for k in range(len(template[j])):
            if template[j][k] == 1:
                template[j][k] = DataFrame(inputdata[inputlist[k]])
                
    # takes the matrix of dataframes and concats them together into a single dataframe (where there is more than 1)
    # I should now be able to perform a linear regression on every combination of dataframe
    for i in range(len(template)):
        j=0
        temp_df = DataFrame()
        while j < len(template[i]):
            if isinstance(template[i][j], pd.DataFrame):
                if len(temp_df)==0:
                    temp_df = DataFrame(template[i][j])
                else:
                    temp_df = DataFrame(pd.concat([temp_df,template[i][j]], axis=1))
            j+=1
        output_dataframes.append(temp_df)
    
    return output_dataframes

This runs the list of chosen columns through the function above:

In [46]:
#ALL POSSIBLE COMBINATIONS OF VARIABLES
multi_dataframes = create_dataframes_for_multi_linear_model(multimodel_columns, data_df)    

This function does a multiple regression on every combination of variables calculated in the function above. If you don't read the code, I just want you to know it was hard.

In [47]:
def multiple_linear_regression(xvar_combinations, y_variable):
    '''Does a multiple regression on each item in the list of xvar_combinations. 
    This should be the output from create_dataframes_for_multi_linear_model.
    Saves the information about each multiple regression to a DataFrame to allow the models to be compared.
    Outputs a DataFrame containing the Coefficients, Intercept, RSS, Variance, P-Value, R^2 and SE, 
    as well as a dictionary of the summaries from the statsmodel OLS function.'''
    regression_outputs = []
    regression_summaries= []
    
    for i in range(len(xvar_combinations)):
        #SKLEARN MODEL - DONT NEED TO DO A TRAIN-TEST SPLIT AS I'M NOT TRYING TO MAKE A PREDICTIVE MODEL
        reg_func = linear_model.LinearRegression()
        
        
        #STATSMODELS MODEL - NEED FOR R-VALUE, P-VALUE ETC
        #USEFUL PARAMETERS FROm THE STATSMODELS.OLS - http://statsmodels.sourceforge.net/devel/generated/statsmodels.regression.linear_model.RegressionResults.html
        #MAY NOT WANT TO ADD A CONSTANT IF REGRESSION SHOULD BE CONSTRAINED THROUGH THE ORIGIN
        xvariables = sm.add_constant(xvar_combinations[i], prepend=False)
        model = sm.OLS(y_variable,xvariables)
        results = model.fit()
        #GETS THE NAMES OF THE VARIABLES IN THE MODEL
        x_variable_names_start = str(xvar_combinations[i].columns).find('[')
        x_variable_names_end = str(xvar_combinations[i].columns).find(']')
        x_variable_names = str(xvar_combinations[i].columns)[x_variable_names_start+1:x_variable_names_end].replace("'",'')
        
        regression_outputs.append({'X-Variables':x_variable_names,
                   'Coef(sk)':reg_func.fit(xvariables, y_variable).coef_,
                   'Intercept(sk)':reg_func.fit(xvariables, y_variable).intercept_,
                   'RSS(sm)':np.mean((results.resid**2)),
                          'P-Value(sm)':results.f_pvalue,
                          'R-Squared(sm)':results.rsquared,
                                  'Std Err(sm)':results.scale**0.5,
                                  'ID':i})
        regression_summaries.append({'X-Variables':x_variable_names,'Summary':results.summary(),'ID':i})
    #OUTPUTS DATAFRAME AND ORDER COLUMNS SENSIBLY
    df = DataFrame(regression_outputs)
    df = df[['X-Variables','Intercept(sk)','Coef(sk)',  'RSS(sm)', 'R-Squared(sm)','P-Value(sm)','Std Err(sm)','ID' ]]
        
    return df, regression_summaries
    

This runs the combinations of DataFrames through the multiple regression function above:

In [50]:
multiple_linear_models,summaries = multiple_linear_regression(multi_dataframes,data_df['MEAN DEP SCORE'])

It would be a pain sifting through all 127 different combinations of variables, so I devised a crude method to help me determine which combinations of variables I would consider to be suitable candidates for a multiple regression.

Using the same measures I used to classify the singles regressions, I created a 4 copies of the regression data, each sorted on a different measure, I then ranked the variable combinations in order of 'goodness', merged the DataFrames back together and summed the ranks. This allowed me to discount the models that performed really badly and to spend more time considering the 5 or so models at the top of the list.

In [51]:
#CREATES SEPARATE DATAFRAMES, EACH SORTED BY A DIFFERENT MEASURE OF 'GOODNESS' OF THE MODEL
by_p = multiple_linear_models.sort(['P-Value(sm)'], ascending=True)
by_r = multiple_linear_models.sort(['R-Squared(sm)'],ascending=False)
by_RSS = multiple_linear_models.sort(['RSS(sm)'],ascending=True)
by_se = multiple_linear_models.sort(['Std Err(sm)'], ascending=True)

#RANKS THE MODELS BY HOW WELL THEY PERFORM IN EACH METRIC
by_p['rank']=by_p.reset_index().index.values
by_r['rank']=by_r.reset_index().index.values
by_RSS['rank']=by_RSS.reset_index().index.values
by_se['rank']=by_se.reset_index().index.values
In [52]:
#KEEPS ONLY THE RANK AND X-VARIABLES COLUMN
def get_ranks(dataframe):
    output = dataframe.loc[:,['X-Variables','rank']]
    output.rename(columns={'rank':'rank'}, inplace=True)
    return output

ranked_by_p = get_ranks(by_p)
ranked_by_r = get_ranks(by_r)
ranked_by_RSS = get_ranks(by_RSS)
ranked_by_se = get_ranks(by_se)
In [53]:
#STICKS THE MODELS TOGETHER, SUMMING THE RANKS OF EACH, FINALLY MERGING WITH THE FULL INFO ABOUT THE MODELS AND SORTING
#TO REVEAL WHICH ARE THE BEST CANDIDATES FOR PREDICTING DEPRIVATION
final_multimodels = pd.concat([ranked_by_r,ranked_by_p, ranked_by_RSS, ranked_by_se]).groupby(['X-Variables'], as_index=False)['rank'].sum()
final_model_and_data = final_multimodels.merge(multiple_linear_models,how='left',on='X-Variables')


final_model_and_data.rename(columns={'rank':'Sum of Rank'},inplace=True)
final_model_and_data.sort('Sum of Rank')
Out[53]:
X-Variables Sum of Rank Intercept(sk) Coef(sk) RSS(sm) R-Squared(sm) P-Value(sm) Std Err(sm) ID
87 Lives in a household,\n Black/African/Ca... 7 12.210549 [8.30532615656e-05, 0.000188751423961, -0.0001... 41.987483 0.409067 8.023450e-34 6.550484 62
41 Children, Lives in a household,\n Black/... 17 12.234387 [0.000109624833768, 5.40861933451e-05, 0.00017... 41.862720 0.410823 3.211866e-33 6.551021 126
56 Children, Lives in a household,\n Other ... 17 12.144967 [0.000134059573103, 4.9897005847e-05, -0.00015... 42.219062 0.405808 1.898987e-33 6.568523 110
108 Lives in a household, Other ethnic group: Any ... 20 12.106615 [8.55408862423e-05, -0.000134155979803, -7.573... 42.408945 0.403135 5.519395e-34 6.572984 46
28 Children, Black/African/Caribbean/Black Britis... 21 12.395731 [0.000286055396495, 0.000160819423157, -0.0002... 42.237636 0.405546 2.034417e-33 6.569968 94
26 Children, Black/African/Caribbean/Black Britis... 24 12.261860 [0.000268253747312, 0.000199268789424, -0.0002... 42.409197 0.403132 5.524580e-34 6.573003 92
39 Children, Lives in a household,\n Black/... 25 12.120359 [0.000163508727568, 2.99890064633e-05, 0.00022... 42.268010 0.405119 2.276844e-33 6.572330 124
74 Children, Other ethnic group: Any other ethnic... 30 12.301525 [0.000296010110107, -0.00016894797577, -4.7616... 42.540177 0.401288 8.987470e-34 6.583146 78
43 Children, Lives in a household,\n Black/... 34 12.332372 [0.000152718314797, 3.27369518917e-05, 0.00013... 42.393435 0.403354 3.621134e-33 6.582074 122
30 Children, Black/African/Caribbean/Black Britis... 34 12.423524 [0.00026454211561, 0.000129254275418, -0.00017... 42.547776 0.401182 9.244410e-34 6.583734 90
76 Children, Other ethnic group: Any other ethnic... 38 12.340408 [0.000276295246205, -0.000147801653487, -0.000... 42.750840 0.398324 2.513095e-34 6.589138 74
85 Lives in a household,\n Black/African/Ca... 38 12.054146 [7.07904421749e-05, 0.000255613236057, -0.0002... 42.567868 0.400899 9.959386e-34 6.585288 60
58 Children, Lives in a household,\n Other ... 42 12.249633 [0.000166018211674, 3.23169384107e-05, -0.0001... 42.600415 0.400441 1.123590e-33 6.587805 106
110 Lives in a household, Other ethnic group: Any ... 45 12.216589 [7.50839787915e-05, -0.000104652644533, -0.000... 42.900103 0.396223 4.374151e-34 6.600631 42
89 Lives in a household,\n Black/African/Ca... 47 12.311002 [7.17600783268e-05, 0.000144299368976, -0.0001... 42.644695 0.399817 1.323724e-33 6.591228 58
72 Children, Other ethnic group: Any other ethnic... 49 12.054269 [0.000271628824657, -0.000190222203445, -8.934... 42.925767 0.395862 4.810512e-34 6.602605 76
54 Children, Lives in a household,\n Other ... 59 11.962742 [0.000213850450105, 1.66035971743e-05, -0.0001... 42.880405 0.396500 3.158443e-33 6.609418 108
106 Lives in a household, Other ethnic group: Any ... 65 11.839338 [6.97969111538e-05, -0.000166033551719, -0.000... 43.418297 0.388930 2.950336e-33 6.640376 44
37 Children, Lives in a household,\n Black/... 76 12.191647 [0.00027917746284, -2.52097757868e-05, 0.00018... 43.483164 0.388017 2.854744e-32 6.655710 120
24 Children, Black/African/Caribbean/Black Britis... 76 12.016612 [0.000168257551355, 0.000206669920988, -0.0001... 43.647707 0.385701 6.817659e-33 6.657896 88
52 Children, Lives in a household,\n Other ... 80 12.053042 [0.00031376412078, -3.26964330241e-05, -0.0001... 43.917326 0.381906 1.814100e-32 6.678428 104
77 Children, Other ethnic group: Any other ethnic... 81 11.797244 [0.000170152464587, -0.000154788494646, 0.2566... 44.203682 0.377876 5.779649e-33 6.689753 72
96 Lives in a household,\n Black/African/Ca... 96 13.166749 [7.25656477299e-05, -8.16610493146e-05, -0.000... 44.430515 0.374684 1.148944e-31 6.717334 50
83 Lives in a household,\n Black/African/Ca... 100 12.086906 [3.5499977055e-05, 0.000238627906776, -0.00017... 44.480493 0.373980 1.373600e-31 6.721111 56
95 Lives in a household,\n Black/African/Ca... 101 13.182341 [7.35716351649e-05, -8.42174592523e-05, -7.765... 44.424090 0.374774 8.334839e-31 6.727335 54
112 Lives in a household, Religion not stated, Den... 103 13.416074 [6.98258704218e-05, -0.000725785369186, 0.1921... 44.556491 0.372911 2.065068e-32 6.716397 34
50 Children, Lives in a household,\n Black/... 105 13.146554 [-1.92978146434e-05, 7.74802254277e-05, -7.529... 44.425756 0.374751 8.384232e-31 6.727462 114
65 Children, Lives in a household, Religion not s... 106 13.311525 [-5.09215414146e-05, 8.33573648547e-05, -0.000... 44.518034 0.373452 1.570585e-31 6.723947 98
49 Children, Lives in a household,\n Black/... 107 13.157648 [-3.05321298551e-05, 8.18155402933e-05, -7.534... 44.413571 0.374922 5.260139e-30 6.737074 118
102 Lives in a household, No religion: Total, Reli... 110 13.424939 [7.01745099356e-05, -2.94223009392e-06, -0.000... 44.555551 0.372924 1.795453e-31 6.726780 38
... ... ... ... ... ... ... ... ... ...
6 Black/African/Caribbean/Black British: African... 384 15.251536 [0.000347935919091, 1.14490785569e-06, 0.00025... 55.376910 0.220624 2.538852e-17 7.487645 25
51 Children, Lives in a household,\n Other ... 390 14.477278 [-4.91367693477e-05, 3.21142980354e-05, 7.9170... 55.394083 0.220382 2.667388e-17 7.488806 103
4 Black/African/Caribbean/Black British: African... 391 15.237534 [0.000335660561604, 9.70867261091e-06, 1.61948... 55.350430 0.220997 1.425594e-16 7.497505 29
70 Children, Other ethnic group: Any other ethnic... 395 14.886391 [8.77045302705e-05, 7.68908408414e-05, 0.0] 55.700358 0.216072 8.426044e-18 7.497846 71
2 Black/African/Caribbean/Black British: African... 399 15.460045 [0.000299757890396, 3.93748817987e-05, 5.53587... 55.538877 0.218344 4.042516e-17 7.498587 27
9 Black/African/Caribbean/Black British: African... 401 15.513349 [0.000418569278657, 6.21249007559e-05, 0.0] 55.825988 0.214304 1.212383e-17 7.506297 19
75 Children, Other ethnic group: Any other ethnic... 406 15.064775 [0.000108414307881, 8.31437513155e-05, -8.6285... 55.630706 0.217052 5.259115e-17 7.504783 73
71 Children, Other ethnic group: Any other ethnic... 412 14.940154 [9.65821945981e-05, 7.57150814383e-05, -8.5048... 55.687744 0.216249 6.191331e-17 7.508630 75
73 Children, Other ethnic group: Any other ethnic... 414 15.062016 [0.000105267936714, 8.63280059678e-05, 9.11812... 55.622801 0.217163 3.081554e-16 7.515930 77
119 Other ethnic group: Any other ethnic group, No... 425 15.239006 [0.000103637146942, 5.55399591213e-05, 0.0] 56.721098 0.201706 1.582403e-16 7.566236 11
121 Other ethnic group: Any other ethnic group, No... 432 15.182217 [9.86263056252e-05, 4.66122823978e-05, 4.30797... 56.710413 0.201856 1.121159e-15 7.577261 13
123 Other ethnic group: Any other ethnic group, Re... 434 15.219239 [8.22096629359e-05, 0.00022495940931, 0.0] 56.945957 0.198541 2.997900e-16 7.581218 9
99 Lives in a household, No religion: Total 442 14.192914 [5.18507419363e-05, -7.96776992493e-05, 0.0] 57.046757 0.197123 3.988953e-16 7.587925 35
101 Lives in a household, No religion: Total, Reli... 442 14.158646 [4.36027426333e-05, -9.1087968621e-05, 0.00015... 56.871665 0.199587 1.761243e-15 7.588026 37
60 Children, Lives in a household, No religion: T... 449 14.069930 [-7.24846612718e-05, 7.12407417644e-05, -9.093... 56.976599 0.198110 2.361281e-15 7.595024 99
61 Children, Lives in a household, No religion: T... 450 14.073009 [-5.2258707226e-05, 5.83115698384e-05, -9.8192... 56.836567 0.200081 9.097169e-15 7.597491 101
13 Black/African/Caribbean/Black British: African... 450 17.178786 [0.000301198511642, 6.84743478572e-05, 0.0] 57.274927 0.193911 7.600338e-16 7.603084 23
81 Lives in a household 463 13.941445 [3.33541241571e-05, 0.0] 57.856942 0.185720 3.561233e-16 7.629815 31
68 Children, No religion: Total, Religion not stated 463 14.419911 [0.000128645395058, -6.3011500255e-05, 0.00024... 57.318238 0.193302 6.109160e-15 7.617760 69
34 Children, Lives in a household 465 14.022751 [3.71815827915e-05, 2.47480441349e-05, 0.0] 57.834228 0.186040 3.651129e-15 7.640117 95
16 Children 469 14.350270 [0.000141407947933, 0.0] 58.017322 0.183463 5.609574e-16 7.640383 63
111 Lives in a household, Religion not stated 470 13.919790 [2.99965724307e-05, 4.7804524743e-05, 0.0] 57.838741 0.185976 3.697434e-15 7.640415 33
66 Children, No religion: Total 474 14.576551 [0.000170830084733, -3.10597758658e-05, 0.0] 57.841870 0.185932 3.729877e-15 7.640622 67
64 Children, Lives in a household, Religion not s... 475 14.022413 [5.14340857732e-05, 1.65644028967e-05, 6.95485... 57.799042 0.186535 2.304426e-14 7.649643 97
78 Children, Religion not stated 478 14.161881 [0.00010635789973, 0.000120600276727, 0.0] 57.851320 0.185799 3.829594e-15 7.641246 65
0 Black/African/Caribbean/Black British: African 478 17.682280 [0.00052866879711, 0.0] 58.229440 0.180477 1.021267e-15 7.654337 15
117 Other ethnic group: Any other ethnic group 484 16.962333 [0.000133142164971, 0.0] 58.468558 0.177112 2.001790e-15 7.670038 7
125 Religion not stated 497 14.323400 [0.000423725525574, 0.0] 59.117419 0.167980 1.227469e-14 7.712480 1
115 No religion: Total, Religion not stated 498 14.456164 [-2.86616298637e-05, 0.000511155879179, 0.0] 58.995621 0.169694 9.056339e-14 7.716448 5
113 No religion: Total 504 14.983287 [0.00010650816686, 0.0] 62.353850 0.122430 8.033960e-11 7.920780 3

127 rows × 9 columns

The model with the lowest rank score appears to be reasonable. The Coef(sk) column shows a list of values; each corresponds to each X-Variable.

In [54]:
a=final_model_and_data.loc[[87]]
a
Out[54]:
X-Variables Sum of Rank Intercept(sk) Coef(sk) RSS(sm) R-Squared(sm) P-Value(sm) Std Err(sm) ID
87 Lives in a household,\n Black/African/Ca... 7 12.210549 [8.30532615656e-05, 0.000188751423961, -0.0001... 41.987483 0.409067 8.023450e-34 6.550484 62

This is the full regression summary of the best-performing model (by my crude metric!); of all the variables included, it seems that population density has the biggest effect on the MEAN DEP SCORE of an area, with a coefficient of 0.2641. The R^2 of the model itself is quite low, but I can accept that as the data is quite messy!

In [55]:
summaries[62]['Summary']
Out[55]:
OLS Regression Results
Dep. Variable: MEAN DEP SCORE R-squared: 0.409
Model: OLS Adj. R-squared: 0.398
Method: Least Squares F-statistic: 36.80
Date: Tue, 26 Jan 2016 Prob (F-statistic): 8.02e-34
Time: 21:44:19 Log-Likelihood: -1071.8
No. Observations: 326 AIC: 2158.
Df Residuals: 319 BIC: 2184.
Df Model: 6
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
Lives in a household 8.305e-05 1.19e-05 6.975 0.000 5.96e-05 0.000
Black/African/Caribbean/Black British: African 0.0002 0.000 1.789 0.074 -1.88e-05 0.000
Other ethnic group: Any other ethnic group -0.0002 4.16e-05 -4.303 0.000 -0.000 -9.71e-05
No religion: Total -8.921e-05 3.99e-05 -2.235 0.026 -0.000 -1.07e-05
Religion not stated -0.0004 0.000 -2.100 0.037 -0.001 -2.4e-05
Density 0.2641 0.028 9.453 0.000 0.209 0.319
const 12.2105 0.736 16.594 0.000 10.763 13.658
Omnibus: 15.877 Durbin-Watson: 1.560
Prob(Omnibus): 0.000 Jarque-Bera (JB): 17.394
Skew: 0.561 Prob(JB): 0.000167
Kurtosis: 2.854 Cond. No. 4.08e+05

Thank-you for reading! If you have any questions, comments or recommendations, please don't hesitate to say. If you would like to continue or expand upon the analysis, you can fork it at my GitHub account.