Data Analysis and a bit on Democracy

By Samuel Muiruri | Dec. 20, 2018 | Python Scripts

I’ve been using pandas to analyze csv, xls “excel” data as a dataframe, do some manipulation to it and even plot it as a graph to better understand the data through visualization. I’ve got one with history of some information per country such as the Democratic Index, Perception on Corruption and so on. I’m going to show some example on how I got the result but will not deal with explaining how pandas works, there’s plenty of well written tutorials on this and pandas docs is also easy to understand.

Reading an excel file with pandas and displaying the first five rows, trimmed because it has a lot of rows.

df9 = pd.read_excel('data/happiness index.xls')

First thing to note is there’s a column for year so this spans multiple years and if you wanted to analyze a country you’ll likely find data from it that spans a numbet of years, so a simple per country graph wouldn’t be as effective, but groupby will allow the data to be sorted by the year and also plot the data.

df9.groupby('year')['Life Ladder'].agg(['mean','max','min']).plot(kind='bar')

Grouping by year, getting the column Life Ladder and getting the aggregrate “agg” mean, max and min is a quick way to view in general in all the countries what the trend is. Life Ladder likely in reference how hard it’s like to make it in life, in interest is the ranked hardest “min” and easiest “max” differ by how much. The values from 0 to 8 might be in reference to how easy it might be make it from a scale of 0 to 10 can’t be sure though since I collected some datasets to analyze, some from querying wikipedia with pandas and saving the tables as csv or xls file. This file however I got it from a site I didn’t mark from, that said the validity of the data might be questioned but the intent of this article and my analysis was to get familiar with data manipulation and analysis a key step before Machine Learning since you can’t simply throw data at a model and expect it to make a smart analysis and if it fails keep adding more layers expecting better results.

Moving on there’s another for GDP per capita

df9.groupby('year')['Log GDP per capita'].agg(['mean','max','min']).plot(kind='bar')

The minimum seems to be steady, this for example could go hand in hand with the understanding that life in general is getting better worldwide.

Then here’s Social Support

df9.groupby('year')['Social support'].agg(['mean','max','min']).plot(kind='bar')

There’s a high amount of deviation in the minimum, this should make you curious. Maybe as throughout the years the author was able to collect data from more countries so that causes the fluctuation.

Getting the shape of the dataframe gives you back the number of rows and columns respectively.

>>> (1562, 19)

The code below gets the sum of null values per column, Life Ladder has 0 null values, GDP has 27 and Social support has 13, likely not the cause

country                                                       0
year                                                          0
Life Ladder                                                   0
Log GDP per capita                                           27
Social support                                               13
Healthy life expectancy at birth                              9
Freedom to make life choices                                 29
Generosity                                                   80
Perceptions of corruption                                    90
Positive affect                                              18
Negative affect                                              12
Confidence in national government                           161
Democratic Quality                                          171
Delivery Quality                                            171
Standard deviation of ladder by country-year                  0
Standard deviation/Mean of ladder by country-year             0
GINI index (World Bank estimate)                            979
GINI index (World Bank estimate), average 2000-15           176
gini of household income reported in Gallup, by wp5-year    357
dtype: int64

I can dive in deeper to know those with nulls for more insight, I also sort by year to make it easier to confirm any suspicions

df9[df9['Social support'].isnull()].sort_values(by=['year'])

Then I check for number of countries per year by counties the number of values for the year column

2005     27
2006     89
2007    102
2008    110
2009    114
2010    124
2011    146
2012    142
2013    137
2014    145
2015    143
2016    142
2017    141
Name: year, dtype: int64

Comparing it with the graph again below, 2011 had the most countries, there’s some fluctuation but I wouldn’t put it on new countries, by 2010 there was already most countries on the list. I could drill down further into the years countries that deviate but at this point I feel like I’ve already understood mostly what goes on in the graph.

Then there’s life expectancy

df9.groupby('year')['Healthy life expectancy at birth'].agg(['mean','max','min']).plot(kind='bar')

The minimum is steadily increasing as is the maximum life expectancy, in line with what you’d expect.

Then there’s Freedom to make life choices

df9.groupby('year')['Freedom to make life choices'].agg(['mean','max','min']).plot(kind='bar')

There’s some spiking on the minimum, to be expected since certain events in some countries could lead to some of their freedom being curtailed, internal conflicts for example.

Then there’s generosity


Which I found interesting, there’s a negative index, so some of these countries are negative compared with positive ones.

I can get the positive countries

df9[df9['Generosity'] > 0]['country'].unique()
array(['Afghanistan', 'Angola', 'Australia', 'Austria', 'Bahrain',
       'Bangladesh', 'Belgium', 'Belize', 'Benin', 'Bhutan', 'Bolivia',
       'Bosnia and Herzegovina', 'Burkina Faso', 'Cambodia', 'Cameroon',
       'Canada', 'Central African Republic', 'Chad', 'Chile', 'Comoros',
       'Congo (Kinshasa)', 'Costa Rica', 'Croatia', 'Cyprus', 'Denmark',
       'Djibouti', 'Dominican Republic', 'Ethiopia', 'Finland', 'France',
       'Germany', 'Ghana', 'Guatemala', 'Guinea', 'Guyana', 'Haiti',
       'Honduras', 'Hong Kong S.A.R. of China', 'Iceland', 'India',
       'Indonesia', 'Iran', 'Ireland', 'Israel', 'Italy', 'Kenya',
       'Kosovo', 'Kuwait', 'Kyrgyzstan', 'Laos', 'Lebanon', 'Liberia',
       'Luxembourg', 'Macedonia', 'Malawi', 'Malaysia', 'Mali', 'Malta',
       'Mauritania', 'Mauritius', 'Moldova', 'Mongolia', 'Montenegro',
       'Mozambique', 'Myanmar', 'Nepal', 'Netherlands', 'New Zealand',
       'Nicaragua', 'Niger', 'Nigeria', 'Norway', 'Pakistan', 'Panama',
       'Paraguay', 'Philippines', 'Poland', 'Qatar', 'Rwanda', 'Senegal',
       'Serbia', 'Sierra Leone', 'Singapore', 'Slovenia', 'South Korea',
       'South Sudan', 'Sri Lanka', 'Sudan', 'Sweden', 'Switzerland',
       'Syria', 'Tajikistan', 'Tanzania', 'Thailand',
       'Trinidad and Tobago', 'Turkmenistan', 'Uganda', 'Ukraine',
       'United Arab Emirates', 'United Kingdom', 'United States',
       'Uruguay', 'Uzbekistan', 'Vietnam', 'Zambia'], dtype=object)

And negatively indexed countries

df9[df9['Generosity'] < 0]['country'].unique()
array(['Afghanistan', 'Albania', 'Algeria', 'Angola', 'Argentina',
       'Armenia', 'Azerbaijan', 'Bahrain', 'Bangladesh', 'Belarus',
       'Belgium', 'Benin', 'Bolivia', 'Bosnia and Herzegovina',
       'Botswana', 'Brazil', 'Bulgaria', 'Burkina Faso', 'Burundi',
       'Cameroon', 'Central African Republic', 'Chad', 'Chile', 'China',
       'Colombia', 'Comoros', 'Congo (Brazzaville)', 'Congo (Kinshasa)',
       'Costa Rica', 'Croatia', 'Cyprus', 'Czech Republic', 'Djibouti',
       'Dominican Republic', 'Ecuador', 'Egypt', 'El Salvador', 'Estonia',
       'Ethiopia', 'Finland', 'France', 'Gabon', 'Georgia', 'Ghana',
       'Greece', 'Guatemala', 'Guinea', 'Honduras', 'Hungary', 'India',
       'Iraq', 'Italy', 'Ivory Coast', 'Jamaica', 'Japan', 'Jordan',
       'Kazakhstan', 'Kenya', 'Kosovo', 'Kuwait', 'Kyrgyzstan', 'Latvia',
       'Lebanon', 'Lesotho', 'Liberia', 'Libya', 'Lithuania',
       'Luxembourg', 'Macedonia', 'Madagascar', 'Malaysia', 'Mali',
       'Mauritania', 'Mexico', 'Moldova', 'Montenegro', 'Morocco',
       'Mozambique', 'Namibia', 'Nicaragua', 'Niger', 'Nigeria', 'Norway',
       'Oman', 'Palestinian Territories', 'Panama', 'Paraguay', 'Peru',
       'Philippines', 'Poland', 'Portugal', 'Qatar', 'Romania', 'Russia',
       'Rwanda', 'Saudi Arabia', 'Senegal', 'Serbia', 'Sierra Leone',
       'Singapore', 'Slovakia', 'Slovenia', 'South Africa', 'South Korea',
       'Spain', 'Sudan', 'Suriname', 'Swaziland',
       'Taiwan Province of China', 'Tajikistan', 'Tanzania', 'Togo',
       'Trinidad and Tobago', 'Tunisia', 'Turkey', 'Turkmenistan',
       'Uganda', 'Ukraine', 'United Arab Emirates', 'Uruguay',
       'Uzbekistan', 'Venezuela', 'Vietnam', 'Yemen', 'Zambia',
       'Zimbabwe'], dtype=object)

And compare generous and not generous, first always generous countries

generous = set(df9[df9['Generosity'] > 0]['country'].unique().tolist())
not_generous = set(df9[df9['Generosity'] < 0]['country'].unique().tolist())
#always generous
generous - not_generous
 'Hong Kong S.A.R. of China',
 'New Zealand',
 'South Sudan',
 'Sri Lanka',
 'United Kingdom',
 'United States'}

And never generous countries

#never generous
not_generous - generous
 'Congo (Brazzaville)',
 'Czech Republic',
 'El Salvador',
 'Ivory Coast',
 'Palestinian Territories',
 'Saudi Arabia',
 'South Africa',
 'Taiwan Province of China',

And those that have been on both sides

#been on both sides
generous & not_generous
 'Bosnia and Herzegovina',
 'Burkina Faso',
 'Central African Republic',
 'Congo (Kinshasa)',
 'Costa Rica',
 'Dominican Republic',
 'Sierra Leone',
 'South Korea',
 'Trinidad and Tobago',
 'United Arab Emirates',

You can also plot the negative index only to get the variation of these alone

df9[df9['Generosity'] < 0].groupby('year')['Generosity'].agg(['mean','max','min']).plot(kind='bar')

And positive index countries

df9[df9['Generosity'] > 0].groupby('year')['Generosity'].agg(['mean','max','min']).plot(kind='bar')

The next column is corruption

df9.groupby('year')['Perceptions of corruption'].agg(['mean','max','min']).plot(kind='bar')

hard to make head or tails with this, there’s some dip maybe where the markets crashed in US “I’m guessting” and 2016 seems to have another low of all lows, the mean change is a bit interesting.

Note that this column though is perception of corruption so it’s likely how the general public views how corrupt the government and institutions are. Filtering it down by country starting with Kenya first.

df9[df9['country'] == 'Kenya'].groupby('year')['Perceptions of corruption'].agg(['max']).plot(kind='bar')

Really high and it stays high generally, not that suprising.

Next is Uganda

df9[df9['country'] == 'Uganda'].groupby('year')['Perceptions of corruption'].agg(['max']).plot(kind='bar')

Also as high and with little variation, at this point the more countries we poll the more we can consider either the data is skewed, incorrect or people just generally view their countries as corrupt.

Here’s Tanzania, another high scorer with similar stats

df9[df9['country'] == 'Tanzania'].groupby('year')['Perceptions of corruption'].agg(['max']).plot(kind='bar')

Next one is a country known in the region to be least corrupt, Rwanda

df9[df9['country'] == 'Rwanda'].groupby('year')['Perceptions of corruption'].agg(['max']).plot(kind='bar')

With low scores from the get go and a significant drop still later on.

Libya on the other hand unfortunately has too many NAN’s so the graph only has two entries

df9[df9['country'] == 'Libya'].groupby('year')['Perceptions of corruption'].agg(['max']).plot(kind='bar')

Then there’s US with also high figures but it doesn’t go up to 0.8 like the others in the region

df9[df9['country'] == 'United States'].groupby('year')['Perceptions of corruption'].agg(['max']).plot(kind='bar')

I’ve also done some comparison of activity per country but I’ll concentrate with Kenya starting with GDP with a constant increase

df9[df9['country'] == 'Kenya'].groupby('year')['Log GDP per capita'].agg(['max']).plot(kind='bar')

Social support which seems to be on the decline

df9[df9['country'] == 'Kenya'].groupby('year')['Social support'].agg(['max']).plot(kind='bar')

Life Ladder which increases with some dips in between

df9[df9['country'] == 'Kenya'].groupby('year')['Life Ladder'].agg(['max']).plot(kind='bar')

Life expectancy on the increase almost hiting the max plateau of 60

df9[df9['country'] == 'Kenya'].groupby('year')['Healthy life expectancy at birth'].agg(['max']).plot(kind='bar')

Freedom to make life choices with some spikes but generally on the increase

df9[df9['country'] == 'Kenya'].groupby('year')['Freedom to make life choices'].agg(['max']).plot(kind='bar')

And finally generosity with the first row 2006 seems to be on the negative but since then on the increase

df9[df9['country'] == 'Kenya'].groupby('year')['Generosity'].agg(['max']).plot(kind='bar')

Then there’s Uganda with a clear negative and positive index

df9[df9['country'] == 'Uganda'].groupby('year')['Generosity'].agg(['max']).plot(kind='bar')

Tanzania with the same but less instances on the negative index

df9[df9['country'] == 'Tanzania'].groupby('year')['Generosity'].agg(['max']).plot(kind='bar')

Libya with fewer data but always on the negative index

And South Africa with a clear stronger negative index even more than Libya

df9[df9['country'] == 'South Africa'].groupby('year')['Generosity'].agg(['max']).plot(kind='bar')

US with a clear top index with some spikes but there’s a large decline in 2016

df9[df9['country'] == 'United States'].groupby('year')['Generosity'].agg(['max']).plot(kind='bar')

Then there’s Canada with a steady decrease from 2013 and no data for 2017

df9[df9['country'] == 'Canada'].groupby('year')['Generosity'].agg(['max']).plot(kind='bar')

And finally Saudi Arabia with a constant negative index that seemed to be decreasing to the positive then went back to the high negatives

df9[df9['country'] == 'Saudi Arabia'].groupby('year')['Generosity'].agg(['max']).plot(kind='bar')