The United States disburses the most foreign aid in the world (World Economic Forum) and it is of paramount importance that the countries who are in greatest need are able to get it. We examine US foreign aid disbursements to six countires of middle to low economic rank to determine the effect of monetary aid increases on five health indicators representative of overall health outcomes. The countries of Ethiopia, El Salvador, Ecuador, Haiti, Moldova, and Yemen represent a sample of countries exhibiting some of the greatest overall need for humanitarian aid. Geopolitical and environmental factors have rendered Ethiopia, Haiti, and Yemen dependent on foreign aid for humanitarian support. It is anecdotally acknowledged that overall increases in foreign aid disbursements would have a positive increase in overall welfare. We aim to prove this and determine which US Agencies have the greatest direct impact on improving health outcomes.
Our analysis will explore and answer the following questions:
For this project, we utilized the python libraries pandas, maplotlib, numpy, and more. The first few code cells will be installing seaborn, a plotting library, and importing other necessary libraries.
%pip install seaborn
Requirement already satisfied: seaborn in c:\users\perso\anaconda3\envs\landlab_dev\lib\site-packages (0.12.1) Requirement already satisfied: matplotlib!=3.6.1,>=3.1 in c:\users\perso\anaconda3\envs\landlab_dev\lib\site-packages (from seaborn) (3.5.2) Requirement already satisfied: numpy>=1.17 in c:\users\perso\anaconda3\envs\landlab_dev\lib\site-packages (from seaborn) (1.23.4) Requirement already satisfied: pandas>=0.25 in c:\users\perso\anaconda3\envs\landlab_dev\lib\site-packages (from seaborn) (1.5.1) Requirement already satisfied: pillow>=6.2.0 in c:\users\perso\anaconda3\envs\landlab_dev\lib\site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (9.2.0) Requirement already satisfied: pyparsing>=2.2.1 in c:\users\perso\anaconda3\envs\landlab_dev\lib\site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (3.0.9) Requirement already satisfied: fonttools>=4.22.0 in c:\users\perso\anaconda3\envs\landlab_dev\lib\site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (4.25.0) Requirement already satisfied: kiwisolver>=1.0.1 in c:\users\perso\anaconda3\envs\landlab_dev\lib\site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (1.4.2) Requirement already satisfied: python-dateutil>=2.7 in c:\users\perso\anaconda3\envs\landlab_dev\lib\site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (2.8.2) Requirement already satisfied: packaging>=20.0 in c:\users\perso\anaconda3\envs\landlab_dev\lib\site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (21.3) Requirement already satisfied: cycler>=0.10 in c:\users\perso\anaconda3\envs\landlab_dev\lib\site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (0.11.0) Requirement already satisfied: pytz>=2020.1 in c:\users\perso\anaconda3\envs\landlab_dev\lib\site-packages (from pandas>=0.25->seaborn) (2022.1) Requirement already satisfied: six>=1.5 in c:\users\perso\anaconda3\envs\landlab_dev\lib\site-packages (from python-dateutil>=2.7->matplotlib!=3.6.1,>=3.1->seaborn) (1.16.0) Note: you may need to restart the kernel to use updated packages.
WARNING: Error parsing requirements for pywinpty: [Errno 2] No such file or directory: 'c:\\users\\perso\\anaconda3\\envs\\landlab_dev\\lib\\site-packages\\pywinpty-2.0.6.dist-info\\METADATA'
## IMPORTS ##
# Load Numpy
import numpy as np
# Load os
import os
# Load Pandas
import pandas as pd
# Load MatPlotLib
import matplotlib
import matplotlib.pyplot as plt
# Load Seaborn
import seaborn as sns; sns.set_theme()
# These two things are for Pandas, it widens the notebook and lets us display data easily.
from IPython.display import display, HTML
display(HTML("<style>.container { width:95% !important; }</style>"))
Let's change the directory to the folder containing the foreign aid data.
os.chdir("Foreign-Aid-Country-Data")
We obtained data from two primary sources: US Gov Foreign Assistance and WHO GW13 Dataset. These two datasets contain the publicly avaliable US foreign aid disbursements per country and health indicators by country over the past 20 years, respectively.
Our first dataset comes from the US Foreign Assistance webpage. With this dataset, we will be focusing on quantifying the amount of foreign aid the United States gives to the six countries in question, per year, and which agencies provide this support.
Analysis of this dataset will answer the following questions:
Figure 1: The image above is an example query of the foreign aid given to Ethiopia from 2018 to 2022
For each country, we created a query, similar to that of figure 1, obtaining the non-military foreign aid given from 2002 to 2022. Since we are examining how foreign aid impacts health, we assume that military foreign aid has no impact on the health outcomes of the country and will be removed from the dataframes. Each of these datasets can be found in the "Foreign-Aid-Country-Data" folder.
Here, we are reading in the country data from our queries on ForeignAssistance.gov. Each of the queries encompass non-military foreign aid from 2002 to 2022 fiscal years.
We chose countries in low to middle income groups with one to two per continent/region and low military assistance.
Let's read in our data into Pandas DataFrames.
Ethiopia_foreign_aid_data = pd.read_csv(os.path.join("Africa","Ethiopia_2022-2002_Econ_Other.csv"))
Moldova_foreign_aid_data = pd.read_csv(os.path.join("Europe", "Moldova_2022-2002_Econ_Other.csv"))
Yemen_foreign_aid_data = pd.read_csv(os.path.join("Middle-East","Yemen_2022-2002_Econ_Other.csv"))
Ecuador_foreign_aid_data = pd.read_csv(os.path.join("South-America","Ecuador_2022-2002_Econ_Other.csv"))
El_Salvador_foreign_aid_data = pd.read_csv(os.path.join("North-America","El_Salvador_2022-2002_Econ_Other.csv"))
Haiti_foreign_aid_data = pd.read_csv(os.path.join("North-America","Haiti_2022-2002_Econ_Other.csv"))
countries = [
Ethiopia_foreign_aid_data,
Moldova_foreign_aid_data,
Yemen_foreign_aid_data,
Ecuador_foreign_aid_data,
El_Salvador_foreign_aid_data,
Haiti_foreign_aid_data
] # For ease of looping
Ethiopia_foreign_aid_data.head(3)
Country ID | Country Code | Country Name | Region ID | Region Name | Income Group ID | Income Group Name | Income Group Acronym | Managing Agency ID | Managing Agency Acronym | ... | Activity Name | Activity Description | Activity Project Number | Activity Start Date | Activity End Date | Transaction Type ID | Transaction Type Name | Fiscal Year | Current Dollar Amount | Constant Dollar Amount | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 231 | ETH | Ethiopia | 5 | Sub-Saharan Africa | 1 | Low Income Country | LIC | 1 | USAID | ... | Title II Emergency Program | Title II Emergency Program through a partner o... | 720BHA22CA00005 | NaN | NaN | 2 | Obligations | 2022 | 113052149 | 109155305 |
1 | 231 | ETH | Ethiopia | 5 | Sub-Saharan Africa | 1 | Low Income Country | LIC | 1 | USAID | ... | Emergency food assistance | Emergency humanitarian food and nutirition ass... | 720BHA22IO00038 | NaN | 2023-03-09 | 2 | Obligations | 2022 | 77740000 | 75060346 |
2 | 231 | ETH | Ethiopia | 5 | Sub-Saharan Africa | 1 | Low Income Country | LIC | 1 | USAID | ... | Title II Emergency Program | Title II Emergency Program through a partner o... | 720BHA22IO00034 | NaN | NaN | 2 | Obligations | 2022 | 57420571 | 55441315 |
3 rows × 51 columns
We will be focusing on the country-specific datasets from the ForeignAssistance.gov website. Each row in the summary describes a country and the fiscal year. Each entity contains data on the Transaction Type, Fiscal Year, and Constant Amount (in USD). We are planning on using this data from the summary as a reference to determine how much money is irrelevant to our study (non-military/defense spending).
As you can see, there are 51 columns in the DataFrame. Many of these columns are redundant or irrelevant to our analysis. Below, we will remove these columns, maintaining the interesting features of the dataset.
columns_to_drop = ['Transaction Type ID',
'Foreign Assistance Objective ID',
'Implementing Partner Category ID',
'Implementing Partner Category Name',
'Implementing Partner Sub-category ID',
'Implementing Partner Sub-category Name',
'Implementing Partner ID',
'Implementing Partner Name',
'International Category ID',
'International Category Name',
'International Sector Code',
'International Sector Name',
'International Purpose Code',
'Managing Agency ID',
'Managing Sub-agency or Bureau ID',
'Activity Project Number',
'Managing Sub-agency or Bureau Name',
'Aid Type Group ID',
'Funding Agency ID',
'Submission ID',
'Funding Acount ID',
'Activity ID',
'Income Group Acronym',
'Managing Sub-agency or Bureau Acronym',
'US Category ID',
'US Category Name',
]
for country in countries:
for col in columns_to_drop:
country.drop(labels= col, axis= 'columns', inplace= True)
Ethiopia_foreign_aid_data.columns
Index(['Country ID', 'Country Code', 'Country Name', 'Region ID', 'Region Name', 'Income Group ID', 'Income Group Name', 'Managing Agency Acronym', 'Managing Agency Name', 'International Purpose Name', 'US Sector ID', 'US Sector Name', 'Funding Account Name', 'Funding Agency Name', 'Funding Agency Acronym', 'Foreign Assistance Objective Name', 'Aid Type Group Name', 'Activity Name', 'Activity Description', 'Activity Start Date', 'Activity End Date', 'Transaction Type Name', 'Fiscal Year', 'Current Dollar Amount', 'Constant Dollar Amount'], dtype='object')
Next, we want to highlight the income of each of the countries to verify that the countries chosen are likely to receive a sizeable amount and diverse range of US foreign aid.
for country in countries:
display(country['Country Name'][0] + ': '+ country['Income Group Name'][0])
'Ethiopia: Low Income Country'
'Moldova: Lower Middle Income Country'
'Yemen: Low Income Country'
'Ecuador: Upper Middle Income Country'
'El Salvador: Lower Middle Income Country'
'Haiti: Low Income Country'
As we can see, most of these countries fall into the low / middle income countries. We will see that, with the exception of Yemen, all of these countries receive roughly the same amount of foreign aid.
Before we visuallize the aid amounts, we want to get rid of entries / aid projects that come from military sources since they would not have impacts on any health metrics we are interested in.
# Drop rows unrelated to economic/health development
for i, country in enumerate(countries):
country.drop(country[country["Managing Agency Name"] == 'Department of Defense'].index, inplace = True)
country.drop(country[country["Managing Agency Name"] == 'Department of Homeland Security'].index, inplace = True)
country.drop(country[country["Managing Agency Name"] == 'Department of Justice'].index, inplace = True)
country.drop(country[country["Managing Agency Name"] == 'Department of the Army'].index, inplace = True)
country.drop(country[country["Managing Agency Name"] == 'Department of the Interior'].index, inplace = True)
# Drop row if dollar amount is less than 0 as this is usually a redaction of funds
country.drop(country[country["Constant Dollar Amount"] < 0 ].index, inplace = True)
Now we are going to bring in our second datatset which comes from the GW13 WHO Dataset, this dataset contains health markers of interest, stratified by country and year. This dataset comes from a paper by the World Health Organization about the organization's five year strategy to achieve measureable heatlh outcomes.
The strategy includes a "triple billion" plan which sets to achieve
The dataset includes the health impacts that are measured. We did basic filtering on this dataset to only select data from our target countries, and begin filtering unnecessary health indicators. We plan on using this data to contextualize and predict the importance of various funding and their quantifible impacts on health outcomes in our countries of interest.
gw19_df = pd.read_excel(os.path.join("..","WHO_datasets", "GW19_2002_2020_filtered.xlsx"))
gw19_df
setting | year | indicator_abbr | indicator_name | dimension | subgroup | estimate | population | indicator_scale | |
---|---|---|---|---|---|---|---|---|---|
0 | Ecuador | 2002 | alcohol | Total alcohol per capita consumption in adults... | Sex | Female | 2.0 | NaN | 100 |
1 | Ecuador | 2002 | alcohol | Total alcohol per capita consumption in adults... | Sex | Male | 7.0 | NaN | 100 |
2 | Ecuador | 2002 | bp | Prevalence of hypertension among adults aged 3... | Sex | Female | 25.0 | NaN | 100 |
3 | Ecuador | 2002 | bp | Prevalence of hypertension among adults aged 3... | Sex | Male | 23.0 | NaN | 100 |
4 | Ecuador | 2002 | hh_san_sm | Population using safely managed sanitation ser... | Place of residence | Rural | 44.0 | 5110442.0 | 100 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
5438 | Yemen | 2020 | tb_incidence | TB incidence (new infections per 100 000 popul... | Sex | Female | 44.0 | 14800983.0 | 100000 |
5439 | Yemen | 2020 | tb_incidence | TB incidence (new infections per 100 000 popul... | Sex | Male | 53.0 | 15024985.0 | 100000 |
5440 | Yemen | 2020 | u5mr | Under-five mortality rate (deaths per 1000 liv... | Sex | Female | 55.0 | NaN | 1000 |
5441 | Yemen | 2020 | u5mr | Under-five mortality rate (deaths per 1000 liv... | Sex | Male | 64.0 | NaN | 1000 |
5442 | Yemen | 2020 | life_ex | Life Expectancy | NaN | NaN | 67.9 | NaN | 100 |
5443 rows × 9 columns
# Here are the different indicators in this dataset.
display(gw19_df["indicator_name"].unique())
array(['Total alcohol per capita consumption in adults aged 15+ (litres of pure alcohol)', 'Prevalence of hypertension among adults aged 30-79 years (age-standardized) (%)', 'Population using safely managed sanitation services (%)', 'Population using safely managed drinking water services (%)', 'Obesity prevalence among adults (%)', 'Obesity prevalence among children and adolescents (5-19) (%)', 'Population using at least basic sanitation services (%)', 'Suicide mortality rate (per 100 000 population)', 'Under-five mortality rate (deaths per 1000 live births)', 'Life Expectancy', 'Antenatal care coverage - at least four visits (in the five years preceding the survey) (%)', 'Overweight prevalence in children aged < 5 years (%)', 'Stunting prevalence in children aged < 5 years (%)', 'DTP3 immunization coverage among one-year-olds (%)', 'Measles immunization coverage among one-year-olds (%)', 'Polio immunization coverage among one-year-olds (%)', 'Healthy life expectancy (HALE) at birth (years)', 'Population with >10% household expenditures on health (%)', 'Population with >25% household expenditures on health (%)', 'Mortality rate attributed to unsafe water, unsafe sanitation and lack of hygiene (per 100 000 population)', 'TB incidence (new infections per 100 000 population)', 'Neonatal mortality rate (deaths per 1000 live births)'], dtype=object)
After filtering the data to get only the countries we wanted, we decided to reduce the indicators down to 5 as many of the other indicators had inconsistent data or none at all for the countries we selected. We decided to select indicators we determined to be very important to a country's health, which include, Life expectancy, population with access to clean drinking water, and the under-five mortality rate. We also included more generally related indicators: total alcohol consumption and obesity rate to see if these indicators have a significant effect as well.
gw_five = pd.read_excel(os.path.join("..", "WHO_datasets", "GW_five_combined_subgroups.xlsx"))
# add combined for merging another dataset by setting and year
gw_five["combined"] = gw_five["setting"].astype(str) + "_" + gw_five["year"].astype(str)
display(gw_five["indicator_name"].unique())
display(gw_five)
array(['Total alcohol per capita consumption in adults aged 15+ (litres of pure alcohol)', 'Population using safely managed drinking water services (%)', 'Life Expectancy', 'Obesity prevalence among adults (%)', 'Under-five mortality rate'], dtype=object)
setting | year | indicator_abbr | indicator_name | Avg estimate | combined | |
---|---|---|---|---|---|---|
0 | Ecuador | 2002 | alcohol | Total alcohol per capita consumption in adults... | -4.50 | Ecuador_2002 |
1 | Ecuador | 2002 | hh_wat_sm | Population using safely managed drinking water... | 56.50 | Ecuador_2002 |
2 | Ecuador | 2002 | life_ex | Life Expectancy | 73.90 | Ecuador_2002 |
3 | Ecuador | 2002 | obesitya | Obesity prevalence among adults (%) | -13.00 | Ecuador_2002 |
4 | Ecuador | 2002 | u5mr | Under-five mortality rate | -2.65 | Ecuador_2002 |
... | ... | ... | ... | ... | ... | ... |
544 | Yemen | 2019 | alcohol | Total alcohol per capita consumption in adults... | 0.00 | Yemen_2019 |
545 | Yemen | 2019 | life_ex | Life Expectancy | 67.80 | Yemen_2019 |
546 | Yemen | 2019 | u5mr | Under-five mortality rate | -6.10 | Yemen_2019 |
547 | Yemen | 2020 | life_ex | Life Expectancy | 67.90 | Yemen_2020 |
548 | Yemen | 2020 | u5mr | Under-five mortality rate | -5.95 | Yemen_2020 |
549 rows × 6 columns
You may notice that some of the Avg estimate values are negative. This is done to insure that all indicators all have the same directionality. We changed the directionality of indicators such that a desireable movement of an estimate will be in the positive direction and an undesireable movement will be in the negative direction. Since we want alcohol, obesity, and under five mortality to go down, their estimates are turned negative, so that a lower estimate of this estimate is in the positive direction.
Now that the health indicators are chosen, we will group the agencies funding by year and country. This will help create a dataframe of the funding amounts by agency and year.
# Create dataframe of the funding amounts in each country per year
masterdf = pd.DataFrame()
for i, country in enumerate(countries):
per_country_df = pd.DataFrame()
grouped_df = country.groupby(["Funding Agency Acronym","Fiscal Year", "Country Name"])
# Create a new column based on the grouped data
per_country_df["Constant Dollar Amount"] = grouped_df["Constant Dollar Amount"].sum()
per_country_df = pd.DataFrame(per_country_df)
masterdf = pd.concat([masterdf, per_country_df])
masterdf.reset_index(inplace=True)
# Create column for a key to merge on based on country and year.
masterdf["combined"] = masterdf["Country Name"].astype(str) + "_" + masterdf["Fiscal Year"].astype(str)
masterdf
Funding Agency Acronym | Fiscal Year | Country Name | Constant Dollar Amount | combined | |
---|---|---|---|---|---|
0 | ADF | 2011 | Ethiopia | 58070 | Ethiopia_2011 |
1 | ADF | 2012 | Ethiopia | 59155 | Ethiopia_2012 |
2 | ADF | 2013 | Ethiopia | 91947 | Ethiopia_2013 |
3 | ADF | 2014 | Ethiopia | 375920 | Ethiopia_2014 |
4 | ADF | 2015 | Ethiopia | 210735 | Ethiopia_2015 |
... | ... | ... | ... | ... | ... |
773 | USAID | 2018 | Haiti | 160983489 | Haiti_2018 |
774 | USAID | 2019 | Haiti | 137906915 | Haiti_2019 |
775 | USAID | 2020 | Haiti | 130047717 | Haiti_2020 |
776 | USAID | 2021 | Haiti | 174539218 | Haiti_2021 |
777 | USAID | 2022 | Haiti | 82660892 | Haiti_2022 |
778 rows × 5 columns
# merge health indicator data with agency funding amounts.
merged_df = gw_five.merge(masterdf,how="left", on="combined")
merged_df.drop(["combined", "Fiscal Year", "Country Name"], axis=1, inplace=True)
merged_df.to_csv(os.path.join("..", "WHO_datasets", "merged_data.csv"))
merged_df["Funding Agency Acronym"].unique()
array(['AGR', 'IAF', 'PC', 'STATE', 'TDA', 'USAID', 'DOL', 'TREAS', 'DOC', 'DOE', 'HHS', 'FTC', 'MCC', 'EPA', 'ADF', 'DFC', 'DOT', 'NSF', 'LIC', nan], dtype=object)
# Create country/year combinations to add all agencies and other missing indicators not in data
merged_df["country_year"] = merged_df["setting"].astype(str) + "_" + merged_df["year"].astype(str)
merged_df["country_year_indicator"] = merged_df["setting"].astype(str) + "_" + merged_df["year"].astype(str) + "_" + merged_df["indicator_abbr"].astype(str)
merged_df["country_year_indicator_agency"] = merged_df["setting"].astype(str) + "_" + merged_df["year"].astype(str) + "_" + merged_df["indicator_abbr"].astype(str) + "_" + merged_df["Funding Agency Acronym"]
merged_df
setting | year | indicator_abbr | indicator_name | Avg estimate | Funding Agency Acronym | Constant Dollar Amount | country_year | country_year_indicator | country_year_indicator_agency | |
---|---|---|---|---|---|---|---|---|---|---|
0 | Ecuador | 2002 | alcohol | Total alcohol per capita consumption in adults... | -4.50 | AGR | 9241684.0 | Ecuador_2002 | Ecuador_2002_alcohol | Ecuador_2002_alcohol_AGR |
1 | Ecuador | 2002 | alcohol | Total alcohol per capita consumption in adults... | -4.50 | IAF | 1948562.0 | Ecuador_2002 | Ecuador_2002_alcohol | Ecuador_2002_alcohol_IAF |
2 | Ecuador | 2002 | alcohol | Total alcohol per capita consumption in adults... | -4.50 | PC | 3948070.0 | Ecuador_2002 | Ecuador_2002_alcohol | Ecuador_2002_alcohol_PC |
3 | Ecuador | 2002 | alcohol | Total alcohol per capita consumption in adults... | -4.50 | STATE | 35779088.0 | Ecuador_2002 | Ecuador_2002_alcohol | Ecuador_2002_alcohol_STATE |
4 | Ecuador | 2002 | alcohol | Total alcohol per capita consumption in adults... | -4.50 | TDA | 1638316.0 | Ecuador_2002 | Ecuador_2002_alcohol | Ecuador_2002_alcohol_TDA |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
3116 | Yemen | 2020 | life_ex | Life Expectancy | 67.90 | STATE | 82848136.0 | Yemen_2020 | Yemen_2020_life_ex | Yemen_2020_life_ex_STATE |
3117 | Yemen | 2020 | life_ex | Life Expectancy | 67.90 | USAID | 248473546.0 | Yemen_2020 | Yemen_2020_life_ex | Yemen_2020_life_ex_USAID |
3118 | Yemen | 2020 | u5mr | Under-five mortality rate | -5.95 | AGR | 224994779.0 | Yemen_2020 | Yemen_2020_u5mr | Yemen_2020_u5mr_AGR |
3119 | Yemen | 2020 | u5mr | Under-five mortality rate | -5.95 | STATE | 82848136.0 | Yemen_2020 | Yemen_2020_u5mr | Yemen_2020_u5mr_STATE |
3120 | Yemen | 2020 | u5mr | Under-five mortality rate | -5.95 | USAID | 248473546.0 | Yemen_2020 | Yemen_2020_u5mr | Yemen_2020_u5mr_USAID |
3121 rows × 10 columns
Since some agencies had $0 in funding some years, these records did not exist in our dataset, but we want them in our data, so we will add records of every possible combination.
# Make set of all combination of possible rows containing a country's name, the year, indicator, and agency.
all_possible_agencies = ['AGR', 'IAF', 'PC', 'STATE', 'TDA', 'USAID', 'DOL', 'TREAS', 'DOC',
'DOE', 'HHS', 'FTC', 'MCC', 'EPA', 'ADF', 'DFC', 'DOT', 'NSF','LIC']
all_possible_indicators = ["alcohol", "hh", "life", "obesitya", "u5mr"]
all_combos = set()
for index, row in gw_five.iterrows():
# from gw_five combined is the country and year
value = row['combined']
indic = row['indicator_abbr']
for agency in all_possible_agencies:
for indic in all_possible_indicators:
all_combos.add(value + "_" + indic + "_" + agency)
all_combos = sorted(all_combos)
all_combos_df = pd.DataFrame(all_combos, columns=["cyia_string"])
all_combos_df
cyia_string | |
---|---|
0 | Ecuador_2002_alcohol_ADF |
1 | Ecuador_2002_alcohol_AGR |
2 | Ecuador_2002_alcohol_DFC |
3 | Ecuador_2002_alcohol_DOC |
4 | Ecuador_2002_alcohol_DOE |
... | ... |
12630 | Yemen_2020_u5mr_PC |
12631 | Yemen_2020_u5mr_STATE |
12632 | Yemen_2020_u5mr_TDA |
12633 | Yemen_2020_u5mr_TREAS |
12634 | Yemen_2020_u5mr_USAID |
12635 rows × 1 columns
# Temp dataframe used to add avg estimate column to merged_df later.
setting_indicator_estimate = pd.DataFrame(gw_five, columns=["setting", "year", "indicator_abbr", "Avg estimate"])
setting_indicator_estimate
setting | year | indicator_abbr | Avg estimate | |
---|---|---|---|---|
0 | Ecuador | 2002 | alcohol | -4.50 |
1 | Ecuador | 2002 | hh_wat_sm | 56.50 |
2 | Ecuador | 2002 | life_ex | 73.90 |
3 | Ecuador | 2002 | obesitya | -13.00 |
4 | Ecuador | 2002 | u5mr | -2.65 |
... | ... | ... | ... | ... |
544 | Yemen | 2019 | alcohol | 0.00 |
545 | Yemen | 2019 | life_ex | 67.80 |
546 | Yemen | 2019 | u5mr | -6.10 |
547 | Yemen | 2020 | life_ex | 67.90 |
548 | Yemen | 2020 | u5mr | -5.95 |
549 rows × 4 columns
# Used to generate indicator descriptions when only given abbreviations
dict_for_abbrs = {"alcohol": 'Total alcohol per capita consumption in adults aged 15+ (litres of pure alcohol)', 'life': 'Life Expectancy',
'obesitya': 'Obesity prevalence among adults (%)', 'hh':'Population using safely managed drinking water services (%)',
'u5mr':'Under-five mortality rate'}
The cell below adds the average estimate of an indicator for a year to all the agencies which will help with our model. It also allows us to shift the data down a year easily, such that cells with 2003 data constant dollars now have 2002 constant dollars as our model wants to use the money from previous years funding to predict health indicators. Having current estimates with older funding it what we want to use for our model.
# Takes while to run, so we already ran it and have the finished dataset saved in merged_df.csv
# Change run_function to True to run this
import warnings
# Suppress warnings
warnings.filterwarnings('ignore', category=DeprecationWarning)
warnings.filterwarnings('ignore', category=FutureWarning)
run_function = False
if(run_function):
merged_df = merged_df.sort_values('country_year_indicator_agency', ascending=True)
# Fill in dataframe using all possible setting, agency, year, indicators from all_combos_df
for index, row in all_combos_df.iterrows():
value = row["cyia_string"]
row_for_estimate = []
if (merged_df["country_year_indicator_agency"] != value).all():
# split long combo strings into their separate parts
country_year_indicator_agency_list = value.split('_')
value_name = country_year_indicator_agency_list[0]
value_year = int(country_year_indicator_agency_list[1])
value_indicator = country_year_indicator_agency_list[2]
value_agency = country_year_indicator_agency_list[3]
# get indicator name from abbr dict
full_name_ind = dict_for_abbrs[value_indicator]
# get value indicator for same country, year, indicator as current record
value_estimate_series = merged_df.loc[(merged_df["setting"] == value_name) & (merged_df["year"] == value_year) & (merged_df["indicator_abbr"] == value_indicator) , "Avg estimate"]
# if value estimate series is empty there is not data on this combination and it should be 0 for the year.
if value_estimate_series.empty:
value_estimate = 0
else:
value_estimate = value_estimate_series.values[0]
# adds rows that are missing
merged_df = merged_df.append({'country_year_indicator_agency':value, "setting":value_name,
"year":value_year, "indicator_abbr":value_indicator, "indicator_name":full_name_ind,
"Avg estimate":value_estimate,"Funding Agency Acronym":value_agency,
"Constant Dollar Amount": 0}, ignore_index =True)
else:
merged_df = pd.read_csv(os.path.join("..", "WHO_datasets", "merged_data.csv"))
print("Function skipped, change run_function variable as needed")
Function skipped, change run_function variable as needed
# saves dataframe as csv
merged_df.to_csv(os.path.join("..", "WHO_datasets", "merged_data.csv"))
merged_df
Unnamed: 0 | setting | year | indicator_abbr | indicator_name | Avg estimate | Funding Agency Acronym | Constant Dollar Amount | |
---|---|---|---|---|---|---|---|---|
0 | 0 | Ecuador | 2002 | alcohol | Total alcohol per capita consumption in adults... | -4.50 | AGR | 9241684.0 |
1 | 1 | Ecuador | 2002 | alcohol | Total alcohol per capita consumption in adults... | -4.50 | IAF | 1948562.0 |
2 | 2 | Ecuador | 2002 | alcohol | Total alcohol per capita consumption in adults... | -4.50 | PC | 3948070.0 |
3 | 3 | Ecuador | 2002 | alcohol | Total alcohol per capita consumption in adults... | -4.50 | STATE | 35779088.0 |
4 | 4 | Ecuador | 2002 | alcohol | Total alcohol per capita consumption in adults... | -4.50 | TDA | 1638316.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
3116 | 3116 | Yemen | 2020 | life_ex | Life Expectancy | 67.90 | STATE | 82848136.0 |
3117 | 3117 | Yemen | 2020 | life_ex | Life Expectancy | 67.90 | USAID | 248473546.0 |
3118 | 3118 | Yemen | 2020 | u5mr | Under-five mortality rate | -5.95 | AGR | 224994779.0 |
3119 | 3119 | Yemen | 2020 | u5mr | Under-five mortality rate | -5.95 | STATE | 82848136.0 |
3120 | 3120 | Yemen | 2020 | u5mr | Under-five mortality rate | -5.95 | USAID | 248473546.0 |
3121 rows × 8 columns
After setting up our dataset for our model, we imputed data on indicators that may be missing for years, typically replacing the average estimate for the last year's estimate if the estimate was missing for this year. If indicator data for a particular country was completely missing for all years, we kept the average estimate at 0. Additionally, we moved the indicators up a year to show how money from previous years (starting from 2002) impacts later years.
With this data imputation and adjusting the consant dollar amounts up a year, our dataset is now ready for a model.
final_dataset = pd.read_csv(os.path.join("..","WHO_datasets", "final_dataset.csv"))
final_dataset
setting | Previous year | indicator_abbr | indicator_name | Avg estimate | Funding Agency Acronym | Constant Dollar Amount | |
---|---|---|---|---|---|---|---|
0 | Ecuador | 2003 | alcohol | Total alcohol per capita consumption in adults... | -4.00 | ADF | 0 |
1 | Ecuador | 2003 | alcohol | Total alcohol per capita consumption in adults... | -4.00 | AGR | 717869 |
2 | Ecuador | 2003 | alcohol | Total alcohol per capita consumption in adults... | -4.00 | DFC | 0 |
3 | Ecuador | 2003 | alcohol | Total alcohol per capita consumption in adults... | -4.00 | DOC | 0 |
4 | Ecuador | 2003 | alcohol | Total alcohol per capita consumption in adults... | -4.00 | DOE | 0 |
... | ... | ... | ... | ... | ... | ... | ... |
10332 | Yemen | 2020 | u5mr | Under-five mortality rate | -5.95 | PC | 0 |
10333 | Yemen | 2020 | u5mr | Under-five mortality rate | -5.95 | STATE | 82848136 |
10334 | Yemen | 2020 | u5mr | Under-five mortality rate | -5.95 | TDA | 0 |
10335 | Yemen | 2020 | u5mr | Under-five mortality rate | -5.95 | TREAS | 0 |
10336 | Yemen | 2020 | u5mr | Under-five mortality rate | -5.95 | USAID | 248473546 |
10337 rows × 7 columns
First, we're going to quickly make another column for the purpose of scaling the data. Then, we're going to plot the amount of foreign aid received by country. This exploration will help us determine which of the countries chosen recieve the most aid. Make sure to pay attention to the axis labels!
for country in countries:
country['Constant Dollar Amount ($1,000,000s)'] = country['Constant Dollar Amount']
country['Constant Dollar Amount ($1,000,000s)'] = country['Constant Dollar Amount ($1,000,000s)'].apply(lambda x: x / 1000000) # Scale by one million $
country['Year Abv.'] = country['Fiscal Year']
country['Year Abv.'] = country['Year Abv.'].apply(lambda x: int(str(x)[2:])) # Get last two digits (ex. 2020 --> 20)
# Get the top 5 agencies by money spent
country_by_fiscal_yr = []
for i, country in enumerate(countries):
# Groupby funding agency to get money spent per agency.
country_by_fiscal_yr.append(pd.DataFrame(country.groupby('Year Abv.')['Constant Dollar Amount ($1,000,000s)'].sum()))
country_by_fiscal_yr[i].reset_index(inplace=True) # Reset index for plotting
# 0: Ethiopia
# 1: Moldova
# 2: Yemen
# 3: Ecuador
# 4: El Salvador
# 5: Haiti
fig1, axes1 = plt.subplots(1, 3, figsize=(15,5) , sharey= True)
fig1.tight_layout(h_pad= 5)
sns.barplot(country_by_fiscal_yr[0], x= 'Year Abv.', y= 'Constant Dollar Amount ($1,000,000s)', ax= axes1[0], width= 1, errorbar= None).set_title('Ethiopia')
sns.barplot(country_by_fiscal_yr[5], x= 'Year Abv.', y= 'Constant Dollar Amount ($1,000,000s)', ax= axes1[1], width= 1, errorbar= None).set_title('Haiti')
sns.barplot(country_by_fiscal_yr[2], x= 'Year Abv.', y= 'Constant Dollar Amount ($1,000,000s)', ax= axes1[2], width= 1, errorbar= None).set_title('Yemen')
fig2, axes2 = plt.subplots(1, 3, figsize=(15,5) , sharey= True)
fig2.tight_layout(h_pad= 5)
sns.barplot(country_by_fiscal_yr[1], x= 'Year Abv.', y= 'Constant Dollar Amount ($1,000,000s)', ax= axes2[0], width= 1, errorbar= None).set_title('Moldova')
sns.barplot(country_by_fiscal_yr[3], x= 'Year Abv.', y= 'Constant Dollar Amount ($1,000,000s)', ax= axes2[1], width= 1, errorbar= None).set_title('Ecuador')
sns.barplot(country_by_fiscal_yr[4], x= 'Year Abv.', y= 'Constant Dollar Amount ($1,000,000s)', ax= axes2[2], width= 1, errorbar= None).set_title('El Salvador')
Text(0.5, 1.0, 'El Salvador')
Graph 1: Constant Dollar Amount in $1,000,000 of Foreign Aid by Fiscal Year. The countries are separated in groups of 3 for scale purposes.
The countries chosen represent two distinct groups. One group that receives a vary large amount of foreign aid and one that receives less, but still a significant amount of aid. There is a noticable upward trend in Ethiopia and a peak in Haiti around 2010. The upward trend in Ethiopia is most likely due to a combination of drought, famine, civil unrest, and a civil war. Similarly, Yemen receives the a large amount of US aid in recent years due to the ourbreak of the Yemeni Civil War in late 2014, demarking the upward trend.
Next, let's examine which agencies give the most aid. This will allow us to determine which category, health or economic, the country receives the most aid in.
# Get the top 5 agencies by money spent
country_by_funding_agency = []
for i, country in enumerate(countries):
# Groupby funding agency to get money spent per agency.
country_by_funding_agency.append(pd.DataFrame(country.groupby('Funding Agency Acronym')['Constant Dollar Amount ($1,000,000s)'].sum()))
country_by_funding_agency[i].reset_index(inplace=True) # Reset index for plotting
# Sort and locate the top 3 agencies by money spent
country_by_funding_agency[i] = country_by_funding_agency[i].sort_values(ascending= False, by= 'Constant Dollar Amount ($1,000,000s)')
country_by_funding_agency[i] = country_by_funding_agency[i].iloc[0:3]
# Plotting the top 3 agencies by money spent
fig, axes = plt.subplots(1, 3, figsize=(15,5), sharey= True)
sns.barplot(data= country_by_funding_agency[0] , x= 'Funding Agency Acronym', y= 'Constant Dollar Amount ($1,000,000s)', hue= 'Funding Agency Acronym' , ax= axes[0]).set_title('Ethiopia')
sns.barplot(data= country_by_funding_agency[5] , x= 'Funding Agency Acronym', y= 'Constant Dollar Amount ($1,000,000s)', hue= 'Funding Agency Acronym' , ax= axes[1]).set_title('Haiti')
sns.barplot(data= country_by_funding_agency[2] , x= 'Funding Agency Acronym', y= 'Constant Dollar Amount ($1,000,000s)', hue= 'Funding Agency Acronym' , ax= axes[2]).set_title('Yemen')
fig1, axes1 = plt.subplots(1, 3, figsize=(15,5), sharey= True)
sns.barplot(data= country_by_funding_agency[1] , x= 'Funding Agency Acronym', y= 'Constant Dollar Amount ($1,000,000s)', hue= 'Funding Agency Acronym' ,ax= axes1[0]).set_title('Moldova')
sns.barplot(data= country_by_funding_agency[3] , x= 'Funding Agency Acronym', y= 'Constant Dollar Amount ($1,000,000s)', hue= 'Funding Agency Acronym' ,ax= axes1[1]).set_title('Ecuador')
sns.barplot(data= country_by_funding_agency[4] , x= 'Funding Agency Acronym', y= 'Constant Dollar Amount ($1,000,000s)', hue= 'Funding Agency Acronym' ,ax= axes1[2]).set_title('El Salvador')
Text(0.5, 1.0, 'El Salvador')
Graph 2: Constant Dollar Amount in millions vs Funding Agency. The countries are separated in groups of 3 for scale purposes.
We see four main aid agencies:
See appendix for more information
To highlight and explore the GW13 dataset, we are going to show how the five indicators change over time for Ecuador.
alc_dataset = final_dataset[final_dataset["indicator_abbr"] == "alcohol"]
alc_dataset = alc_dataset[alc_dataset["setting"] == "Ecuador"]
hh_dataset = final_dataset[final_dataset["indicator_abbr"] == "hh"]
hh_dataset = hh_dataset[hh_dataset["setting"] == "Ecuador"]
life_dataset = final_dataset[final_dataset["indicator_abbr"] == "life"]
life_dataset = life_dataset[life_dataset["setting"] == "Ecuador"]
obesity_dataset = final_dataset[final_dataset["indicator_abbr"] == "obesitya"]
obesity_dataset = obesity_dataset[obesity_dataset["setting"] == "Ecuador"]
U5_dataset = final_dataset[final_dataset["indicator_abbr"] == "u5mr"]
U5_dataset = U5_dataset[U5_dataset["setting"] == "Ecuador"]
fig3, axes3 = plt.subplots(1, 5, figsize=(15,8), sharey= False)
fig3.tight_layout()
sns.lineplot(data= alc_dataset, x = "Previous year", y= "Avg estimate", ax=axes3[0]).set_title('Alcohol Consumption by liters weekly')
sns.lineplot(data= hh_dataset, x = "Previous year", y= "Avg estimate", ax=axes3[1]).set_title('Access to Clean Drinking Water')
sns.lineplot(data= life_dataset, x = "Previous year", y= "Avg estimate", ax=axes3[2]).set_title('Life Expectancy')
sns.lineplot(data= obesity_dataset, x = "Previous year", y= "Avg estimate", ax=axes3[3]).set_title('Obesity Rate')
sns.lineplot(data= U5_dataset, x = "Previous year", y= "Avg estimate", ax=axes3[4]).set_title('Under Five Mortality Rate')
Text(0.5, 1.0, 'Under Five Mortality Rate')
Graph 3: Average estimate of each indicator versus the year. Differentiated by indicator.
Since the early 2000s, all indicators except for obesity have increased since the early 2000s, meaning that based on these indicators, Ecuador's overall health should be increasing. Next, we are going to look at the life expectancy for all our countries over time.
life_df = final_dataset[final_dataset["indicator_abbr"] == "life"]
fig4, ax4 = plt.subplots(figsize=(10,6))
plt.xlabel('xlabel', fontsize=18)
plt.ylabel('ylabel', fontsize=16)
ax4.set_xlabel('Year', fontsize=16)
ax4.set_ylabel('Life Expectancy (in years)', fontsize=12)
sns.scatterplot(data= life_df, x = "Previous year", y= "Avg estimate", hue="setting", ax=ax4).set_title('Life Expectancy by Country')
Text(0.5, 1.0, 'Life Expectancy by Country')
Graph 4: Life Expectancy versus year. Points colored by country.
Generally, life expectancies have increased over time, with an outlier point in Haiti, which comes from the devastating earthquake in 2010, estimated to kill between 100k-300k people which brings the average life expectancy very significantly.
Now that we have collected, cleaned, transformed, and explored the data it is time to answer some of the questions which we set out to answer.
Those questions are:
To answer these questions, we are going to turn to machine learning to learn our data and predict new values based on our data. We are going to use a simple linear regression to fit our model. We are working under the assumption that increased aid leads to general heath improvement. It is for this reason that we chose a simple linear regression as our predictor of choice.
We opt to separate the indicators into their own models based on previous conclusions and under the assumption that the chosen indicators have no predictive value on each other.
# IMPORTS for model
import sklearn
import sklearn.datasets
assert sklearn.__version__ >= "0.20"
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
First, we're going to get the X and y data for the model. This will consist of the country('setting), the year before, which corresponds to the constant dollar amount, the indicator name in question, the agency who is funding the aid, and the amount they are contributing each year. Our predictor value will be the average estimate gathered from the GW13 dataset.
final_dataset['indicator_name'].unique()
array(['Total alcohol per capita consumption in adults aged 15+ (litres of pure alcohol)', 'Population using safely managed drinking water services (%)', 'Life Expectancy', 'Obesity prevalence among adults (%)', 'Under-five mortality rate'], dtype=object)
model_df_under5 = final_dataset[final_dataset['indicator_name'] == 'Under-five mortality rate']
X_under5 = model_df_under5[['setting' ,'Previous year', 'indicator_name', 'Funding Agency Acronym', 'Constant Dollar Amount']]
X_under5 = pd.get_dummies(X_under5)
y_under5 = model_df_under5[['Avg estimate']]
Next, we're going to create the train, validation, and test sets. After that, we're going to fit and predict our model on the final test set. Previous adjustments to the code were done on the validation sets.
X_train_full_5, X_test_5, y_train_full_5, y_test_5 = train_test_split(X_under5, y_under5, test_size=0.2, random_state=0)
#X_train, X_val, y_train, y_val = train_test_split(X_train_full, y_train_full, test_size=0.25, random_state=0)
model_under5 = LinearRegression()
model_under5.fit(X_train_full_5, y_train_full_5)
predicts_under5 = model_under5.predict(X_test_5)
mae_5 = mean_absolute_error(predicts_under5, y_test_5)
r_squared_5 = model_under5.score(X_test_5 , y_test_5 )
print('R-squared value: ', r_squared_5 )
print('Model MAE: ', mae_5)
R-squared value: 0.7903251858464739 Model MAE: 0.7939630674179491
Since the estimate is measured in terms of %, we can see from our MAE that for our under 5 mortality model is on average 0.8% off from the actual estimate.
print('model intercept:', model_under5.intercept_)
coef = model_under5.coef_
coef_series = pd.DataFrame(coef, columns= X_under5.columns).iloc[0]
coef_series[9:]
model intercept: [-334.99343318]
Funding Agency Acronym_ADF -0.054791 Funding Agency Acronym_AGR 0.236369 Funding Agency Acronym_DFC -0.131322 Funding Agency Acronym_DOC 0.046639 Funding Agency Acronym_DOE -0.073089 Funding Agency Acronym_DOL -0.095695 Funding Agency Acronym_DOT -0.070006 Funding Agency Acronym_EPA -0.096783 Funding Agency Acronym_FTC 0.012000 Funding Agency Acronym_HHS -0.028213 Funding Agency Acronym_IAF 0.060164 Funding Agency Acronym_LIC -0.074186 Funding Agency Acronym_MCC 0.053703 Funding Agency Acronym_NSF -0.053509 Funding Agency Acronym_PC 0.111324 Funding Agency Acronym_STATE 0.154022 Funding Agency Acronym_TDA -0.059771 Funding Agency Acronym_TREAS -0.038698 Funding Agency Acronym_USAID 0.101842 Name: 0, dtype: float64
Each coefficient (Beta) represents the difference in the predicted value of y (the average estimate of under 5 mortality) for each unit difference in the corresponding feature, if the other features stay constant. In essence, this means that if X increases or decreases by one unit, the average estimate of under 5 mortality will change by beta units.
Most impactful agency
coef_series.loc[coef_series.index[9:]].idxmax()
'Funding Agency Acronym_AGR'
model_df_alc = final_dataset[final_dataset['indicator_name'] == 'Total alcohol per capita consumption in adults aged 15+ (litres of pure alcohol)']
X_alc = model_df_alc[['setting' ,'Previous year', 'indicator_name', 'Funding Agency Acronym', 'Constant Dollar Amount']]
X_alc = pd.get_dummies(X_alc)
y_alc = model_df_alc[['Avg estimate']]
Next, we're going to create the train, validation, and test sets. After that, we're going to fit and predict our model on the final test set. Previous adjustments to the code were done on the validation sets.
X_train_full_alc, X_test_alc, y_train_full_alc, y_test_alc = train_test_split(X_alc, y_alc, test_size=0.2, random_state=0)
model_alc = LinearRegression()
model_alc.fit(X_train_full_alc, y_train_full_alc)
predicts_alc = model_alc.predict(X_test_alc)
mae_alc = mean_absolute_error(predicts_alc, y_test_alc)
r_squared_alc = model_alc.score(X_test_alc , y_test_alc )
print('R-squared value: ', r_squared_alc )
print('Model MAE: ', mae_alc)
R-squared value: 0.8469852802786237 Model MAE: 0.7150067905134573
Since the estimate is measured in terms of liter, we can see from our MAE that for our alcohol consumption model is on average 0.7 liters/week off from the actual estimate.
print('model intercept:', model_alc.intercept_)
coef_alc = model_alc.coef_
coef_series_alc = pd.DataFrame(coef_alc, columns= X_alc.columns).iloc[0]
coef_series_alc[9:]
model intercept: [-98.99754148]
Funding Agency Acronym_ADF -0.139566 Funding Agency Acronym_AGR 0.039903 Funding Agency Acronym_DFC 0.047342 Funding Agency Acronym_DOC 0.044578 Funding Agency Acronym_DOE -0.128638 Funding Agency Acronym_DOL 0.029335 Funding Agency Acronym_DOT 0.060120 Funding Agency Acronym_EPA -0.041348 Funding Agency Acronym_FTC 0.047553 Funding Agency Acronym_HHS -0.166892 Funding Agency Acronym_IAF 0.015767 Funding Agency Acronym_LIC 0.114250 Funding Agency Acronym_MCC 0.085641 Funding Agency Acronym_NSF -0.122542 Funding Agency Acronym_PC 0.003367 Funding Agency Acronym_STATE 0.098357 Funding Agency Acronym_TDA -0.032193 Funding Agency Acronym_TREAS 0.005792 Funding Agency Acronym_USAID 0.039176 Name: 0, dtype: float64
Most impactful agency
coef_series_alc.loc[coef_series_alc.index[9:]].idxmax()
'Funding Agency Acronym_LIC'
model_df_life = final_dataset[final_dataset['indicator_name'] == 'Life Expectancy']
X_life = model_df_life[['setting' ,'Previous year', 'indicator_name', 'Funding Agency Acronym', 'Constant Dollar Amount']]
X_life = pd.get_dummies(X_life)
y_life = model_df_life[['Avg estimate']]
Next, we're going to create the train, validation, and test sets. After that, we're going to fit and predict our model on the final test set. Previous adjustments to the code were done on the validation sets.
X_train_full_life, X_test_life, y_train_full_life, y_test_life = train_test_split(X_life, y_life, test_size=0.2, random_state=0)
model_life = LinearRegression()
model_life.fit(X_train_full_life, y_train_full_life)
predicts_life = model_life.predict(X_test_life)
mae_life = mean_absolute_error(predicts_life, y_test_life)
r_squared_life = model_life.score(X_test_life, y_test_life)
print('R-squared value: ', r_squared_life)
print('Model MAE: ', mae_life)
R-squared value: 0.8011108446414228 Model MAE: 1.4176280750898032
Since the estimate is measured in terms of years, we can see from our MAE that for our life expectancy model is on average 1.4 years off from the actual estimate.
print('model intercept:', model_life.intercept_)
coef_life = model_life.coef_
coef_series_life = pd.DataFrame(coef_life, columns= X_life.columns).iloc[0]
coef_series_life[9:]
model intercept: [21715.70476767]
Funding Agency Acronym_ADF 2852.754682 Funding Agency Acronym_AGR 2852.471314 Funding Agency Acronym_DFC 2852.574361 Funding Agency Acronym_DOC 2852.488663 Funding Agency Acronym_DOE 2852.626195 Funding Agency Acronym_DOL 2852.913651 Funding Agency Acronym_DOT 2852.419819 Funding Agency Acronym_EPA 2852.835268 Funding Agency Acronym_FTC 2852.541626 Funding Agency Acronym_HHS 2852.480125 Funding Agency Acronym_IAF 2852.443298 Funding Agency Acronym_LIC 2852.390947 Funding Agency Acronym_MCC 2852.415794 Funding Agency Acronym_NSF 2852.458384 Funding Agency Acronym_PC 2852.545699 Funding Agency Acronym_STATE 2852.365797 Funding Agency Acronym_TDA 2852.411273 Funding Agency Acronym_TREAS 2852.500019 Funding Agency Acronym_USAID 2852.488978 Name: 0, dtype: float64
Most impactful agency
coef_series_life.loc[coef_series_life.index[9:]].idxmax()
'Funding Agency Acronym_DOL'
model_df_obesity = final_dataset[final_dataset['indicator_name'] == 'Obesity prevalence among adults (%)']
X_obesity = model_df_obesity[['setting' ,'Previous year', 'indicator_name', 'Funding Agency Acronym', 'Constant Dollar Amount']]
X_obesity = pd.get_dummies(X_obesity)
y_obesity = model_df_obesity[['Avg estimate']]
Next, we're going to create the train, validation, and test sets. After that, we're going to fit and predict our model on the final test set. Previous adjustments to the code were done on the validation sets.
X_train_full_obesity , X_test_obesity , y_train_full_obesity , y_test_obesity = train_test_split(X_obesity, y_obesity, test_size=0.2, random_state=0)
model_obesity = LinearRegression()
model_obesity.fit(X_train_full_obesity, y_train_full_obesity)
predicts_obesity = model_obesity.predict(X_test_obesity)
mae_obesity = mean_absolute_error(predicts_obesity, y_test_obesity)
r_squared_obesity = model_obesity.score(X_test_obesity, y_test_obesity)
print('R-squared value: ', r_squared_obesity)
print('Model MAE: ', mae_obesity)
R-squared value: 0.978766769693189 Model MAE: 0.6402113938263317
Since the estimate is measured in terms of %, we can see from our MAE that our for obesity model is on average 0.64% off from the actual estimate.
print('model intercept:', model_obesity.intercept_)
coef_obesity = model_obesity.coef_
coef_series_obesity = pd.DataFrame(coef_obesity, columns= X_obesity.columns).iloc[0]
coef_series_obesity[9:]
model intercept: [757.55612393]
Funding Agency Acronym_ADF -0.007747 Funding Agency Acronym_AGR 0.092102 Funding Agency Acronym_DFC 0.034112 Funding Agency Acronym_DOC -0.013388 Funding Agency Acronym_DOE -0.027756 Funding Agency Acronym_DOL 0.002047 Funding Agency Acronym_DOT -0.076675 Funding Agency Acronym_EPA -0.096046 Funding Agency Acronym_FTC 0.016318 Funding Agency Acronym_HHS -0.000027 Funding Agency Acronym_IAF -0.017796 Funding Agency Acronym_LIC -0.032623 Funding Agency Acronym_MCC 0.024782 Funding Agency Acronym_NSF -0.038777 Funding Agency Acronym_PC 0.016032 Funding Agency Acronym_STATE 0.085456 Funding Agency Acronym_TDA -0.072786 Funding Agency Acronym_TREAS 0.044613 Funding Agency Acronym_USAID 0.068160 Name: 0, dtype: float64
Most impactful agency
coef_series_obesity.loc[coef_series_obesity.index[9:]].idxmax()
'Funding Agency Acronym_AGR'
model_df_water = final_dataset[final_dataset['indicator_name'] == 'Population using safely managed drinking water services (%)']
X_water = model_df_water[['setting' ,'Previous year', 'indicator_name', 'Funding Agency Acronym', 'Constant Dollar Amount']]
X_water = pd.get_dummies(X_water)
y_water = model_df_water[['Avg estimate']]
Next, we're going to create the train, validation, and test sets. After that, we're going to fit and predict our model on the final test set. Previous adjustments to the code were done on the validation sets.
X_train_full_water, X_test_water, y_train_full_water, y_test_water = train_test_split(X_water, y_water, test_size=0.2, random_state=0)
model_water = LinearRegression()
model_water.fit(X_train_full_water, y_train_full_water)
predicts_water = model_water.predict(X_test_water)
mae_water = mean_absolute_error(predicts_water, y_test_water)
r_squared_water = model_water.score(X_test_water, y_test_water)
print('R-squared value: ', r_squared_water)
print('Model MAE: ', mae_water)
R-squared value: 0.9982495503032212 Model MAE: 0.7150255979592951
Since the estimate is measured in terms of %, we can see from our MAE that our for safe drinking water model is on average 0.7% off from the actual estimate.
print('model intercept:', model_water.intercept_)
coef_water = model_water.coef_
coef_series_water = pd.DataFrame(coef_water, columns= X_water.columns).iloc[0]
coef_series_water[9:]
model intercept: [-194636.53814263]
Funding Agency Acronym_ADF 33200.480720 Funding Agency Acronym_AGR 33200.411869 Funding Agency Acronym_DFC 33200.440314 Funding Agency Acronym_DOC 33200.427906 Funding Agency Acronym_DOE 33200.467663 Funding Agency Acronym_DOL 33200.437283 Funding Agency Acronym_DOT 33200.356895 Funding Agency Acronym_EPA 33200.360992 Funding Agency Acronym_FTC 33200.463437 Funding Agency Acronym_HHS 33200.533920 Funding Agency Acronym_IAF 33200.369219 Funding Agency Acronym_LIC 33200.430396 Funding Agency Acronym_MCC 33200.399428 Funding Agency Acronym_NSF 33200.370769 Funding Agency Acronym_PC 33200.467869 Funding Agency Acronym_STATE 33200.459103 Funding Agency Acronym_TDA 33200.437018 Funding Agency Acronym_TREAS 33200.456665 Funding Agency Acronym_USAID 33200.407146 Name: 0, dtype: float64
The strange nature of the coefficients can most likely be attributed to the lack of data from a few of the countries.
Most impactful agency
coef_series_water.loc[coef_series_water.index[9:]].idxmax()
'Funding Agency Acronym_HHS'
from tabulate import tabulate
print(tabulate([['Health Indicator','Most impactful agencies', 'Model Error', 'R-Squared Value'],
['Under-five mortality rate', coef_series.loc[coef_series.index[9:]].idxmax(), mae_5, r_squared_5],
['Total alcohol per capita consumption in adults aged 15+ (litres of pure alcohol)',coef_series_alc.loc[coef_series_alc.index[9:]].idxmax(), mae_alc, r_squared_alc],
[ 'Life Expectancy', coef_series_life.loc[coef_series_life.index[9:]].idxmax(), mae_life, r_squared_life],
['Obesity prevalence among adults (%)', coef_series_obesity.loc[coef_series_obesity.index[9:]].idxmax(), mae_obesity, r_squared_obesity],
['Population using safely managed drinking water services (%)', coef_series_water.loc[coef_series_water.index[9:]].idxmax(), mae_water, r_squared_water]],
headers='firstrow', tablefmt= 'simple_grid'
))
Health Indicator Most impactful agencies Model Error R-Squared Value -------------------------------------------------------------------------------- -------------------------- ------------- ----------------- Under-five mortality rate Funding Agency Acronym_AGR 0.793963 0.790325 Total alcohol per capita consumption in adults aged 15+ (litres of pure alcohol) Funding Agency Acronym_LIC 0.715007 0.846985 Life Expectancy Funding Agency Acronym_DOL 1.41763 0.801111 Obesity prevalence among adults (%) Funding Agency Acronym_AGR 0.640211 0.978767 Population using safely managed drinking water services (%) Funding Agency Acronym_HHS 0.715026 0.99825
The United States is the number one distributor of foregin aid in the world. Due to this, it is extremely important to determine how foreign aid is being used in order to distribute aid to achieve maximum impact.
Have we gathered enough evidence to answer the questions we set out to answer?
In all predictions for all indicators, we have high to extremely high R-squared values indicating a direct positive correlation between foreing aid and health outcomes.
As described in the model analysis section, each indicator has a corresponding agency which has the greatest impact on the indicator. For example, if the US gave the Department of Agriculture one more dollar to combat Under-five Mortality, it would be more effective than any other agency.
As we can see, performing this type of analysis effectively determines which US agency should spearhead health improvement initiatives.
We can see that certain agencies are better equipped to affect change in health outcomes. We hope that this data can be used as a preliminary way of exploring how the United States can effectively distribute foreign aid. We hope that this analysis could be used to highlight inefficiencies or spark conversations about where foreign aid could be most effective. We'd like to thank Professor Mattei for the guidance and many office hours spent "aiding" us in our efforts.
AGR: Department of Agriculture
USAID: US Agency for International Development
STATE: Department of State
MCC: Millennium (for more info: https://www.mcc.gov/)
HHS: Department of Health and Human Services
LIC: Open World Leadership Center (for more info: https://www.openworld.gov/)
ADF: US African Development Foundation
DFC: US International Development Finance Corporation
DOC: Department of Commerce
DOE: Department of Energy
DOL: Department of Labor
DOT: Department of Transportation
EPA: US Environmental Protection Agency
FTC: US Federal Trade Comission
IAF: US Inter-American Foundation (for more info: https://www.iaf.gov/)
NSF: US National Science Foundation
PC : Peace Corps
TDA: US Trade and Development Agency
TREAS: Department of the Treasury
Total alcohol per capita consumption in adults aged 15+ (litres of pure alcohol)
Population using safely managed drinking water services (%)
Life Expectancy
Obesity prevalence among adults (%)
Under-five mortality rate