Website Link: https://chriscallahan99.github.io/DataScience_Final_Project/

How does US Foreign Aid Impact Health Outcomes?


By Chris Callahan and Mikey Sison
1. Introduction
2. Data: Extraction, Transform and Load
3. Exploratory Data Analysis
4. Model and Analysis
5. Conclusion
6. Appendix

Introduction

Abstract¶

The United States disburses the most foreign aid in the world (World Economic Forum) and it is of paramount importance that the countries who are in greatest need are able to get it. We examine US foreign aid disbursements to six countires of middle to low economic rank to determine the effect of monetary aid increases on five health indicators representative of overall health outcomes. The countries of Ethiopia, El Salvador, Ecuador, Haiti, Moldova, and Yemen represent a sample of countries exhibiting some of the greatest overall need for humanitarian aid. Geopolitical and environmental factors have rendered Ethiopia, Haiti, and Yemen dependent on foreign aid for humanitarian support. It is anecdotally acknowledged that overall increases in foreign aid disbursements would have a positive increase in overall welfare. We aim to prove this and determine which US Agencies have the greatest direct impact on improving health outcomes.

Our analysis will explore and answer the following questions:

  • How does US foreign aid impact the overall health of a country?
  • Does a \$1 marginal dollar given to one agency increase health indicators more or less than different agency?
  • Since US agencies have budgets set by Congress, how can they better utilize a limited budget with the maximum impact?

Data: Extraction, Transform, and Load

Imports¶

For this project, we utilized the python libraries pandas, maplotlib, numpy, and more. The first few code cells will be installing seaborn, a plotting library, and importing other necessary libraries.

In [1]:
%pip install seaborn
Requirement already satisfied: seaborn in c:\users\perso\anaconda3\envs\landlab_dev\lib\site-packages (0.12.1)
Requirement already satisfied: matplotlib!=3.6.1,>=3.1 in c:\users\perso\anaconda3\envs\landlab_dev\lib\site-packages (from seaborn) (3.5.2)
Requirement already satisfied: numpy>=1.17 in c:\users\perso\anaconda3\envs\landlab_dev\lib\site-packages (from seaborn) (1.23.4)
Requirement already satisfied: pandas>=0.25 in c:\users\perso\anaconda3\envs\landlab_dev\lib\site-packages (from seaborn) (1.5.1)
Requirement already satisfied: pillow>=6.2.0 in c:\users\perso\anaconda3\envs\landlab_dev\lib\site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (9.2.0)
Requirement already satisfied: pyparsing>=2.2.1 in c:\users\perso\anaconda3\envs\landlab_dev\lib\site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (3.0.9)
Requirement already satisfied: fonttools>=4.22.0 in c:\users\perso\anaconda3\envs\landlab_dev\lib\site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (4.25.0)
Requirement already satisfied: kiwisolver>=1.0.1 in c:\users\perso\anaconda3\envs\landlab_dev\lib\site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (1.4.2)
Requirement already satisfied: python-dateutil>=2.7 in c:\users\perso\anaconda3\envs\landlab_dev\lib\site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (2.8.2)
Requirement already satisfied: packaging>=20.0 in c:\users\perso\anaconda3\envs\landlab_dev\lib\site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (21.3)
Requirement already satisfied: cycler>=0.10 in c:\users\perso\anaconda3\envs\landlab_dev\lib\site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (0.11.0)
Requirement already satisfied: pytz>=2020.1 in c:\users\perso\anaconda3\envs\landlab_dev\lib\site-packages (from pandas>=0.25->seaborn) (2022.1)
Requirement already satisfied: six>=1.5 in c:\users\perso\anaconda3\envs\landlab_dev\lib\site-packages (from python-dateutil>=2.7->matplotlib!=3.6.1,>=3.1->seaborn) (1.16.0)
Note: you may need to restart the kernel to use updated packages.
WARNING: Error parsing requirements for pywinpty: [Errno 2] No such file or directory: 'c:\\users\\perso\\anaconda3\\envs\\landlab_dev\\lib\\site-packages\\pywinpty-2.0.6.dist-info\\METADATA'
In [2]:
## IMPORTS ##

# Load Numpy
import numpy as np

# Load os
import os

# Load Pandas
import pandas as pd

# Load MatPlotLib
import matplotlib
import matplotlib.pyplot as plt

# Load Seaborn
import seaborn as sns; sns.set_theme()

# These two things are for Pandas, it widens the notebook and lets us display data easily.
from IPython.display import display, HTML
display(HTML("<style>.container { width:95% !important; }</style>"))

Let's change the directory to the folder containing the foreign aid data.

In [3]:
os.chdir("Foreign-Aid-Country-Data")

Data Sourcing & Extraction¶

We obtained data from two primary sources: US Gov Foreign Assistance and WHO GW13 Dataset. These two datasets contain the publicly avaliable US foreign aid disbursements per country and health indicators by country over the past 20 years, respectively.

Foreign Aid by Country¶

Our first dataset comes from the US Foreign Assistance webpage. With this dataset, we will be focusing on quantifying the amount of foreign aid the United States gives to the six countries in question, per year, and which agencies provide this support.

Analysis of this dataset will answer the following questions:

  • How much does a country recieve from the US in foreign aid per year?
  • How much aid is disbursed each year by agency?

Example Query¶

FAgovEthiopiaQuery.png

Figure 1: The image above is an example query of the foreign aid given to Ethiopia from 2018 to 2022

Obtaining the data¶

For each country, we created a query, similar to that of figure 1, obtaining the non-military foreign aid given from 2002 to 2022. Since we are examining how foreign aid impacts health, we assume that military foreign aid has no impact on the health outcomes of the country and will be removed from the dataframes. Each of these datasets can be found in the "Foreign-Aid-Country-Data" folder.

Reading in Country data (Loading)¶

Here, we are reading in the country data from our queries on ForeignAssistance.gov. Each of the queries encompass non-military foreign aid from 2002 to 2022 fiscal years.

We chose countries in low to middle income groups with one to two per continent/region and low military assistance.

  • Europe: Moldova
  • Middle East: Yemen
  • Africa: Ethiopia
  • South America: Ecuador
  • North/Central America: El Salvador, Haiti

Let's read in our data into Pandas DataFrames.

In [4]:
Ethiopia_foreign_aid_data = pd.read_csv(os.path.join("Africa","Ethiopia_2022-2002_Econ_Other.csv"))
Moldova_foreign_aid_data = pd.read_csv(os.path.join("Europe", "Moldova_2022-2002_Econ_Other.csv"))
Yemen_foreign_aid_data = pd.read_csv(os.path.join("Middle-East","Yemen_2022-2002_Econ_Other.csv"))
Ecuador_foreign_aid_data = pd.read_csv(os.path.join("South-America","Ecuador_2022-2002_Econ_Other.csv"))
El_Salvador_foreign_aid_data = pd.read_csv(os.path.join("North-America","El_Salvador_2022-2002_Econ_Other.csv"))
Haiti_foreign_aid_data = pd.read_csv(os.path.join("North-America","Haiti_2022-2002_Econ_Other.csv"))

countries = [
             Ethiopia_foreign_aid_data,
             Moldova_foreign_aid_data,
             Yemen_foreign_aid_data, 
             Ecuador_foreign_aid_data, 
             El_Salvador_foreign_aid_data,
             Haiti_foreign_aid_data
            ] # For ease of looping

Ethiopia_foreign_aid_data.head(3)
Out[4]:
Country ID Country Code Country Name Region ID Region Name Income Group ID Income Group Name Income Group Acronym Managing Agency ID Managing Agency Acronym ... Activity Name Activity Description Activity Project Number Activity Start Date Activity End Date Transaction Type ID Transaction Type Name Fiscal Year Current Dollar Amount Constant Dollar Amount
0 231 ETH Ethiopia 5 Sub-Saharan Africa 1 Low Income Country LIC 1 USAID ... Title II Emergency Program Title II Emergency Program through a partner o... 720BHA22CA00005 NaN NaN 2 Obligations 2022 113052149 109155305
1 231 ETH Ethiopia 5 Sub-Saharan Africa 1 Low Income Country LIC 1 USAID ... Emergency food assistance Emergency humanitarian food and nutirition ass... 720BHA22IO00038 NaN 2023-03-09 2 Obligations 2022 77740000 75060346
2 231 ETH Ethiopia 5 Sub-Saharan Africa 1 Low Income Country LIC 1 USAID ... Title II Emergency Program Title II Emergency Program through a partner o... 720BHA22IO00034 NaN NaN 2 Obligations 2022 57420571 55441315

3 rows × 51 columns

Focus Areas¶

We will be focusing on the country-specific datasets from the ForeignAssistance.gov website. Each row in the summary describes a country and the fiscal year. Each entity contains data on the Transaction Type, Fiscal Year, and Constant Amount (in USD). We are planning on using this data from the summary as a reference to determine how much money is irrelevant to our study (non-military/defense spending).

Extraneous Columns¶

As you can see, there are 51 columns in the DataFrame. Many of these columns are redundant or irrelevant to our analysis. Below, we will remove these columns, maintaining the interesting features of the dataset.

In [5]:
columns_to_drop = ['Transaction Type ID',
        'Foreign Assistance Objective ID',
        'Implementing Partner Category ID',
        'Implementing Partner Category Name',
        'Implementing Partner Sub-category ID',
        'Implementing Partner Sub-category Name', 
        'Implementing Partner ID',
        'Implementing Partner Name', 
        'International Category ID',
        'International Category Name',
        'International Sector Code',
        'International Sector Name', 
        'International Purpose Code',
        'Managing Agency ID',
        'Managing Sub-agency or Bureau ID',
        'Activity Project Number',
        'Managing Sub-agency or Bureau Name',
        'Aid Type Group ID',
        'Funding Agency ID',
        'Submission ID', 
        'Funding Acount ID',
        'Activity ID',
        'Income Group Acronym',
        'Managing Sub-agency or Bureau Acronym',
        'US Category ID',
        'US Category Name',
    ]


for country in countries:
    for col in columns_to_drop:
        country.drop(labels= col, axis= 'columns', inplace= True)

Ethiopia_foreign_aid_data.columns
Out[5]:
Index(['Country ID', 'Country Code', 'Country Name', 'Region ID',
       'Region Name', 'Income Group ID', 'Income Group Name',
       'Managing Agency Acronym', 'Managing Agency Name',
       'International Purpose Name', 'US Sector ID', 'US Sector Name',
       'Funding Account Name', 'Funding Agency Name', 'Funding Agency Acronym',
       'Foreign Assistance Objective Name', 'Aid Type Group Name',
       'Activity Name', 'Activity Description', 'Activity Start Date',
       'Activity End Date', 'Transaction Type Name', 'Fiscal Year',
       'Current Dollar Amount', 'Constant Dollar Amount'],
      dtype='object')

Income Analysis¶

Next, we want to highlight the income of each of the countries to verify that the countries chosen are likely to receive a sizeable amount and diverse range of US foreign aid.

In [6]:
for country in countries:
    display(country['Country Name'][0] + ': '+ country['Income Group Name'][0])
'Ethiopia: Low Income Country'
'Moldova: Lower Middle Income Country'
'Yemen: Low Income Country'
'Ecuador: Upper Middle Income Country'
'El Salvador: Lower Middle Income Country'
'Haiti: Low Income Country'

As we can see, most of these countries fall into the low / middle income countries. We will see that, with the exception of Yemen, all of these countries receive roughly the same amount of foreign aid.

Removing Military Aid & Negative Amounts¶

Before we visuallize the aid amounts, we want to get rid of entries / aid projects that come from military sources since they would not have impacts on any health metrics we are interested in.

In [7]:
# Drop rows unrelated to economic/health development
for i, country in enumerate(countries):
    country.drop(country[country["Managing Agency Name"] == 'Department of Defense'].index, inplace = True)
    country.drop(country[country["Managing Agency Name"] == 'Department of Homeland Security'].index, inplace = True)
    country.drop(country[country["Managing Agency Name"] == 'Department of Justice'].index, inplace = True)
    country.drop(country[country["Managing Agency Name"] == 'Department of the Army'].index, inplace = True)
    country.drop(country[country["Managing Agency Name"] == 'Department of the Interior'].index, inplace = True)
    
    # Drop row if dollar amount is less than 0 as this is usually a redaction of funds
    country.drop(country[country["Constant Dollar Amount"] < 0 ].index, inplace = True)

GW13 WHO Dataset¶

Now we are going to bring in our second datatset which comes from the GW13 WHO Dataset, this dataset contains health markers of interest, stratified by country and year. This dataset comes from a paper by the World Health Organization about the organization's five year strategy to achieve measureable heatlh outcomes.

The strategy includes a "triple billion" plan which sets to achieve

  • One billion more people are benefiting from universal health coverage
  • One billion more people are better protected from health emergencies
  • One billion more people are enjoying better health and well-being

The dataset includes the health impacts that are measured. We did basic filtering on this dataset to only select data from our target countries, and begin filtering unnecessary health indicators. We plan on using this data to contextualize and predict the importance of various funding and their quantifible impacts on health outcomes in our countries of interest.

In [8]:
gw19_df = pd.read_excel(os.path.join("..","WHO_datasets", "GW19_2002_2020_filtered.xlsx"))
gw19_df
Out[8]:
setting year indicator_abbr indicator_name dimension subgroup estimate population indicator_scale
0 Ecuador 2002 alcohol Total alcohol per capita consumption in adults... Sex Female 2.0 NaN 100
1 Ecuador 2002 alcohol Total alcohol per capita consumption in adults... Sex Male 7.0 NaN 100
2 Ecuador 2002 bp Prevalence of hypertension among adults aged 3... Sex Female 25.0 NaN 100
3 Ecuador 2002 bp Prevalence of hypertension among adults aged 3... Sex Male 23.0 NaN 100
4 Ecuador 2002 hh_san_sm Population using safely managed sanitation ser... Place of residence Rural 44.0 5110442.0 100
... ... ... ... ... ... ... ... ... ...
5438 Yemen 2020 tb_incidence TB incidence (new infections per 100 000 popul... Sex Female 44.0 14800983.0 100000
5439 Yemen 2020 tb_incidence TB incidence (new infections per 100 000 popul... Sex Male 53.0 15024985.0 100000
5440 Yemen 2020 u5mr Under-five mortality rate (deaths per 1000 liv... Sex Female 55.0 NaN 1000
5441 Yemen 2020 u5mr Under-five mortality rate (deaths per 1000 liv... Sex Male 64.0 NaN 1000
5442 Yemen 2020 life_ex Life Expectancy NaN NaN 67.9 NaN 100

5443 rows × 9 columns

In [9]:
# Here are the different indicators in this dataset.
display(gw19_df["indicator_name"].unique())
array(['Total alcohol per capita consumption in adults aged 15+ (litres of pure alcohol)',
       'Prevalence of hypertension among adults aged 30-79 years (age-standardized) (%)',
       'Population using safely managed sanitation services (%)',
       'Population using safely managed drinking water services (%)',
       'Obesity prevalence among adults (%)',
       'Obesity prevalence among children and adolescents (5-19) (%)',
       'Population using at least basic sanitation services (%)',
       'Suicide mortality rate (per 100 000 population)',
       'Under-five mortality rate (deaths per 1000 live births)',
       'Life Expectancy',
       'Antenatal care coverage - at least four visits (in the five years preceding the survey) (%)',
       'Overweight prevalence in children aged < 5 years (%)',
       'Stunting prevalence in children aged < 5 years (%)',
       'DTP3 immunization coverage among one-year-olds (%)',
       'Measles immunization coverage among one-year-olds (%)',
       'Polio immunization coverage among one-year-olds (%)',
       'Healthy life expectancy (HALE) at birth (years)',
       'Population with >10% household expenditures on health (%)',
       'Population with >25% household expenditures on health (%)',
       'Mortality rate attributed to unsafe water, unsafe sanitation and lack of hygiene (per 100 000 population)',
       'TB incidence (new infections per 100 000 population)',
       'Neonatal mortality rate (deaths per 1000 live births)'],
      dtype=object)

After filtering the data to get only the countries we wanted, we decided to reduce the indicators down to 5 as many of the other indicators had inconsistent data or none at all for the countries we selected. We decided to select indicators we determined to be very important to a country's health, which include, Life expectancy, population with access to clean drinking water, and the under-five mortality rate. We also included more generally related indicators: total alcohol consumption and obesity rate to see if these indicators have a significant effect as well.

In [10]:
gw_five = pd.read_excel(os.path.join("..", "WHO_datasets", "GW_five_combined_subgroups.xlsx"))

# add combined for merging another dataset by setting and year
gw_five["combined"] = gw_five["setting"].astype(str) + "_" + gw_five["year"].astype(str)

display(gw_five["indicator_name"].unique())
display(gw_five)
array(['Total alcohol per capita consumption in adults aged 15+ (litres of pure alcohol)',
       'Population using safely managed drinking water services (%)',
       'Life Expectancy', 'Obesity prevalence among adults (%)',
       'Under-five mortality rate'], dtype=object)
setting year indicator_abbr indicator_name Avg estimate combined
0 Ecuador 2002 alcohol Total alcohol per capita consumption in adults... -4.50 Ecuador_2002
1 Ecuador 2002 hh_wat_sm Population using safely managed drinking water... 56.50 Ecuador_2002
2 Ecuador 2002 life_ex Life Expectancy 73.90 Ecuador_2002
3 Ecuador 2002 obesitya Obesity prevalence among adults (%) -13.00 Ecuador_2002
4 Ecuador 2002 u5mr Under-five mortality rate -2.65 Ecuador_2002
... ... ... ... ... ... ...
544 Yemen 2019 alcohol Total alcohol per capita consumption in adults... 0.00 Yemen_2019
545 Yemen 2019 life_ex Life Expectancy 67.80 Yemen_2019
546 Yemen 2019 u5mr Under-five mortality rate -6.10 Yemen_2019
547 Yemen 2020 life_ex Life Expectancy 67.90 Yemen_2020
548 Yemen 2020 u5mr Under-five mortality rate -5.95 Yemen_2020

549 rows × 6 columns

Directionality¶

You may notice that some of the Avg estimate values are negative. This is done to insure that all indicators all have the same directionality. We changed the directionality of indicators such that a desireable movement of an estimate will be in the positive direction and an undesireable movement will be in the negative direction. Since we want alcohol, obesity, and under five mortality to go down, their estimates are turned negative, so that a lower estimate of this estimate is in the positive direction.

Now that the health indicators are chosen, we will group the agencies funding by year and country. This will help create a dataframe of the funding amounts by agency and year.

In [11]:
# Create dataframe of the funding amounts in each country per year 

masterdf = pd.DataFrame()

for i, country in enumerate(countries):
    per_country_df = pd.DataFrame()    
    grouped_df = country.groupby(["Funding Agency Acronym","Fiscal Year", "Country Name"])
    
    # Create a new column based on the grouped data
    per_country_df["Constant Dollar Amount"] = grouped_df["Constant Dollar Amount"].sum()
    per_country_df = pd.DataFrame(per_country_df)
    
    
    masterdf = pd.concat([masterdf, per_country_df])
    
masterdf.reset_index(inplace=True)

# Create column for a key to merge on based on country and year.
masterdf["combined"] = masterdf["Country Name"].astype(str) + "_" + masterdf["Fiscal Year"].astype(str)

masterdf
Out[11]:
Funding Agency Acronym Fiscal Year Country Name Constant Dollar Amount combined
0 ADF 2011 Ethiopia 58070 Ethiopia_2011
1 ADF 2012 Ethiopia 59155 Ethiopia_2012
2 ADF 2013 Ethiopia 91947 Ethiopia_2013
3 ADF 2014 Ethiopia 375920 Ethiopia_2014
4 ADF 2015 Ethiopia 210735 Ethiopia_2015
... ... ... ... ... ...
773 USAID 2018 Haiti 160983489 Haiti_2018
774 USAID 2019 Haiti 137906915 Haiti_2019
775 USAID 2020 Haiti 130047717 Haiti_2020
776 USAID 2021 Haiti 174539218 Haiti_2021
777 USAID 2022 Haiti 82660892 Haiti_2022

778 rows × 5 columns

In [12]:
# merge health indicator data with agency funding amounts.

merged_df = gw_five.merge(masterdf,how="left", on="combined")
merged_df.drop(["combined", "Fiscal Year", "Country Name"], axis=1, inplace=True)

merged_df.to_csv(os.path.join("..", "WHO_datasets", "merged_data.csv"))
In [13]:
merged_df["Funding Agency Acronym"].unique()
Out[13]:
array(['AGR', 'IAF', 'PC', 'STATE', 'TDA', 'USAID', 'DOL', 'TREAS', 'DOC',
       'DOE', 'HHS', 'FTC', 'MCC', 'EPA', 'ADF', 'DFC', 'DOT', 'NSF',
       'LIC', nan], dtype=object)
In [14]:
# Create country/year combinations to add all agencies and other missing indicators not in data

merged_df["country_year"] = merged_df["setting"].astype(str) + "_" + merged_df["year"].astype(str) 
merged_df["country_year_indicator"] = merged_df["setting"].astype(str) + "_" + merged_df["year"].astype(str) + "_" + merged_df["indicator_abbr"].astype(str)
merged_df["country_year_indicator_agency"] = merged_df["setting"].astype(str) + "_" + merged_df["year"].astype(str) + "_" + merged_df["indicator_abbr"].astype(str) + "_" + merged_df["Funding Agency Acronym"]
merged_df
Out[14]:
setting year indicator_abbr indicator_name Avg estimate Funding Agency Acronym Constant Dollar Amount country_year country_year_indicator country_year_indicator_agency
0 Ecuador 2002 alcohol Total alcohol per capita consumption in adults... -4.50 AGR 9241684.0 Ecuador_2002 Ecuador_2002_alcohol Ecuador_2002_alcohol_AGR
1 Ecuador 2002 alcohol Total alcohol per capita consumption in adults... -4.50 IAF 1948562.0 Ecuador_2002 Ecuador_2002_alcohol Ecuador_2002_alcohol_IAF
2 Ecuador 2002 alcohol Total alcohol per capita consumption in adults... -4.50 PC 3948070.0 Ecuador_2002 Ecuador_2002_alcohol Ecuador_2002_alcohol_PC
3 Ecuador 2002 alcohol Total alcohol per capita consumption in adults... -4.50 STATE 35779088.0 Ecuador_2002 Ecuador_2002_alcohol Ecuador_2002_alcohol_STATE
4 Ecuador 2002 alcohol Total alcohol per capita consumption in adults... -4.50 TDA 1638316.0 Ecuador_2002 Ecuador_2002_alcohol Ecuador_2002_alcohol_TDA
... ... ... ... ... ... ... ... ... ... ...
3116 Yemen 2020 life_ex Life Expectancy 67.90 STATE 82848136.0 Yemen_2020 Yemen_2020_life_ex Yemen_2020_life_ex_STATE
3117 Yemen 2020 life_ex Life Expectancy 67.90 USAID 248473546.0 Yemen_2020 Yemen_2020_life_ex Yemen_2020_life_ex_USAID
3118 Yemen 2020 u5mr Under-five mortality rate -5.95 AGR 224994779.0 Yemen_2020 Yemen_2020_u5mr Yemen_2020_u5mr_AGR
3119 Yemen 2020 u5mr Under-five mortality rate -5.95 STATE 82848136.0 Yemen_2020 Yemen_2020_u5mr Yemen_2020_u5mr_STATE
3120 Yemen 2020 u5mr Under-five mortality rate -5.95 USAID 248473546.0 Yemen_2020 Yemen_2020_u5mr Yemen_2020_u5mr_USAID

3121 rows × 10 columns

Since some agencies had $0 in funding some years, these records did not exist in our dataset, but we want them in our data, so we will add records of every possible combination.

In [15]:
# Make set of all combination of possible rows containing a country's name, the year, indicator, and agency. 

all_possible_agencies = ['AGR', 'IAF', 'PC', 'STATE', 'TDA', 'USAID', 'DOL', 'TREAS', 'DOC',
       'DOE', 'HHS', 'FTC', 'MCC', 'EPA', 'ADF', 'DFC', 'DOT', 'NSF','LIC']

all_possible_indicators = ["alcohol", "hh", "life", "obesitya", "u5mr"]

all_combos = set()

for index, row in gw_five.iterrows():
    # from gw_five combined is the country and year
    value = row['combined']
    indic = row['indicator_abbr']
    for agency in all_possible_agencies:
        for indic in all_possible_indicators:
            all_combos.add(value + "_" + indic + "_" + agency)
            
all_combos = sorted(all_combos)
all_combos_df = pd.DataFrame(all_combos, columns=["cyia_string"])
all_combos_df
Out[15]:
cyia_string
0 Ecuador_2002_alcohol_ADF
1 Ecuador_2002_alcohol_AGR
2 Ecuador_2002_alcohol_DFC
3 Ecuador_2002_alcohol_DOC
4 Ecuador_2002_alcohol_DOE
... ...
12630 Yemen_2020_u5mr_PC
12631 Yemen_2020_u5mr_STATE
12632 Yemen_2020_u5mr_TDA
12633 Yemen_2020_u5mr_TREAS
12634 Yemen_2020_u5mr_USAID

12635 rows × 1 columns

In [16]:
# Temp dataframe used to add avg estimate column to merged_df later.
setting_indicator_estimate = pd.DataFrame(gw_five, columns=["setting", "year", "indicator_abbr", "Avg estimate"])
setting_indicator_estimate
Out[16]:
setting year indicator_abbr Avg estimate
0 Ecuador 2002 alcohol -4.50
1 Ecuador 2002 hh_wat_sm 56.50
2 Ecuador 2002 life_ex 73.90
3 Ecuador 2002 obesitya -13.00
4 Ecuador 2002 u5mr -2.65
... ... ... ... ...
544 Yemen 2019 alcohol 0.00
545 Yemen 2019 life_ex 67.80
546 Yemen 2019 u5mr -6.10
547 Yemen 2020 life_ex 67.90
548 Yemen 2020 u5mr -5.95

549 rows × 4 columns

In [17]:
# Used to generate indicator descriptions when only given abbreviations
dict_for_abbrs = {"alcohol": 'Total alcohol per capita consumption in adults aged 15+ (litres of pure alcohol)', 'life': 'Life Expectancy', 
                  'obesitya': 'Obesity prevalence among adults (%)', 'hh':'Population using safely managed drinking water services (%)', 
                  'u5mr':'Under-five mortality rate'}

The cell below adds the average estimate of an indicator for a year to all the agencies which will help with our model. It also allows us to shift the data down a year easily, such that cells with 2003 data constant dollars now have 2002 constant dollars as our model wants to use the money from previous years funding to predict health indicators. Having current estimates with older funding it what we want to use for our model.

In [18]:
# Takes while to run, so we already ran it and have the finished dataset saved in merged_df.csv
# Change run_function to True to run this 

import warnings

# Suppress warnings
warnings.filterwarnings('ignore', category=DeprecationWarning) 
warnings.filterwarnings('ignore', category=FutureWarning) 

run_function = False

if(run_function):
    merged_df = merged_df.sort_values('country_year_indicator_agency', ascending=True)

    # Fill in dataframe using all possible setting, agency, year, indicators from all_combos_df 
    for index, row in all_combos_df.iterrows():
        value = row["cyia_string"]
        row_for_estimate = []

        if (merged_df["country_year_indicator_agency"] != value).all():
            # split long combo strings into their separate parts
            country_year_indicator_agency_list = value.split('_')

            value_name = country_year_indicator_agency_list[0]
            value_year = int(country_year_indicator_agency_list[1])
            value_indicator = country_year_indicator_agency_list[2]
            value_agency = country_year_indicator_agency_list[3]

            # get indicator name from abbr dict
            full_name_ind = dict_for_abbrs[value_indicator]

            # get value indicator for same country, year, indicator as current record
            value_estimate_series = merged_df.loc[(merged_df["setting"] == value_name) & (merged_df["year"] == value_year) & (merged_df["indicator_abbr"] == value_indicator) , "Avg estimate"]

            # if value estimate series is empty there is not data on this combination and it should be 0 for the year.
            if value_estimate_series.empty:
                value_estimate = 0
            else:
                value_estimate = value_estimate_series.values[0]

            # adds rows that are missing 
            merged_df = merged_df.append({'country_year_indicator_agency':value, "setting":value_name, 
                                          "year":value_year, "indicator_abbr":value_indicator, "indicator_name":full_name_ind, 
                                          "Avg estimate":value_estimate,"Funding Agency Acronym":value_agency, 
                                          "Constant Dollar Amount": 0}, ignore_index =True)
else:
    merged_df = pd.read_csv(os.path.join("..", "WHO_datasets", "merged_data.csv"))
    print("Function skipped, change run_function variable as needed")
        
Function skipped, change run_function variable as needed
In [19]:
# saves dataframe as csv
merged_df.to_csv(os.path.join("..", "WHO_datasets", "merged_data.csv"))
merged_df
Out[19]:
Unnamed: 0 setting year indicator_abbr indicator_name Avg estimate Funding Agency Acronym Constant Dollar Amount
0 0 Ecuador 2002 alcohol Total alcohol per capita consumption in adults... -4.50 AGR 9241684.0
1 1 Ecuador 2002 alcohol Total alcohol per capita consumption in adults... -4.50 IAF 1948562.0
2 2 Ecuador 2002 alcohol Total alcohol per capita consumption in adults... -4.50 PC 3948070.0
3 3 Ecuador 2002 alcohol Total alcohol per capita consumption in adults... -4.50 STATE 35779088.0
4 4 Ecuador 2002 alcohol Total alcohol per capita consumption in adults... -4.50 TDA 1638316.0
... ... ... ... ... ... ... ... ...
3116 3116 Yemen 2020 life_ex Life Expectancy 67.90 STATE 82848136.0
3117 3117 Yemen 2020 life_ex Life Expectancy 67.90 USAID 248473546.0
3118 3118 Yemen 2020 u5mr Under-five mortality rate -5.95 AGR 224994779.0
3119 3119 Yemen 2020 u5mr Under-five mortality rate -5.95 STATE 82848136.0
3120 3120 Yemen 2020 u5mr Under-five mortality rate -5.95 USAID 248473546.0

3121 rows × 8 columns

After setting up our dataset for our model, we imputed data on indicators that may be missing for years, typically replacing the average estimate for the last year's estimate if the estimate was missing for this year. If indicator data for a particular country was completely missing for all years, we kept the average estimate at 0. Additionally, we moved the indicators up a year to show how money from previous years (starting from 2002) impacts later years.

With this data imputation and adjusting the consant dollar amounts up a year, our dataset is now ready for a model.

In [20]:
final_dataset = pd.read_csv(os.path.join("..","WHO_datasets", "final_dataset.csv"))
final_dataset
Out[20]:
setting Previous year indicator_abbr indicator_name Avg estimate Funding Agency Acronym Constant Dollar Amount
0 Ecuador 2003 alcohol Total alcohol per capita consumption in adults... -4.00 ADF 0
1 Ecuador 2003 alcohol Total alcohol per capita consumption in adults... -4.00 AGR 717869
2 Ecuador 2003 alcohol Total alcohol per capita consumption in adults... -4.00 DFC 0
3 Ecuador 2003 alcohol Total alcohol per capita consumption in adults... -4.00 DOC 0
4 Ecuador 2003 alcohol Total alcohol per capita consumption in adults... -4.00 DOE 0
... ... ... ... ... ... ... ...
10332 Yemen 2020 u5mr Under-five mortality rate -5.95 PC 0
10333 Yemen 2020 u5mr Under-five mortality rate -5.95 STATE 82848136
10334 Yemen 2020 u5mr Under-five mortality rate -5.95 TDA 0
10335 Yemen 2020 u5mr Under-five mortality rate -5.95 TREAS 0
10336 Yemen 2020 u5mr Under-five mortality rate -5.95 USAID 248473546

10337 rows × 7 columns

Exploratory Data Analysis

Visualizing Total Foreign Aid Given by Year¶

First, we're going to quickly make another column for the purpose of scaling the data. Then, we're going to plot the amount of foreign aid received by country. This exploration will help us determine which of the countries chosen recieve the most aid. Make sure to pay attention to the axis labels!

In [21]:
for country in countries:
    country['Constant Dollar Amount ($1,000,000s)'] = country['Constant Dollar Amount']
    country['Constant Dollar Amount ($1,000,000s)'] = country['Constant Dollar Amount ($1,000,000s)'].apply(lambda x: x / 1000000) # Scale by one million $
    
    country['Year Abv.'] = country['Fiscal Year']
    country['Year Abv.'] = country['Year Abv.'].apply(lambda x: int(str(x)[2:])) # Get last two digits (ex. 2020 --> 20)
In [22]:
# Get the top 5 agencies by money spent 

country_by_fiscal_yr = []
for i, country  in enumerate(countries): 
    # Groupby funding agency to get money spent per agency.
    country_by_fiscal_yr.append(pd.DataFrame(country.groupby('Year Abv.')['Constant Dollar Amount ($1,000,000s)'].sum()))    
    country_by_fiscal_yr[i].reset_index(inplace=True) # Reset index for plotting
    
# 0: Ethiopia
# 1: Moldova
# 2: Yemen
# 3: Ecuador
# 4: El Salvador
# 5: Haiti
In [23]:
fig1, axes1 = plt.subplots(1, 3, figsize=(15,5) , sharey= True)
fig1.tight_layout(h_pad= 5)

sns.barplot(country_by_fiscal_yr[0], x= 'Year Abv.', y= 'Constant Dollar Amount ($1,000,000s)', ax= axes1[0], width= 1, errorbar= None).set_title('Ethiopia')
sns.barplot(country_by_fiscal_yr[5], x= 'Year Abv.', y= 'Constant Dollar Amount ($1,000,000s)', ax= axes1[1], width= 1, errorbar= None).set_title('Haiti')
sns.barplot(country_by_fiscal_yr[2], x= 'Year Abv.', y= 'Constant Dollar Amount ($1,000,000s)', ax= axes1[2], width= 1, errorbar= None).set_title('Yemen')


fig2, axes2 = plt.subplots(1, 3, figsize=(15,5) , sharey= True)
fig2.tight_layout(h_pad= 5)

sns.barplot(country_by_fiscal_yr[1], x= 'Year Abv.', y= 'Constant Dollar Amount ($1,000,000s)', ax= axes2[0], width= 1, errorbar= None).set_title('Moldova')
sns.barplot(country_by_fiscal_yr[3], x= 'Year Abv.', y= 'Constant Dollar Amount ($1,000,000s)', ax= axes2[1], width= 1, errorbar= None).set_title('Ecuador')
sns.barplot(country_by_fiscal_yr[4], x= 'Year Abv.', y= 'Constant Dollar Amount ($1,000,000s)', ax= axes2[2], width= 1, errorbar= None).set_title('El Salvador')
Out[23]:
Text(0.5, 1.0, 'El Salvador')

Graph 1: Constant Dollar Amount in $1,000,000 of Foreign Aid by Fiscal Year. The countries are separated in groups of 3 for scale purposes.

Analysis of Foreign Aid Amounts¶

The countries chosen represent two distinct groups. One group that receives a vary large amount of foreign aid and one that receives less, but still a significant amount of aid. There is a noticable upward trend in Ethiopia and a peak in Haiti around 2010. The upward trend in Ethiopia is most likely due to a combination of drought, famine, civil unrest, and a civil war. Similarly, Yemen receives the a large amount of US aid in recent years due to the ourbreak of the Yemeni Civil War in late 2014, demarking the upward trend.

Visualizing Foreign Aid by Agency¶

Next, let's examine which agencies give the most aid. This will allow us to determine which category, health or economic, the country receives the most aid in.

In [24]:
# Get the top 5 agencies by money spent 

country_by_funding_agency = []
for i, country  in enumerate(countries):
    # Groupby funding agency to get money spent per agency.
    country_by_funding_agency.append(pd.DataFrame(country.groupby('Funding Agency Acronym')['Constant Dollar Amount ($1,000,000s)'].sum()))
    country_by_funding_agency[i].reset_index(inplace=True) # Reset index for plotting
    
    # Sort and locate the top 3 agencies by money spent
    country_by_funding_agency[i] = country_by_funding_agency[i].sort_values(ascending= False, by= 'Constant Dollar Amount ($1,000,000s)')
    country_by_funding_agency[i] = country_by_funding_agency[i].iloc[0:3] 
In [25]:
# Plotting the top 3 agencies by money spent
fig, axes = plt.subplots(1, 3, figsize=(15,5), sharey= True)


sns.barplot(data= country_by_funding_agency[0] , x= 'Funding Agency Acronym', y= 'Constant Dollar Amount ($1,000,000s)', hue= 'Funding Agency Acronym' , ax= axes[0]).set_title('Ethiopia')
sns.barplot(data= country_by_funding_agency[5] , x= 'Funding Agency Acronym', y= 'Constant Dollar Amount ($1,000,000s)', hue= 'Funding Agency Acronym' , ax= axes[1]).set_title('Haiti')
sns.barplot(data= country_by_funding_agency[2] , x= 'Funding Agency Acronym', y= 'Constant Dollar Amount ($1,000,000s)', hue= 'Funding Agency Acronym' , ax= axes[2]).set_title('Yemen')


fig1, axes1 = plt.subplots(1, 3, figsize=(15,5), sharey= True)
sns.barplot(data= country_by_funding_agency[1] , x= 'Funding Agency Acronym', y= 'Constant Dollar Amount ($1,000,000s)', hue= 'Funding Agency Acronym' ,ax= axes1[0]).set_title('Moldova')
sns.barplot(data= country_by_funding_agency[3] , x= 'Funding Agency Acronym', y= 'Constant Dollar Amount ($1,000,000s)', hue= 'Funding Agency Acronym' ,ax= axes1[1]).set_title('Ecuador')
sns.barplot(data= country_by_funding_agency[4] , x= 'Funding Agency Acronym', y= 'Constant Dollar Amount ($1,000,000s)', hue= 'Funding Agency Acronym' ,ax= axes1[2]).set_title('El Salvador')
Out[25]:
Text(0.5, 1.0, 'El Salvador')

Graph 2: Constant Dollar Amount in millions vs Funding Agency. The countries are separated in groups of 3 for scale purposes.

Analysis of Aid per Agency¶

We see four main aid agencies:

  • USAID: US Agency for International Development
  • AGR: Department of Agriculture
  • STATE: Department of State
  • MCC: Millenium Challenge Corporation (for more info: see https://www.mcc.gov/)

See appendix for more information

GW13 EDA¶

To highlight and explore the GW13 dataset, we are going to show how the five indicators change over time for Ecuador.

In [26]:
alc_dataset = final_dataset[final_dataset["indicator_abbr"] == "alcohol"]
alc_dataset = alc_dataset[alc_dataset["setting"] == "Ecuador"]

hh_dataset = final_dataset[final_dataset["indicator_abbr"] == "hh"]
hh_dataset = hh_dataset[hh_dataset["setting"] == "Ecuador"]

life_dataset = final_dataset[final_dataset["indicator_abbr"] == "life"]
life_dataset = life_dataset[life_dataset["setting"] == "Ecuador"]

obesity_dataset = final_dataset[final_dataset["indicator_abbr"] == "obesitya"]
obesity_dataset = obesity_dataset[obesity_dataset["setting"] == "Ecuador"]

U5_dataset = final_dataset[final_dataset["indicator_abbr"] == "u5mr"]
U5_dataset = U5_dataset[U5_dataset["setting"] == "Ecuador"]


fig3, axes3 = plt.subplots(1, 5, figsize=(15,8), sharey= False)
fig3.tight_layout()
    
sns.lineplot(data= alc_dataset, x = "Previous year", y= "Avg estimate", ax=axes3[0]).set_title('Alcohol Consumption by liters weekly')
sns.lineplot(data= hh_dataset, x = "Previous year", y= "Avg estimate", ax=axes3[1]).set_title('Access to Clean Drinking Water')
sns.lineplot(data= life_dataset, x = "Previous year", y= "Avg estimate", ax=axes3[2]).set_title('Life Expectancy')
sns.lineplot(data= obesity_dataset, x = "Previous year", y= "Avg estimate", ax=axes3[3]).set_title('Obesity Rate')
sns.lineplot(data= U5_dataset, x = "Previous year", y= "Avg estimate", ax=axes3[4]).set_title('Under Five Mortality Rate')
Out[26]:
Text(0.5, 1.0, 'Under Five Mortality Rate')

Graph 3: Average estimate of each indicator versus the year. Differentiated by indicator.

Since the early 2000s, all indicators except for obesity have increased since the early 2000s, meaning that based on these indicators, Ecuador's overall health should be increasing. Next, we are going to look at the life expectancy for all our countries over time.

In [27]:
life_df = final_dataset[final_dataset["indicator_abbr"] == "life"]

fig4, ax4 = plt.subplots(figsize=(10,6))
plt.xlabel('xlabel', fontsize=18)
plt.ylabel('ylabel', fontsize=16)

ax4.set_xlabel('Year', fontsize=16)
ax4.set_ylabel('Life Expectancy (in years)', fontsize=12)

sns.scatterplot(data= life_df, x = "Previous year", y= "Avg estimate", hue="setting", ax=ax4).set_title('Life Expectancy by Country')
Out[27]:
Text(0.5, 1.0, 'Life Expectancy by Country')

Graph 4: Life Expectancy versus year. Points colored by country.

Generally, life expectancies have increased over time, with an outlier point in Haiti, which comes from the devastating earthquake in 2010, estimated to kill between 100k-300k people which brings the average life expectancy very significantly.


Modeling and Analysis

Now that we have collected, cleaned, transformed, and explored the data it is time to answer some of the questions which we set out to answer.

Those questions are:

  • Does a \$1 marginal dollar given to one agency increase health indicators more or less than different agency?
  • Since US agencies have budgets set by Congress, how can they better utilize a limited budget with the maximum impact?

To answer these questions, we are going to turn to machine learning to learn our data and predict new values based on our data. We are going to use a simple linear regression to fit our model. We are working under the assumption that increased aid leads to general heath improvement. It is for this reason that we chose a simple linear regression as our predictor of choice.

We opt to separate the indicators into their own models based on previous conclusions and under the assumption that the chosen indicators have no predictive value on each other.

In [28]:
# IMPORTS for model
import sklearn
import sklearn.datasets
assert sklearn.__version__ >= "0.20"

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

First, we're going to get the X and y data for the model. This will consist of the country('setting), the year before, which corresponds to the constant dollar amount, the indicator name in question, the agency who is funding the aid, and the amount they are contributing each year. Our predictor value will be the average estimate gathered from the GW13 dataset.

In [29]:
final_dataset['indicator_name'].unique()
Out[29]:
array(['Total alcohol per capita consumption in adults aged 15+ (litres of pure alcohol)',
       'Population using safely managed drinking water services (%)',
       'Life Expectancy', 'Obesity prevalence among adults (%)',
       'Under-five mortality rate'], dtype=object)

Under-Five Mortality¶

In [30]:
model_df_under5 = final_dataset[final_dataset['indicator_name'] == 'Under-five mortality rate']
In [31]:
X_under5 = model_df_under5[['setting' ,'Previous year', 'indicator_name', 'Funding Agency Acronym', 'Constant Dollar Amount']]

X_under5 = pd.get_dummies(X_under5)

y_under5 = model_df_under5[['Avg estimate']]

Next, we're going to create the train, validation, and test sets. After that, we're going to fit and predict our model on the final test set. Previous adjustments to the code were done on the validation sets.

In [32]:
X_train_full_5, X_test_5, y_train_full_5, y_test_5 = train_test_split(X_under5, y_under5, test_size=0.2, random_state=0) 
#X_train, X_val, y_train, y_val = train_test_split(X_train_full, y_train_full, test_size=0.25, random_state=0)

model_under5 = LinearRegression()
model_under5.fit(X_train_full_5, y_train_full_5)

predicts_under5 = model_under5.predict(X_test_5)

mae_5 = mean_absolute_error(predicts_under5, y_test_5)
r_squared_5  = model_under5.score(X_test_5 , y_test_5 )
print('R-squared value: ', r_squared_5 )
print('Model MAE: ', mae_5)
R-squared value:  0.7903251858464739
Model MAE:  0.7939630674179491

Since the estimate is measured in terms of %, we can see from our MAE that for our under 5 mortality model is on average 0.8% off from the actual estimate.

Analyzing our Coefficients¶

In [33]:
print('model intercept:', model_under5.intercept_)
coef = model_under5.coef_
coef_series = pd.DataFrame(coef, columns= X_under5.columns).iloc[0]

coef_series[9:]
model intercept: [-334.99343318]
Out[33]:
Funding Agency Acronym_ADF     -0.054791
Funding Agency Acronym_AGR      0.236369
Funding Agency Acronym_DFC     -0.131322
Funding Agency Acronym_DOC      0.046639
Funding Agency Acronym_DOE     -0.073089
Funding Agency Acronym_DOL     -0.095695
Funding Agency Acronym_DOT     -0.070006
Funding Agency Acronym_EPA     -0.096783
Funding Agency Acronym_FTC      0.012000
Funding Agency Acronym_HHS     -0.028213
Funding Agency Acronym_IAF      0.060164
Funding Agency Acronym_LIC     -0.074186
Funding Agency Acronym_MCC      0.053703
Funding Agency Acronym_NSF     -0.053509
Funding Agency Acronym_PC       0.111324
Funding Agency Acronym_STATE    0.154022
Funding Agency Acronym_TDA     -0.059771
Funding Agency Acronym_TREAS   -0.038698
Funding Agency Acronym_USAID    0.101842
Name: 0, dtype: float64

Each coefficient (Beta) represents the difference in the predicted value of y (the average estimate of under 5 mortality) for each unit difference in the corresponding feature, if the other features stay constant. In essence, this means that if X increases or decreases by one unit, the average estimate of under 5 mortality will change by beta units.

Most impactful agency

In [34]:
coef_series.loc[coef_series.index[9:]].idxmax()
Out[34]:
'Funding Agency Acronym_AGR'

Total alcohol per capita consumption in adults aged 15+ (litres of pure alcohol)¶

In [35]:
model_df_alc = final_dataset[final_dataset['indicator_name'] == 'Total alcohol per capita consumption in adults aged 15+ (litres of pure alcohol)']
In [36]:
X_alc = model_df_alc[['setting' ,'Previous year', 'indicator_name', 'Funding Agency Acronym', 'Constant Dollar Amount']]

X_alc = pd.get_dummies(X_alc)

y_alc = model_df_alc[['Avg estimate']]

Next, we're going to create the train, validation, and test sets. After that, we're going to fit and predict our model on the final test set. Previous adjustments to the code were done on the validation sets.

In [37]:
X_train_full_alc, X_test_alc, y_train_full_alc, y_test_alc = train_test_split(X_alc, y_alc, test_size=0.2, random_state=0) 

model_alc = LinearRegression()
model_alc.fit(X_train_full_alc, y_train_full_alc)

predicts_alc = model_alc.predict(X_test_alc)

mae_alc = mean_absolute_error(predicts_alc, y_test_alc)
r_squared_alc  = model_alc.score(X_test_alc , y_test_alc )
print('R-squared value: ', r_squared_alc )

print('Model MAE: ', mae_alc)
R-squared value:  0.8469852802786237
Model MAE:  0.7150067905134573

Since the estimate is measured in terms of liter, we can see from our MAE that for our alcohol consumption model is on average 0.7 liters/week off from the actual estimate.

Analyzing our Coefficients¶

In [38]:
print('model intercept:', model_alc.intercept_)
coef_alc = model_alc.coef_
coef_series_alc = pd.DataFrame(coef_alc, columns= X_alc.columns).iloc[0]

coef_series_alc[9:]
model intercept: [-98.99754148]
Out[38]:
Funding Agency Acronym_ADF     -0.139566
Funding Agency Acronym_AGR      0.039903
Funding Agency Acronym_DFC      0.047342
Funding Agency Acronym_DOC      0.044578
Funding Agency Acronym_DOE     -0.128638
Funding Agency Acronym_DOL      0.029335
Funding Agency Acronym_DOT      0.060120
Funding Agency Acronym_EPA     -0.041348
Funding Agency Acronym_FTC      0.047553
Funding Agency Acronym_HHS     -0.166892
Funding Agency Acronym_IAF      0.015767
Funding Agency Acronym_LIC      0.114250
Funding Agency Acronym_MCC      0.085641
Funding Agency Acronym_NSF     -0.122542
Funding Agency Acronym_PC       0.003367
Funding Agency Acronym_STATE    0.098357
Funding Agency Acronym_TDA     -0.032193
Funding Agency Acronym_TREAS    0.005792
Funding Agency Acronym_USAID    0.039176
Name: 0, dtype: float64

Most impactful agency

In [39]:
coef_series_alc.loc[coef_series_alc.index[9:]].idxmax()
Out[39]:
'Funding Agency Acronym_LIC'

Life Expectancy¶

In [40]:
model_df_life = final_dataset[final_dataset['indicator_name'] == 'Life Expectancy']
In [41]:
X_life = model_df_life[['setting' ,'Previous year', 'indicator_name', 'Funding Agency Acronym', 'Constant Dollar Amount']]

X_life = pd.get_dummies(X_life)

y_life = model_df_life[['Avg estimate']]

Next, we're going to create the train, validation, and test sets. After that, we're going to fit and predict our model on the final test set. Previous adjustments to the code were done on the validation sets.

In [42]:
X_train_full_life, X_test_life, y_train_full_life, y_test_life = train_test_split(X_life, y_life, test_size=0.2, random_state=0) 

model_life = LinearRegression()
model_life.fit(X_train_full_life, y_train_full_life)

predicts_life = model_life.predict(X_test_life)

mae_life = mean_absolute_error(predicts_life, y_test_life)
r_squared_life = model_life.score(X_test_life, y_test_life)
print('R-squared value: ', r_squared_life)

print('Model MAE: ', mae_life)
R-squared value:  0.8011108446414228
Model MAE:  1.4176280750898032

Since the estimate is measured in terms of years, we can see from our MAE that for our life expectancy model is on average 1.4 years off from the actual estimate.

Analyzing our Coefficients¶

In [43]:
print('model intercept:', model_life.intercept_)
coef_life = model_life.coef_
coef_series_life = pd.DataFrame(coef_life, columns= X_life.columns).iloc[0]

coef_series_life[9:]
model intercept: [21715.70476767]
Out[43]:
Funding Agency Acronym_ADF      2852.754682
Funding Agency Acronym_AGR      2852.471314
Funding Agency Acronym_DFC      2852.574361
Funding Agency Acronym_DOC      2852.488663
Funding Agency Acronym_DOE      2852.626195
Funding Agency Acronym_DOL      2852.913651
Funding Agency Acronym_DOT      2852.419819
Funding Agency Acronym_EPA      2852.835268
Funding Agency Acronym_FTC      2852.541626
Funding Agency Acronym_HHS      2852.480125
Funding Agency Acronym_IAF      2852.443298
Funding Agency Acronym_LIC      2852.390947
Funding Agency Acronym_MCC      2852.415794
Funding Agency Acronym_NSF      2852.458384
Funding Agency Acronym_PC       2852.545699
Funding Agency Acronym_STATE    2852.365797
Funding Agency Acronym_TDA      2852.411273
Funding Agency Acronym_TREAS    2852.500019
Funding Agency Acronym_USAID    2852.488978
Name: 0, dtype: float64

Most impactful agency

In [44]:
coef_series_life.loc[coef_series_life.index[9:]].idxmax()
Out[44]:
'Funding Agency Acronym_DOL'

Obesity prevalence among adults¶

In [45]:
model_df_obesity = final_dataset[final_dataset['indicator_name'] == 'Obesity prevalence among adults (%)']
In [46]:
X_obesity = model_df_obesity[['setting' ,'Previous year', 'indicator_name', 'Funding Agency Acronym', 'Constant Dollar Amount']]

X_obesity = pd.get_dummies(X_obesity)

y_obesity = model_df_obesity[['Avg estimate']]

Next, we're going to create the train, validation, and test sets. After that, we're going to fit and predict our model on the final test set. Previous adjustments to the code were done on the validation sets.

In [47]:
X_train_full_obesity , X_test_obesity , y_train_full_obesity , y_test_obesity  = train_test_split(X_obesity, y_obesity, test_size=0.2, random_state=0) 

model_obesity = LinearRegression()
model_obesity.fit(X_train_full_obesity, y_train_full_obesity)

predicts_obesity = model_obesity.predict(X_test_obesity)

mae_obesity = mean_absolute_error(predicts_obesity, y_test_obesity)
r_squared_obesity = model_obesity.score(X_test_obesity, y_test_obesity)
print('R-squared value: ', r_squared_obesity)
print('Model MAE: ', mae_obesity)
R-squared value:  0.978766769693189
Model MAE:  0.6402113938263317

Since the estimate is measured in terms of %, we can see from our MAE that our for obesity model is on average 0.64% off from the actual estimate.

Analyzing our Coefficients¶

In [48]:
print('model intercept:', model_obesity.intercept_)
coef_obesity = model_obesity.coef_
coef_series_obesity = pd.DataFrame(coef_obesity, columns= X_obesity.columns).iloc[0]

coef_series_obesity[9:]
model intercept: [757.55612393]
Out[48]:
Funding Agency Acronym_ADF     -0.007747
Funding Agency Acronym_AGR      0.092102
Funding Agency Acronym_DFC      0.034112
Funding Agency Acronym_DOC     -0.013388
Funding Agency Acronym_DOE     -0.027756
Funding Agency Acronym_DOL      0.002047
Funding Agency Acronym_DOT     -0.076675
Funding Agency Acronym_EPA     -0.096046
Funding Agency Acronym_FTC      0.016318
Funding Agency Acronym_HHS     -0.000027
Funding Agency Acronym_IAF     -0.017796
Funding Agency Acronym_LIC     -0.032623
Funding Agency Acronym_MCC      0.024782
Funding Agency Acronym_NSF     -0.038777
Funding Agency Acronym_PC       0.016032
Funding Agency Acronym_STATE    0.085456
Funding Agency Acronym_TDA     -0.072786
Funding Agency Acronym_TREAS    0.044613
Funding Agency Acronym_USAID    0.068160
Name: 0, dtype: float64

Most impactful agency

In [49]:
coef_series_obesity.loc[coef_series_obesity.index[9:]].idxmax()
Out[49]:
'Funding Agency Acronym_AGR'

Population using safely managed drinking water services (%)¶

In [50]:
model_df_water = final_dataset[final_dataset['indicator_name'] == 'Population using safely managed drinking water services (%)']
In [51]:
X_water = model_df_water[['setting' ,'Previous year', 'indicator_name', 'Funding Agency Acronym', 'Constant Dollar Amount']]

X_water = pd.get_dummies(X_water)

y_water = model_df_water[['Avg estimate']]

Next, we're going to create the train, validation, and test sets. After that, we're going to fit and predict our model on the final test set. Previous adjustments to the code were done on the validation sets.

In [52]:
X_train_full_water, X_test_water, y_train_full_water, y_test_water = train_test_split(X_water, y_water, test_size=0.2, random_state=0) 

model_water = LinearRegression()
model_water.fit(X_train_full_water, y_train_full_water)

predicts_water = model_water.predict(X_test_water)

mae_water = mean_absolute_error(predicts_water, y_test_water)
r_squared_water = model_water.score(X_test_water, y_test_water)

print('R-squared value: ', r_squared_water)
print('Model MAE: ', mae_water)
R-squared value:  0.9982495503032212
Model MAE:  0.7150255979592951

Since the estimate is measured in terms of %, we can see from our MAE that our for safe drinking water model is on average 0.7% off from the actual estimate.

Analyzing our Coefficients¶

In [53]:
print('model intercept:', model_water.intercept_)
coef_water = model_water.coef_
coef_series_water = pd.DataFrame(coef_water, columns= X_water.columns).iloc[0]

coef_series_water[9:]
model intercept: [-194636.53814263]
Out[53]:
Funding Agency Acronym_ADF      33200.480720
Funding Agency Acronym_AGR      33200.411869
Funding Agency Acronym_DFC      33200.440314
Funding Agency Acronym_DOC      33200.427906
Funding Agency Acronym_DOE      33200.467663
Funding Agency Acronym_DOL      33200.437283
Funding Agency Acronym_DOT      33200.356895
Funding Agency Acronym_EPA      33200.360992
Funding Agency Acronym_FTC      33200.463437
Funding Agency Acronym_HHS      33200.533920
Funding Agency Acronym_IAF      33200.369219
Funding Agency Acronym_LIC      33200.430396
Funding Agency Acronym_MCC      33200.399428
Funding Agency Acronym_NSF      33200.370769
Funding Agency Acronym_PC       33200.467869
Funding Agency Acronym_STATE    33200.459103
Funding Agency Acronym_TDA      33200.437018
Funding Agency Acronym_TREAS    33200.456665
Funding Agency Acronym_USAID    33200.407146
Name: 0, dtype: float64

The strange nature of the coefficients can most likely be attributed to the lack of data from a few of the countries.

Most impactful agency

In [54]:
coef_series_water.loc[coef_series_water.index[9:]].idxmax()
Out[54]:
'Funding Agency Acronym_HHS'

Analysis¶

In [55]:
from tabulate import tabulate

print(tabulate([['Health Indicator','Most impactful agencies', 'Model Error', 'R-Squared Value'], 
        ['Under-five mortality rate', coef_series.loc[coef_series.index[9:]].idxmax(), mae_5, r_squared_5],
        ['Total alcohol per capita consumption in adults aged 15+ (litres of pure alcohol)',coef_series_alc.loc[coef_series_alc.index[9:]].idxmax(), mae_alc, r_squared_alc],
        [ 'Life Expectancy', coef_series_life.loc[coef_series_life.index[9:]].idxmax(), mae_life, r_squared_life],
        ['Obesity prevalence among adults (%)', coef_series_obesity.loc[coef_series_obesity.index[9:]].idxmax(), mae_obesity, r_squared_obesity],
        ['Population using safely managed drinking water services (%)', coef_series_water.loc[coef_series_water.index[9:]].idxmax(), mae_water, r_squared_water]], 
        headers='firstrow', tablefmt= 'simple_grid'
    ))
Health Indicator                                                                  Most impactful agencies       Model Error    R-Squared Value
--------------------------------------------------------------------------------  --------------------------  -------------  -----------------
Under-five mortality rate                                                         Funding Agency Acronym_AGR       0.793963           0.790325
Total alcohol per capita consumption in adults aged 15+ (litres of pure alcohol)  Funding Agency Acronym_LIC       0.715007           0.846985
Life Expectancy                                                                   Funding Agency Acronym_DOL       1.41763            0.801111
Obesity prevalence among adults (%)                                               Funding Agency Acronym_AGR       0.640211           0.978767
Population using safely managed drinking water services (%)                       Funding Agency Acronym_HHS       0.715026           0.99825

Conclusion

The United States is the number one distributor of foregin aid in the world. Due to this, it is extremely important to determine how foreign aid is being used in order to distribute aid to achieve maximum impact.

Have we gathered enough evidence to answer the questions we set out to answer?

  • How does US foreign aid impact the overall health of a country?
  • Does a \$1 marginal dollar given to one agency increase health indicators more or less than different agency?
  • Since US agencies have budgets set by Congress, how can they better utilize a limited budget with the maximum impact?

How does US foreign aid impact the overall health of a country?¶

In all predictions for all indicators, we have high to extremely high R-squared values indicating a direct positive correlation between foreing aid and health outcomes.

Does a \$1 marginal dollar given to one agency increase health indicators more or less than different agency?¶

As described in the model analysis section, each indicator has a corresponding agency which has the greatest impact on the indicator. For example, if the US gave the Department of Agriculture one more dollar to combat Under-five Mortality, it would be more effective than any other agency.

Since US agencies have budgets set by Congress, how can they better utilize a limited budget with the maximum impact?¶

As we can see, performing this type of analysis effectively determines which US agency should spearhead health improvement initiatives.

Final Notes¶

We can see that certain agencies are better equipped to affect change in health outcomes. We hope that this data can be used as a preliminary way of exploring how the United States can effectively distribute foreign aid. We hope that this analysis could be used to highlight inefficiencies or spark conversations about where foreign aid could be most effective. We'd like to thank Professor Mattei for the guidance and many office hours spent "aiding" us in our efforts.


Appendix

Funding Agency Acronyms¶

AGR: Department of Agriculture

USAID: US Agency for International Development

STATE: Department of State

MCC: Millennium (for more info: https://www.mcc.gov/)

HHS: Department of Health and Human Services

LIC: Open World Leadership Center (for more info: https://www.openworld.gov/)

ADF: US African Development Foundation

DFC: US International Development Finance Corporation

DOC: Department of Commerce

DOE: Department of Energy

DOL: Department of Labor

DOT: Department of Transportation

EPA: US Environmental Protection Agency

FTC: US Federal Trade Comission

IAF: US Inter-American Foundation (for more info: https://www.iaf.gov/)

NSF: US National Science Foundation

PC : Peace Corps

TDA: US Trade and Development Agency

TREAS: Department of the Treasury

Indicators Used¶

Total alcohol per capita consumption in adults aged 15+ (litres of pure alcohol)

Population using safely managed drinking water services (%)

Life Expectancy

Obesity prevalence among adults (%)

Under-five mortality rate

In [ ]: