During the pandemic, it has become glaringly obvious that COVID-19 has affected us all, for better or worse. However, it seems many Americans have vastly different opinions on the issue; covid has become a political issue. So, I decided to dig deeper into the relation, using topics from data science to build a machine learning algorithm that makes predictions based upon this information. An algorithm with a high accuracy will signal that there is a likely relation between demographics/COVID-19 statistics and political affiliation of a county.
In this notebook, I will create a k-NN
(k-nearest neighbors) classifier which takes as its features information about COVID-19 in a certain
county and demographic information about that county. The classifier will return a prediction for who won the
2020 Presidential election in this county. The techniques I will use in this project include, but are not
limited to: markdown, $\LaTeX$, importing libraries and .csv
files, table manipulation,
data cleaning and filtering, defining functions, statistical distribution analysis, for loops, and basic
machine learning.
Note: I will use first-person-plural pronouns, such as $\text{"we"}$ and $\text{"our"}$. In doing so, I refer only to myself and the reader.
Definitions:
Classifiers: A classifier in machine learning is an algorithm that automatically orders or categorizes data into one or more of a set of “classes.” One of the most common examples is an email classifier that scans emails to filter them by class label: Spam or Not Spam.
Per Capita: Per capita means per person. It is a Latin term that translates to "by the head." It's commonly used in statistics, economics, and business to report an average per person. It tells you how a country, state, or city affects its residents.
Note: For further clarification of any terms in this project or for information on any basic topic in data science, please refer to Computational and Inferential Thinking: The Foundations of Data Science, by Ani Adhikari, and John DeNero, with contributions from David Wagner and Henry Milner.
Here, we import all of the libraries necessary to complete this project
from datascience import *
import numpy as np
from math import *
import math as math
import scipy.stats
%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')
import pandas as pd
from IPython.display import *
Next, we will load 3 .csv
files with information needed to build the classifier (and drop
unnecessesary data), aquired from New York Times, Politico
and New York Times, and The Census
Bureau, respectively. We will then import a .csv file from the EPA to validate population data from our
other tables. This validation will be done by randomly selecting fips numbers, and seeing if the data from our
two courses on population looks plausible; however, we will expect to see a small difference, because the data
are from two different years (2015 and 2017).
covid_county = Table().read_table('us-counties.csv').drop(0,6,7,8,9)
covid_county.show(5)
county | state | fips | cases | deaths |
---|---|---|---|---|
Autauga | Alabama | 1001 | 7150 | 111 |
Baldwin | Alabama | 1003 | 21661 | 311 |
Barbour | Alabama | 1005 | 2337 | 59 |
Bibb | Alabama | 1007 | 2665 | 64 |
Blount | Alabama | 1009 | 6887 | 139 |
... (3242 rows omitted)
pop_county = Table().read_table('county-population.csv').select(7,5)
pop_county.show(5)
fips | 2015 POPULATION |
---|---|
1001 | 55,347 |
1003 | 203,709 |
1005 | 26,489 |
1007 | 22,583 |
1009 | 57,673 |
... (3229 rows omitted)
elect_county = Table().read_table('county-elections.csv').select(1,7,8).relabel(0,'fips')
elect_county.show(5)
fips | per_gop | per_dem |
---|---|---|
1001 | 0.714368 | 0.270184 |
1003 | 0.761714 | 0.22409 |
1005 | 0.534512 | 0.457882 |
1007 | 0.784263 | 0.206983 |
1009 | 0.895716 | 0.0956938 |
... (3147 rows omitted)
complete = Table().read_table('county_complete.csv').drop(3,4,5,6,7,8,9,10).select(0,1,2,3,'median_age_2019', 'white_2019','hs_grad_2019','bachelors_2019','median_household_income_2019','poverty_2019','unemployment_rate_2019','white_2019').drop(1,2)
complete.show(5)
fips | pop2017 | median_age_2019 | white_2019 | hs_grad_2019 | bachelors_2019 | median_household_income_2019 | poverty_2019 | unemployment_rate_2019 |
---|---|---|---|---|---|---|---|---|
1001 | 55504 | 38.2 | 76.8 | 88.5 | 26.6 | 58731 | 15.2 | 3.5 |
1003 | 212628 | 43 | 86.2 | 90.8 | 31.9 | 58320 | 10.4 | 4 |
1005 | 25270 | 40.4 | 46.8 | 73.2 | 11.6 | 32525 | 30.7 | 9.4 |
1007 | 22668 | 40.9 | 76.8 | 79.1 | 10.4 | 47542 | nan | 7 |
1009 | 58013 | 40.7 | 95.5 | 80.5 | 13.1 | 49358 | 13.6 | 3.1 |
... (3137 rows omitted)
We can see below that the population data we have are very similar, so it is fair to say that the data is likely accurate.
pop_county.join('fips', complete.select(0,1)).sample(10).show(10)
fips | 2015 POPULATION | pop2017 |
---|---|---|
40047 | 63,569 | 61581 |
5137 | 12,456 | 12537 |
16047 | 15,284 | 15124 |
46045 | 3,999 | 3919 |
51041 | 335,687 | 343599 |
40001 | 22,004 | 21909 |
1055 | 103,057 | 102755 |
48121 | 780,612 | 836210 |
15009 | 164,637 | 166260 |
47085 | 18,135 | 18484 |
Now, we'll clean up the data from these tables, and join them into one table which can be used to build the classifer.
county = covid_county
county = county.where('cases', are.above_or_equal_to(0)).where('deaths', are.above_or_equal_to(0)).where('fips', are.below(60000))
counties_1 = county.join('fips', elect_county)
counties = counties_1.join('fips',complete)
counties.show(5)
fips | county | state | cases | deaths | per_gop | per_dem | pop2017 | median_age_2019 | white_2019 | hs_grad_2019 | bachelors_2019 | median_household_income_2019 | poverty_2019 | unemployment_rate_2019 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1001 | Autauga | Alabama | 7150 | 111 | 0.714368 | 0.270184 | 55504 | 38.2 | 76.8 | 88.5 | 26.6 | 58731 | 15.2 | 3.5 |
1003 | Baldwin | Alabama | 21661 | 311 | 0.761714 | 0.22409 | 212628 | 43 | 86.2 | 90.8 | 31.9 | 58320 | 10.4 | 4 |
1005 | Barbour | Alabama | 2337 | 59 | 0.534512 | 0.457882 | 25270 | 40.4 | 46.8 | 73.2 | 11.6 | 32525 | 30.7 | 9.4 |
1007 | Bibb | Alabama | 2665 | 64 | 0.784263 | 0.206983 | 22668 | 40.9 | 76.8 | 79.1 | 10.4 | 47542 | nan | 7 |
1009 | Blount | Alabama | 6887 | 139 | 0.895716 | 0.0956938 | 58013 | 40.7 | 95.5 | 80.5 | 13.1 | 49358 | 13.6 | 3.1 |
... (3102 rows omitted)
Next, we will relabel the columns of the counties
table
counties = counties.drop('population')
counties = counties.relabel('pop2017','pop').relabel('median_age_2019', 'median age').relabel('white_2019','white').relabel('hs_grad_2019','hs grad').relabel('bachelors_2019','bachelors')
counties = counties.relabel('median_household_income_2019','median household income').relabel('poverty_2019','poverty').relabel('unemployment_rate_2019','unemployment')
counties.show(5)
fips | county | state | cases | deaths | per_gop | per_dem | pop | median age | white | hs grad | bachelors | median household income | poverty | unemployment |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1001 | Autauga | Alabama | 7150 | 111 | 0.714368 | 0.270184 | 55504 | 38.2 | 76.8 | 88.5 | 26.6 | 58731 | 15.2 | 3.5 |
1003 | Baldwin | Alabama | 21661 | 311 | 0.761714 | 0.22409 | 212628 | 43 | 86.2 | 90.8 | 31.9 | 58320 | 10.4 | 4 |
1005 | Barbour | Alabama | 2337 | 59 | 0.534512 | 0.457882 | 25270 | 40.4 | 46.8 | 73.2 | 11.6 | 32525 | 30.7 | 9.4 |
1007 | Bibb | Alabama | 2665 | 64 | 0.784263 | 0.206983 | 22668 | 40.9 | 76.8 | 79.1 | 10.4 | 47542 | nan | 7 |
1009 | Blount | Alabama | 6887 | 139 | 0.895716 | 0.0956938 | 58013 | 40.7 | 95.5 | 80.5 | 13.1 | 49358 | 13.6 | 3.1 |
... (3102 rows omitted)
In this cell, we will format the trump
and biden
columns into percentages and not
proportions. The fix_votes
function does this.
def fix_votes(x):
"""Return proportion in rouned percentage form"""
return round(x*100,3)
counties['trump'] = counties.apply(fix_votes,'per_gop')
counties['biden'] = counties.apply(fix_votes,'per_dem')
data_1 = counties.drop('poverty', 'per_gop', 'per_dem')
data = data_1.select(0,1,2,5,3,4,12,13,10,11,6,7,8,9).relabel('unemployment', 'unemployed')
data = data.with_columns('cases1',data.column('cases')/data.column('pop'),'deaths1',data.column('deaths')/data.column('pop')).drop('cases', 'deaths')
data = data.relabel('cases1','cases').relabel( 'deaths1', 'deaths')
c = data.apply(fix_votes, 'cases')
d = data.apply(fix_votes, 'deaths')
data = data.drop(12,13).with_columns('cases', c, 'deaths', d)
data = data.select(0,1,2,3,'cases', 'deaths', 4,5,6,7,8,9,10,11)
data.show(5)
fips | county | state | pop | cases | deaths | trump | biden | median household income | unemployed | median age | white | hs grad | bachelors |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1001 | Autauga | Alabama | 55504 | 12.882 | 0.2 | 71.437 | 27.018 | 58731 | 3.5 | 38.2 | 76.8 | 88.5 | 26.6 |
1003 | Baldwin | Alabama | 212628 | 10.187 | 0.146 | 76.171 | 22.409 | 58320 | 4 | 43 | 86.2 | 90.8 | 31.9 |
1005 | Barbour | Alabama | 25270 | 9.248 | 0.233 | 53.451 | 45.788 | 32525 | 9.4 | 40.4 | 46.8 | 73.2 | 11.6 |
1007 | Bibb | Alabama | 22668 | 11.757 | 0.282 | 78.426 | 20.698 | 47542 | 7 | 40.9 | 76.8 | 79.1 | 10.4 |
1009 | Blount | Alabama | 58013 | 11.871 | 0.24 | 89.572 | 9.569 | 49358 | 3.1 | 40.7 | 95.5 | 80.5 | 13.1 |
... (3102 rows omitted)
In the data
table below, the following columns are represented:
Note: All Alaskan counties have been excluded, as Alaskan election information is not collected by county
fips
: An identification number given to each county by the Federal Government
county
: The county name
state
: The state that county is in, or the District of Columbia
pop
: The population of the county (as of 2017)
cases
: The number of cumulative COVID-19 cases in that county per capita (as of June 2nd,
2021)
deaths
: The number of cumulative COVID-19 deaths in that county per capita (as of June 2nd,
2021)
trump
: The percent of votes that were cast for Donald J. Trump in that county in the 2020
Presidential Election
biden
: The percent of votes that were cast for Joesph R. Biden in that county in the 2020
Presidential Election
median household income
: The median household income of that county (as of 2019)
unemployed
: The unemployment rate in that county (as of 2019)
median age
: The median age of that county (as of 2019)
white
: The percent of residents of that county that are white (as of 2019)
hs grad
: The percent of residents of that county that have graduated high school (as of
2019)
bachelors
: The percent of residents of that county that have graduated college with a bachelor's
degree (as of 2019)
data.show(5)
fips | county | state | pop | cases | deaths | trump | biden | median household income | unemployed | median age | white | hs grad | bachelors |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1001 | Autauga | Alabama | 55504 | 12.882 | 0.2 | 71.437 | 27.018 | 58731 | 3.5 | 38.2 | 76.8 | 88.5 | 26.6 |
1003 | Baldwin | Alabama | 212628 | 10.187 | 0.146 | 76.171 | 22.409 | 58320 | 4 | 43 | 86.2 | 90.8 | 31.9 |
1005 | Barbour | Alabama | 25270 | 9.248 | 0.233 | 53.451 | 45.788 | 32525 | 9.4 | 40.4 | 46.8 | 73.2 | 11.6 |
1007 | Bibb | Alabama | 22668 | 11.757 | 0.282 | 78.426 | 20.698 | 47542 | 7 | 40.9 | 76.8 | 79.1 | 10.4 |
1009 | Blount | Alabama | 58013 | 11.871 | 0.24 | 89.572 | 9.569 | 49358 | 3.1 | 40.7 | 95.5 | 80.5 | 13.1 |
... (3102 rows omitted)
data.to_csv('politics_data.csv')
We will now modify the data
table to make our classification easier. The first
step we will make is identifying which candidate won in each county. To do this, we will define the
votes
function as below and create an election
table with only the
voting percentages. Then, we will save this new array.
def vote(x,y):
"""Return plurality winner in Presidential Election"""
if x>y:
return 'Trump'
elif y>x:
return 'Biden'
else:
return 'Tie'
votes = data.select('trump', 'biden')
winner = votes.apply(vote,0,1)
fips = data.column('fips')
We now create 2 tables, one with the categorical data, and one with the numerical data. We will also drop the
vote % of each candidate, as we hace the winner
array, so they are no longer relavent.
cat = data.select(0,1,2).with_column('winner', winner)
num = data.select(3,4,5,8,9,10,11,12,13)
Now, because the units in each column are different, we want to convert the data into standard
units or z-score so that all the data has the same weight in our classifier. We will do
this by defining the s_u
and standardize
functions
def s_u(array):
"""Return array in standard units"""
return (array-np.mean(array))/np.std(array)
def standardize(table):
"""Return a table in standard units"""
t = Table()
for column in np.array(table.labels):
col = s_u(table.column(column))
t = t.with_column(column,col)
return t
Now, each value represents how many standard deviations it is above or below the mean of that column. We can
now call the standardize
function on the num
table, and add the fips
column so we can join all the data back into one table.
stan_num = standardize(num)
stan_num_fips = stan_num.with_column('fips', fips)
full = cat.join('fips', stan_num_fips)
full.show(5)
fips | county | state | winner | pop | cases | deaths | median household income | unemployed | median age | white | hs grad | bachelors |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1001 | Autauga | Alabama | Trump | -0.141733 | 0.909967 | -0.0287858 | 0.383516 | -0.550973 | -0.608552 | -0.399307 | 0.251456 | 0.487353 |
1003 | Baldwin | Alabama | Trump | 0.338787 | -0.0038243 | -0.510629 | 0.354379 | -0.354464 | 0.283253 | 0.176396 | 0.618838 | 1.04212 |
1005 | Barbour | Alabama | Trump | -0.234195 | -0.32221 | 0.265674 | -1.47431 | 1.76784 | -0.199808 | -2.23666 | -2.19243 | -1.08275 |
1007 | Bibb | Alabama | Trump | -0.242152 | 0.528514 | 0.702903 | -0.409707 | 0.824593 | -0.106912 | -0.399307 | -1.25002 | -1.20836 |
1009 | Blount | Alabama | Trump | -0.134059 | 0.567168 | 0.328135 | -0.280965 | -0.708181 | -0.14407 | 0.745974 | -1.02639 | -0.925743 |
... (3101 rows omitted)
As is standard practice in machine learning, we will now randomly split the data into 2 sets,
train
which will be used to train our algorithm, and test
which will be used to
evaulate the efficiency of our classifier.
shuf_full = full.sample(with_replacement = False)
training = shuf_full.take(np.arange(1606))
test = shuf_full.take(np.arange(1606,3106))
We'll now only work with the training
table until we have built our classifier. But first, let's
define some functions we will need.
distance
: finds euclidean distance between two points
row_distance
: uses previous function to find distance between two row objects
distances
: creates a new table: training
with one more column, which contains the
distance from each row to the given example
closest
: creates a new table with only the $k$ closest rows
majority_class
: finds the class of the majority of the rows in a given table
classify_1
: calls majority_class
on the table from closest
to return our
classifiers prediction
def distance(pt1, pt2):
"""Return the distance between two points, represented as arrays"""
return np.sqrt(sum((pt1 - pt2)**2))
def row_distance(row1, row2):
"""Return the distance between two numerical rows of a table"""
return distance(np.array(row1), np.array(row2))
def distances(training, example):
"""Return training table with distances column"""
distances = make_array()
attributes_only = training.drop('winner',0,1,2)
for row in attributes_only.rows:
distances = np.append(distances, row_distance(row, example))
return training.with_column('Distance', distances)
def closest(training, example, k):
"""
Return a table of the k closest neighbors to example
"""
return distances(training, example).sort('Distance').take(np.arange(k))
def majority_class(topk):
"""Return the class with the highest count"""
return topk.group('winner').sort('count', descending=True).column(0).item(0)
def classify_1(training, example, k):
"""Return the majority class among the k nearest neighbors of example"""
return majority_class(closest(training, example, k))
Here, we define one more function, accuracy
, which evaluates the efficiency of
classifier
on the test
set, by comparing the predictions and the actual, and returning
the percentage of times classify
was correct.
def accuracy(training, test, k):
"""Return the proportion of correctly classified examples
in the test set"""
test_attributes = test.drop('winner',0,1,2)
num_correct = 0
for i in np.arange(test.num_rows):
c = classify_1(training, test_attributes.row(i), k)
num_correct = num_correct + (c == test.column('winner').item(i))
return (num_correct / test.num_rows)*100
With some trial and error, we can find that the optimal amount of points to base our prediiction off of is
$k=13$, this $k$ leads to an accuracy. on the test
set of $92.6\%$
accuracy(training, test, 13)
92.6
Thus, we will define the final k-NN function classify
, which makes its prediction using
$k=13$ inherently, and the full
table.
def classify(example):
return classify_1(full, example ,13)
Finally, we can create a new function predict
which, given a state and county name, will make a
prediction using $k=13$ and return the actual result for comparison. This will use the full
table.
def predict(state, county):
ex = full.where('state', state).where('county', county)
example = ex.drop(0,1,2,3).row(0)
result = classify(example)
print('My algorithm predicts that', result, 'won the 2020 Presidential election in', county,',',state)
return full.where('state',state).where('county',county).select(0,2,1,3)
Below I show a few examples of the predict
function in action.
predict('California','Yuba')
My algorithm predicts that Trump won the 2020 Presidential election in Yuba , California
fips | state | county | winner |
---|---|---|---|
6115 | California | Yuba | Trump |
predict('Texas','Frio')
My algorithm predicts that Trump won the 2020 Presidential election in Frio , Texas
fips | state | county | winner |
---|---|---|---|
48163 | Texas | Frio | Trump |
predict('New York','Nassau')
My algorithm predicts that Biden won the 2020 Presidential election in Nassau , New York
fips | state | county | winner |
---|---|---|---|
36059 | New York | Nassau | Biden |
predict('Nebraska','Brown')
My algorithm predicts that Trump won the 2020 Presidential election in Brown , Nebraska
fips | state | county | winner |
---|---|---|---|
31017 | Nebraska | Brown | Trump |
predict('Illinois','Lee')
My algorithm predicts that Trump won the 2020 Presidential election in Lee , Illinois
fips | state | county | winner |
---|---|---|---|
17103 | Illinois | Lee | Trump |
predict('Mississippi','Bolivar')
My algorithm predicts that Biden won the 2020 Presidential election in Bolivar , Mississippi
fips | state | county | winner |
---|---|---|---|
28011 | Mississippi | Bolivar | Biden |
As we can see from the $92.6\%$ accuracy of the classify
function, given basic demographic
information about a county, and its COVID-19 total case and total death count per-capita, we can conclude that
it is likely that the identity of a county's residents and its reaction to the COVID-19 pandemic do correlate
with their political affiliation.
It is possible that the ability of the algorithm to predict the winner of each county was due to only the demographics, or only the COVID-19 data. A further project which created classifiers for each type of data could evaluate the accuracy of both to identify where the decision border is clarified. If both have relatively low accuracy, then the association is with both; in the case that only one is high, it can be inferred that the classifier in this project was successful largely because of that set of variables.
Another, further, application of this algorithm is evaluating the accuracy of this classifier on other Presidential Elections, for instance, any past elections, or even to future elections, such as the 2024 Presidential Election, or the 2022 Midterm Election.
It is clear to see that while this project is limited to the scope of both sets of variables, and the 2020 Presidential Election, there are endless variations that could give even deeper insight into the nature of American politics in the context of the two-party-system.
If you have any questions about this project or require any clarification on concepts/topics, or need a walkthrough of any of my code, do not hesitate to reach out to me at jonathanferrari@berkeley.com. Cheers!