COVID-19, Demographics, and Political Affiliation

A Machine Learning Project focused on the 2020 Presidential Election

by Jonathan Ferrari

Introduction

During the pandemic, it has become glaringly obvious that COVID-19 has affected us all, for better or worse. However, it seems many Americans have vastly different opinions on the issue; covid has become a political issue. So, I decided to dig deeper into the relation, using topics from data science to build a machine learning algorithm that makes predictions based upon this information. An algorithm with a high accuracy will signal that there is a likely relation between demographics/COVID-19 statistics and political affiliation of a county.

Abstract

In this notebook, I will create a k-NN (k-nearest neighbors) classifier which takes as its features information about COVID-19 in a certain county and demographic information about that county. The classifier will return a prediction for who won the 2020 Presidential election in this county. The techniques I will use in this project include, but are not limited to: markdown, $\LaTeX$, importing libraries and .csv files, table manipulation, data cleaning and filtering, defining functions, statistical distribution analysis, for loops, and basic machine learning.

Note: I will use first-person-plural pronouns, such as $\text{"we"}$ and $\text{"our"}$. In doing so, I refer only to myself and the reader.

Definitions:

Classifiers: A classifier in machine learning is an algorithm that automatically orders or categorizes data into one or more of a set of “classes.” One of the most common examples is an email classifier that scans emails to filter them by class label: Spam or Not Spam.

Per Capita: Per capita means per person. It is a Latin term that translates to "by the head." It's commonly used in statistics, economics, and business to report an average per person. It tells you how a country, state, or city affects its residents.

Note: For further clarification of any terms in this project or for information on any basic topic in data science, please refer to Computational and Inferential Thinking: The Foundations of Data Science, by Ani Adhikari, and John DeNero, with contributions from David Wagner and Henry Milner.

Set-up

Here, we import all of the libraries necessary to complete this project

Next, we will load 3 .csv files with information needed to build the classifier (and drop unnecessesary data), aquired from New York Times, Politico and New York Times, and The Census Bureau, respectively. We will then import a .csv file from the EPA to validate population data from our other tables. This validation will be done by randomly selecting fips numbers, and seeing if the data from our two courses on population looks plausible; however, we will expect to see a small difference, because the data are from two different years (2015 and 2017).

We can see below that the population data we have are very similar, so it is fair to say that the data is likely accurate.

Now, we'll clean up the data from these tables, and join them into one table which can be used to build the classifer.

Next, we will relabel the columns of the counties table

In this cell, we will format the trump and biden columns into percentages and not proportions. The fix_votes function does this.

The Data

In the data table below, the following columns are represented:

Note: All Alaskan counties have been excluded, as Alaskan election information is not collected by county

fips: An identification number given to each county by the Federal Government

county: The county name

state: The state that county is in, or the District of Columbia

pop: The population of the county (as of 2017)

cases: The number of cumulative COVID-19 cases in that county per capita (as of June 2nd, 2021)

deaths: The number of cumulative COVID-19 deaths in that county per capita (as of June 2nd, 2021)

trump: The percent of votes that were cast for Donald J. Trump in that county in the 2020 Presidential Election

biden: The percent of votes that were cast for Joesph R. Biden in that county in the 2020 Presidential Election

median household income: The median household income of that county (as of 2019)

unemployed: The unemployment rate in that county (as of 2019)

median age: The median age of that county (as of 2019)

white: The percent of residents of that county that are white (as of 2019)

hs grad: The percent of residents of that county that have graduated high school (as of 2019)

bachelors: The percent of residents of that county that have graduated college with a bachelor's degree (as of 2019)

We will now modify the data table to make our classification easier. The first step we will make is identifying which candidate won in each county. To do this, we will define the votes function as below and create an election table with only the voting percentages. Then, we will save this new array.

We now create 2 tables, one with the categorical data, and one with the numerical data. We will also drop the vote % of each candidate, as we hace the winner array, so they are no longer relavent.

Now, because the units in each column are different, we want to convert the data into standard units or z-score so that all the data has the same weight in our classifier. We will do this by defining the s_u and standardize functions

Now, each value represents how many standard deviations it is above or below the mean of that column. We can now call the standardize function on the num table, and add the fips column so we can join all the data back into one table.

As is standard practice in machine learning, we will now randomly split the data into 2 sets, train which will be used to train our algorithm, and test which will be used to evaulate the efficiency of our classifier.

We'll now only work with the training table until we have built our classifier. But first, let's define some functions we will need.

distance: finds euclidean distance between two points

row_distance: uses previous function to find distance between two row objects

distances: creates a new table: training with one more column, which contains the distance from each row to the given example

closest: creates a new table with only the $k$ closest rows

majority_class: finds the class of the majority of the rows in a given table

classify_1: calls majority_class on the table from closest to return our classifiers prediction

Here, we define one more function, accuracy, which evaluates the efficiency of classifier on the test set, by comparing the predictions and the actual, and returning the percentage of times classify was correct.

With some trial and error, we can find that the optimal amount of points to base our prediiction off of is $k=13$, this $k$ leads to an accuracy. on the test set of $92.6\%$

Thus, we will define the final k-NN function classify, which makes its prediction using $k=13$ inherently, and the full table.

Finally, we can create a new function predict which, given a state and county name, will make a prediction using $k=13$ and return the actual result for comparison. This will use the full table.

Below I show a few examples of the predict function in action.

Conclusion and Discussion

As we can see from the $92.6\%$ accuracy of the classify function, given basic demographic information about a county, and its COVID-19 total case and total death count per-capita, we can conclude that it is likely that the identity of a county's residents and its reaction to the COVID-19 pandemic do correlate with their political affiliation.

It is possible that the ability of the algorithm to predict the winner of each county was due to only the demographics, or only the COVID-19 data. A further project which created classifiers for each type of data could evaluate the accuracy of both to identify where the decision border is clarified. If both have relatively low accuracy, then the association is with both; in the case that only one is high, it can be inferred that the classifier in this project was successful largely because of that set of variables.

Another, further, application of this algorithm is evaluating the accuracy of this classifier on other Presidential Elections, for instance, any past elections, or even to future elections, such as the 2024 Presidential Election, or the 2022 Midterm Election.

It is clear to see that while this project is limited to the scope of both sets of variables, and the 2020 Presidential Election, there are endless variations that could give even deeper insight into the nature of American politics in the context of the two-party-system.

If you have any questions about this project or require any clarification on concepts/topics, or need a walkthrough of any of my code, do not hesitate to reach out to me at jonathanferrari@berkeley.com. Cheers!