Motivation

A highly anticipated presidential election is less than one year away, with the first primary votes happening in Iowa in just two months. We were interested in exploring factors that affect candidate popularity in the 2020 democratic primary over the last year in order to better understand how candidates might perform in upcoming votes. We thought that factors such as debate performance and campaign events might help explain fluctuations in the polls or a particular candidates’ success. In order to assess candidate popularity we chose to use polling data as an indicator of potential election success as well as Google Trends data. We wanted to assess the impacts of debate performance, measured by variables such as speaking time, as well as examine trends in polling and google searches of candidates after newsworthy events at the state and national level.

Data

Polling and Key Events Data

National and state level polling for all candidates from January 24, 2019 to November 3, 2019 were obtained from FiveThirtyEight. These data contained information on the ranking of the poll (letter grades). Data was cleaned to include only polls with a grade of A or A+ and only democratic candidates that participated in each of the four debates. A mean polling percent was taken from all remaining polls for each available time point. Additional data on key campaign events were obtained from news sources such as the New York Times, Vox, Politico, and The Cut.

Debate Indicators Data

For the plots included in the dashboard, polling data was filtered to include only the polls directly before and directly after the fourth primary debate. A percent change variable was created to show the change in polling numbers before and after the debate.

Trump mentions and words spoken per candidate were scraped from FiveThirtyEight using rvest.

Airtime was scraped from Wikipedia using rvest.

Data on candidate attacks during the debate was taken from NBC News. Data were presented in graphic form and were therefore entered into a .CSV file manually due to scraping difficulties. The data were checked by all team members to ensure no mistakes were made.

The resulting dataframes were filtered to include only the candidates of interest for this project and joined to create a single dataframe including polling percent change and all debate variables of interest.

Exploratory Analysis

We explored the relationship between key campaign events including controversies, medical events, and endorsements and a candidate’s polling performance. The list of candidates was narrowed down to the three frontrunner candidates with most of the media coverage. We had originally planned to look at each candidate but it became difficult to find major campaign events for low polling candidates. We also decided to look into the relationship between the number of endorsements per candidate and their current polling performance after reading additional research on how well the number of endorsements predicts nomination to the general election. It is hard to determine whether these events were truly the reason for a drop or gain in polling percent because other factors such as debate performance, candidate fatigue and advertising and therefore our analysis is purely exploratory.

We also discussed several approaches to the exploratory analysis of debate factors affecting polling performance. Originally, we chose an approach in which we would compile debate indicators across debates one through four and examine their association with candidates’ polling percent change over that time period. However, we realized that many other factors besides debate-related variables were likely to affect polling numbers over that period of time. We ultimately decided to focus on the fourth debate because we believe the percent change in polling numbers directly before and after a single debate is more likely to reflect debate performance.

Although we believe focusing on one debate is valuable, we anticipated that it may be difficult to observe meaningful trends due to the small sample size. Additionally, we understand that certain debate-related factors that influence voting behavior may be less quantifiable than variables such as airtime and Trump mentions. We therefore chose to explore Google Search behavior as a secondary indicator of the impact that debate performance has on public perceptions of Democratic presidential candidates. We decided to choose key events from each debate to look at using Google Search behavior in the days immediately following the debates. These events were as follows: Kamala Harris’ comments toward Joe Biden in the first debate about his previous opposition to school busing programs, Bernie Sanders’ memorable comment that he “wrote the damn bill” in the second debate, Beto O’Rourke’s comment of “hell yes” to gun buy-back programs in the third debate, and Andrew Yang and Kamala Harris bringing up Universal Basic Income and reproductive rights, respectively, in the fourth debate. We also decided to compare Yang and Harris after the fourth debate to further compare their debate performances. We initially thought about instead looking at the 24 hours preceding and following the debates. However, we realized that a longer timeframe would give us a better understanding of behavior changes by allowing us to see if they were sustained over time so we looked at daily data for the week preceding and following each debate. Analysis was very exploratory in nature, and search volume was visualized using spaghetti plots.

Finally, there is significant interest in which democratic candidate will eventually go head to head with President Trump in the 2020 presidential election, so we explored polling results in two key states, IA and PA. The list of candidates was limited to the four polling highest in the states who are still in the race: Biden, Sanders, Warren, and Buttigieg. It became difficult to assess polling for candidates in Pennsylvania as opposed to Iowa, as polling data for several candidates was not as varied, which means the spread of polling results was not easily visualizable. This may be due to voters not having as clear of a picture of who will be a realistic contender that late in the race, as Pennsylvania’s primary will happen in April, as opposed to Iowa’s early February caucus. In addition to looking at Google Search behavior following debates at the national level, we also explored how searches for these top candidates varied (or didn’t) between the two states over the course of this year to see if voters exhibited a strong interest or preference in any particular candidate and if these fluctuations coincided with key events like debates.

Additional Analysis

This project was intended to provide useful real-world insight into factors that may be affecting perceptions of candidates in the upcoming 2020 presidential election. Given our decision to focus on this election, many of our analyses do not have a great enough sample size to support further confirmatory analysis and regression; this would require a compilation of data from numerous election cycles. Therefore, these analyses were purely exploratory and no further analysis was conducted.

Discussion

Key Events

Key campaign events were associated with changes in polling status for several candidates:

  • Flores’s essay about Joe Biden’s inappropriate touching coincided with one of his largest drops in the polls, while the conspiracy theory about his involvement in Ukraine occurred right before his biggest gain in the polls. The Feinstein endorsement occurred shortly before the former Vice President saw a gain in the polls but this was short-lived.

  • Elizabeth Warren has seen mostly consistent gains in the polls since announcing her bid for the presidency so it is more difficult to determine how events may have impacted her polling performance. One of the only drops in her polling after an event captured in our data was after her apology for the release of her ancestry results, but this was quickly followed by an increase in the polls around the same time as her endorsement from the Working Family Party. Due to the lack of polling information after the New York Times article on Warren’s time as a corporate lawyer, it is hard to tell whether this impacted her polling performance.

  • With events that occurred around the same time period, it was difficult to determine the impact of individual events on Sanders’ polling performance. Polling following Biden’s announcement of his bid for presidency as well as news of Sanders becoming a millionaire showed his largest decline in polling percent. Polling after his heart attack and endorsements by Ilhan Omar, Alexandra Ocasio-Cortez, and Rashida Tlaib showed modest gains. A limited number of data points for polling means we cannot see the immediate effects of each event. Polls only occurred once to twice a month, meaning these events cannot be directly linked to polling.

The Fourth Debate

When looking at candidate polling before and after the fourth debate, we see that Pete Buttigieg, Andrew Yang and Julian Castro’s numbers were unchanged. Compared to polls directly before this debate, Joe Biden, Corey Booker, Kamala Harris, Tulsi Gabbard, Amy Klobuchar and Bernie Sanders moved up in those polls following the debate, while Elizabeth Warren was the only candidate to move down. Biden and Warren had the largest percent change in polling numbers (Biden = 28%, Warren = -21%).

We did not observe meaningful trends in polling performance before and after the debates based on words spoken, airtime, Trump mentions, or candidate attacks. However, a few points of interest are worth noting:

  • Elizabeth Warren, who was the only candidate to move down in the polls, spoke the greatest number of words, had the longest amount of airtime, and was attacked by other candidates far more than anyone else during the debate.

  • Both Elizabeth Warren and Joe Biden, the two candidates with the greatest number of words spoken and longest amount of airtime, had the greatest percent change in polling numbers comparing pre- and post-debate polls.

  • Andrew Yang, Pete Buttigieg, and Julian Castro were the only candidates who did not mention Trump during the debate and were also the only candidates that did not move in the polls following the debate.

The lack of observable trends is not unusual, as experts have suggested that less quantifiable factors such as charisma and “likeability” may play a large role in changing voting behavior following debates. Our analysis indicates that the variables we’ve explored can have varying effects on polling performance; further research with a larger sample size is needed to better characterize these relationships.

Battleground States

In Iowa and Pennsylvania, polling results among top candidates Biden, Warren, Sanders, and Buttigieg were fairly consistent. From 33 polls among Iowa voters, Biden leads the field, followed by Warren, Sanders, and finally Buttigieg. Warren and Sanders are close with just a 2 percentage point separation while Buttigieg sits at least 5 points behind. However, Buttigieg recently rose into the double digits at 13% in an October poll from Suffolk University/USA Today among potential Iowa caucus goers, which pushed him past Sanders. That poll was not included in this sample, but is considered to be a sign of Iowans’ favor towards Buttigieg in the coming caucus. He would still need to see a substantial increase in support to close the 12 point gap between himself and Biden though. In Pennsylvania, Biden still appears to be the frontrunner with a median 30 percentage points, while the gaps between Warren, Sanders, and Buttigieg have closed, as they are all within 3.2 percentage points of each other. Data for Pennsylvania was only available from 7 polls, which meant there was a lower sample size and overall variability in data points to construct accurate estimates of polling performance, particularly for Biden and Buttigieg.

Candidacy announcements and debates were associated with the greatest increases in relative search volume. Both Sanders and Biden experienced large spikes in search volume in the spring in both states at the time of their announcements. With the exception of these spikes, queries for Biden, Warren, and Sanders largely mirror one another throughout the year and increase together during debates. In Iowa, relative search volume trended upward through to the fall, while in Pennsylvania volume remained quite level outside of occasional spikes. Higher relative search volumes during debates suggest voter interest in performance, although there isn’t a clear candidate preference moving forward in either Iowa of Pennsylvania. Buttigieg exhibited the lowest overall results in relative search queries, and changes in search behavior due to factors such as debate performances are masked by the significantly greater search volume associated with the other three candidates.