I. Introduction

Authors and Acknowledgments

Authors: Ammar Plumber, Elaina Lin, Kim Nguyen, Ryan Karbowicz, and Meghan Aines

This website was produced as a final project for BDS 516: Data Science and Quantitative Modeling, a graduate course taught by Alex Shpenev at the University of Pennsylvania.

The tweets that we use in this analysis were obtained from the following GitHub repository by Emily Chen, a computer science Ph.D. student at USC: https://github.com/echen102/COVID-19-TweetIDs

A. Motivation

COVID-19 has wreaked unprecedented havoc around the world. From a data analytics perspective, never before has a pandemic occurred during a time in history when almost any human can publicly share their thoughts on a global platform. More specifically, Twitter offers real-time insight on the attitudes, beliefs, and general moods of a populace. In this analysis, we compare the sentiments of COVID-19 related tweets at the beginning of the pandemic on March 30, 2020, to the sentiments exactly one year later on March 30, 2021. We select this time frame to capture two of the key events during this pandemic, the onset of stay-at-home orders and vaccine availability. As opposed to other data collection methods, such as interviews and surveys (and the numerous response biases that come along with them), sentiment analysis through Twitter is better able to capture the raw and unfiltered emotions of people who feel the need to express their views.

By gaining a better understanding of the general sentiment of a given population, policy leaders can become better informed regarding how to more effectively govern people during times of crisis. For instance, if feelings of fear are high, politicians can offer words of reassurance to instill feelings of calmness and ease. Or, if feelings of trust are low, politicians can attempt to mend the public trust by strengthening accountability and transparency within the government. At the end of the day, essentially any policy decision can be better informed by knowing how the general populace feels about the issue at hand.

We also examine which COVID-19 topics are most talked about. Similar to sentiment analysis, topic analysis can help inform policy leaders about which topics garner the most interest and need to be addressed. For instance, in the case of COVID-19, if a popular topic is the lack of ventilators, a good policy leader would be wise to offer updates on the distribution, as well as the known efficacy of ventilators.

We also examine differences in geographic attention between the two dates. More specifically, we look at how often each country is mentioned in each sample period, as well as which words are associated with each country. In the case of COVID-19, this information can be very helpful in terms of gaining a better understanding of general attitudes towards China. Because the corona virus originated in Wuhan, China, people around the world have unfortunately expressed negative sentiment towards Chinese people. Gaining a better understanding of how these attitudes have changed over time can help inform policy leaders as to whether or not extra measures need to be taken to protect and defend people of Chinese origin.

Lastly, we examine which features are most predictive of how many retweets a tweet gets. In general, it is believed that social media posts with more extreme positive/negative valence tend to be more likely to go viral. In fact, the best selling author Seth Godin has remarked that ”One of the problems with social media is that the stronger the view you express, the more likely it will become amplified.” By examining the kinds of sentiments associated with more viral tweets, as well as the sources of these tweets and the textual elements of the tweets, we can either confirm or disconfirm this common belief. The results that we find can help inform policy leaders about how to craft tweets containing important information in a way that maximizes the likelihood that a large number of people will be exposed to that content. This analysis will also provide valuable insights regarding how the virality of tweets has changed over time, so that policy leaders can adjust their approaches to the climate of the times.

B. Research Questions

  1. Are there differences in sentiments between the two sample periods—both in original tweets and in retweets?

  2. Which topics are there the most original tweets about, and which are more often the subject of retweets?

  3. Is there a difference in geographic attention between the two dates? For example, was China being discussed more in 2020 or 2021?

  4. Which of these features are most predictive of how many retweets a tweet gets?

  5. Are there certain sentiments or topic-specific words that are most likely to attract retweets?

II. Methods

Methods

After hydrating the tweets using Python, we segregated each period’s tweets into original and retweet sets and prepared R-ready CSV files. Then, using R, we determined the frequency of words in the tweets, tweet lengths, common hashtags, and mentions. Many of these steps required our data to be in tidytext format.

Then, we conduct sentiment analysis to further characterize the two periods. We joined the NRC Word-Emotion Association Lexicon to our data. Doing so allowed us to tag the words with eight basic emotions and sentiments. We also joined the AFINN lexicon to rate the emotional valence (positive or negative) of each word.

Next, we use the quanteda package for Latent Dirichlet Allocation (LDA) topic modeling. This algorithm determines the clusters of words that are likely to co-occur, thus defining the topics. Topic modeling helps to illuminate the different conversations surrounding COVID-19 in each period.

We also seek to identify the features that predict a tweet’s virality—defined as the ratio of retweets to followers of the original account. For example, a user with two followers who gets eight retweets on a given tweet would receive a virality score of four. For our two modeling methods, we use the sentiments, mean AFINN score, and topic-specific words as inputs. Our methods include the random forests method and backward stepwise regression optimized using the Akaike Information Criterion (AIC).

We used random forest because it is an ensemble algorithm that runs well on large datasets and has a low risk of overfitting. We use backward stepwise regression so we that can start with a complete model with all of our selected variables and remove those that are predictively insignificant.

III. Original Tweets

What words appear most frequently in tweets from each sample period?



It appears that by 2021, the term “coronavirus” has become a fairly uncommon way to refer to the virus, with COVID becoming the reference of choice. It also seems that over the time frame, messaging about staying home, social distancing, and Trump has died down and become more undifferentiated from the many other COVID-19-related topics being discussed.

Is there a difference in tweet lengths between the two samples?

30 Mar 2020 Mean Tweet Length: 149.37
30 Mar 2021 Mean Tweet Length: 164.94



We check whether there is a difference in tweet lengths between the two samples.

Tweets from March 2020 were on average shorter than those from 2021. It seems that there is a greater share of longer tweets (around 300 characters) in 2021.

The difference in mean tweet length is statistically significant, as p-value obtained for the one-sided unpaired two sample t-test is much less than 0.01. The difference is substantial as well, about sixteen fewer characters.

What differences do we observe in sentiments between the two periods?



We joined the NRC Word-Emotion Association Lexicon to our data, which allowed us to identify words associated with eight basic emotions (anger, fear, anticipation, trust, surprise, sadness, joy, and disgust) and two sentiments (negative and positive).

We produce visualizations comparing the sentiments being expressed in each sample period.

Compared to our 2020 tweets, the 2021 tweets express less trust, less surprise, less joy, less disgust, less anticipation, and less anger, but more sadness, more fear, and more positivity,

Looking at the specific words underlying the 2020 and 2021 sentiments, we can see that the word “pandemic” has been most used but with a different frequency in each sample period. In 2021, other negatively valenced words such as “bad” and "sh*t" words became more common, as did positively valenced words such as “hope” and “love”. This is interesting because it demonstrates that after a year, people seem to be more expressive, likely from the fallout and exhaustion of the previous year.

How positive vs. negative are the tweets from each year?

Average AFINN scores for all words by date

30 Mar 2020: -0.61
30 Mar 2021: -0.608


Next, we join the AFINN sentiment lexicon, a list of English terms manually rated for valence with an integer between -5 (negative) and +5 (positive) by Finn Årup Nielsen between 2009 and 2011. We use this lexicon to compute mean positivity scores for all words tweeted in each sample year.

The tweets from 2021 are slightly more positive, but the difference appears negligible.

In 2020, the word “support” (positively valenced) was the most frequently appearing word from the lexicon, whereas in 2021, the word “stop” (negatively valenced) appeared the most frequently. Note that “support” and “stop” are opposites. Perhaps initially, there were certain efforts people wanted to promote to mitigate effects of the pandemic. It could be that people grew exhausted of the pandemic and became more attitudinally opposed to certain phenomena than supportive of others.

What topics/discussions are prevalent in tweets published on 30 Mar 2020?

2020 Topic Key

topic1 topic2 topic3 topic4 topic5 topic6 topic7 topic8 topic9 topic10
covid covid pandem corona lockdown stay coronavirus china deliv #coronavirus
death pandem trump virus @drtedro home covid virus #covid19 #covid
coronavirus coronavirus coronavirus lockdown world social health trump support #covid19
test call @realdonaldtrump shit god distanc mask peopl offici read
week time presid time @who peopl hospit spread act #stayhom
die nation peopl day coronavirus safe care world ll quarantin
infect due live hope follow friend fight chines sign day
report busi news famili bad month worker countri senat post
york respons media don india fuck patient @realdonaldtrump copi #socialdistanc
updat pay american fuck @ladygaga april protect blame repres watch
posit countri die peopl pandem don medic stop @govkemp video
rate question real covid student day doctor govern #gapol share
confirm crisi global wait thousand live take lie @rondesantisfl #quarantinelif
dr person watch love lot time save call #sayfi stori
total govern respons die game week ventil travel #flapol time
million school #coronavirus gonna feel hous crisi hoax parent check
gt issu listen start readi hand nurs america beach link
counti continu stop kill univers extend donat start middl sunday
hospit amid brief ve covid19 practic healthcar ban @scgovernorpress book
flu move press job usa sick line suppli #scpolit play


Now, we use the quanteda package’s implementation of topic modeling to identify what themes/discussions are prevalent in each year. Underlying this topic modeling implementation is Latent Dirichlet allocation (LDA), a machine learning algorithm that learns clusters of words that tend to occur together (topics). Tweets, therefore, are understood as heterogeneous mixtures of these topics. For each tweet, probabilities are assigned for each topic that the document may or may not include, and we will assume that the topic assigned the highest probability by the algorithm is the focus of the tweet.

While these topics may initially seem to make little sense, there are some patterns we can pick out.

Topic 1 seems to be the informational topic concerning outbreaks, data, hospitalizations, deaths, etc.

Topic 2, it would seem, is focused on the crisis’ impact on the nation: businesses, governments, schools, and people.

Topic 3 appears to be focused on Trump. More specifically, it seems to be about media-related topics such as press briefings, live news, etc. Topic 3 was among the most frequently discussed topics, as a lot of attention was focused on the president.

Topic 4 seems to encompass emotionally intense tweets reflecting fear, anger, and hope. It includes multiple curse words along with words like “love,” “hope,” “kill,” and “die.”

Topic 5 is puzzling; there is no apparent connection between the WHO, Dr. Tedros, and Lady Gaga. We eventually found out that these words correspond to a topic that was trending on 30 Mar 2020: a phone call between WHO Director Dr. Tedros and Lady Gaga. See here: https://twitter.com/drtedros/status/1244008665251708929?lang=en

Topic 6 encompasses tweets urging social distancing.

Topic 7 is the most distinct, as it clearly focuses on the health care situation: mask and ventilator shortages, risks posed to doctors and nurses, and inadequate testing.

Topic 8 centers on China. If we piece together the words, it seems that some of the tweets likely discuss whether it is apt to blame China (note that one of the keywords is “stop”). Additional terms include “travel,” “ban,” and “hoax,” and “lie,” which altogether imply that conversations centered on China are interwoven with virus skepticism.

Topic 9 is clearly a political topic; it seems to be discussing national- and state-level policies. For instance, it includes Governor Kemp of Georgia (GA) and the hashtag #gapol, Governor Ron DeSantis of Florida (FL) and the hashtag #flapol, the South Carolina governor press and SC-related hashtags, and, at a national level, the Senate and representatives.

Topic 10 seems to focus on things people are doing at home while quarantining—watching sports in particular: “#stayhom,” “#quarantinelif,” “read,” “play,” “watch,” “game,” and “day.” This was the second most prevalent topic. It seems that many people were tweeting about their day-to-day experience in quarantine.

What topics/discussions are prevalent in tweets published on 30 Mar 2021?

2020 Topic Key

topic1 topic2 topic3 topic4 topic5 topic6 topic7 topic8 topic9 topic10
#covid19 home covid pandem vaccin pandem covid mask covid pandem
virus covid lockdown post covid trump cdc peopl death covid
#covid stay time link week respons biden wear test peopl
#coronavirus school watch #covid19 shot covid american social peopl fuck
dr time support learn effect fauci news pandem virus feel
concern live month servic dose bad doom distanc day love
base kid day share risk tweet border care report shit
plan children due check million situat impend don posit don
@potus safe play citi receiv birx director stop hospit talk
evict break busi music immun access warn continu rate ve
mar famili close promo prevent origin die health die time
#vaccin student restrict world read call presid covid china guy
#ccpvirus due brisban websit coronavirus amount thousand control flu take
alarm nurs game market studi histori passport public lab god
tax week quarantin leader health reach trump complet patient hope
transmiss travel start live elig blame travel scienc coronavirus happen
peter line local read uk rt america life increas make
fund women ago promot appoint dr surg busi updat day
navarro job season recent pfizer pm top data wuhan start
extend person lock sign develop great illeg rule counti real


The topics from 2021 are harder to interpret.

Topic 1 mentions the Peter Navarro scandal, wherein he—a Trump advisor—allegedly personally profited from questionable/corrupt COVID-19 vaccine investment decisions. None of the other words capture a distinct theme; there are also mentions of taxes, the POTUS, China blaming, etc.

Topic 2 discusses the handling of children and families with respect to schools, travel, and jobs.

Topic 3 seems to be talking about a sports game featuring Brisbane, which people were presumably tuning into. This inference is based on the following words: “watch,” “play,” “brisban,” “game,” and “season.”

We cannot make sense of topic 4.

Topic 5 focuses on the vaccine rollout.

Topic 6 seems focused on Deborah Birx’s comments right around that time, when she claimed that most COVID-19 deaths could have been prevented by Trump and Fauci.

Topic 7 concerns the border crisis—the surge of illegal migrants coming to the US from Mexico, which possibly raises public health fears.

Topic 8 encourages mask wearing and social distancing.

Topic 9 seems informational—focused partly on the COVID situation in China, though the mention of “lab” makes me think that there are conspiracy theories captured by this topic.

Topic 10 is the emotional topic, with angry curse words and words like “love,” “feel,” “hope,” and “God.”

What are the most common hashtags in each sample period?

2020

2021

feature frequency
#covid19 328
#coronavirus 216
#covid 124
#gapol 92
#stayhome 57
#socialdistancing 44
#sayfie 31
#flapol 31
#quarantinelife 31
#stayathome 29
#lockdown 18
#boycotttrumppressconferences 17
#quarantine 13
#corona 13
#trump 12
#coronalockdown 12
#pandemic 12
#coronaviruspandemic 12
#flattenthecurve 12
#scpolitics 12
feature frequency
#covid19 285
#covid 81
#coronavirus 54
#ccpvirus 34
#drlimengyan1 25
#takedowntheccp 25
#wuhanvirus 25
#ludemedia 25
#business 23
#blackownedbusiness 21
#unitedstates 21
#digitalmarketing 20
#ecommerce 20
#ucl 20
#usadetainsoilships3 20
#usanewstoday 20
#customerservice 19
#vaccine 18
#wearamask 17
#pandemic 17

In 2020, the most popular hashtags, of course, are #COVID19 and #coronavirus, but after these, the hot topics seem to be political topics (#gapol, #flapol, #scpolitics, #boycotttrumppressconferences) and social distancing messaging (#stayhome, #socialdistancing, #quarantinelife, #flattenthecurve).

In 2021, some of the most popular hashtags are anti-CCP, seemingly driven by the theory that China developed/released the virus intentionally. Dr. Li-Meng Yan is one of the key conveyors of this theory, as you can see on her Twitter page: https://twitter.com/DrLiMengYAN1

How often is each country mentioned in each sample period?


We examine this question for both March of 2020 and 2021. We use the newsmap model as described on the quanteda package website: https://tutorials.quanteda.io/machine-learning/newsmap/

After formatting the data into country-level document feature matrices, we show the estimated number of mentions for each country.

It appears that for both datasets, the most mentions concern the United States. However, in 2020, a greater share of attention is centered on China than on the next two most mentioned locations (Britain and Canada).

We can visualize this using geographic heatmaps.

As you can see, China is brighter in the 2020 map (as is India and Australia to an extent), which indicates a higher frequency of mentions.

In 2021, the English-speaking Twitter user-base seems to be slightly more focused on its home countries than on China, though China still receives substantial attention.

What words are associated with each country?

2020


US

China

Great Britain

Canada


word coef
americans 6.64
american 6.36
wtf 2.12
propaganda 2.00
dangerous 2.00
scientists 1.71
twitter 1.71
relief 1.71
irresponsible 1.71
citizens 1.62
word coef
china 7.79
chinese 6.04
tons 4.82
communist 3.08
wing 2.98
february 2.91
january 2.57
blocked 2.47
racist 2.37
supplies 2.32
word coef
uk 6.65
ventilator 2.20
son 1.91
street 1.60
write 1.60
building 1.60
usa 1.56
police 1.54
note 1.51
economic 1.51
word coef
canada 6.46
recovered 1.77
buy 1.70
security 1.70
mo 1.63
difficult 1.63
wall 1.63
leading 1.63
write 1.63
spend 1.63


2021


US

China

Great Britain

Canada


word coef
americans 6.95
american 6.25
washington 5.43
bless 3.12
saved 2.92
facilities 2.09
borders 1.90
reopen 1.78
died 1.69
elected 1.69
word coef
china 7.18
chinese 6.30
origins 3.52
theory 3.27
obama 2.93
animal 2.42
lab 2.33
believes 2.24
wuhan 2.17
investigation 2.12
word coef
uk 6.85
eu 3.36
grow 2.77
worry 2.77
strain 2.64
az 2.30
delay 1.90
ireland 1.90
simple 1.90
struggling 1.81
word coef
canada 6.66
astrazeneca 2.34
hotel 2.07
simple 1.96
book 1.87
national 1.61
ford 1.56
ridiculous 1.56
increasing 1.56
doubt 1.56

For each of the two sample periods, we would like to look at what words are associated with each country. We specifically look at the four most mentioned countries in each dataset: the US, China, Great Britain, and Canada.

2020 Interpretation

It is difficult to make sense of these words, but there are a few whose meanings are obvious. There seems to be a lot of focus on expertism in the US with words like “scientists,” “propaganda,” and “expert.” There’s also discussion of relief bills and certain people being irresponsible.

The China conversation centers around communism, racism, and international travel—to Europe and India.

The Britain-/Canada-related words don’t show any obvious themes, but it seems that the ventilator shortage in the UK was one salient topic.

2021 Interpretation

It is clear that the conversations surrounding these countries has changed in the past year. Rochelle Walensky, the new CDC Director is one term that sticks out. Others are the discussion of the country reopening, energy policy, and borders. We see that all of these terms are more specific than the general focus on scientists and experts that we saw in 2020. Perhaps our national fog is clearing as our country disseminates the vaccine and progress is made.

The China discussion still centers on theories about the origin of COVID-19, with mentions of a lab, bats, and Wuhan. Racism is still a common topic, especially given the recent hate crimes in the US.

THe UK mentions occur in the contexts of relations with the EU, worries about a new strain, and the Johnson & Johnson vaccine. Concerning Canada, we see talks of the AstraZeneca vaccine, which recently rolled out there. The other words listed are less easy to interpret.

IV. Analyzing Retweets

Which are the most followed accounts being retweeted in our sample?


2020

Obama was the most followed person getting retweeted on that day, and it seems that Katy Perry was second. India PM Narendra Modi also was getting retweeted at the time, along with several news outlets, politicians, and celebrities.

2021

In 2021, it seems that Obama was far and away the most followed person getting retweeted with nobody else coming near. News outlets encompass a greater share of the top 20 in follows—perhaps because there are less celebrities talking about COVID-19 in March of 2021 than in 2020.

Whose tweets aggregately reached the highest retweet counts in each of the two sample periods?

2020

2021


ScreenName rt_total
moreki_mo 369869
SethAbramson 236566
ChicagoTraderrr 204214
TechInsider 184160
siravariety 176882
a_new_hopee 144919
SinghLions 141862
_caitlingeorgia 133052
JoeBiden 128372
MLKChefLean 126289
dglo4me 118316
CorneliaLG 114712
BarackObama 111313
comiketofficial 110311
sin_xia 101894
FaveEngineerJen 101459
b0mbchell_ 93733
emmabethgall 90161
jeremycyoung 89643
quenblackwell 85545
ScreenName rt_total
JRKSB_ 140677
Marco_Acortes 91265
Mippcivzla 60549
JoeBiden 59563
BIGHIT_MUSIC 47090
NicolasMaduro 40822
aj_buu 39485
KatPapaJohns 38968
leelecarvalho_ 38215
Mikel_Jollett 37101
wwxwashere 35973
843KT 35627
__Jones__ 32319
Ric3townFinest 31910
VTVcanal8 31656
mordomoeugenio 28083
BarackObama 25413
tattyhassan 24946
Mediavenir 22708
DanPriceSeattle 21799

For both lists, most of these names are not particularly recognizable with the exception of Joe Biden, Barack Obama, and Nicolas Maduro (2021).

Who is the best at getting retweeted (who gets retweeted by the greatest share of followers for each tweet)?

2020

2021


ScreenName Followers Retweets n rt_index
camilacousseau 616 7787 2 6.32
Shirley20188539 602 4154 2 3.45
CovidKidBuak 1330 16624 4 3.12
CorneliaLG 17579 114712 3 2.18
TobiRachel_ 10866 53258 3 1.63
soulzoul 5601 15326 2 1.37
NatalieElsberg 3713 7122 2 0.96
TanushJagdish 688 1260 2 0.92
PrakritiGaba 11893 21353 2 0.90
InactionNever 37452 62671 2 0.84
ScreenName Followers Retweets n rt_index
richardajabu 0 3 2 Inf
_mariagaabriela 869 7468 2 4.30
cesarRG86 756 4561 2 3.02
Jos42121217 4 36 3 3.00
soynanne 192 932 2 2.43
1raciad411 109 339 2 1.56
abg_escalona 229 475 3 0.69
PN_1984 2231 3040 2 0.68
WenkertKai 1138 1331 2 0.58
LaMagnifica25 2153 1756 2 0.41

30 Mar 2020 Mean RTs: 1348.66
30 Mar 2021 Mean RTs: 369.97


We compute an index for this—a proxy for “Twitter game.” First, we determine the number of unique tweets for each screen name. Then, we produce a ratio of number of retweets per tweet to number of followers.

We restricting this analysis to people who have multiple original tweets getting retweeted in our samples. Perhaps this would eliminate from our top ten those users who had one tweet take off but received no attention on their others.

In 2020, camilacousseau over two tweets averaged over six retweets per follower. Nobody else exceeded 55% of this impressive total.

In 2021, richardajabu had an infinitely high retweet index by managing to get two tweets retweeted three times despite having zero followers. Incredible.

We also check whether there’s a difference in mean retweets for each sample. Among the tweets that were retweeted (an important specification), the average number of retweets were much higher in 2020 than in 2021. An unpaired two sample t-test leads us to accept the one-sided alternative hypothesis (p<0.01) that the mean number of retweets on 30 Mar 2020 was substantially greater than the mean number of retweets on the same date in 2021. It seems that people were hitting the retweet button a lot more in the spring of 2020 than in 2021, which makes sense given that people were stuck at home in lockdown with a lot of novel happenings to discuss. Now, with vaccines being rolled out and coronavirus in decline, there is much less panic, anger, and dramatic attention being paid towards the virus.

In each year, who was the worst at getting retweeted?

2020

2021


ScreenName Followers Retweets n rt_index*1e5
ElNacionalWeb 5147877 6 3 0.04
ndtv 15088045 71 4 0.12
Reuters 23385704 200 5 0.17
guardian 9702293 54 3 0.19
la_patilla 7067042 83 5 0.23
radiomitre 1054776 6 2 0.28
lasopa_news 447945 4 3 0.30
kompascom 8139885 146 6 0.30
CGTNOfficial 13614960 142 3 0.35
elnorte 1017073 8 2 0.39
ScreenName Followers Retweets n rt_index*1e5
detikcom 16866238 6 9 0.00
TheEconomist 25665215 9 4 0.01
FinancialTimes 6956127 4 5 0.01
SSalud_mx 1239631 3 8 0.03
latimes 3813280 11 8 0.04
Independent 3557651 7 3 0.07
eleconomista 721345 1 2 0.07
HoustonChron 648473 1 2 0.08
CGTNOfficial 13614961 32 3 0.08
DolarToday 3749456 20 6 0.09

2020

ElNacionalWeb, NDTV, Reuters, and the Guardian are all terrible at getting retweeted, which makes sense because they likely tweet a lot of boring, matter-of-fact news as opposed to the clickbait-y headlines that sites like Fox or the New York Times tweet.

2021

Again, we see mostly news sites failing to get many retweets. There isn’t anything too interesting to be said about this.

Using our previous topic models, which topics were being retweeted about the most in each year?


2020

It seems that topic 9 was being retweeted about the most, which is very interesting because topic 9 had the fewest original tweets out of all topics we identified from the 2020 data.

Recall that topic 9 was about politics. Perhaps people tend to promote the views of media outlets or political influencers whose content they consume but don’t have much to personally contribute to these conversations.

It may also be the case that retweeting and publishing original tweets are zero sum behaviors. In other words, when people are more likely to retweet about a particular topic, maybe that makes them less likely to also tweet about it themselves. Maybe others’ have summed up their thoughts better than they can convey.

We will see if this pattern is also seen in the 2021 data.

2021

This hypothesized explanation is supported by the 2021 data; topic 1 is the most retweeted about but is the least originally tweeted about. Recall that topic 1 was a sort of catch-all alarmist topic, though it mentions Peter Navarro. It may be the case that retweets displace tweets about the same topic.

Is there a difference in tweet length among retweets in each sample year?

30 Mar 2020 Mean Retweeted Tweet Length: 188.460557146576
30 Mar 2021 Mean Retweeted Tweet Length: 196.493874248729



We see that 2020 retweets are slightly shorter and conduct a t-test.

It seems the difference in tweet lengths is statistically significant. 2020 tweets (those being retweeted) were slightly shorter.

Which sentiments are prevalent among retweeted tweets?


The sentiments observed are not strikingly different, but there is more sadness, less surprise, more negative sentiments, and less joy being expressed in 2021—over a year from the start of the pandemic in the US.

Is there a difference between RT AFINN scores in each period?

Average AFINN scores for all retweeted words by date

30 Mar 2020: -0.525
30 Mar 2021: -0.608


The 2021 retweets appear to be more negatively valenced.

The most common word in each period is “death.” It seems that this word is much more common among retweets than among original tweets. Perhaps, the topic is too heavy for people to feel compelled to write their own tweets about it but are willing to retweet others’ words. "F***" is much more common in 2020 than in 2021. Perhaps this is because people were more panicked at the start of the pandemic.

What textual features (sentiments and words) predict how viral a tweet became in 2020?

2020 Stepwise Regression


  virality
Predictors Estimates CI p
(Intercept) 0.48 0.14 – 0.81 0.005
hoax 5.66 3.19 – 8.12 <0.001
afinn -0.14 -0.29 – 0.01 0.061
joy 0.36 -0.00 – 0.72 0.051
surprise -0.47 -0.80 – -0.13 0.007
gonna 2.86 0.23 – 5.48 0.033
Observations 4042
R2 / R2 adjusted 0.008 / 0.007

2021 Stepwise Regression


  virality
Predictors Estimates CI p
(Intercept) 0.09 -0.09 – 0.27 0.352
take 1.25 0.29 – 2.20 0.010
positive -0.14 -0.26 – -0.02 0.019
trust 0.40 0.27 – 0.52 <0.001
fear -0.08 -0.18 – 0.03 0.143
joy -0.24 -0.43 – -0.05 0.012
Observations 2818
R2 / R2 adjusted 0.017 / 0.015


Let’s see if using sentiments, mean AFINN score, and topic-specific words, we can predict a given tweet’s retweet-to-follower ratio (a measure of how viral a tweet was—to what extent it took off). We produce two models for each year—a random forests model and a stepwise regression optimized using AIC. We could include tweet length but opt not to because we are not interested in generating a highly predictive model but understanding what semantic content is being retweeted, and tweet length does not speak strongly to that question.

We are not interested in knowing the predictive efficacy or goodness of fit of the random forests model. We only want to know, out of the features we have selected, which seem the most important to predicting virality. As such, we display variable importance plots for each model and use these for interpretation.

2020

According to our random forests model, the most predictive variables are positive sentiments, disgust, and the words “sick,” “hoax,” and “country” in all of its variations.

Because the model is predicting both high and low values, the sign of each predictor (positive or negative effect on virality) is unclear. However, we would surmise that features with a strong positive effect would be the most likely to be selected given that most of the variation is positive.

However, because we can’t be sure about this, we build a simple backward stepwise linear regression (optimized on AIC) to determine which effects were positive and which were likely negative.

It appears that using the word “hoax” predicts a substantial increase in virality—to the tune of over five retweets per follower. The AFINN variable has a negative coefficient, which implies that more positive tweets go less viral. Confusingly, the opposite effect is seen with the joy sentiment. Perhaps joy denotes something more specific than a positive AFINN score would imply. Surprise sentiments predict less virality, and, for whatever reason, the use of the word “gonna” predicts a substantial increase in virality.

2021

For our 2021 random forests model, the word “take” was the most predictive, followed by positive sentiments, trust sentiments, anger, surprise, fear and joy. It seems that particular words were much less predictive on 30 Mar 2021. Perhaps this means that there are fewer “hot topics,” so mood is our best way to finger the pulse now relative to early in the pandemic.

Again, we also produce a backward stepwise regression, which confirms the predictive significance of the word “take” and shows that it predicts more retweets per follower. Positive sentiments, fear, and joy predict less virality, whereas trust predicts more.

V. Conclusion

Conclusion

By examining tweets from these two dates, we were able to uncover a trove of insights about how COVID-19 communications have changed—both in sentiments and in content. We see that focus on China has abated slightly. Similarly, the topics that people are discussing have changed. Conversations about lockdown are less prevalent, and political discussions have changed as well. The kinds of tweets going viral (getting heavily retweeted) differ substantially between the two years. In fact, the average number of retweets has itself shifted dramatically, with much fewer retweets overall in 2021. While the presence of particular words can’t by itself predict how viral a tweet is, there are clear patterns about which words attracted attention during each period.

These findings may give clues about how policymakers and influencers can craft their messages to reach more eyes during similar healthcare crises. Because determinants of virality are ever-changing, influencers should keep their finger on the pulse using methods similar to the ones we use here.

Limitations

First, we ultimately compared one day of tweets to another day of tweets one year later. While this computation already consumed a massive amount of computational power and memory, a more robust analysis might have compared a full week or even a month of tweets to another week or month of tweets. Because we only analyzed one day of tweets, there is a chance that our analysis could be capturing idiosyncratic variation for a given day rather than the general underlying sentiment of a prolonged time period. It goes without saying, however, that a full day of tweets is certainly more robust than one hour (or even one minute) of tweets.

Second, in this analysis, only certain kinds of people are drawn to Twitter, so Twitter analysis does not necessarily offer an accurate representation of how the population feels as a whole. More outspoken and extroverted people are more likely to express their views on Twitter, which leads to the underrepresentation of more soft spoken and introverted people. Therefore, this Twitter Analysis would be best utilized in combination with other data collection and analysis methods.

Contributions

Ammar Plumber did the data retrieval and cleaning along with the topic modeling, newsmaps, exploratory/predictive analysis of virality, and interpretations of these sections.

Elaina Lin produced exploratory visualizations including word frequencies and sentiment analyses, as well as the CSS for the webpage and the writing/interpretations.

Kim Nguyen assisted with analyses of tweet length in both original tweets and retweets. She also contributed to the writing and webpage formatting efforts.

Ryan Karbowicz produced hashtag counts, topic frequency bar graphs for each sample, and much of the writing.

Meghan Aines helped with the writing and formatting of the webpage.