Project as part of Data Structures course
- Worked with large scale datasets (~1TB Large; all tweets sent over the year with targeted tags)
- Worked with multilingual text
- Used the MapReduce divide-and-conquer paradigm to create parallel code
The full set of findings is contained in the two analysis_ folder which breaks down tweets with the specified keyword by language and country. Some interesting findings that can be concluded from analyzing these data include:
There are so many names for the coronavirus. Some of these terms are included in the hashtags and it can be seen that among English speaking Twitter users Covid19 (with 617695 instances of hashtags at time of data collection) is the most popular while Corona (with 529764) and Coronavirus (with 422394) closely follow.
Looking at the most popular term for hashtags, covid19, it can be seen that US has produced the most tweets with this hashtag at 283149 tweets while India (country code IN) and United Kingdom (country code GB) follow behind with 88590 tweets and 88178 tweets. This can be an indication of which country has the most online discussion about the virus – though these numbers should probably be put into context by comparing to number of twitter users in each country.