Please see presentation slides in the repository.
II. Data Dictionary
III. Methodology
a. Acquisition
b. Processing
d. Modeling
IV. Conclusion
For this project, I'm an analyst at a political targeting firm. In light of the pandemic, a larger share of direct voter contact is over the internet, rather than in person. However, it isn't easy to connect social media users to their voter file records and therefore their ideological support and voter turnout scores as determined by other models. So, in order to identify persuadable targets and activatable supporters, my firm wants to create a model that can differentiate between language used by people on varying ends of the political spectrum. Eventually, they will use this model to identify target accounts on platforms like Twitter to whom campaigns will conduct outreach.
Question I'm seeking to answer: Do Reddit users on varying ends of the economic political spectrum use vocabularies distinct enough for a model to differentiate between them?
Titles and content of posts requested from r/Socialism and r/Capitalism using the Pushshift API.
Feature | Type | Description |
---|---|---|
subreddit_s | int64 | Binary indicator of the subreddit from which a post was pulled (1 = r/Socialism, 0 = r/Capitalism) |
sentences | int64 | Number of sentences in post as determined by nltk's PunktSentenceTokenizer |
avg_sent_len | float64 | Average number of words in each sentences of the post |
words | int64 | Total number of words in the post |
avg_word_len | float64 | Average number of characters in each word of the post |
ADJ_prop | float64 | Proportion of the content made up of adjectives |
ADP_prop | float64 | Proportion of the content made up of adpositions |
ADV_prop | float64 | Proportion of the content made up of adverbs |
AUX_prop | float64 | Proportion of the content made up of auxiliary words |
CCONJ_prop | float64 | Proportion of the content made up of coordinating conjunctions |
DET_prop | float64 | Proportion of the content made up of determiners |
INTJ_prop | float64 | Proportion of the content made up of interjections |
NOUN_prop | float64 | Proportion of the content made up of nouns |
NUM_prop | float64 | Proportion of the content made up of numbers |
PART_prop | float64 | Proportion of the content made up of particles |
PRON_prop | float64 | Proportion of the content made up of pronouns |
PROPN_prop | float64 | Proportion of the content made up of proper nouns |
PUNCT_prop | float64 | Proportion of the content made up of punctuation |
SCONJ_prop | float64 | Proportion of the content made up of subordinating conjunction |
SPACE_prop | float64 | Proportion of the content made up of spaces |
SYM_prop | float64 | Proportion of the content made up of symbols |
VERB_prop | float64 | Proportion of the content made up of verbs |
X_prop | float64 | Proportion of the content made up of uncategorizable words |
vader_neg | float64 | Average negativity score for the words in the post as determined by vaderSentiment's SentimentIntensityAnalyzer |
vader_pos | float64 | Average positivity score for the words in the post as determined by vaderSentiment's SentimentIntensityAnalyzer |
vader_neu | float64 | Average neutrality score for the words in the post as determined by vaderSentiment's SentimentIntensityAnalyzer |
vader_compound | float64 | Average compound score for the words in the post as determined by vaderSentiment's SentimentIntensityAnalyzer |
Check the documentation for spaCy and vaderSentiment for more information.
In order to acquire content for this project, I used the Pushshift API to request posts from two subreddits, r/Socialism and r/Capitalism. Because the Pushshit API allows a maximum of 100 posts for each request, I created a function that would take in a subreddit, a number, and a list to extend if one already existed, and extract 100 posts that number of times into that list. In order to be respectful of Reddit's servers, I added a time cushion of 5 seconds in between each request so I wouldn't be inundating the servers with consecutive requests. I utilized a while loop to request ~2800 posts from each subreddit that seemed to have substantive content (i.e. the body didn't read "deleted" or ""). I then combined these into one dataframe by narrowing the columns to only the ones they had in common and adding the subreddit_s
column to differentiate between the two sets. Then I was ready to start processing!
Before I altered the content in any way, I wanted to tokenize each post into sentences and words to get the sentences
, avg_sent_len
, words
, and avg_word_len
columns. After that, I converted all the text into lowercase and removed punctuation to make the content a bit more uniform across the datset. I used spaCy to parse out posts by parts of speech and create the _prop
columns to be used in the models later. Then I used vaderSentiment's SentimentIntensityAnalyzer to create sentiment scores
for each post. Then I moved on to feature selection in terms of which words I would include in my model.
To start, I used sklearn's CountVectorizer to get a dataframe of unique words and the frequency of their use. I couldn't use all 26,518 unique words in these posts (not including stop words) as features in my models without waiting hours for them to run and probably overfitting to my data. So in order to narrow it down, I used nltk's PorterStemmer to group similar words together and replaced the words in the content with their stemmed versions. Then I concatenated all posts into two very long strings corresponding to each subreddit and used sklearn's TfidfVectorizer to glean the importance of each of these words to their respective "documents" (the super long strings!). I narrowed my list of stems to the top 1,000 most important words from each subreddit, which only came out to 1197 words total because there was so much overlap. Then I used CountCectorizer again to get a new DataFrame with only that smaller group of stemmed words and their frequency of use. Finally, I merged that dataframe with the sentences
, avg_sent_len
, words
, avg_word_len
, and _prop
columns from earlier to create my final dataframe of features and the target variable.
I started off my exploratory analysis as I often do: by looking at a correlation heatmap. But this time, with over 1200 features, I narrowed it to only the features that weren't individual words. It was a bit daunting that all of the correlations were weaker than 0.1 or -0.1 except for NOUN_prop
which had a -0.12 correlation with the r/Socialism subreddit.
To get an idea of how different the vocabularies were on the subreddits since it didn't seem feasible to do a heatmap, I created a venn diagram to analyze how much overlap there was between the top 100 most common words in each subreddit. This was also a bit daunting as they shared 73 of their top 100 words and only 27 words in each subreddit's top 100 words were different.
Then I went on to look at the mentions of the topics of each subreddit to see if they mentioned each other as much as they mentioned themselves. I found that the subreddits were significantly more likely to mention their own topic than the other topic but still mentioned the other topic quite a bit.
I also wanted to see if there was any strong disparities in the sentiment of posts in these subreddits. I found that they had largely similar average positivity, neutrality, negativity, and compound scores, but if anything, r/Capitalism was more emotional and less neutral in their posts.
After this exploration, I was not particularly encouraged that I'd be able to predict a particularly accurate model to differentiate between these subreddits. The language and sentiment seems quite similar. But I was hoping the combination of all of these features would be more meaningful than any of them on their own.
The baseline accuracy is 0.510 for comparison. For all modeling, I used a random_state
of 26.
I tried out models in order of increasing complexity for this project. So the first model I tried out was a LogisticRegression. Any parameters I tweaked resulted in lower scores so I used the default parameters and got a score of 0.877 on the training data and 0.726 on the testing data. This model was definitely overfit, but it was encouraging to see the simplest model outperform the baseline accuracy by about 50%. I had less success with KNeighborsClassifier, DecisionTreeClassifier, and BaggingClassifier, all of which had testing scores below 0.700.
But as I tried other ensemble models, I had increasing success. The best of these was the ExtraTrees and AdaBoost with ExtraTrees as the base_estimator
. Then I utilized the top 5 performing models in a VotingClassifier for my best testing score of 0.773. I was able to very slightly bring that testing score up by reducing the features a bit based on the coefficients from my top performing models. When I reduced the features from 1200+ to 800, I got a final testing score of 0.774. All scores are below for your comparison.
Model | Parameters | Training Score | Testing Score |
---|---|---|---|
LogisticRegression | All default parameters | 0.877 | 0.726 |
KNeighbors | algorithm = 'ball_tree', n_neighbors = 15, p = 1 | 0.684 | 0.615 |
DecisionTree | min_samples_leaf = 5, min_samples_split = 10, ccp_alpha = 0.001 | 0.892 | 0.663 |
Bagging | All default parameters | 0.988 | 0.692 |
RandomForest | n_estimators = 150 | 0.999 | 0.738 |
ExtraTrees | min_samples_split = 5, n_estimators = 200, warm_start = True | 0.998 | 0.756 |
AdaBoost | base_estimator = ExtraTreesClassifier(), learning_rate = 0.9, algorithm = 'SAMME' | 0.999 | 0.768 |
GradientBoost | learning_rate = 0.08, max_depth = 4, n_estimators = 150 | 0.874 | 0.753 |
Voting | estimators = [LogisticRegression(), RandomForestClassifier(), ExtraTreesClassifier(), AdaBoostClassifier(), GradientBoostingClassifier], weight = proportional by model testing score | 0.998 | 0.773 |
Voting2 | Same as above but with fewer features | 0.998 | 0.774 |
In the end, my best model had a training score of 0.998 and testing score of 0.774, so it was quite overfit but also outperformed the baseline accuracy by 51.7% on unseen data. The following features were the most signficant across the top performing models used in the final VotingClassifier.
The feature with the strongest association with r/Socialism was words with the stem 'engag', ('engaged', 'engagement', 'engaging') with 'comrad' ('comrade', 'comradery') following closely behind. Then words like 'sector', 'pass', 'leftist', and 'class' followed. Interestingly, r/Socialist seems to use more proper nouns and pronouns than r/Capitalism, which I think can explain the use of more particles too (possessive apostrophes at the end of a word may get removed so an 's' is then counted as a standalone letter for example). For r/Capitalism, fascinatingly, the number on strongest feature associated with that subreddit was use of the word 'stupid'. After that were words/stems like 'poll', 'factor', 'market', 'venezuela', and 'tax'. There seems to be more use of verbs, nouns, spaces, and possibly usernames ('latestagecapit' and 'modelusgov') among this subreddit as well.
So to conclude, it seems like there are overlapping vocabularies between r/Socialism and r/Capitalism but the way in which the language is used can help create a fairly accurate classification model. As for the problem statement, I think this could be further refined with more data and further modeling to put a finer point on the difference between the way socialists and capitalists speak. With more work, I think this type of model could be used for targeting social media users for various calls to action.
Thank you for reading all of this, dear reader! I hope it was interesting and useful for any of your future endeavors. Godspeed :^)