Natural Language Processing Analysis

Executive Summary

In this round of analysis we worked with three Natural Language Processing (NLP) techniques namely

Topic Modeling
Named Entity Recognition (NER)
Sentiment Analysis

Our results from both Topic Modeling and NER indicate widespread consumption of news on the ongoing Russia-Ukraine Conflict. In both cases we saw that the topics our model saw and the entities being referenced most often were either directly involved or adjacent to said conflict. We did see that when subsetting the data around events of the conflict, the topics our models produce had less overlap. Lastly, our Sentiment Analysis on submissions and comments related to the conflict suggested that submissions tended to be more negative while comments had an equal distribution of both negative and positive sentiments. The models appeared to demonstrate low neutrality in both submissions and comments.

Text Pre-Processing

We started off our analysis by assessing the number of words in the title and body columns of our datasets. They are represented in the two graphs below.

Both these graphs in figures 1 and 2 are right skewed, reflecting that a major amount of both titles and comments are below 50 words.

Next, we moved on to preprocessing. In order to do the natural language processing (NLP) tasks, the corpuses of comments and submissions needed to face several steps of processing as listed below:

Remove stop words
Remove other languages
Remove special characters
Convert text to lower case
Lemmatize words

The above listed preprocessing steps were applied to text portions of our submissions and comments datasets, on the columns title and body respectively. We used the johnsnowlabs package to conduct our preprocessing and implemented our preprocessing task through a pipeline. systematically converted to document file types and then tokenized. This transformed the data into a bag of words form and was used to accomplish the business goals below.

After the preprocessing, we assessed the 15 most commonly occurring words in both the submissions and comments dataset as displayed in table 1.

Table 3.1 : Most Common Words in the Submissions Datasets

╒════════════════════════════╤════════════════════════════╕
│ Top Words in Submissions   │   Occurence in Submissions │
╞════════════════════════════╪════════════════════════════╡
│ ukraine                    │                      26704 │
│ russian                    │                      19298 │
│ russia                     │                      17456 │
│ war                        │                       9411 │
│ ukrainian                  │                       6740 │
│ putin                      │                       6683 │
│ china                      │                       5264 │
│ world                      │                       4722 │
│ kill                       │                       4711 │
│ news                       │                       4638 │
│ attack                     │                       4036 │
│ we                         │                       4024 │
│ military                   │                       3670 │
│ force                      │                       3626 │
│ report                     │                       3568 │
╘════════════════════════════╧════════════════════════════╛

Table 3.2 : Most Common Words in the Comments Datasets

╒═════════════════════════╤═════════════════════════╕
│ Top Words in Comments   │   Occurence in Comments │
╞═════════════════════════╪═════════════════════════╡
│ russia                  │                 2402776 │
│ people                  │                 2123906 │
│ russian                 │                 1659100 │
│ ukraine                 │                 1618223 │
│ war                     │                 1496956 │
│ make                    │                 1477347 │
│ country                 │                 1377832 │
│ putin                   │                  989950 │
│ time                    │                  978403 │
│ thing                   │                  889382 │
│ world                   │                  842894 │
│ year                    │                  804429 │
│ it                      │                  707501 │
│ nato                    │                  669443 │
│ good                    │                  668103 │
╘═════════════════════════╧═════════════════════════╛

It is evident from both tables 3.1 and 3.2 that the Russia-Ukraine Conflict dominates both in the submissions and comments datasets. This result agrees with our EDA analysis. Even when constructing a Word Cloud with the Submissions dataset, we found similar trends as displayed in figure 3.3 below.

Figure 3.3 : Word Cloud of Submissions Dataset

Digging in further, we also evaluated the most important words using the Term Frequency - Inverse Document Frequency (TF-IDF) methodology and obtained the following results.

Table 3.3 : Most Important Words in the Datasets

╒══════════════════════════════════════╤═══════════════════════════════════╕
│ Top Important Words in Submissions   │ Top Important Words in Comments   │
╞══════════════════════════════════════╪═══════════════════════════════════╡
│ work                                 │ god                               │
│ web                                  │ decade                            │
│ fire                                 │ whale                             │
│ stream                               │ ween                              │
│ health                               │ 10110000                          │
│ cool                                 │ queen                             │
│ jika                                 │ save                              │
│ bot                                  │ boris                             │
│ pump                                 │ diplomat                          │
│ glaub                                │ johnson                           │
│ vitamin                              │ understand                        │
│ film                                 │ tug                               │
╘══════════════════════════════════════╧═══════════════════════════════════╛

The TF-IDF doesn’t reflect the dominance of war related submissions and comments obtained above. This will be further explored through our following analysis of dominant topics in the data.

Key Topics in Submissions

We employed topic modeling through Latent Dirchlet Allocation (LDA) on the cleaned titles of our submissions dataset and obtained 7 topic groups, which saw most of the groups dealing with the Russia-Ukraine Conflict, addressing its different facets. The topics were:

Topic 1: deals with the Russia-Ukraine Conflict and nuclear weapons, power plants
Topic 2: deals with the Russia-Ukraine Conflict and the media/news coverage on the political leaders
Topic 3: deals with the Russia-Ukraine Conflict and oil imports, the EU, laws, and regulation
Topic 4: deals with the Russia-Ukraine Conflict as well as Twitter and Elon Musk
Topic 5: news related to covid, China, and Korea
Topic 6: deals with the Russia-Ukraine Conflict and neighbouring countries
Topic 7: the Russia-Ukraine Conflict and the Qatar World Cup

As stated, several of these topics related to the Russia-Ukraine Conflict directly. This high centralization of topic matter seemed to affect topics 4 and 7 which appeared to combine disparate subjects. The LDA topic visualization is displayed below.

Figure 3.4 : LDA Topic Visualization for Submissions Datset

Further viewing the , utilizing google trends data we were able to determine events in the conflict which most affected internet traffic. Using this we sought to examine how such key events affected the subreddit’s submissions, ie news coverage. The first event selected was the start of the war, which we used to create a subset of the data that encompassed the first two weeks of the Russia-Ukraine Conflict and was modeled with only a topic count of 5. Its results saw the following topics:

Topics 1, 2, 3, 5 all relating to the Russia-Ukraine Conflict, with repeated terms (political leaders of both countries, cities in both countries, and weapons - missiles, nuclear, military, troops)
Topic 4 provided an interesting insight - it includes news related to Indian students in Ukraine, as well as Starlink and Elon Musk.

To provide context, topic 4 highlighted the early involvement of Musk in providing his satellite services to the country and the Nazi Azov battalion accosting foreign students attempting to flee.

Figure 3.5 : LDA Topic Visualization for Submissions Datset During the Start of the War

The second event we used from the trends coverage, was when Russia began withdrawing. After applying LDA once more, we saw the topic modelling results indicated:

Topics 1, 2, 3, 4 all relating to Russia-Ukraine Conflict, with repeated terms (political leaders of both countries, cities in both countries, and weapons - missiles, nuclear, military, troops)
Topic 5 : news related to North Korea firing a ballistic missile over Japan.

As was stated, North Korea had fired a ballistic missile over Japan in the same time period and had been rapidly increasing the rate of testing. This appears to suggest that the smaller subsets of data might be more useful in capturing specific stories without overlapping the conflict with other topics.

Figure 3.6 : LDA Topic Visualization for Submissions Datset When Russian Troops Started Receding

Identifying Key Entities in the Datasets

The top 12 entities for person, location and organization are displayed in table 3. We used BERT pretrained Named Entity Recognition (NER) models to accomplish this. Initially, when feeding the pre trained model with our pre-processed data, we found that Russia and Ukraine didn’t feature as one of the top entities. Upon re-running the model pipeline with unprocessed data, the results were more familiar.

Given the density of topics demonstrated by the previous section’s analysis, the focus of the entities on the conflict should not come as a surprise. Using NER, we saw that just as the topics were heavily saturated with the Russia-Ukraine Conflict, the results for most frequently mentioned entities demonstrated this as well.

In table 3.4, we can clearly see the prevalence of Putin, Biden, and Zelensky; but these persons are not the only noteworthy results. We also can see British political figures, Boris Johnson and Liz Truss, taking a prominent role in the results. Considering the domain table from our EDA, British news coverage made up some of the most popular sources for posts. This could be the reason for the British political figures being highlighted by the NER model.

When examining the locations results, we saw Ukraine and Russia occur with much greater frequency than the other states, followed by the United States, which appeared twice, and China to a lesser extent.

Lastly our organization based model appeared to return the most mixed results. The top three results were organizations that could easily indicate stories related to the Russia-Ukraine Conflict, but after this the results become less than ideal. Ukrainian cities, Russia, Covid, and Elon Musk were included in the results. This could be seen as a short coming of the models, but also testifies to the prevalence of the conflict in the subreddits submissions. It should also be noted that different n-grams and misspellings resulted in less refined results, and, as stated, the organization results were rather mixed. Table 3.4 displays the entities in the decreasing order of occurence.

Table 3.4 : Top Entities in Submissions

╒════════════════╤═════════════════╤═════════════════════╕
│ Top People     │ Top Locations   │ Top Organizations   │
╞════════════════╪═════════════════╪═════════════════════╡
│ Putin          │ Ukraine         │ EU                  │
│ Biden          │ Russia          │ NATO                │
│ Zelensky       │ US              │ UN                  │
│ Vladimir Putin │ China           │ Kyiv                │
│ Boris Johnson  │ U.S             │ CNN                 │
│ Zelenskyy      │ India           │ Elon Musk           │
│ Russia's       │ UK              │ COVID               │
│ Trump          │ Iran            │ Mariupol            │
│ Liz Truss      │ Germany         │ News                │
│ Bucha          │ Israel          │ Russia's            │
│ Blinken        │ Taiwan          │ Reuters             │
│ Shinzo Abe     │ Japan           │ Twitter             │
╘════════════════╧═════════════════╧═════════════════════╛

Assessing Sentiments on War Related Comments and Submissions

Based on our previous analysis in the EDA, it has been determined that the live threads in our subreddit were predisposed towards the Russia-Ukraine Conflict. The dominance of said conflict in the submissions was further supported by topic modeling results. We expected that such a topic would lead itself towards strong sentiments and were affirmed in that expectation through employing pre-trained sentiment models as shown in table 3.5 below.

Table 3.5 : Sentiment Analysis Results From Live Thread Comments

╒════════════════════════════╤════════════╤═══════════╤════════════╕
│ Model                      │ negative   │ neutral   │ positive   │
╞════════════════════════════╪════════════╪═══════════╪════════════╡
│ IMDB Sentiment Analyzer    │ 52.54%     │ 4.32%     │ 43.30%     │
│ Twitter Sentiment Analyzer │ 52.54%     │ 4.32%     │ 43.30%     │
│ Vivek Sentiment Analyzer   │ 44.78%     │ 18.91%    │ 36.30%     │
╘════════════════════════════╧════════════╧═══════════╧════════════╛

Although the Vivek analyzer categorized more comments as neutral, overall users appeared to have stronger sentiments, generally of a negative sort. This negative pattern was demonstrated to an equal extent by both the IMDB and Twitter models, without significant difference.

Further sentiment analysis was done on submissions related to the Russia-Ukraine Conflict and their related comments, excluding the above analyzed live threads. The models appear to have similar trends, but it should be noted that that for the Twitter and IMDB models the submission titles, those of the articles being shared, were significantly more negative than the comments. Overall the title results for each model indicated 25.22% more overall negative sentiment than the related comment sentiment results, all of which can be seen in table 3.6 below.

Table 3.6 : Sentiment Analysis of War-Related Submissions and Comments

╒════════════════════════════╤═════════════╤════════════╤═══════════╤════════════╕
│ Model                      │ Type        │ negative   │ neutral   │ positive   │
╞════════════════════════════╪═════════════╪════════════╪═══════════╪════════════╡
│ IMDB Sentiment Analyzer    │ Comments    │ 52.68%     │ 4.26%     │ 43.16%     │
│ IMDB Sentiment Analyzer    │ Submissions │ 77.90%     │ 1.25%     │ 20.85%     │
│ Twitter Sentiment Analyzer │ Comments    │ 52.68%     │ 4.26%     │ 43.16%     │
│ Twitter Sentiment Analyzer │ Submissions │ 77.90%     │ 1.25%     │ 20.85%     │
│ Vivek Sentiment Analyzer   │ Comments    │ 46.72%     │ 17.10%    │ 36.18%     │
│ Vivek Sentiment Analyzer   │ Submissions │ 45.19%     │ 30.27%    │ 24.54%     │
╘════════════════════════════╧═════════════╧════════════╧═══════════╧════════════╛

The stark difference in results between the Vivek Model and the Twitter and IMDB models is more apparent here. We assumed that titles would have demonstrated more neutral sentiment, as the Vivek model demonstrated; but assuming that the results IMDB and Twitter models are more reliable, marketing tactics may be the culprit for the perceived negativity. Given how article titles have increasingly become the method of user interaction in social media, drawing attention with inflammatory language can be a tactic for increasing user interest. This perceived negativity might also be explained by the dominance of Western Media outlets in this subreddit, that was seen in the EDA analysis.

We attempted a lexicon based approach, but due to logistical issues we instead went forward with pretrained models sentimentdl_use_twitter, sentimentdl_use_imdb and VivekSentimentAnalyzer . In the future, we believe that training our own sentiment analyzer could improve the results.