Natural Language Processing Analysis
Executive Summary
In this round of analysis we worked with three Natural Language Processing (NLP) techniques namely- Topic Modeling
- Named Entity Recognition (NER)
- Sentiment Analysis
Our results from both Topic Modeling and NER indicate widespread consumption of news on the ongoing Russia-Ukraine Conflict. In both cases we saw that the topics our model saw and the entities being referenced most often were either directly involved or adjacent to said conflict. We did see that when subsetting the data around events of the conflict, the topics our models produce had less overlap. Lastly, our Sentiment Analysis on submissions and comments related to the conflict suggested that submissions tended to be more negative while comments had an equal distribution of both negative and positive sentiments. The models appeared to demonstrate low neutrality in both submissions and comments.
Text Pre-Processing
We started off our analysis by assessing the number of words in the title and body columns of our datasets. They are represented in the two graphs below.
Both these graphs in figures 1 and 2 are right skewed, reflecting that a major amount of both titles and comments are below 50 words.
Next, we moved on to preprocessing. In order to do the natural language processing (NLP) tasks, the corpuses of comments and submissions needed to face several steps of processing as listed below:- Remove stop words
- Remove other languages
- Remove special characters
- Convert text to lower case
- Lemmatize words
The above listed preprocessing steps were applied to text portions of our submissions and comments datasets, on the columns title and body respectively. We used the johnsnowlabs package to conduct our preprocessing and implemented our preprocessing task through a pipeline. systematically converted to document file types and then tokenized. This transformed the data into a bag of words form and was used to accomplish the business goals below.
After the preprocessing, we assessed the 15 most commonly occurring words in both the submissions and comments dataset as displayed in table 1.
Table 3.1 : Most Common Words in the Submissions Datasets
╒════════════════════════════╤════════════════════════════╕
│ Top Words in Submissions │ Occurence in Submissions │
╞════════════════════════════╪════════════════════════════╡
│ ukraine │ 26704 │
│ russian │ 19298 │
│ russia │ 17456 │
│ war │ 9411 │
│ ukrainian │ 6740 │
│ putin │ 6683 │
│ china │ 5264 │
│ world │ 4722 │
│ kill │ 4711 │
│ news │ 4638 │
│ attack │ 4036 │
│ we │ 4024 │
│ military │ 3670 │
│ force │ 3626 │
│ report │ 3568 │
╘════════════════════════════╧════════════════════════════╛
Table 3.2 : Most Common Words in the Comments Datasets
╒═════════════════════════╤═════════════════════════╕
│ Top Words in Comments │ Occurence in Comments │
╞═════════════════════════╪═════════════════════════╡
│ russia │ 2402776 │
│ people │ 2123906 │
│ russian │ 1659100 │
│ ukraine │ 1618223 │
│ war │ 1496956 │
│ make │ 1477347 │
│ country │ 1377832 │
│ putin │ 989950 │
│ time │ 978403 │
│ thing │ 889382 │
│ world │ 842894 │
│ year │ 804429 │
│ it │ 707501 │
│ nato │ 669443 │
│ good │ 668103 │
╘═════════════════════════╧═════════════════════════╛
It is evident from both tables 3.1 and 3.2 that the Russia-Ukraine Conflict dominates both in the submissions and comments datasets. This result agrees with our EDA analysis. Even when constructing a Word Cloud with the Submissions dataset, we found similar trends as displayed in figure 3.3 below.
Figure 3.3 : Word Cloud of Submissions Dataset
Digging in further, we also evaluated the most important words using the Term Frequency - Inverse Document Frequency (TF-IDF) methodology and obtained the following results.
Table 3.3 : Most Important Words in the Datasets
╒══════════════════════════════════════╤═══════════════════════════════════╕
│ Top Important Words in Submissions │ Top Important Words in Comments │
╞══════════════════════════════════════╪═══════════════════════════════════╡
│ work │ god │
│ web │ decade │
│ fire │ whale │
│ stream │ ween │
│ health │ 10110000 │
│ cool │ queen │
│ jika │ save │
│ bot │ boris │
│ pump │ diplomat │
│ glaub │ johnson │
│ vitamin │ understand │
│ film │ tug │
╘══════════════════════════════════════╧═══════════════════════════════════╛
The TF-IDF doesn’t reflect the dominance of war related submissions and comments obtained above. This will be further explored through our following analysis of dominant topics in the data.
Key Topics in Submissions
We employed topic modeling through Latent Dirchlet Allocation (LDA) on the cleaned titles of our submissions dataset and obtained 7 topic groups, which saw most of the groups dealing with the Russia-Ukraine Conflict, addressing its different facets. The topics were:- Topic 1: deals with the Russia-Ukraine Conflict and nuclear weapons, power plants
- Topic 2: deals with the Russia-Ukraine Conflict and the media/news coverage on the political leaders
- Topic 3: deals with the Russia-Ukraine Conflict and oil imports, the EU, laws, and regulation
- Topic 4: deals with the Russia-Ukraine Conflict as well as Twitter and Elon Musk
- Topic 5: news related to covid, China, and Korea
- Topic 6: deals with the Russia-Ukraine Conflict and neighbouring countries
- Topic 7: the Russia-Ukraine Conflict and the Qatar World Cup
As stated, several of these topics related to the Russia-Ukraine Conflict directly. This high centralization of topic matter seemed to affect topics 4 and 7 which appeared to combine disparate subjects. The LDA topic visualization is displayed below.
Figure 3.4 : LDA Topic Visualization for Submissions Datset
- Topics 1, 2, 3, 5 all relating to the Russia-Ukraine Conflict, with repeated terms (political leaders of both countries, cities in both countries, and weapons - missiles, nuclear, military, troops)
- Topic 4 provided an interesting insight - it includes news related to Indian students in Ukraine, as well as Starlink and Elon Musk.
To provide context, topic 4 highlighted the early involvement of Musk in providing his satellite services to the country and the Nazi Azov battalion accosting foreign students attempting to flee.
Figure 3.5 : LDA Topic Visualization for Submissions Datset During the Start of the War
The second event we used from the trends coverage, was when Russia began withdrawing. After applying LDA once more, we saw the topic modelling results indicated:
- Topics 1, 2, 3, 4 all relating to Russia-Ukraine Conflict, with repeated terms (political leaders of both countries, cities in both countries, and weapons - missiles, nuclear, military, troops)
- Topic 5 : news related to North Korea firing a ballistic missile over Japan.
As was stated, North Korea had fired a ballistic missile over Japan in the same time period and had been rapidly increasing the rate of testing. This appears to suggest that the smaller subsets of data might be more useful in capturing specific stories without overlapping the conflict with other topics.
Figure 3.6 : LDA Topic Visualization for Submissions Datset When Russian Troops Started Receding
Identifying Key Entities in the Datasets
The top 12 entities for person, location and organization are displayed in table 3. We used BERT pretrained Named Entity Recognition (NER) models to accomplish this. Initially, when feeding the pre trained model with our pre-processed data, we found that Russia and Ukraine didn’t feature as one of the top entities. Upon re-running the model pipeline with unprocessed data, the results were more familiar.
Given the density of topics demonstrated by the previous section’s analysis, the focus of the entities on the conflict should not come as a surprise. Using NER, we saw that just as the topics were heavily saturated with the Russia-Ukraine Conflict, the results for most frequently mentioned entities demonstrated this as well.
In table 3.4, we can clearly see the prevalence of Putin, Biden, and Zelensky; but these persons are not the only noteworthy results. We also can see British political figures, Boris Johnson and Liz Truss, taking a prominent role in the results. Considering the domain table from our EDA, British news coverage made up some of the most popular sources for posts. This could be the reason for the British political figures being highlighted by the NER model.
When examining the locations results, we saw Ukraine and Russia occur with much greater frequency than the other states, followed by the United States, which appeared twice, and China to a lesser extent.
Lastly our organization based model appeared to return the most mixed results. The top three results were organizations that could easily indicate stories related to the Russia-Ukraine Conflict, but after this the results become less than ideal. Ukrainian cities, Russia, Covid, and Elon Musk were included in the results. This could be seen as a short coming of the models, but also testifies to the prevalence of the conflict in the subreddits submissions. It should also be noted that different n-grams and misspellings resulted in less refined results, and, as stated, the organization results were rather mixed. Table 3.4 displays the entities in the decreasing order of occurence.
Table 3.4 : Top Entities in Submissions
╒════════════════╤═════════════════╤═════════════════════╕
│ Top People │ Top Locations │ Top Organizations │
╞════════════════╪═════════════════╪═════════════════════╡
│ Putin │ Ukraine │ EU │
│ Biden │ Russia │ NATO │
│ Zelensky │ US │ UN │
│ Vladimir Putin │ China │ Kyiv │
│ Boris Johnson │ U.S │ CNN │
│ Zelenskyy │ India │ Elon Musk │
│ Russia's │ UK │ COVID │
│ Trump │ Iran │ Mariupol │
│ Liz Truss │ Germany │ News │
│ Bucha │ Israel │ Russia's │
│ Blinken │ Taiwan │ Reuters │
│ Shinzo Abe │ Japan │ Twitter │
╘════════════════╧═════════════════╧═════════════════════╛