Exploratory Data Analysis

Preprocessing and Data Dictionary

We assessed the basic specifications of the dataset, removed duplicates and anomalies and dropped undesired columns to finally get 170,144 submissions and 18,548,934 comments in the respective datasets. The variables of interest after preprocessing the datasets are listed below:

In the Submissions dataset :

author : The user who created the post.
created_utc : The time the submission or comment was posted. Was used in the time series analysis section of our project.
domain : The news site posted in the submission.
id : The unique identifier of each post.
num_comments : The number of comments under each submission. This may not capture the exact picture as it is dependent on the day the data was retrieved.
url : The url associated with the post. Most of the submissions contain this url to the news article.
score : The Karma score awarded to each post

In the Comments dataset :

author : The user who posted the comment.
created_utc : The time the submission or comment was posted. Was used in the time series analysis section of our project.
body : The text in the comment.
id : The unique identifier of each comment.
link_id : The id of the submission under which the comment exists.
controversiality : Whether a comment was classified as ‘controversial’.
gilded : Whether the comments have been gilded or awarded.
distinguished : Whether the comments have been distinguished as moderator.
score : The Karma score awarded to each post

During this process, several dummy variables were created to aid in our analysis. A foreign key like variable called submission_id was also created in the comments dataset, that linked any comment to the submission it was made under.

Analyzing User Activity

In this section, we dig a little deeper into how users act on this subreddit. Currently this subreddit currently has 31.5 subscribers. Upon evaluation however, we found that only around 27000 users created submissions and around 1.2 million users commented in the past year. Out of these numbers, only a small proportion of users actively posted as evidenced by figure 2.1 below.

Figure 2.1 assesses the percentage of users posting monthly among total distinct users in the respective datasets. It is observed that a higher proportion of users commented on posts as compared to making a post themselves. In both cases however, less than half of distinct users were active monthly.

The top 10 posters and commenters were also found and users with the most controversial comments were evaluated. This was compared against the users’ total gilded and distinguished comments as shown below in table 2.1.

Table 2.1 Comparison of Users With Most Number of Controversial Comments

╒════════════════════╤══════════════════════════╤═══════════════════╤══════════════════════════╕
│ User               │   Controversial Comments │   Gilded Comments │   Distinguished Comments │
╞════════════════════╪══════════════════════════╪═══════════════════╪══════════════════════════╡
│ catsinbananahats   │                      382 │                 2 │                        0 │
│ Torifyme12         │                      355 │                 3 │                        0 │
│ green_flash        │                      335 │                 7 │                       13 │
│ ylteicz123         │                      317 │                 2 │                        0 │
│ Test19s            │                      307 │                 5 │                        0 │
│ IsraeliDonut       │                      306 │                 4 │                        0 │
│ KeyWestTime        │                      299 │                 4 │                        0 │
│ Freschledditor     │                      290 │                 1 │                        0 │
│ HamburgerEarmuff   │                      281 │                 1 │                        0 │
│ Goshdang56         │                      275 │                 0 │                        0 │
│ _Plork_            │                      271 │                 1 │                        0 │
│ stretching_holes   │                      270 │                 3 │                        0 │
│ feeltheslipstream  │                      267 │                 2 │                        0 │
│ timelyparadox      │                      260 │                 3 │                        0 │
│ Silurio1           │                      252 │                 0 │                        0 │
│ InnocentTailor     │                      245 │                 0 │                        0 │
│ JPR_FI             │                      245 │                 0 │                        0 │
│ albertnormandy     │                      241 │                 0 │                        0 │
│ Foreign-Engine8678 │                      239 │                 2 │                        0 │
│ pieter1234569      │                      238 │                 0 │                        0 │
╘════════════════════╧══════════════════════════╧═══════════════════╧══════════════════════════╛

Efforts were made to identify if high controversiality led to higher number of gilded or distinguished but that was not the case.

Time Series Analysis

The next step to understanding the trends within the datasets was to plot multiple time series graph. Much like figure 1.2, a daily submission and comments frequency graph was also plotted as shown below.

There are some points of interest evident in the plot above. The first is the occurance of temporal gaps in the data which could be attributed to the data collection process of the push-shift reddit dataset. Second, there is a peak in Submissions and Comments in late February, and upon marking the start date of the Russia-Ukraine Conflict on the chart, it is apparent that this high volume of submissions and comments could have stemmed from the ongoing war.

Exporing further into this, the datasets were filtered to retain only observations regarding the conflict. After visualizing monthly frequencies of occurrences, it was found that the patterns observed in figure 2.3 were almost identical to the ones present in figure 1.2. This corroborates our earlier findings about increased activity in the subreddit during different phases of the Russia-Ukraine conflict.

Karma scores often determine user activity in any subreddit. In order to identify the overall popularity of submissions, their scores quantiles per month were visualized. Scores are the difference between the number of upvotes and number of downvotes that a submission receives. From the median line in the visualization, it can be interpreted that most submissions have a score between 10-30. Since the 25th percentile remains 0 for all the months, it can be inferred that there are submissions (although few) with an overall negative score. The scores from the 75th percentile line reveal that there are submissions with very high overall scores as well. Additionally, these high overall scores per month do not follow any similar trend to the median data, indicating that there may be a handful of submissions that contribute to a very high overall score in a particular month.

Finally, to further capture trends in submission posts, the monthly frequencies of five authors with the highest number of submissions in this time-period were visualized. From the graph, it can be inferred that some of the authors’ changes over time resemble the overall submission frequency plot. However, two other authors had submissions only during some months in the year. Infact, hieronymusanonymous’s submission frequencies were only during the second half of the year, indicating that there might be authors who did not post much about the Russia-Ukraine conflict.

Most Shared News Stories

From the preceding analysis, a sharp peak was observed in the submissions and comments frequencies which we propose is due the ongoing Russia_Ukraine conflict. The following analysis looks at the most highly scored news stories, the news sites that occur most in the submissions dataset, the presence of a live thread, and whether this would reveal anything on the surge of news articles shared on this subreddit.

Upon analysis, it was found that the top 10 news stories were generally to do with Russia’s war on Ukraine. This provided us key insight to look more closely at the data pertaining to war efforts. The top news sites were also evaluated and an aggregation table was generated as shown below.

Table 2.2 : Top 20 News Sources in the Subreddit

╒═════════════════════╤═══════════════╤═════════════════╤══════════════════╤═══════════════════════════╕
│ News Site           │   No of Posts │   Average Score │   Total Comments │   No. of Posts in Top 100 │
╞═════════════════════╪═══════════════╪═════════════════╪══════════════════╪═══════════════════════════╡
│ reuters.com         │         10402 │         1005.53 │           941804 │                         2 │
│ theguardian.com     │          5302 │          794.83 │           367433 │                         3 │
│ bbc.com             │          3726 │          406.33 │           150686 │                         0 │
│ youtube.com         │          3268 │            0.99 │             1395 │                         0 │
│ youtu.be            │          3155 │            0.99 │             1478 │                         0 │
│ apnews.com          │          2776 │          826.12 │           196241 │                         2 │
│ aljazeera.com       │          2121 │          410.58 │            91511 │                         0 │
│ businessinsider.com │          2107 │         3815.61 │           562551 │                        12 │
│ cnn.com             │          1944 │          805.23 │           155646 │                         3 │
│ pravda.com.ua       │          1702 │         2857.63 │           317834 │                         7 │
│ newsweek.com        │          1658 │         2516.2  │           369312 │                         4 │
│ bbc.co.uk           │          1637 │         1032.32 │           156512 │                         4 │
│ timesofisrael.com   │          1519 │         1058.52 │           143717 │                         5 │
│ twitter.com         │          1428 │            1.06 │              396 │                         0 │
│ edition.cnn.com     │          1412 │         1022.95 │           127421 │                         2 │
│ france24.com        │          1375 │          784.59 │            83869 │                         0 │
│ msn.com             │          1357 │         1242.52 │           182079 │                         2 │
│ dw.com              │          1225 │          782.35 │            76086 │                         0 │
│ nytimes.com         │          1224 │          742.27 │            74875 │                         1 │
│ news.sky.com        │          1176 │         1711.02 │           193815 │                         3 │
│ bloomberg.com       │          1119 │         1021.3  │           105288 │                         1 │
╘═════════════════════╧═══════════════╧═════════════════╧══════════════════╧═══════════════════════════╛

It was observed from the table above, that the most popular news sites on the subreddit over the past year were generally from western countries. This could potentially explain the high consumption of news related to the war within the subreddit despite the presence of Russian media sources as well.

Live thread submissions were found and it was determined that all the live thread submissions pertained to the war. This provided an opportunity to evaluate the comments of the live thread against regular submissions that also dealt with the Conflict as shown in the table below.

Table 2.3 : Comparison of Live Thread Comments and Regular Comments on War

╒════════════════════════════════════════════════════════════╤═══════════════╤══════════════════╤══════════════════════════╤═══════════════════╤══════════════════════════╕
│ Comments                                                   │   Total Posts │   Total Comments │ Controversial Comments   │ Gilded Comments   │ Distinguished comments   │
╞════════════════════════════════════════════════════════════╪═══════════════╪══════════════════╪══════════════════════════╪═══════════════════╪══════════════════════════╡
│ Live Thread Comments                                       │           353 │          1824577 │ 2.84%                    │ 0.03%             │ 0.05%                    │
│ Russian-Ukraine Conflict Comments (excluding Live Threads) │         45841 │          6358941 │ 4.46%                    │ 0.06%             │ 0.1%                     │
╘════════════════════════════════════════════════════════════╧═══════════════╧══════════════════╧══════════════════════════╧═══════════════════╧══════════════════════════╛

The table above captures the percentage of comments that were controversial, gilded and distinguished for the live thread and for other submissions dealing with the war. It was observed that more controversiality was present in regular submissions as compared to live threads, possible due to larger number of normal posts.

Most Common Words

Progressing with our analysis, we also looked at which were the most commonly used words or phrases in the comments of the top 3 news stories. To evaluate this we generated word clouds as shown below.

Figure 2.6 : Word Cloud Of Comments From Top 3 News Stories

These word clouds revealed that the Russia-Ukraine Conflict , and political leaders of these countries were the most repeated words. We also found terms relating to Queen Elizabeth II’s demise and the British royal family to be quite repetitive.

Comparison with Other Sources

As a final task, we sought to compare the information present in the subreddit’s submissions about the events pertaining to Russia and Ukraine, with the events data from Armed Conflict Location & Event Data Project (ACLED). ACLED collects real-time data on locations, dates, actors, fatalities and types of all reported political violence and protest events around the world, from various international and regional news sources. The ACLED data for Ukraine and Russia were aggregated to obtain daily counts of event types in the following categories:

Armed Clashes
Shelling/Artillery/Missile Attacks
Remote Explosives/Landmines/IED
Disrupted Weapons Use

The submissions titles were analyzed using regex to find terms related to aforementioned event types to obtain daily counts for these events. The cosine similarity between ACLED counts and counts obtained from submissions for each event type were found as shown below in table 2.4. Our results indicate that reddit data is not quite similar to ACLED data. One possible reason for low similarity might be that our data has been filtered to English, and ACLED uses its own translation methodology and produces regional level news related to the conflict as well.

Table 2.4 : Cosine Similarity Scores for ACLED and Submissions Dataset on Different War Events

╒═════════════════════════════════╤═════════════════════╕
│ War Event Type                  │   Cosine Similarity │
╞═════════════════════════════════╪═════════════════════╡
│ Armed Clash                     │           0.076989  │
│ Shelling/Artillery/Missile      │           0.210093  │
│ Remote Explosives/Landmines/IED │           0.0882741 │
│ Air Drone Strike                │           0.050086  │
│ Disrupted Weapons Use           │           0.115323  │
╘═════════════════════════════════╧═════════════════════╛