Exploratory Data Analysis
Preprocessing and Data Dictionary
We assessed the basic specifications of the dataset, removed duplicates and anomalies and dropped undesired columns to finally get 170,144 submissions and 18,548,934 comments in the respective datasets. The variables of interest after preprocessing the datasets are listed below:
In the Submissions dataset :- author : The user who created the post.
- created_utc : The time the submission or comment was posted. Was used in the time series analysis section of our project.
- domain : The news site posted in the submission.
- id : The unique identifier of each post.
- num_comments : The number of comments under each submission. This may not capture the exact picture as it is dependent on the day the data was retrieved.
- url : The url associated with the post. Most of the submissions contain this url to the news article.
- score : The Karma score awarded to each post
- author : The user who posted the comment.
- created_utc : The time the submission or comment was posted. Was used in the time series analysis section of our project.
- body : The text in the comment.
- id : The unique identifier of each comment.
- link_id : The id of the submission under which the comment exists.
- controversiality : Whether a comment was classified as ‘controversial’.
- gilded : Whether the comments have been gilded or awarded.
- distinguished : Whether the comments have been distinguished as moderator.
- score : The Karma score awarded to each post
During this process, several dummy variables were created to aid in our analysis. A foreign key like variable called submission_id was also created in the comments dataset, that linked any comment to the submission it was made under.
Analyzing User Activity
In this section, we dig a little deeper into how users act on this subreddit. Currently this subreddit currently has 31.5 subscribers. Upon evaluation however, we found that only around 27000 users created submissions and around 1.2 million users commented in the past year. Out of these numbers, only a small proportion of users actively posted as evidenced by figure 2.1 below.
Figure 2.1 assesses the percentage of users posting monthly among total distinct users in the respective datasets. It is observed that a higher proportion of users commented on posts as compared to making a post themselves. In both cases however, less than half of distinct users were active monthly.
The top 10 posters and commenters were also found and users with the most controversial comments were evaluated. This was compared against the users’ total gilded and distinguished comments as shown below in table 2.1.
Table 2.1 Comparison of Users With Most Number of Controversial Comments
╒════════════════════╤══════════════════════════╤═══════════════════╤══════════════════════════╕
│ User │ Controversial Comments │ Gilded Comments │ Distinguished Comments │
╞════════════════════╪══════════════════════════╪═══════════════════╪══════════════════════════╡
│ catsinbananahats │ 382 │ 2 │ 0 │
│ Torifyme12 │ 355 │ 3 │ 0 │
│ green_flash │ 335 │ 7 │ 13 │
│ ylteicz123 │ 317 │ 2 │ 0 │
│ Test19s │ 307 │ 5 │ 0 │
│ IsraeliDonut │ 306 │ 4 │ 0 │
│ KeyWestTime │ 299 │ 4 │ 0 │
│ Freschledditor │ 290 │ 1 │ 0 │
│ HamburgerEarmuff │ 281 │ 1 │ 0 │
│ Goshdang56 │ 275 │ 0 │ 0 │
│ _Plork_ │ 271 │ 1 │ 0 │
│ stretching_holes │ 270 │ 3 │ 0 │
│ feeltheslipstream │ 267 │ 2 │ 0 │
│ timelyparadox │ 260 │ 3 │ 0 │
│ Silurio1 │ 252 │ 0 │ 0 │
│ InnocentTailor │ 245 │ 0 │ 0 │
│ JPR_FI │ 245 │ 0 │ 0 │
│ albertnormandy │ 241 │ 0 │ 0 │
│ Foreign-Engine8678 │ 239 │ 2 │ 0 │
│ pieter1234569 │ 238 │ 0 │ 0 │
╘════════════════════╧══════════════════════════╧═══════════════════╧══════════════════════════╛
Efforts were made to identify if high controversiality led to higher number of gilded or distinguished but that was not the case.
Time Series Analysis
The next step to understanding the trends within the datasets was to plot multiple time series graph. Much like figure 1.2, a daily submission and comments frequency graph was also plotted as shown below.
There are some points of interest evident in the plot above. The first is the occurance of temporal gaps in the data which could be attributed to the data collection process of the push-shift reddit dataset. Second, there is a peak in Submissions and Comments in late February, and upon marking the start date of the Russia-Ukraine Conflict on the chart, it is apparent that this high volume of submissions and comments could have stemmed from the ongoing war.
Exporing further into this, the datasets were filtered to retain only observations regarding the conflict. After visualizing monthly frequencies of occurrences, it was found that the patterns observed in figure 2.3 were almost identical to the ones present in figure 1.2. This corroborates our earlier findings about increased activity in the subreddit during different phases of the Russia-Ukraine conflict.
Karma scores often determine user activity in any subreddit. In order to identify the overall popularity of submissions, their scores quantiles per month were visualized. Scores are the difference between the number of upvotes and number of downvotes that a submission receives. From the median line in the visualization, it can be interpreted that most submissions have a score between 10-30. Since the 25th percentile remains 0 for all the months, it can be inferred that there are submissions (although few) with an overall negative score. The scores from the 75th percentile line reveal that there are submissions with very high overall scores as well. Additionally, these high overall scores per month do not follow any similar trend to the median data, indicating that there may be a handful of submissions that contribute to a very high overall score in a particular month.
Finally, to further capture trends in submission posts, the monthly frequencies of five authors with the highest number of submissions in this time-period were visualized. From the graph, it can be inferred that some of the authors’ changes over time resemble the overall submission frequency plot. However, two other authors had submissions only during some months in the year. Infact, hieronymusanonymous’s submission frequencies were only during the second half of the year, indicating that there might be authors who did not post much about the Russia-Ukraine conflict.
Most Common Words
Progressing with our analysis, we also looked at which were the most commonly used words or phrases in the comments of the top 3 news stories. To evaluate this we generated word clouds as shown below.
Figure 2.6 : Word Cloud Of Comments From Top 3 News Stories
These word clouds revealed that the Russia-Ukraine Conflict , and political leaders of these countries were the most repeated words. We also found terms relating to Queen Elizabeth II’s demise and the British royal family to be quite repetitive.
Comparison with Other Sources
As a final task, we sought to compare the information present in the subreddit’s submissions about the events pertaining to Russia and Ukraine, with the events data from Armed Conflict Location & Event Data Project (ACLED). ACLED collects real-time data on locations, dates, actors, fatalities and types of all reported political violence and protest events around the world, from various international and regional news sources. The ACLED data for Ukraine and Russia were aggregated to obtain daily counts of event types in the following categories:- Armed Clashes
- Shelling/Artillery/Missile Attacks
- Remote Explosives/Landmines/IED
- Disrupted Weapons Use
The submissions titles were analyzed using regex to find terms related to aforementioned event types to obtain daily counts for these events. The cosine similarity between ACLED counts and counts obtained from submissions for each event type were found as shown below in table 2.4. Our results indicate that reddit data is not quite similar to ACLED data. One possible reason for low similarity might be that our data has been filtered to English, and ACLED uses its own translation methodology and produces regional level news related to the conflict as well.
Table 2.4 : Cosine Similarity Scores for ACLED and Submissions Dataset on Different War Events
╒═════════════════════════════════╤═════════════════════╕
│ War Event Type │ Cosine Similarity │
╞═════════════════════════════════╪═════════════════════╡
│ Armed Clash │ 0.076989 │
│ Shelling/Artillery/Missile │ 0.210093 │
│ Remote Explosives/Landmines/IED │ 0.0882741 │
│ Air Drone Strike │ 0.050086 │
│ Disrupted Weapons Use │ 0.115323 │
╘═════════════════════════════════╧═════════════════════╛