Business Goals
Exploratory Data Analysis Goals
Business Goal -1 : We plan on examining user behaviors as our first step. To do this we will check what proportion of users post submissions and comments. This will help us determine if there are hyperactive users in this subreddit. We will also look for users who have posted the largest amount of controversial comments. Doing so we will be able to see if controversiality correlates with a user’s likeliness to have gilded/distinguished comments in this reddit space.
Technical proposal: We will generate tables from the data, aggregated around the Users, to answer the above questions. The number of distinct users posting or commenting will be compared against the total number of subscribers in the subreddit(obtained from an external source). Also, using this aggregated data, we plan to sort by the amount of controversial comments in order to see their total gilded and distinguished counts.
Business Goal - 2 : We will find what are the general frequency trends in submissions and comments in the r/worldnews subreddit, and see how characteristic features of the subreddit vary across time and compare with the general trend. This could also be used to identify events which have affected changes in user behavior.
Technical Proposal : In order to perform time series analysis, the timestamp of the observation’s creation will be used. We will then visualize and analyze the frequency of posting on each date value.
Business Goal - 3 : We will examine, what are the most popular/upvoted news sources and stories in the subreddit, and check the count of ‘live threads.’ We will also check what these threads relate to.
Technical proposal : The most popular submissions can be derived from the score variable. The most shared news sites can be found by aggregating on the domain column and examining the number of posts, comments, posts in the top 100, and the average scores. Live threads can be obtained using regex on the title column by applying an appropriate search term.
Business Goal - 4 : We plan to examine the most repetitive words in the comments under the top 5 news stories in the subreddit.
Technical Proposal : By sorting on the “score” column from the dataset, the most popular subreddits can be found. Then word clouds can be created to explore the most frequent words in the comments of the each of the top five submissions. These visualizations give a take-away of the popular topics in the subreddit.
Business Goal - 5 : Given the prominence of the Russia-Ukraine Conflict, we wish to see if the r/worldnews subreddit captured all events it.
Technical Proposal: In order to compare posts regarding the Russia-Ukraine conflict, the ACLED (Armed Conflict Location and Event Data Project) data is proposed to be used. Using terms relevant for war-related events, we can create dummy variables for each news posting and find daily counts of event type. Then, both the datasets will be merged on the event date, and their counts will be compared using apt proximity measures.
Natural Language Processing Goals
Business Goal - 6 : We wish to see what the main topics in the submissions of the subreddit are.
Technical Proposal: In order to identify the major topics under submissions related to the Russia-Ukraine Conflict, topic modeling with LDA will be performed. Particularly, the LDA method for topic modeling will be used as its results are based on conditional probability estimates.
Business Goal - 7 : We will examine what the key entities in the submissions are and what categories they can be classified to?
Technical Proposal : Named Entity recognition will be performed using pre-trained models available in johnsnow labs. We shall identify standard categories in the text data such as person’s name, geographic locations, and organizations.
Business Goal - 8 : From previous analysis it was determined that the live threads were tied to the Russia-Ukraine Conflict. Given such a high stakes topic, we are curious to see how the user base reacted to it. To do so we plan on determining the comment sentiments.
Technical Proposal : For conducting sentiment analysis, we will use pretrained models from johnsnow labs. We will attempt several models to see comparative results, so as to verify our findings.
Machine Learning Goals
Business Goal - 9 : Building on the pre-trained model results, we wish to develop a lexicon based sentiment model using our data. This will allow us to apply sentiment labels to our data and apply supervised learning models enabling better model evaluation.
Technical Proposal : After labelling the dataset using vader sentiment lexicon, it is possible to train supervised models using the sentiment labels as the target variable. A portion of the data will be used in sentiment analysis to obtain the required labels and the remaining data can be labelled by the trained supervised learning models.
Business Goal - 10 : We plan on generating a model that will predict controversiality based on the text data.
Technical Proposal : The text data will undergo tfidf weighting. After this, the text will be used as an input to predict what the controversiality is.
Future Goals
Business Goal - 11 : Karma scores can often influence any user’s actions on a particular subreddit. We wish to see if individual user Karma scores can be predicted by the variables in our dataset?
Technical Proposal: Using the scores in the comments dataset as our target variable, we plan to construct a number of predictors from the dataset such as number of submissions or comments posted, number of gilded comments etc to predict the score value.