Text Classification — AIExplorer

Introduction

In today's digital age, social media has become a powerful tool for businesses to connect with their target audience and drive brand awareness. Recognizing the vast potential that lies within the popular social media platform Reddit, a baby food startup company has decided to embark on a data-driven marketing campaign. As a skilled data scientist, your role is pivotal in guiding the marketing team towards success.

Reddit, consistently ranked among the top 10 popular social media sites in the USA, holds immense promise for the startup's social media campaign. With its low ad competition and cost-effective inventory, Reddit presents a unique opportunity to reach a large and engaged user base. However, navigating the advertising landscape on Reddit is notoriously challenging, as its users, known as redditors, highly value authenticity and community engagement.

To forge a genuine connection with redditors, the key lies in understanding the platform's dynamic discussions, trending topics, user interests, and unique linguistic styles found within specific subreddits. By mining data from Reddit posts, valuable insights can be extracted. Employing the power of natural language processing (NLP) tools enables the analysis of targeted subreddits, unveiling hidden trends and identifying relevant ad-keywords. This knowledge empowers the marketing team to craft compelling promotional posts and content that resonate with the community, ensuring the ads are posted in the most suitable subreddits.

For this particular promotional campaign, the subreddits of interest are r/BabyBumps and r/beyondthebump, catering directly to the startup's target audience. To assist the marketing team in creating an effective Reddit advertising campaign, a Logistic Regression classification model has been developed. This model accurately identifies which subreddit a post belongs to, guaranteeing that the relevant ads are showcased to the appropriate audience.

Furthermore, leveraging data analysis and NLP techniques, deeper insights can be derived from these target subreddits. This knowledge will inform decision-making processes, enabling the generation of relevant and engaging content that captures and retains the interest of redditors.

As a data-driven marketing campaign unfolds on Reddit, the potential for the baby food startup to captivate and connect with their target audience becomes increasingly tangible. Through the synergistic efforts of data science and marketing expertise, the stage is set for a remarkable journey of growth and success.

Datasets

In order to develop a robust classification model for our NLP project, an extensive process of data scraping and cleaning was undertaken. Leveraging the power of the Pushshift Reddit API, we scraped a substantial amount of data from two significant subreddits: r/BabyBumps and r/beyondthebump.

The API was configured to retrieve a total of 5000 reddit posts from each of these subreddits, ensuring a diverse and comprehensive dataset for analysis.

Once the data was collected, the next crucial step was to clean and prepare it for further processing. Several cleaning techniques were applied to ensure the integrity and quality of the dataset:

Removal of non-unicode characters, emojis, and hyperlinks: To maintain the focus on textual content, any non-unicode characters, emojis, and hyperlinks were eliminated from the dataset. This step ensured that the analysis would be based solely on the text of the posts.

Elimination of empty posts: Posts that contained empty titles or body text were identified and removed from the dataset. This step helped filter out any posts that lacked substantial content, ensuring that our model would be trained on meaningful and informative data.

Substitution of abbreviated texts: Regular expressions were employed to substitute certain commonly abbreviated texts found within the posts. For instance, expressions like "1w2d" were transformed to "1 week 2 days," and "2mos" became "2 months." This process aimed to standardize the language and facilitate a more consistent analysis.

Filtering specific post types: Certain post types were filtered out from the dataset to refine its quality. Stickied posts, posts with zero comments, as well as deleted posts by either the moderators or authors were excluded. This step allowed us to focus on posts that were more likely to represent genuine user-generated content.

By conducting this meticulous data scraping and cleaning process, we have ensured that our classification model will be trained on reliable and relevant data. These efforts lay the foundation for the subsequent stages of analysis and the development of a highly accurate and insightful NLP model.

EDA

To gain deeper insights into the posts from the r/BabyBumps and r/beyondthebump subreddits, cutting-edge NLP tools such as Spacy and NLTK were employed. Through comprehensive EDA analysis, I sought to understand the nuances and uncover the similarities and differences between these two subreddits' posts. Key aspects of text data were explored to shed light on their content and characteristics.

Investigating the distribution of the word counts for the two subreddits:

By analyzing the word count distribution in r/BabyBumps and r/beyondthebump subreddits, I gained insights into post characteristics. This investigation revealed patterns in post length, helping tailor content strategies to match each community's preferences. Comparing the distributions shed light on the level of depth and complexity in discussions.

Investigating the distribution of the post lengths for the two subreddits:

To understand the characteristics of r/BabyBumps and r/beyondthebump, I examined the distribution of post lengths in each subreddit. This analysis unveiled insights into the structure and content of the posts. By studying the distribution, I could identify trends in post lengths and determine whether the subreddits favor shorter, concise posts or longer, in-depth discussions.

Investigating Commonly Occurring Words in Subreddits:

Through an in-depth analysis, I explored the top 25 most frequently occurring words in the r/BabyBumps and r/beyondthebump subreddits. This investigation provided valuable insights into the recurring topics and discussions within each community. By identifying the common words, I gained a deeper understanding of the interests and concerns of the subreddit users.

Named-Entity-Recognition Using Spacy:

In Spacy, named entities refer to real-world objects such as persons, organizations, locations, dates, percentages, monetary values, and more. These entities are identifiable and can be categorized into predefined types. Spacy's named entity recognition (NER) module is designed to identify and classify these entities within a given text. By analyzing the context and linguistic features of the text, Spacy can accurately identify and extract named entities, providing valuable information about the specific entities mentioned in a document or corpus.

By leveraging Spacy's named entity recognition capabilities, I visualized prominent entities in a typical Reddit post. This intuitive representation revealed key figures, locations, and topics, enabling a deeper understanding of the post's content and context. Such insights guided content creation and targeted marketing strategies, ensuring resonance with the audience's interests and preferences. The visualization served as a valuable resource for data-driven decision-making, empowering marketers to craft engaging and relevant content that establishes meaningful connections. With Spacy's entity visualization, exploring Reddit posts became more dynamic, providing actionable insights into the entities at play and facilitating impactful engagement with the target audience.

Distribution of Named Entities by SubReddits:

I compared the distribution of named entities identified by Spacy to uncover potential trends specific to each subreddit. This information is valuable for model setup and hyperparameter tuning. Interestingly, the DATE entity emerged as the most frequent in both subreddits, aligning with discussions related to unborn and born babies. The TIME entity showed a similar trend. These visualizations also aided in assessing data quality. Understanding entity distributions enhances our understanding of subreddit content and supports effective modeling for targeted marketing campaigns.

During my investigation, I delved deeper into the trends within each named entity identified by the Spacy model. By visualizing the distributions of commonly occurring words within each entity, I aimed to gain a better understanding of the context and themes discussed in the subreddits. This analysis provided insights that can be leveraged during the model building phase.

The purpose behind these comparisons was to identify any discernible patterns that shed light on the discussions and enable us to utilize this information effectively. The plots showcased the distributions of commonly occurring words within a specific entity, allowing for a comparison between the two subreddits.

By examining these trends, we can gain a clearer understanding of the prevalent topics, concerns, and interests within each subreddit. This information becomes invaluable when building the model, as it helps us align the model's focus with the specific characteristics of each subreddit, enhancing its accuracy and relevance.

Overall, the investigation into the trends within named entities provided crucial insights into the context and themes of discussions, allowing for informed decision-making during the model building phase and enabling effective utilization of the gathered information for targeted marketing campaigns.

Commonly Used Acronyms and Words:

Acknowledging Redditors' affinity for using acronyms and specialized lingos, I conducted an investigation into these abbreviations within the r/BabyBumps and r/beyondthebump subreddits. This analysis involved comparing the occurrences and distributions of these acronyms to identify trends and patterns unique to each subreddit.

Understanding and tracking these acronyms is crucial for comprehending the discussions and engaging effectively with the community. As frequent visitors are already familiar with these abbreviations, they serve as a shorthand language that facilitates efficient communication.

By comparing the trends and distributions of these acronyms, I aimed to uncover any distinctions or variations in their usage between the two subreddits. This insight helps in understanding the communication styles and preferences of each community, enabling more effective targeting and engagement strategies.

Additionally, exploring the definitions and explanations of these acronyms in the subreddit wiki provided valuable context and allowed for a deeper understanding of the discussions taking place. By incorporating this knowledge into the model building phase, we can ensure our approach aligns with the linguistic nuances and preferences of the respective subreddits.

The visualization consists of pie charts that represent the distribution of a specific acronym or word. Each pie chart is shaded differently to distinguish between the r/BabyBumps and r/beyondthebump subreddits. The size of the total pie reflects the relative frequency of that particular word across both subreddits. The arrangement of the pie charts follows a concentric circle pattern, with the largest bubble positioned at the center. As you spiral out in a counterclockwise direction, the size of the bubbles gradually decreases.

This visualization approach allows for a visual comparison of word distributions between the subreddits. The differing shades and bubble sizes provide a clear representation of the relative frequency and prevalence of the selected words within each subreddit. By observing the variations and patterns, you can gain insights into the unique linguistic characteristics and communication trends specific to r/BabyBumps and r/beyondthebump.

Sentiment Analysis:

Performing sentiment analysis on the posts from r/BabyBumps and r/beyondthebump subreddits, I compared the sentiment distributions between the two communities. This analysis aimed to understand the prevailing sentiment and emotional tone of the discussions within each subreddit.

By utilizing natural language processing techniques, I assigned sentiment scores to individual posts, categorizing them as positive, negative, or neutral. I then visualized the sentiment distributions to highlight any significant differences or similarities between the subreddits.

The sentiment analysis comparison provided valuable insights into the overall emotional climate and user experiences within each community. It allowed for a deeper understanding of the prevailing sentiments expressed by the subreddit members, enabling marketers and community managers to tailor their strategies accordingly.

Model Building

After scraping and cleaning the data, a baseline accuracy of 41% was established for our dataset. To improve upon this baseline, four different classification models were built to accurately identify the subreddits based on the content of the posts.

The first model employed was LogisticRegression, followed by the MultinomialNB Classifier, RandomForest Classifier, and C-Support Vector Classifier. However, these models initially showed signs of overfitting when tested with default hyperparameters.

To address the overfitting issue and enhance model performance, the GridSearchCV and RandomizedSearchCV libraries were utilized. These libraries allowed for systematic exploration of different hyperparameter combinations, helping to identify the optimal settings for each model. Additionally, pipelines of transformers such as CountVectorizer and TfidfVectorizer were incorporated to preprocess the text data, while final estimators were employed for classification.

By fine-tuning the hyperparameters and implementing appropriate transformers, the models were optimized to deliver better accuracy and reduce overfitting. This iterative process ensured that the models could effectively capture the distinguishing features and patterns within the subreddit posts, enabling more accurate classification.

Through these efforts, the goal was to develop robust and reliable models that could correctly identify the subreddit based on the content of the posts, ultimately supporting data-driven marketing campaigns and decision-making processes.

Model Evaluation

Among the models built, the RandomForest Classifier exhibited the highest degree of overfitting, even after hyperparameter tuning. The SVC model demonstrated relatively less overfitting compared to RandomForest. The MNB model achieved a reasonable balance between bias and variance, although its testing scores were the lowest among the models. The LogisticRegression model emerged as the best-performing model, with training and testing scores being almost identical.

The final LogisticRegression model attained a training score and testing score of 85.5%. The AUC score, which indicates the separation between True Positives and True Negatives, was measured at 0.92. In general, the model demonstrated better performance in identifying posts belonging to the r/beyondthebump subreddit compared to r/BabyBumps. The recall score for r/beyondthebump was 0.9, while for r/BabyBumps it was 0.79. The f1-score for the model was calculated as 0.817.

To gain further insights, the coefficients of the LogisticRegression model were analyzed to identify the top predictor words for each subreddit. The top three predictor words for r/BabyBumps were 'trimester', 'pregnant', and 'week'. On the other hand, the top three predictor words for r/beyondthebump were 'month', 'old', and 'LO' (likely referring to 'little one' or 'baby'). These words provide valuable information about the distinct topics and discussions happening within each subreddit, aiding in targeted marketing and content creation strategies.

Conclusions

In conclusion, the project aimed to design a data-driven marketing campaign for a baby food startup by leveraging the popular social media platform Reddit. Through extensive data scraping, cleaning, and analysis, valuable insights were gained to guide the marketing team in their campaign strategy.

Key findings and conclusions include:

Reddit's popularity: Reddit consistently ranks among the top 10 social media sites in the USA and holds a significant position in global internet traffic and engagement. This makes it an attractive platform for marketing campaigns.
Authenticity and community engagement: Redditors highly value authenticity, necessitating a genuine connection with the community. Understanding relevant topics, interests, and subreddits' lingo is crucial for effective communication and engagement.
Data mining and NLP: Reddit posts serve as a valuable resource for data mining. Natural Language Processing (NLP) techniques, such as sentiment analysis, named entity recognition, and word frequency analysis, provide insights into community sentiment, prevailing trends, and commonly used terms.
Model building and evaluation: Various classification models, including LogisticRegression, MultinomialNB, RandomForest, and SVC, were built to identify the subreddit of a post. The LogisticRegression model stood out as the best-performing model, with balanced bias-variance tradeoff and high accuracy.
Top predictor words: The LogisticRegression model's coefficients revealed the top predictor words for each subreddit, such as 'trimester', 'pregnant', and 'week' for r/BabyBumps, and 'month', 'old', and 'LO' for r/beyondthebump. These words provide valuable insights into the topics and discussions prevalent in each subreddit.

These conclusions provide a foundation for the data-driven marketing campaign, enabling the marketing team to target the right audience, create relevant content, and engage with the Reddit community effectively. By leveraging the insights gained from this project, the baby food startup can optimize their marketing strategy and increase their chances of success on Reddit.

Driving Effective Ad Campaigns: Utilizing Text Classification for Enhanced Targeting

Introduction

Datasets

EDA

Model Building

Model Evaluation

Conclusions

Driving Effective Ad Campaigns:
Utilizing Text Classification for Enhanced Targeting