We try to answer the classic question of our era with topic modeling, sentiment analysis, and some good ol’ stats.
Published in · 8 min read · Jul 29, 2021
--
It’s the dataset that’s been calling my name. I’m not quite sure what that says about me.
Ever since I heard of it, I’ve wanted to explore a dataset of Reddit posts from a well known subreddit called “Am I the Asshole?” (AITA), in which users post stories of conflicts in which they weren’t sure if they did the right thing or were instead the, um, asshole. Other users comment and vote with their judgment: You’re the Asshole (YTA), Not the Asshole (NTA), Everyone Sucks Here (ESH), No Assholes Here (NAH), or Not Enough Info (INFO).
The dataset contains the text of over 97,000 posts, plus the voting outcome and the number of comments for each. In only about 27% of the cases, users rendered a judgment of either YTA or ESH, which means almost three-quarters of the cases were judged to contain no assholery. That’s actually reassuring about human nature and our tendency to worry about doing the right thing.
Though the assholes turned out to be the minority, we can dig further into this rich dataset of complicated human situations. It’s a lot of text, but we have the necessary tools in Alteryx Designer and the Text Mining palette from the Alteryx Intelligence Suite.
I decided to use those tools and the Data Investigation tool palette to explore interesting patterns in the AITA posts. Enjoy this slightly rude refresher on sentiment analysis, topic modeling and correlations. Maybe we’ll gain more insight into human behavior along the way.
Some AITA post titles and the judgments
AITA…
for wiping my dog’s drool back on him when he licks my arm? (NTA)
for remarking on a sriracha bottle that expired in 2013? (YTA)
for only wanting to give my Secret Santa giftee stuff for their cat? (NTA)
for putting all the moldy dirty dishes and garbage from my roommate in her bathroom? (NTA)
for getting upset at this game of Monopoly? (YTA)
for hiding candy in the store so I can buy it when it’s on sale? (YTA)
The dataset was pretty clean (in the data sense of the word, anyway), so I just tidied up some small text formatting issues and created a new variable for the length of the original post. I thought it would be interesting to see if the length of a post — the complexity of a situation and/or the degree to which someone felt they had to explain themselves — would correlate with the other variables.
Before doing any other processing on the text, I used the Sentiment Analysis Tool to assess the positive, neutral or negative valence, or emotional weight, of the title and body of each post. VADER, the algorithm behind this tool, is designed to work well even on text that contains NSFW words, emojis, exaggerated punctuation!!! and other oddities in social media content. All of those should be left intact for sentiment analysis.
However, prior to topic modeling, I prepared the text a bit more. The Text Pre-processing Tool took care of that big task. (Read all about it in part one and two of our posts on text normalization.) This tool is based on the Python NLP library spaCy, and it will normalize and filter the text. It does one weird thing: It replaces pronouns with the notation -PRON-. If you’ve spent any time on the internet, you might suspect that spaCy is referring to something other than pronouns. In reality, this abbreviation is its substitution for pronouns in text. I removed all of those notations from the titles and from the processed post text with a REGEX_Replace function in a Formula Tool.
I then added the Topic Modeling Tool to the workflow and configured it to identify three topics in the posts. The resulting visualization was pretty easy to interpret; check out the GIF below to see the main topics that emerged.
Based on the lists of salient words for each topic and knowing the AITA context, the three topics could be said to represent “family issues,” “romantic/friend relationship conflicts” and “work/job problems.” The three topics are nicely separated in the Intertopic Distance Map, and the lists of words characterizing each topic make sense. The Topic Modeling Tool also adds a score for each topic to each post in the dataset, reflecting the degree to which that topic appears in the post.
It’s awesome to quickly find the major themes in more than 97,000 posts, plus analyze the sentiment within them. But did those themes and sentiment levels connect to the AITA judgments passed by users? To find out, I broke out the Data Investigation tool palette to see what we could find about patterns in these posts and the responses.
The Contingency Table Tool makes it easy to compare categorical variables and see how their values coincide. It’s a great way to look more closely at the sentiment analysis results and the AITA judgments. We can compare the positive or negative sentiment of the titles and posts with the “is_asshole” variable provided in the dataset. (The is_asshole variable is 0 if the final vote was Not the Asshole, No Assholes Here, or Not Enough Info, and 1 if the result was You’re the Asshole or Everyone Sucks Here.)
Maybe surprisingly, in terms of quantity, there wasn’t much of a difference between the emotional valence of the titles and posts that were judged to contain assholery and those that weren’t. Positive posts were actually judged YTA or ESH slightly more than negative posts.
Digging in a little deeper with the Association Analysis Tool, we can check out the correlations between our sentiment valence scores, topic scoring, and the post length variable I added. I chose the “Target a field for more detailed analysis” option to get p-values for these variables’ relationship with the “is_asshole” variable.
Here we see, somewhat surprisingly, that while negative sentiment in titles and posts didn’t have a significant correlation with assholery, positive sentiment in titles and posts did. So being positive about a situation maybe makes it more likely that YTA, or at least that you’ll be judged as such.
Of course, Pearson correlations are based on linear relationships between variables; we can also try the Spearman Correlation Tool, whose calculation doesn’t assume a linear relationship. As with Pearson correlations, values closer to -1 or 1 suggest a stronger negative or positive relationship, respectively.
The Spearman correlation between positivity of post titles and is_asshole is 0.31. The more positive the title, the more likely the judgment of assholery. (With this dataset, we have to be a little skeptical; for example, one post title with high positive valence is “best friend party potty fiasco.” VADER might be thrown off a bit by the happy sound of “best friend” and “party,” but not pick up on the concerning last two words in that title.)
The Spearman correlation between positivity of posts and is_asshole is only 0.04, so titles may matter more in setting voters’ expectations (though we can’t assume there’s a causal relationship).
Enough about feelings; which topics seem to involve the most assholery? Do people tend to be judged as assholes more when they share family, romance/friendship or work situations? We can look at the correlations above for this comparison, but it’s also possible to look at these as categories. I identified which of the three topics scored highest for each post, and then compared how the topics were judged across the board. Another Contingency Table Tool revealed the comparison below.
It turns out that bad behavior is pretty evenly distributed in our lives, at least according to these scenarios and judges. The Reddit voters were slightly more lenient toward family and work situations and judged romance/friendship issues somewhat more harshly, but the proportions aren’t all that different.
If you’re curious about whether YTA and want to submit your dilemma to the AITA voters, what will get people to upvote or comment on your post? The “score” variable in this dataset represents the net votes a post received (upvotes minus downvotes), and it’s naturally highly correlated with the number of comments (Pearson correlation of 0.83). Overall, there was only a mild correlation between judgments of YTA or ESH and the number of comments on the post, and very little correlation with the score.
Turns out, if you dish about your family (“topic 3” in the results above) in your post or at least write a lot, people may be slightly more likely to engage with it. But don’t write a positive-sounding title, as positivity in titles was slightly negatively correlated with comments and the score.
This analysis of the AITA posts shows how it’s possible to quickly distill a lot of unstructured text information into topical and emotional insights that can be analyzed in many different ways. This kind of approach could be used on your social media content, product reviews, survey responses and many other kinds of text data, and integrated into predictive models as well. Whatever your project, I hope you find that the assholes are in the minority in your data, too.
Recommended Reading
- Tokenization and Filtering Stopwords with the Text Pre-Processing Tool
- Our series of posts on topic modeling, starting with Getting to the Point with Topic Modeling | Part 1 — What is LDA?
- Ho, Ho … Ow! Identifying Holiday Hazards with Topic Modeling
- More on the Pearson Product-Moment Correlation (aka the Pearson correlation) and Spearman’s Rank-Order Correlation
Originally published on the Alteryx Community Data Science Blog.
FAQs
What does ESH mean on Reddit? ›
Other users can judge them with the ratings of YTA (you're the asshole), NTA (not the asshole), NAH (no assholes here), or ESH (everyone sucks here).
What does the acronym Nah stand for in AITA? ›Phrase. (Internet slang) Initialism of no assholes here; used as a response to AITA (“am I the asshole?”) to indicate that the speaker believes that no parties involved are assholes.
What is the site Reddit? ›Reddit is a social news website and forum where content is socially curated and promoted by site members through voting. The site name is a play on the words "I read it." Reddit member registration is free, and it is required to use the website's basic features.
What does MRW stand for on Reddit? ›MRW is a textspeak acronym used on the internet, often in conjunction with an image or GIF, that means my reaction when. Related words: MFW. d'fuq.
What does A2A mean Reddit? ›Also, A2A means “asked to answer.” You may have noticed that when you ask a question on Quora, you can request answers from a list of suggested writers. Sometimes these writers will acknowledge the honor of being requested by saying “thanks for A2A” or something like that.
What does NC mean on Reddit? ›No Contact. In Reddit's relationship-related subreddits, NC stands for "No Contact." No Contact is a strategy some Redditors recommend for recovering from intense breakups, by having no post-breakup contact with former SOs.
What is Reddit blackout? ›The Reddit blackout explained Thousands of communities on the social media site Reddit went dark in an apparent protest of new fees. The outrage focuses on new charges that Reddit levied on the developers of third-party apps.
What is the NSFW policy on Reddit? ›Never post or threaten to post intimate or sexually-explicit media of someone without their consent. Do not post or encourage the posting of sexual or suggestive content involving minors.
What do the arrows mean on Reddit? ›Next to each post and comment you'll notice and arrow icons. These icons allow you to "upvote" or "downvote" content. Upvotes show that redditors think content is positively contributing to a community or the site as a whole. Downvotes mean redditors think that content should never see the light of day.
What does YMMV mean on Reddit? ›written abbreviation for your mileage may vary: used, for example on social media and in text messages and emails, to mean that you understand people may have a different opinion or experience than yours: Their first album is better, but of course YMMV.
What does OC mean on Reddit? ›
OC (original content): Content that is deemed original to Reddit and isn't reposted from elsewhere, but is something a user has created themselves.
What does PSA mean on Reddit? ›PSA means "Public Service Announcement" and is commonly used in the title of a Reddit post.
What does 4 online mean on Reddit? ›Mlakuss • 1 yr. ago. Additional comment actions. Members is the number of people who have pressed the "join" button. Online is the number of people currently viewing the sub.
What are Reddit bots? ›What Is A Reddit bot? A Reddit bot is a program that can monitor posts, comments, and other users' actions and autonomously respond to them. For example, you can create a Reddit bot that scans all the comments in the community r/funny, one of the largest subreddits, and reply to those containing the word 'dog.
What does base mean on Reddit? ›ago. Additional comment actions. Based pretty much means “I agree” but of course it can also be used ironically/sarcastically.
What does Lo stand for in a relationship? ›Individuals with limerence display an obsessive attachment to a particular person or “limerent object” (LO) that interferes with daily functioning and the formation and maintenance of healthy relationships.
What does WP mean on Reddit? ›[deleted] • 11 yr. ago. Additional comment actions. WP = waypoint. wp = well played.
What is SC in Reddit? ›SC is shadow clone. The other one might be *HS, which is heavenly strike. 2.
What does Apollo for Reddit do? ›Apollo's built for speed, customizability, gestures, and taking advantage of the best your device has to offer. Apollo is a beautiful Reddit app built for fast navigation with an incredibly powerful set of features.
Is there an alternative to Reddit? ›- Hive.
- Quora.
- 4Chan.
- Steemit — A Forum like Reddit for Crypto-focussed Discussions.
- Hacker News (Y Combinator)
- ProductHunt.
- Mix (Previously StumbleUpon)
- Imgur.
What is Reddit Shadowbanning? ›
A shadowban occurs when a social media platform bans a user's content from showing up without notifying the user. In most cases, the user can still post and respond, but only they see their content. They don't receive an official ban or notification, hence the name shadowban.
Does Reddit have inappropriate content? ›Reddit is a social news site where users create and share content. Some of this content is only suitable for those 18+ and is marked as NSFW (not safe for work).
Is gore considered NSFW? ›most likely, try to avoid it, or at least include a content warning and/or have a way around it. i'd simply try to steer away from anything too gruesome entirely, to avoid being disqualified.
What do people use Reddit for? ›Redditors use the platform in many ways, but common ones are: To ask for help with a specific problem, such as tech tutorials (or big life crises) Subscribing to subreddits to stay informed about their favorite topics. Connecting with others who share their interests.
What does the green circle mean on Reddit? ›The green light is your online status, and refers to whether you're online or offline. To turn it off, click/tap on the green light beside your avatar. This will take you to the hompage, where you will see your data(avatar, username, reddit age, karma, etc.) including your online status. Green means you're online.
What does hot mean on Reddit? ›Hot are comments that are becoming popular fast and are new. Best can be the pinned messages and the comments that have the most upvotes and comments (messages may be old). Top is the comments with the top upvotes and comments but you can request for the time in the past and get comments that were popular back then.
What does karma mean on Reddit? ›What Is Reddit Karma? Reddit karma is like a user's score, totaling their amount of upvotes against their downvotes. It has a few practical benefits—namely, allowing you to start your own subreddit and join some exclusive communities—but mostly it's about reputation. Keep up with Reddit using automation. Learn how.
What does TIFU stand for subreddit? ›TIFU stands for "today I f***ed up." This acronym is most commonly used on Reddit, where the r/tifu subreddit contains millions of stories about times people embarrassed themselves or others, all of which start with TIFU.
What do the letters mean on TIFU? ›Overall, TIFU stands for “Today I F***ed Up.” This texting abbreviation is popular on social media sites like Reddit and is commonly used in memes. The popular term is used to admit when one has made a mistake or done something wrong.
What does JP mean on Reddit? ›Additional comment actions. "Jo" can also mean "after all". "De er jo foreldrene mine", for example. In those cases it acts a bit like a reassurance or an intensifier. "I have to do this, because they ARE my parents, after all".
What is the Fullform of NSFW Reddit? ›
NSFW – Not safe for work.
What is AMA slang? ›AMA is the acronym for “Ask me anything”.
What is the meaning of NTA? ›Notice to Appear (“NTA”)? Information on Notices to Appear.
What does CBAT stand for? ›Community Based Acute Treatment | Contact Us. Contact the Community Based Acute Treatment (CBAT) Program.
What does the blue S on Reddit mean? ›Additional comment actions. That's post flair, the flairs in those subs indicate the length of the post (S - short, M - medium, L - long). 4. REALPurpleCat • 4 yr.
What does P mean in Reddit? ›Basically, pushin P means to stay real, and the P stands for player.
What does the M mean on Reddit profile? ›It indicates it is a multireddit, not a subreddit. 3.
What does the blue tag NSFW mean? ›Most people know what “NSFW” means, but they might not know why it's used on social media. NSFW stands for Not Safe For Work, and is often used as a warning that the content of a post or link is inappropriate for viewing in a public place.
What does NSFW mean in bed? ›not safe for work; not suitable for work: used in an email or other electronic communication as a warning that it contains or links to pornographic, offensive, or other content unsuitable for viewing at work or in public places.