• Home
  • About
  • Resume
  • Case studies
    • Government Chatbot
    • Kipsigis Living Dictionary
    • Inclusive Language Guide
    • User attitudes on Reddit
  • Contact

User Attitudes on Reddit

Using computational tools to understand how Reddit users view different types of publishing
Attitudinal research | Computational text analysis

Project overview

In this project--completed as part of a 6-week data science for social justice workshop at UC Berkeley--I explored user attitudes towards different types of publishing within the r/publishing community on Reddit. Through this project, I gained experience with a range of tools for computational text analysis, which can be used to understand and analyze user sentiment on a large scale.

Question

How do Reddit users in the r/publishing community view different types of publishing like traditional publishing with an established publishing house or self-publishing? How can computational tools help us understand user attitudes on a large scale?

Outcome

Using qualitative and quantitative research methods, I learned that users in r/publishing hold contradictory views on different forms of publishing. By some metrics, users expressed serious qualms about all types of publishing, while by other metrics, user sentiment reflected a cline, with small, independent publishing houses being favored and self-publishing being strongly dispreferred.

Methods

  • Qualitative text analysis
  • Exploratory data analysis
  • Computational text analysis

Tools

  • Python packages for data management and text analysis (pandas, NumPy, spaCy, scikit-learn, gensim, nltk)
  • MAXQDA

Team

  • 4 researchers
  • 1 project manager

My role

  • By exploring the data set qualitatively, I identified themes across texts and honed in on my specific research question. 
  • I processed the data set, using a relatively standard Python pipeline to convert the raw data into a usable format.
  • I performed qualitative and quantitative analysis of the data set using various computational tools in Python and MAXQDA.
  • I reported my research findings in a digital essay that wove together narrative explanation alongside code.

My process

1. Explore
2. Process
3. Analyze
4. Synthesize
5. Report

Explore

This project was built on data from the r/publishing community on Reddit, which is an online community dedicated to publishing. The dataset contained text from over 7,000 posts and nearly 20,000 comments. At the start of the project, my teammates and I familiarized ourselves with the dataset by reading and annotating as many posts and comments as possible to understand who the users of r/publishing were and what they were most interested in. During this phase of the project, we used MAXQDA to organize our collaborative annotations and to code posts and comments, making them easier to sort and find at later stages of the project.

While exploring the dataset, I noticed that much of the discussion centered around different types of publishing, and three main categories emerged:


  • Traditional publishing with a large, established publishing house
  • Publishing with a small, independent publishing house
  • Self-publishing

Reddit users in r/publishing had diverse and often strongly held opinions about these different kinds of publishing. For instance, some users expressed extreme frustration with the traditional model of publishing, while others showed a similar attitude towards self-publishing. This observation led me to apply computational tools to understand user attitudes towards different types of publishing on a large scale.
Self-published books and authors get more respect than they used to, but there's still a stigma attached to it.
— r/publishing user
Skip traditional publishers, they are dinosaurs. You need to get with newer publishing models.
​— r/publishing user

Process

With this research question in mind, I processed the raw text data to convert it into a format that was usable for computational text analysis. My processing pipeline was standard for computational text analysis and can be applied to other comparable datasets.

First, u
sing Pandas in Python, I removed posts and comments whose content had been deleted and explored the shape of the dataset in more detail, looking at how many unique users there were and how frequently they interacted with the r/publishing community. I found that there were posts from almost 2,000 unique users and comments from over 4,000 unique users, but that individuals in this community interacted with it relatively infrequently.


Then, turning to the content of the posts and comments, I used Python spaCy to lemmatize the text, which removed punctuation and grammatical inflection on words. While this process led to a loss of richness in the data, the lemmatized text was more usable for computational text analysis. Finally, I created bigrams and trigrams to understand which two- and three-word combinations occurred together most frequently in the dataset. This process highlighted key combinations that I used throughout my analysis like 'self_pub', 'small_press', and 'penguin_random_house', which all reflected different types of publishing.

Analyze

To analyze user attitudes towards the different types of publishing, I created word embedding models and performed sentiment analysis in Python. Word embedding models provide a quantitative metric of how closely linked different words are within a given dataset, which can be used to infer the associations that users have with certain expressions. For instance, one word embedding model revealed that the terms most closely associated with self-publishing and traditional publishing were highly negatively charged—words like "gamble", "risky", and "waste_time"--which suggested that users had a negative view of both types of publishing.
Picture
Terms associated with self-publishing
Picture
Terms associated with traditional publishing
To explore this idea further, I applied sentiment analysis to the dataset, which is a tool for computationally identifying opinions expressed in a text to determine whether the user's attitude is positive, negative, or neutral. I determined which type of publishing each comment was about using the proxy terms 'self_pub' and 'traditional'. Then, I used VADER to calculate a sentiment score for each comment and classified the scores as positive, neutral, or negative. This analysis revealed that the majority of comments about both types of publishing were positive, but that there were almost twice as many negative comments about self-publishing compared with traditional publishing.
Picture

Synthesize

Together, the results of the word embedding models and the sentiment analysis painted a contradictory picture. According to the word embedding models, all forms of publishing were quite negatively appraised, yet the sentiment analysis suggested that all forms of publishing were discussed in a generally positive light, with more negative discussion surrounding self-publishing versus traditional publishing. To better understand this situation, I performed additional qualitative analysis on the original posts and comments.

In revisiting the posts and comments, it became clear that the contradictory results of the computational analysis spoke to a contradiction that many users in r/publishing face: while they know that publishing in any form can be difficult, risky, and even exploitative, many of them are still trying to break into the field and want to become published writers. In this way, the use of qualitative and quantitative research methods on this project was essential to developing a holistic understanding of the attitudes of Reddit users in r/publishing.

Report

To report my research findings, I wrote a digital essay that wove together narrative explanation of my research question and methods, alongside the code that I used to process the dataset and perform the computational analyses. By presenting my rationale and code together, I made my research process as transparent as possible, allowing other researchers to verify my results or use my work as a starting place for further research in this area.

Next steps

To gain a more nuanced understanding of the Reddit users in r/publishing, here are some future research steps.
01

I would like to run a sort of diary study on users within the r/publishing community to understand how the sentiment of specific users changes over the course of their time in the group. Does exposure to the discussion in r/publishing lead users to have a more positive or negative view of publishing? What longitudinal patterns emerge within and across users?
02

I also want to take a deeper qualitative look at the reasons behind the positive and negative sentiments identified through my research. What aspects of the publishing industry are people frustrated with? What about the industry excites them? Understanding the answers to these questions could help people already working in publishing improve the field.

Skills

On this project, I gained experience performing computational text analysis with a range of programs in Python. I also became familiar with tools like MAXQDA for mixed-methods research, and I saw first-hand the importance of combining qualitative and quantitative research to develop the clearest understanding of a large user group. I collaborated with a team of researchers from biology, public policy, and computer science and learned from them, while also providing my perspective as a linguist on the texts that we analyzed.
Thank you for reading my case study!

Want to work with me? Feel free to contact me.
Powered by Create your own unique website with customizable templates.
  • Home
  • About
  • Resume
  • Case studies
    • Government Chatbot
    • Kipsigis Living Dictionary
    • Inclusive Language Guide
    • User attitudes on Reddit
  • Contact