Project overview |
In this project--completed as part of a 6-week data science for social justice workshop at UC Berkeley--I explored user attitudes towards different types of publishing within the r/publishing community on Reddit. Through this project, I gained experience with a range of tools for computational text analysis, which can be used to understand and analyze user sentiment on a large scale.
|
QuestionHow do Reddit users in the r/publishing community view different types of publishing like traditional publishing with an established publishing house or self-publishing? How can computational tools help us understand user attitudes on a large scale?
|
OutcomeUsing qualitative and quantitative research methods, I learned that users in r/publishing hold contradictory views on different forms of publishing. By some metrics, users expressed serious qualms about all types of publishing, while by other metrics, user sentiment reflected a cline, with small, independent publishing houses being favored and self-publishing being strongly dispreferred.
|
Methods
|
Tools
|
Team
|
My role
- By exploring the data set qualitatively, I identified themes across texts and honed in on my specific research question.
- I processed the data set, using a relatively standard Python pipeline to convert the raw data into a usable format.
- I performed qualitative and quantitative analysis of the data set using various computational tools in Python and MAXQDA.
- I reported my research findings in a digital essay that wove together narrative explanation alongside code.
My process
This project was built on data from the r/publishing community on Reddit, which is an online community dedicated to publishing. The dataset contained text from over 7,000 posts and nearly 20,000 comments. At the start of the project, my teammates and I familiarized ourselves with the dataset by reading and annotating as many posts and comments as possible to understand who the users of r/publishing were and what they were most interested in. During this phase of the project, we used MAXQDA to organize our collaborative annotations and to code posts and comments, making them easier to sort and find at later stages of the project.
While exploring the dataset, I noticed that much of the discussion centered around different types of publishing, and three main categories emerged:
Reddit users in r/publishing had diverse and often strongly held opinions about these different kinds of publishing. For instance, some users expressed extreme frustration with the traditional model of publishing, while others showed a similar attitude towards self-publishing. This observation led me to apply computational tools to understand user attitudes towards different types of publishing on a large scale.
While exploring the dataset, I noticed that much of the discussion centered around different types of publishing, and three main categories emerged:
- Traditional publishing with a large, established publishing house
- Publishing with a small, independent publishing house
- Self-publishing
Reddit users in r/publishing had diverse and often strongly held opinions about these different kinds of publishing. For instance, some users expressed extreme frustration with the traditional model of publishing, while others showed a similar attitude towards self-publishing. This observation led me to apply computational tools to understand user attitudes towards different types of publishing on a large scale.
Self-published books and authors get more respect than they used to, but there's still a stigma attached to it. |
Skip traditional publishers, they are dinosaurs. You need to get with newer publishing models. |
With this research question in mind, I processed the raw text data to convert it into a format that was usable for computational text analysis. My processing pipeline was standard for computational text analysis and can be applied to other comparable datasets.
First, using Pandas in Python, I removed posts and comments whose content had been deleted and explored the shape of the dataset in more detail, looking at how many unique users there were and how frequently they interacted with the r/publishing community. I found that there were posts from almost 2,000 unique users and comments from over 4,000 unique users, but that individuals in this community interacted with it relatively infrequently.
Then, turning to the content of the posts and comments, I used Python spaCy to lemmatize the text, which removed punctuation and grammatical inflection on words. While this process led to a loss of richness in the data, the lemmatized text was more usable for computational text analysis. Finally, I created bigrams and trigrams to understand which two- and three-word combinations occurred together most frequently in the dataset. This process highlighted key combinations that I used throughout my analysis like 'self_pub', 'small_press', and 'penguin_random_house', which all reflected different types of publishing.
First, using Pandas in Python, I removed posts and comments whose content had been deleted and explored the shape of the dataset in more detail, looking at how many unique users there were and how frequently they interacted with the r/publishing community. I found that there were posts from almost 2,000 unique users and comments from over 4,000 unique users, but that individuals in this community interacted with it relatively infrequently.
Then, turning to the content of the posts and comments, I used Python spaCy to lemmatize the text, which removed punctuation and grammatical inflection on words. While this process led to a loss of richness in the data, the lemmatized text was more usable for computational text analysis. Finally, I created bigrams and trigrams to understand which two- and three-word combinations occurred together most frequently in the dataset. This process highlighted key combinations that I used throughout my analysis like 'self_pub', 'small_press', and 'penguin_random_house', which all reflected different types of publishing.
To analyze user attitudes towards the different types of publishing, I created word embedding models and performed sentiment analysis in Python. Word embedding models provide a quantitative metric of how closely linked different words are within a given dataset, which can be used to infer the associations that users have with certain expressions. For instance, one word embedding model revealed that the terms most closely associated with self-publishing and traditional publishing were highly negatively charged—words like "gamble", "risky", and "waste_time"--which suggested that users had a negative view of both types of publishing.
To explore this idea further, I applied sentiment analysis to the dataset, which is a tool for computationally identifying opinions expressed in a text to determine whether the user's attitude is positive, negative, or neutral. I determined which type of publishing each comment was about using the proxy terms 'self_pub' and 'traditional'. Then, I used VADER to calculate a sentiment score for each comment and classified the scores as positive, neutral, or negative. This analysis revealed that the majority of comments about both types of publishing were positive, but that there were almost twice as many negative comments about self-publishing compared with traditional publishing.
Together, the results of the word embedding models and the sentiment analysis painted a contradictory picture. According to the word embedding models, all forms of publishing were quite negatively appraised, yet the sentiment analysis suggested that all forms of publishing were discussed in a generally positive light, with more negative discussion surrounding self-publishing versus traditional publishing. To better understand this situation, I performed additional qualitative analysis on the original posts and comments.
In revisiting the posts and comments, it became clear that the contradictory results of the computational analysis spoke to a contradiction that many users in r/publishing face: while they know that publishing in any form can be difficult, risky, and even exploitative, many of them are still trying to break into the field and want to become published writers. In this way, the use of qualitative and quantitative research methods on this project was essential to developing a holistic understanding of the attitudes of Reddit users in r/publishing.
In revisiting the posts and comments, it became clear that the contradictory results of the computational analysis spoke to a contradiction that many users in r/publishing face: while they know that publishing in any form can be difficult, risky, and even exploitative, many of them are still trying to break into the field and want to become published writers. In this way, the use of qualitative and quantitative research methods on this project was essential to developing a holistic understanding of the attitudes of Reddit users in r/publishing.
To report my research findings, I wrote a digital essay that wove together narrative explanation of my research question and methods, alongside the code that I used to process the dataset and perform the computational analyses. By presenting my rationale and code together, I made my research process as transparent as possible, allowing other researchers to verify my results or use my work as a starting place for further research in this area.
Next steps
To gain a more nuanced understanding of the Reddit users in r/publishing, here are some future research steps.
|
01
I would like to run a sort of diary study on users within the r/publishing community to understand how the sentiment of specific users changes over the course of their time in the group. Does exposure to the discussion in r/publishing lead users to have a more positive or negative view of publishing? What longitudinal patterns emerge within and across users? |
02
I also want to take a deeper qualitative look at the reasons behind the positive and negative sentiments identified through my research. What aspects of the publishing industry are people frustrated with? What about the industry excites them? Understanding the answers to these questions could help people already working in publishing improve the field. |
Skills
On this project, I gained experience performing computational text analysis with a range of programs in Python. I also became familiar with tools like MAXQDA for mixed-methods research, and I saw first-hand the importance of combining qualitative and quantitative research to develop the clearest understanding of a large user group. I collaborated with a team of researchers from biology, public policy, and computer science and learned from them, while also providing my perspective as a linguist on the texts that we analyzed.