Over winter break, I’ve been learning to mine Reddit for data that might be used to conduct social science research. Specifically, I’m interested in the content and structure of “subreddit” communities that focus on mental and behavioral health conditions. For example, what might we be able to learn from examining the social structure and level of engagement in different types of emotional support threads?
So far, I’ve gotten the hang of retrieving data from the Python Reddit API Wrapper (PRAW). I’m now testing a few different visualization tools for social network analysis. Below are quick overviews of three options that I’ve tested so far (NodeXL, NetworkX, and Gephi). These data come from a subreddit that focuses on a particular mental health condition. The central nodes (hubs) are the subreddit moderators, and the peripheral nodes (spokes) are the last ~50 people who they responded to. What we can see from these visualizations is that the moderators sometimes communicate with each other publicly, but their communications with non-moderators don’t tend to overlap.
I really like this one for it’s simplicity for visualizing networks. It is a Microsoft Excel plugin/template, so it will only work if you have Excel installed. The “Basic” (free) version is somewhat limited in the statistics that it provides. The big perks here are (1) it is simple to load and configure data visualizations, and (2) the displays are fairly straightforward and load quickly. This is a great starting point for network analysis.
This is a Python package for social network analysis. I like it because it is convenient to write the commands into the code that I’m already using to collect and clean data. The data can also be exported to other graphing formats. This one it going to take some trial-and-error to get the hang of, as the commands and syntax get a little dense. It should be pretty powerful for running network statistics and useful to get a basic graphical sense of the data. However, my first graphs appear to be circle-art renditions of lava lamp bubbles.
This one took some extra effort to set-up. I had to update my Java version to get it running. Then, I also discovered that my desktop PC doesn’t have a compatible OpenGL2 display driver. I suppose that’s what I get for using an old server box with an on-board graphics card.
I tried it on a different computer that had a compatible OpenGL2 version. I was able to easily get basic network statistics and, eventually, get the graphics to work also. Some of the graphic displays do take a long time to load (i.e., the algorithms take time to converge), but this seems like a decent program overall. Here’s a nice tutorial of the graphing features.
Each of these approaches have their pluses and minuses. In all likelihood, I’ll keep working with NetworkX, to see how deep that rabbit hole goes. It is nice to have some additional, ready-made options as well, in case I need to graph something in a pinch.
Update: Here’s one more graph that I came up with in Gephi. These data originated from top 10 posts within four subreddits (color coded) about recreational drug use. I searched for the 10 original posters and collected two degrees of separation for the past 10 people who they commented to. 4*10^3 = 4,000 possible nodes. The data yielded 1,605 individual accounts. It took ~82 minutes to collect the data, and the better part of a Saturday evening to figure out what to do with it. I’ll have more details in future posts!