Use of Twitter to assess sentiment toward waterpipe tobacco smoking

Colditz J.B., Naidu M., Smith N.A., Welling J., & Primack B.A. (April 2016). Use of Twitter to Assess Sentiment toward Waterpipe Tobacco Smoking. Oral presentation in Washington, DC: 37th Annual Meeting of the Society of Behavioral Medicine.


Background: Data from Twitter have been used to track health conditions such as influenza and foodborne illness. Advances in machine learning now allow researchers to utilize Twitter data to investigate novel behavioral health trends such as waterpipe tobacco smoking (WTS), which has important health risks and is gaining popularity worldwide.

Methods: Using 5 popular variations on the keyword “hookah,” we retrieved a live feed of all matching Twitter messages (“tweets”) over a complete weekend in November 2014. This resulted in 43,155 English-language tweets. A random subset of 2,000 of these tweets was independently double-coded for WTS relevance and sentiment. We utilized a Naïve Bayes classification algorithm to detect language predictive of WTS sentiment in 75% of the relevant coded data, and we tested the classification algorithm against the remaining 25%. We also examined sentiment differences across Western vs. Eastern hemispheres using timecode metadata.

Results: Initial inter-rater agreement was strong for both positive and negative sentiment (Cohen’s k = 0.74 and 0.71, respectively), and all differences were easily adjudicated. Based on the human-coded data, our classification algorithm detected both positive and negative sentiment with over 70% accuracy. Several interesting elements of the classification algorithm emerged. For example, presence of a “heart” emoji (text-based image) predicted positive sentiment toward WTS at a 14:1 ratio, while presence of the word “cigarettes” predicted negative sentiment at a 23:1 ratio. Western hemisphere tweets were more likely to be positive as compared to Eastern hemisphere tweets (56% vs. 32%, p < .001).

Conclusions: Twitter appears to be a valuable source of data related to WTS. We were able to train a supervised classification algorithm to detect sentiment with relatively little human coding. Examination of user metadata allowed us to detect broad geographic differences related to WTS sentiment. The processes developed through this study may be valuable for tracking sentiment over time and monitoring novel behavioral health trends across geographic regions.

Leave a Reply

Your email address will not be published. Required fields are marked *