Twitter Depression Detection

As part of my Artificial Intelligence class project, we were expected to create something using a Machine Learning model of some sort. As part of a group of 3, I helped develop a model in order to detect signs of depression in a Twitter user's posting history.

This was my first introduction to NLP and Deep Learning, and while I didn't know the challenged involved at first, I was ready to take this project on. Mental health is of huge importance to me, and I was curious as to finding out the correlations between Twitter posting activity and indications of mental unrest.

Data Collection and Cleaning

This was my first foray into automatic data collection. To scrape Twitter for our data, we manually looked for recently posted tweets for certain keywords that could indicate a depressed state of mind. (sidenote: while there were readily available datasets of tweets from depressed people, this is the best we could do in terms of collecting our own profile data given the resources at the time.) After manually reviewing whether or not the person we had found showed greater signs of depression, we then used a Selenium bot to scrape Tweet metadata, collecting around 400-700 tweets per user handle. To find the control for our dataset, we simply used the Selenium bot to scrape tweets from anyone who had posted the word "I". Using this method, we ended up with a collection of ~25,000 tweets, with ~11,000 of those from handles marked as "depressed".

Text cleaning code block

After collecting the tweets, we used NLTK to tokenize and remove symbols and stopwords for the tweets to then be processed for sentiment analysis using TextBlob. TextBlob gives us a polarity score of between -1 and 1, with -1 indicating a overwhelmingly negative tweet and 1 indicating an overwhelmingly positive tweet. This polarity score is an important feature for the model. Along with that, TextBlob also gives us a subjectivity score of the same range, which indicates how objective or subjective the library deems the tweet to be. (-1 - objective, 1 - subjective)

After collecting, cleaning, and anonymizing our data, we organized it all into a pandas dataframe with each tweet representing a row, with the final dataset consisting of 9 different features:

  • the profile's label

  • the polarity and subjectivity scores

  • the year, month, and day of the tweet

  • an "in_between" count of how many days passed since the last tweet

  • A "keyword_score" of how many of the keywords used to search for a "depressed" tweet were found in the specific tweet

  • and a "1stperson_score" of how many times the person used first person pronouns in their tweet (i.e. "I", "me", "my" etc.)

Sample data from a handle deemed "not depressed"

Detection Model

With our newly acquired dataset, we were ready for some model analysis. I built a feed-forward CNN for 3-class classification (0 - not depressed, 1 - possibly depressed/unsure, 2 - depressed). The final model consisted of 4 Dense layers with 1 Dropout layer. This was then trained for 20 epochs with at 80/20 test/train split.

Final model layout

When looking at graphs of the model performance, the model was converging very quickly, with not many fluctuations after around 4 epochs, showing clear signs of overfit. We were receiving around 68% validation accuracy, which interestingly enough jumped to 73% when switching to Google Cloud TPUs for training.

Graphs of model performance on testing set (trained locally)

At this point, we realised that the task of NLP might be much more difficult than we think. This ended being a much bigger challenge than I had first envisioned, especially for my first ever project with large-scale data. However, I believe a lot of the model issues were due to the quality of our data. Firstly, the manner in which we collected our data for depressed individuals was largely arbitrary and left out a lot of nuance, much of which is necessary in such a nuanced topic as depression. Secondly, the date information was rendered quite useless in a CNN due to its inability to store memory and time-series data. In order to adjust that, I would consider the use of an RNN to store time-sensitive data.

All in all, I had a fun time collecting and seeing the differences in what our model deemed depressed or not. This was a challenging but entertaining introduction to NLP and piqued my interest in the field.