Friday, 29 September 2017

Daughters Of India

This post is not about feminist movements in India or any such similar topic; rather it's about using Python code for applying basic NLP (Natural Language Processing) techniques on tweets.

The four twitter users whose tweets I analysed are: Nidhi Razdan, Rupa Subramanya, Shubhrastha, and Rana Ayyub. They are the very few media persons whom I like to hear or read, for I find them insightful and interesting while presenting their information and arguments. The rest, or most of the other journalists are either not insightful or boring. In terms of political inclinations, Nidhi and Rana are at the left of centre whereas Rupa and Shubhrastha are to the right.

The analysis has a limited purpose; it's intended to put a set of numbers around tweets, sort of getting a basic quantified view of these journalists' tweets.

By "basic" NLP, I mean applying the following techniques:
Word Density: This is the simplest one that calculates 'words' per tweet.
Lexical Diversity: This is an interesting statistic, which the author Russel explains, "is defined as the number of unique words divided by the number of total words in a corpus; by definition, a lexical diversity of 1.0 would mean that all words in a corpus were unique, while a lexical diversity that approaches a 0.0 implies more duplicate words.
In the Twitter sphere, lexical diversity might be interpreted in a similar fashion if comparing two Twitter users, but it might also suggest a lot about the relative diversity of overall content being discussed, as might be the case with someone who talks only about technology versus someone who talks about a much wider range of topics."[1]
Top Words: The top words that were most frequently used. My program prints the top five.
Popularity: This is the sum of the number of retweets and likes got by a tweet. My program prints the top 5.
Sentiment: A piece of text can be classified as positive, neutral or negative. My program calculates the number and percentage of positive, neutral and negative tweets.
Clustering: Clustering is a way of dividing a bunch of entities into a fixed number of groups that are not previously determined and are formed organically as we process the data. I used K-means clustering algorithm, which is an unsupervised machine learning algorithm that divides 'n' datapoints into 'K' clusters based on some measure of similarity. My program uses K=5, thus it divides the tweets into five clusters (topics) and prints 10 words from each cluster as an indicative example.

I wrote the code in Python. The program has a few helper functions that are used by the more functional routines:
clean_text_and_tokenize: This function takes a line (string) as input, cleans it and returns the words as a list. The cleaning process consists of: removing hyperlinks, punctuation marks, removing stop words and lemmatizing the remaining words. Stop words are words like a, an that we don't want to be part of the analysis. Lemmatizing is the process of replacing a word by its base word.
clean_tweet: This function takes a line(string), gets the clean words by calling clean_text_and_tokenize and returns a string by joining the cleaned words.
getCleanedWords: This function takes a list of lines (strings), cleans each line and retruns all words from all the lines.

The key functional routines are:
lexical_diversity: This function takes a list of words, and returns the number of unique words divided by the total number of words.
average_words: This function takes an array of strings, splits them into words and returns words per string value.
top_words: This function takes in a list of words, stores the frequency of each word and returns the most frequently used words. If the 'top' number is not passed as argument, it defaults to five.
popular_tweets: This function adds the retweet count and like count of every tweet to calculate its popularity. It uses a priority queue to identify the most popular tweets. If the 'top' number is not passed as argument, it defaults to five.
sentiment_analysis_basic: This function uses the sentiment method of TextBlob library to calculate the polarity of a tweet. The tweet is classified as positive, neutral or negative depending on the value of polarity being greater than, equal to, or less than zero.
clusterTweetsKmeans: This function uses the gensim library to create a model of vectors from the cleaned tweets. After training the model, it invokes KMeans routine of the sklearn library. Tweets are clustered into six topics.

The code is available on my github repository python-misc. The input to the program is a file named <twitter_user>.csv. This file has to be generated first by running the excellent program available in the github repository GetOldTweets-python. The ultra-cool feature of this module is that you don't have to register an app on and use the authorization tokens and passwords in the code.

For this article, I have fetched tweets from 01-Jan-2015 to 25-Sep-2017. To get the tweets csv file of @Nidhi, the command is:
$ python --Nidhi --since 2015-01-01 --until 2017-09-25 creates a file with name output_got.csv which I renamed to Nidhi.csv. Command to rum my program is:
$ python tweets_analysis Nidhi

The program opens the csv file and reads all the records into a list of strings. It skips the first line as it is the header. It then calls the functional routines one by one. The output generated for running with Nidhi.csv is:
Total no. of tweets: 3120
Average Number of words per tweet = 10.4330128205
Lexical diversity = 0.252425418385
| Words | Count |
| thank |   320 |
| india |   162 |
| say   |   124 |
| yes   |   109 |
| also  |   101 |
Printing top 5 tweets
1. I don't know who killed Gauri Lankesh. But I do see who is celebrating her death and vilifying her.
Popularity = 17679
Link =
2. A message to those in the media who are still independent and do their job by fearlessly asking questions. We won't be intimidated https:// s/871593196953849856 …
Popularity = 10653
Link =
3. It's now fairly clear demonetisation was a purely political move. Brilliant actually. Economy got hit but hey, U.P. was won
Popularity = 9018
Link =
4. Hello people, Ramdev is not buying NDTV. Thank you
Popularity = 8892
Link =
5. Honoured to present my book 'Left,Right &Centre,The Idea of India' to the President @RashtrapatiBhvn @PenguinIndia
Popularity = 7233
Link =

No. of positive tweets = 1043 Percentage = 33.4294871795
No. of neutral tweets = 1616 Percentage = 51.7948717949
No. of negative tweets = 461 Percentage = 14.7756410256

Topic 1 has words: income tax department sends notice harsh mander institute via httweets
Topic 2 has words: anyone bjp condemned language today actually first one anything else
Topic 3 has words: wonder took long life short live fruit covered story well
Topic 4 has words: lol sigh never according yes saying mention press cog corner
Topic 5 has words: hiv but thank actually thank thank sephora actually french thank

I have captured the output of the runs against the four files in the following Google sheet:
tweet_analysis output
For your ready reference here is a screenshot:

Some observations
Rupa is the most prolific averaging about 45 tweets per day, whereas the most popular tweet is from Nidhi Razdan. Shubhrastha uses the most words per tweet amongst the four. The highest lexical diversity is from Nidhi indicative of a larger vocabulary knowledge. Rupa's LD value is very low probably because the denominator (number of tweets) is very high. The highest positive sentiment is from Rupa and the highest negative sentiment is from Shubhrastha, both right-leaning. Sentiment neutrality is the lowest in Rupa's tweets indicative of her taking a stand most of the time.

Program improvement & enhancement
For lexical diversity calculation we should perhaps consider equal number of tweets.
Sentiment analysis can be done with a more advanced algorithm like Naive Bayes; that would require a corpus of pre-classified tweets, the training data as it is technically called, preferably from Indian users.
Once we have a larger dataset of twitter analyses, this program could be used to classify a twitter user's political orientation as left, centre or right by analysing their tweets. This could be done either with comparing his/her tweets with a political-ideology corpus or measuring similarity with one of the already analysed twitter user.
Just showing the words in a cluster is not meaningful. I need to experiment with the number of clusters and analyse the cluster again separately to derive some semantic meaning. Topic clustering can be also done with a probabilistic algorithm like LDA.

Mining the Social Web, 2nd Edition by Matthew Russel. O'Reilly Publications.