Saturday 29 June 2013

Tapp(se)ing Twitter Data

Too many encounters with Big Data happening these days -- through articles, newsletters, and even books. Big Data is the Big Buzz. During JavaOne 2013 Hyderabad, I realized that using the words “Big Data” in the title of your presentation is a sure way of drawing crowds; in fact, one of the sessions was about jvm tuning but was named Big Data and performance or some such thing, and needless to say the room was full and folks had crammed even the standing space. Intellectual dishonesty, I abhor.

The starting point for getting into Big Data space is of course data analysis, which is but a fancy word for applying stuff that you learn in college courses on statistics. So I wanted to get my hands on to a few things in data analysis. As is my habit, to take a dive into a subject, I first scour for books. Online, of course.

I found a few interesting ones by O’Reilly publications like Data Analysis With Open Source Tools and Exploring Everyday Things With R and Ruby. I bought them and as I was planning to work through them, I chanced upon Mining The Social Web by Matthew Russell and immediately purchased it. After I received my copy and scanned through, it was obvious that this would have to be consumed first and the previously mentioned two books would have to wait. What I mean is, writing programs to generate data using Monte Carlo simulation and then analysing that with R is a good learning experience and could solve practical problems, but mining Twitter data or LinkedIn / Facebook data is both learning and fun.

Mining The Social Web has Python programs to analyse data of Twitter, gmail, blogs, Facebook and LinkedIn. First choice was easy -- I started with the chapters on Twitter. One may wonder what one could do with 140-character microblogs. A lot, we find out. One could answer interesting questions such as:

How many friends / followers does X have? Who is X following that is not following X back? Who is following X that X is not following back? Who are X’s “mutual friends” (people that X is following that are also following X back). Given all of X’s followers and all of their followers, what is X’s potential influence if X gets retweeted?

The last question is particularly interesting as I think businesses look out for influencers. On this aspect, the book says on pg 103, “one trivial way to measure the relative influence of two or more users is to simply compare their number of followers, since every follower will have a direct view of their tweets.” Extending this discussion, on page 142 the author writes, “if you tweet a lot and nobody retweets you, it’s safe to say that your influence is pretty weak -- at least as a Twitterer. One base metric that’s quite simple and inexpensive to calculate is the ratio of tweets to retweets. A ratio of 1 would mean that every single tweet you’ve authored was retweeted and indicate that your influence is strong -- potentially having second- and third-level effects that reach millions of unique users (literally) -- while values closer to 0 show weaker influence.”

In order to do the calculation, the example given takes advantage of the retweet_count field in a tweet and also does a map/reduce combination using couchdb. This is what I found appealing about this book, that is the variety of tools you would get familiar with. It’s like an introduction to Big Data in a gentle way, without pushing you into the grease and shanks of tools like Hadoop. Apart from couchdb, you would be playing with tools like Redis, Networkxx just to name a couple.

So what would I do with reading the chapters and the programs given therein? Naturally, apply to somebody’s tweets. Film actress Taapsee was the first Twitterer that I started following on Twitter, as also she was the first person to reply-tweet to me back in 2010 when I got myself into Twitter world and was navigating around to understand what it was all about. So I started running the programs on Taapsee’s tweets.

There was a problem. Example 4-4 wouldn’t fetch more than 75,000 ids. The retry code of function handleTwitterHTTPError was not matching any error code and was going to the last else. I thought this was a nasty and premature end to my fun journey. But I discovered that all the code in the book was being maintained on a github repository by the author.

Thankfully, Github allows you to post issues and I posted one (issue # 56). And, more thankfully, the author Matthew Russell responded. Apparently he is busy with the second edition of the book, but he took time out to fix the issue and uploaded the changed code. After which, the program executed correctly and I could go past to other programs. Thanks a lot, Matthew.

So, after running the programs in chapters 4 and 5, here are a few tidbits of analysis about Taapsee’s tweets, but note that Twitter allows only 3200 latest tweets to be fetched. Stats given are based on those only:

Whom did Taapsee retweet? Top 2 are recent film ids only (Chashme Baddoor, @cbdthefilm -- 77 and Gundello Godari, @gundello_godari -- 47). Next by count are : @thesedamnquote (10), @anupampkher (6), @lakshmimanchu (6), and @taapsee_fans (6).
Top followers : @dhanushkraja (414843), Actorjiiva (216836), Samanthaprabhu2 (208428), LakshmiManchu (202766).
Mutual Friends : Out of the 69 people Taapsee follows, 29 don't follow back, so only 40 are mutual friends.
Twitpals : If you don't consider her film ids, @cbdthefilm and @gundello_godaari, the people whom she tweeted to the most are -- @lakshmimanchu (64), @vrindaprasad (57), @gundello_godari (54), @sin2ja (53), @sillijo (52), and @crhemanth (49). Taapsee follows @lakshmimanchu and @vrindaprasad, but doesn't follow @sin2ja, @sillijo, or @crhemanth.
Hashtag density : Avg number of hashtags per tweet for taapsee: 0.125198098257.
Influence ratio : 1640 of 3233 authored tweets were retweeted at least once (0.507268790597 tweet/retweet ratio).

These derived numbers give you an insight into the Twitterer’s tweeting profile. Such analysis can be enhanced if similar numbers are derived for other people and then compared with each other. So there's a lot more stuff one could do beyond the examples given in the book.

As far as the programs are concerned, if you want to analyze tweets with your own tools, and you find Python and couchdb not your cup of caffeine, you could download a Twitterer’s latest 3200 tweets using http://greptweet.com into a text file. You can then write your own tools in a favorite language like Java or Perl to parse the file and invoke Twitter’s APIs to get further data leading you to the insights you want.

After the material on Twitter data, there are two chapters in the book revolving around Google Buzz (which is now dead) and Blogs -- and they are very interesting as they get you onto the subject of natural language processing. I plan to run those programs too and that experience will be the subject of another post.

1 comment:

  1. Mahboob - I'm glad that you found some of the code from Mining the Social Web useful. Thanks so much for sharing this awesome post about what you are doing with Twitter data! Let me know if you run into anything else that I can help with, and also, do note that a revision to the first edition in 2012 included some updates to convert the Google Buzz content to Google+, so that chapter should hopefully be of good use to you as well.

    If you're at all interested in previewing the 2nd Edition (which IMHO is superior to the 1st Edition in nearly every way and is now almost to a final manuscript draft), I'd be glad to share a copy with you ahead of its release. Its code is also on GitHub (though in a different repository) if you'd like to reference it for any reason as well.

    ReplyDelete