Saturday, 27 July 2013

Fetching Blogger Posts With Python

As mentioned in a previous post, I am working through two chapters, 7 and 8 of Matthew Russel’s eminently fun-filled book Mining The Social Web (MTSW). As a first step, I took a go at blogger posts. Example 8-1, gets blog posts from their feeds and saves them in json format.

Very unsurprisingly, I started with my own blog. On blogger when you scroll down and click on Subscribe to: Posts (Atom), you get the feed URL. In my case it is -- One thing to note is that though my blog domain name is, the feed domain is

So I started by passing the Atom feed URL as the command-line argument to the program. First hiccup, the wrong URL links were being copied to the output. So I corrected the index i.e., 'link': e.links[0].href to e.links[4].href to get the proper URL.

By default, blogger sends feed only for 25 posts. If you want more, you need to send max-results parameter set. The hitch is, it takes 500 as its maximum value. The URL to fetch is -

All my blog entries, a grand number of 36 were fetched, and saved in Json format. For the pure pleasure of further analysis and reflection and thought : )- But how much can you analyze on 36. Yeh Dil Maange More.

At this point, I switched to Annie Zaidi (; I had met her recently in her book reading session and also posted a long entry on my blog about my impressions. Once you meet Annie, it’s not easy to get her out of your system.

Now wait, any journey isn’t very smooth and in this case there were road bumps: her blog feed gives only short versions. I checked my blogger settings. They were: Settings -> Site Feed -> Allow Blog Feed? = Full. So I tweeted to her asking her to enable full feed.

But got no reply. Busy bee, no time for the mahboobs of the world. I thought I would send a reminder. I realized there could be other bloggers whose feed settings are Short. It’s better to handle this situation in the code than to keep asking every such blogger in the world to change his / her settings.

The steps are simple: from the feed, fetch the post URLs; go to each URL and fetch the html page; then get the blog content. As Scott Meyers wrote, pretty this ain’t, but sometimes a programmer’s just gotta do what a programmer’s gotta do.

So how do we fetch the post URLs? Already done. As I wrote above, they are obtained from e.links[4].href.

So how do we go to each URL and fetch the html page? Helpful here is the old Python warhorse urllib2. It takes a URL and fetches you the html page.

So how do we fetch the blog content? That is embedded in the div “post-body entry-content”. You tell BeautifulSoup to find it as : find('div', attrs={'class' : 'post-body entry-content'})

Steps done, program works. But then I get ambitious. Why fetch only 500 posts, let’s get all of Annie’s posts. To get the next 500 posts, you have to use the URL -

So how many times do we need to get 500 posts? Simple. It is (total no. of posts / 500) + 1.

So how do we get the total number of posts? Send the request to and in the json return there is a value nested as - "feed" -> "openSearch$totalResult” -> "$t", and accessed in the code as ["feed"]["openSearch$totalResults"]["$t"]

One small correction was in order. In Python for loop, we use range operator which startIndex and endIndex as arguments in order to loop from startIndex to endIndex - 1. I prefer to startIndex from 1, so what should endIndex value be? It should be (Total number of posts / 500) + 2.

The remaining part of the program is same as that of Matthew Russel wrote. I thought I was done and ran the program. As it was running, it bombed while fetching the URL ---

Why? Because this blog post has no text. Uff, ladkiyon ke nakhre. Annie had posted only one image in that post.

So the following code
return BeautifulStoneSoup(clean_html(text),

throws IndexError as there is no content array. If there is no text (non-HTML) content we need to raise an exception and tell the calling code that there is no non-html code.

For this purpose, I enclosed the code in a try block and in the except block I returned a string “NO NON_HTML TEXT FOUND”. In the calling for loop, I skip doing the appending to the blog_posts array with a simple continue statement.

The last modification I did was to take the blog id as a command line argument. And concatenate the URLs from it.

Finally, everything done. I downloaded all posts and saved it on my computer. After having met Annie, interacted with her and written a blog post, I could not get her out of “my” system, so I got her into my “system.”

Full code for the program, which I call is available on github and also given below. The next two programs in MTSW are for finding sentences and max words, and to generate summaries. I suspect there will be some adventure there too. If yes, that will be the material for another post.


