Friday 7 December 2012

Strawberry Perl For Rupa's Blogs

Early this year I had written a Perl program to fetch the links to Rupa Subramanya’s blog entries along with the links inside the posts. I wrote the program as I found that the material she references is quite exhaustive and scholarly.

The Perl script, available at https://sites.google.com/site/mahboobh06/home/rupa-blogs-links-perl-program has undergone a few changes since its first version.

But before the changes, I bumped into Strawberry Perl, "The Perl for MS Windows, free of charge", when I was just surfing the posts in comp.lang.perl.misc. Actually it was there in the signature of one of the posters. I got interested, headed to its website and downloaded it.

The first program I tried with Strawberry Perl was the rupa-blogs-links script. When I ran it, I got the error -

Can't locate Text/Unidecode.pm in @INC (@INC contains: C:/strawberry/perl/site/lib C:/strawberry/perl/vendor/lib C:/strawberry/perl/lib .) at rupa-blog-links.pl
line 6. BEGIN failed--compilation aborted at rupa-blog-links.pl line 6.

At the command prompt, I ran cpan Text::Unicode and it setup the missing libraries. After that, my program could run without any more system changes. I felt Strawberry Perl was true to its statement of “It is designed to be as close as possible to perl environment on UNIX systems.”

Now the Perl changes. In the program, I first fetch blog summary pages. These are pages with multiple blog entries and are located by
http://blogs.wsj.com/indiarealtime/tag/Economics-Journal/page/n/ where n starts from 1. From these summary pages, I extract the links to the actual blog entries. For this, I was selecting lines that had the text ‘href’ and ‘Economics Journal:’. Like -
<h2 class="postTitle"><a href="http://blogs.wsj.com/indiarealtime/2012/11/21/economics-journal-the-blur-between-tips-and-bribes/">Economics Journal: The Blur Between Tips and Bribes</a></h2>

But then, the blog started displaying twitter buttons, so even lines like the following, started matching -
<a href="https://twitter.com/share" class="twitter-share-button" data-counturl= "http://blogs.wsj.com/indiarealtime/2012/11/21/economics-journal-the-blur-between-tips-and-bribes/" data-text="Economics Journal: The Blur Between Tips and Bribes" data-url="http://on.wsj.com/Q9vhxA" data-via="WSJ">Tweet</a>

The solution is to select lines that have the text ‘postTitle’ and ‘Economics Journal:’
So
if (/(href)(.*?)Economics Journal:/) {
became
if (/(postTitle)(.*?)Economics Journal:/) {

Next: The program failed after fetching and trying to process the blog entry - “Copyright and the Case of Student Materials”.

What was the hitch this time? The entry starts as usual with ‘By Rupa Subramanya’ but in the second line there is another text ‘says Rupa Subramanya’. The program selects text between the phrases ‘By Rupa Subramanya’ and ‘Rupa Subramanya’ with the following code:

($post) = $html =~ m{By\s+Rupa\s+Subramanya(.*?)Rupa\s+Subramanya}s

In effect, it was not finding any blog entry because the closing phrase is in the second line itself. I started playing with a better starting phrase and ending phrase and tried a lot of combinations. Finally, or so I thought, I could get it through by selecting text between the phrases ‘Rupa Subramanya’ and ‘Rupa Subramanya writes’.

($post) = $html =~ m{Rupa\s+Subramanya(.*?)Rupa\s+Subramanya\s+writes}s

In effect, I was ignoring the first occurrence of Rupa Subramanya (that appears in By Rupa Subramanya) and starting the text selection from the second occurrence of Rupa Subramanya (that appears in ‘says Rupa Subramanya’) and ending the selection at 'Rupa Subramanya writes'. The program ran fine fetching a lot of links, and then bombed at the post “Indian Grand Prix vs Encephalitis”.

What was the hitch this time? Rupa’s name appears as Rupa Subramanya Dehejia. The entry starts with ‘By Rupa Subramanya Dehejia’ and ends with ‘Rupa Subramanya Dehejia writes’. There was no such phrase as ‘Rupa Subramanya writes’. I tried a couple of matching expressions, but then I realized that there is nothing as bad as trying to do efficiently a thing that should not be done at all in the first place. So I told myself, why in the world I am selecting text using the phrase ‘Rupa Subramanya’ at all?

So I looked at the source more closely and realized two things: a) all blog entries start with the phrase ‘article start’ and end with the phrase ‘article end’. b) But I need to ignore the last two lines viz -
1) Rupa Subramanya Dehejia writes the Economics Journal for India Real Time. You can follow her on Twitter @RupaSubramanya.
2) Follow India Real Time on Twitter @indiarealtime.

These last two lines have the common text ‘Twitter’.

So,
($post) = $html =~ m{By\s+Rupa\s+Subramanya(.*?)Rupa\s+Subramanya}s
finally ended up as -
($post) = $html =~ m{article\s+start(.*?)article\s+end}s

And I introduced my classic all-time favorite idiom in text parsing, next if $_ =~ / /; as follows:
next if $_ =~ /Twitter/;

With these changes, the program ran through all posts and fetched the links, right upto the first entry of 14-Feb-2011 - "The World Through the Eye of an Economist". An evening couple of hours (actually about four hours, went past midnight), well spent. And if you want to take a look at all the blog entries and their links, click on the rupa-subramanya-blogs-links page on my home site.

The complete source code is
By the way, Rupa is launching her book Indianomix and even before launch, it became the Business Standard pick of the month for December. The recommendation page where she shares space with William Dalrymple and Nassim Nicholas Taleb says: “There’s plenty of desi-style Freakonomics here, and this book promises to be as entertaining as their journalism and other writings.“

I am looking forward to attend the Hyderabad launch (if it happens), listen to her and read the book.

1 comment: