docdist in Ruby
Surfing the net a few days ago, I bumped into CS 6.006 (Introduction to Algorithms) course lectures of of MIT taught by professors Erik Demaine and Srinivas Devadas. They illustrate the course with Python language and a program named docdist, which essentially 'computes the "distance" between two text files # as the angle between their word frequency vectors (in radians).' That evening I spent a few hours translating the program into Ruby.
Here is the base version :
I 'time'd the program. The base version was awfully slow - it took about 2 minutes 37.388 seconds.
Base version 0
----------------
$ time ./docdist-v0.rb t2.bobsey.txt t3.lewis.txt
File t2.bobsey.txt:262111 lines,49785 words,3354 distinct words
File t3.lewis.txt:1015474 lines,182355 words,8530 distinct words
The distance between the documents is: 0.574160 (radians)
real 2m37.479s
user 2m27.495s
sys 0m9.893s
Years ago, I'd switched from Python to Ruby for my glue tasks and now I felt I shouldn't have stuck with Python. Should I go back? But before giving up, I came back to the program the next day and started making a few optimizations.
Version 1
-----------
In get_words_from_line_list function, replaced
text = text.downcase.gsub(/[^[:alnum:]]/, ' ')
with
text = text.downcase.tr('^a-zA-Z0-9', ' ')
$ time ./docdist-v1.rb t2.bobsey.txt t3.lewis.txt
real 2m35.658s
user 2m26.369s
sys 0m9.217s
user + sys = 2 min 35.586 sec.
Version 2
-----------
In inner_product function, replaced
sum = sum + d1[key] * d2[key]
with
sum += d1[key] * d2[key]
$ time ./docdist-v2.rb t2.bobsey.txt t3.lewis.txt
real 2m35.404s
user 2m26.267s
sys 0m9.116s
user + sys = 2 min 35.383 sec.
Version 3
-----------
I stopped at version 2 and came back to the program the next day. I noticed that when searching or iterating through hash keys, I was getting the keys as an array and then applying include? or each on the array. I realized that Ruby offers direct key words has_key and each_key and I used them.
In count_frequency function, replaced
if d.keys.include? new_word then
with
if d.has_key? new_word then
In inner_product function, replaced
d1.keys.each do |key|
with
d1.each_key do |key|
and also replaced
if d2.keys.include? key then
with
if d2.has_key? key then
$ time ./docdist-v3.rb t2.bobsey.txt t3.lewis.txt
real 0m0.309s
user 0m0.290s
sys 0m0.018s
user + sys = 0.308 sec. This was a huge improvement in the run time and brought back my confidence in Ruby. Cool. No more harbouring thoughts of switching back to Python.
Version 4
-----------
In count_frequency function, replaced
for new_word in word_list do
with
word_list.each do |new_word|
$ time ./docdist-v4.rb t2.bobsey.txt t3.lewis.txt
real 0m0.328s
user 0m0.308s
sys 0m0.018s
user + sys = 0.326 sec. Performance degraded. So I stopped my optimization cycle, with version 3.
Moral of the story: Code written past midnight HAS to be reviewed next morning.
A couple of observations that looked weird to me:
1) When translating the Python program into Ruby, I copied the block comments with triple double-quotes as is. I thought that they (""") work as delimiters for block comments in Ruby also. I blamed myself for not knowing this basic thing in Ruby all these years. So I thought I learnt something - you could use """ for block comments.
But I could not understand why they did not work for the following comment block in inner_product function. The comments are:
Inner product between two vectors, where vectors
are repeated as dictionaries of (word, freq) pairs.
Example : inner_product({"and":3, "of":2, "the":5},
{"and":4, "in":1, "of":1, "this":2}) = 14.0
If I wrap the above block with """ delimiters, I get the following errors:
./docdist-v3.rb:71: syntax error, unexpected tIDENTIFIER, expecting kEND
Example : inner_product({"and":3, "of":2, "the":5},
^
./docdist-v3.rb:71: syntax error, unexpected tIDENTIFIER, expecting kEND
Example : inner_product({"and":3, "of":2, "the":5},
^
./docdist-v3.rb:72: syntax error, unexpected kIN, expecting kEND
... {"and":4, "in":1, "of":1, "this":2}) = 14.0
^
./docdist-v3.rb:72: syntax error, unexpected tIDENTIFIER, expecting kEND
... {"and":4, "in":1, "of":1, "this":2}) = 14.0
^
./docdist-v3.rb:72: syntax error, unexpected tIDENTIFIER, expecting kEND
..."and":4, "in":1, "of":1, "this":2}) = 14.0
But if I wrap the block with =begin and =end delimiters, Ruby takes it as a comment block.
2) If I get a profile of the run with the statement require 'profile' at the top, the program runs into a massive amount of degradation. V3 with require 'profile' at the top of the program takes the following time:
real 0m30.497s
user 0m27.765s
sys 0m2.683s
So I posted these two questions on stackoverflow.
Version 5
---------
Apparently, the triple double-quotes create a string that is never used. They were not delimiting a comment block at all. Thanks to Stefan for clarifying the mystery for me. Version 5 replaces all the """ delimited strings with =begin and =end. Run time stays the same as Version 3 --> user = 0.29 seconds and sys = 0.018 seconds for a total of 0.308 seconds.
Version 5 is my final version.
Finally, comparing with the Python run time: If I remove profiling from docdist8.py, and use the time command, the program runs on my imac in about 0.2 seconds. user = 0.183 and sys = 0.022 seconds for a total of 0.205 seconds.
$ time python docdist8.py t2.bobsey.txt t3.lewis.txt
real 0m0.207s
user 0m0.183s
sys 0m0.022s
My Ruby version is 1.8.6 and Python version is 2.5.1. As you can see, the Python version runs 50% faster. Perhaps that's a reason why the Yahoos of the world use Python.
Stefan also suggested that I use ruby-prof for profiling. ruby-prof requires 1.8.7 or higher. So I installed Ruby 1.8.7. Strangely docdist-v5.rb runs slower with Ruby 1.8.7. Also, ruby-prof is reporting zero seconds time in its output. Looks like I'll have to stick to 1.8.6 and time command for the moment. The main reason for the faster execution by Python seems to be its string.maketrans util function. Ruby provides the tr keyword, but it is not as fast as maketrans and for Ruby to catchup, perhaps it has to provide something similar. Hence, in conclusion, it seems to me that the only way to speed up the program now is to Ruby to provide something similar to Python's string.maketrans utility which will be faster than tr.
Here is the base version :
I 'time'd the program. The base version was awfully slow - it took about 2 minutes 37.388 seconds.
Base version 0
----------------
$ time ./docdist-v0.rb t2.bobsey.txt t3.lewis.txt
File t2.bobsey.txt:262111 lines,49785 words,3354 distinct words
File t3.lewis.txt:1015474 lines,182355 words,8530 distinct words
The distance between the documents is: 0.574160 (radians)
real 2m37.479s
user 2m27.495s
sys 0m9.893s
Years ago, I'd switched from Python to Ruby for my glue tasks and now I felt I shouldn't have stuck with Python. Should I go back? But before giving up, I came back to the program the next day and started making a few optimizations.
Version 1
-----------
In get_words_from_line_list function, replaced
text = text.downcase.gsub(/[^[:alnum:]]/, ' ')
with
text = text.downcase.tr('^a-zA-Z0-9', ' ')
$ time ./docdist-v1.rb t2.bobsey.txt t3.lewis.txt
real 2m35.658s
user 2m26.369s
sys 0m9.217s
user + sys = 2 min 35.586 sec.
Version 2
-----------
In inner_product function, replaced
sum = sum + d1[key] * d2[key]
with
sum += d1[key] * d2[key]
$ time ./docdist-v2.rb t2.bobsey.txt t3.lewis.txt
real 2m35.404s
user 2m26.267s
sys 0m9.116s
user + sys = 2 min 35.383 sec.
Version 3
-----------
I stopped at version 2 and came back to the program the next day. I noticed that when searching or iterating through hash keys, I was getting the keys as an array and then applying include? or each on the array. I realized that Ruby offers direct key words has_key and each_key and I used them.
In count_frequency function, replaced
if d.keys.include? new_word then
with
if d.has_key? new_word then
In inner_product function, replaced
d1.keys.each do |key|
with
d1.each_key do |key|
and also replaced
if d2.keys.include? key then
with
if d2.has_key? key then
$ time ./docdist-v3.rb t2.bobsey.txt t3.lewis.txt
real 0m0.309s
user 0m0.290s
sys 0m0.018s
user + sys = 0.308 sec. This was a huge improvement in the run time and brought back my confidence in Ruby. Cool. No more harbouring thoughts of switching back to Python.
Version 4
-----------
In count_frequency function, replaced
for new_word in word_list do
with
word_list.each do |new_word|
$ time ./docdist-v4.rb t2.bobsey.txt t3.lewis.txt
real 0m0.328s
user 0m0.308s
sys 0m0.018s
user + sys = 0.326 sec. Performance degraded. So I stopped my optimization cycle, with version 3.
Moral of the story: Code written past midnight HAS to be reviewed next morning.
A couple of observations that looked weird to me:
1) When translating the Python program into Ruby, I copied the block comments with triple double-quotes as is. I thought that they (""") work as delimiters for block comments in Ruby also. I blamed myself for not knowing this basic thing in Ruby all these years. So I thought I learnt something - you could use """ for block comments.
But I could not understand why they did not work for the following comment block in inner_product function. The comments are:
Inner product between two vectors, where vectors
are repeated as dictionaries of (word, freq) pairs.
Example : inner_product({"and":3, "of":2, "the":5},
{"and":4, "in":1, "of":1, "this":2}) = 14.0
If I wrap the above block with """ delimiters, I get the following errors:
./docdist-v3.rb:71: syntax error, unexpected tIDENTIFIER, expecting kEND
Example : inner_product({"and":3, "of":2, "the":5},
^
./docdist-v3.rb:71: syntax error, unexpected tIDENTIFIER, expecting kEND
Example : inner_product({"and":3, "of":2, "the":5},
^
./docdist-v3.rb:72: syntax error, unexpected kIN, expecting kEND
... {"and":4, "in":1, "of":1, "this":2}) = 14.0
^
./docdist-v3.rb:72: syntax error, unexpected tIDENTIFIER, expecting kEND
... {"and":4, "in":1, "of":1, "this":2}) = 14.0
^
./docdist-v3.rb:72: syntax error, unexpected tIDENTIFIER, expecting kEND
..."and":4, "in":1, "of":1, "this":2}) = 14.0
But if I wrap the block with =begin and =end delimiters, Ruby takes it as a comment block.
2) If I get a profile of the run with the statement require 'profile' at the top, the program runs into a massive amount of degradation. V3 with require 'profile' at the top of the program takes the following time:
real 0m30.497s
user 0m27.765s
sys 0m2.683s
So I posted these two questions on stackoverflow.
Version 5
---------
Apparently, the triple double-quotes create a string that is never used. They were not delimiting a comment block at all. Thanks to Stefan for clarifying the mystery for me. Version 5 replaces all the """ delimited strings with =begin and =end. Run time stays the same as Version 3 --> user = 0.29 seconds and sys = 0.018 seconds for a total of 0.308 seconds.
Version 5 is my final version.
Finally, comparing with the Python run time: If I remove profiling from docdist8.py, and use the time command, the program runs on my imac in about 0.2 seconds. user = 0.183 and sys = 0.022 seconds for a total of 0.205 seconds.
$ time python docdist8.py t2.bobsey.txt t3.lewis.txt
real 0m0.207s
user 0m0.183s
sys 0m0.022s
My Ruby version is 1.8.6 and Python version is 2.5.1. As you can see, the Python version runs 50% faster. Perhaps that's a reason why the Yahoos of the world use Python.
Stefan also suggested that I use ruby-prof for profiling. ruby-prof requires 1.8.7 or higher. So I installed Ruby 1.8.7. Strangely docdist-v5.rb runs slower with Ruby 1.8.7. Also, ruby-prof is reporting zero seconds time in its output. Looks like I'll have to stick to 1.8.6 and time command for the moment. The main reason for the faster execution by Python seems to be its string.maketrans util function. Ruby provides the tr keyword, but it is not as fast as maketrans and for Ruby to catchup, perhaps it has to provide something similar. Hence, in conclusion, it seems to me that the only way to speed up the program now is to Ruby to provide something similar to Python's string.maketrans utility which will be faster than tr.
Comments
Post a Comment