docdist in Ruby

Surfing the net a few days ago, I bumped into CS 6.006 (Introduction to Algorithms) course lectures of of MIT taught by professors Erik Demaine and Srinivas Devadas. They illustrate the course with Python language and a program named docdist, which essentially 'computes the "distance" between two text files # as the angle between their word frequency vectors (in radians).' That evening I spent a few hours translating the program into Ruby.

Here is the base version :

#!/usr/bin/ruby
def read_file(filename)
"""
Read the text file with the given filename;
return a list of the lines of text in the file.
"""
begin
f = File.open(filename, "r")
return f.read()
rescue
print "Error opening or reading input file: ", filename
exit
end
end
################################################
# Operation 2: split the text line into words ##
################################################
def get_words_from_line_list(text)
"""
Parse the given text into words.
Return list of all words found.
"""
text = text.downcase.gsub(/[^[:alnum:]]/, ' ')
word_list = text.split
return word_list
end
###############################################
# Operation 3 : count frequency of each word ##
###############################################
def count_frequency(word_list)
"""
Return a dictionary mapping words to frequency.
"""
d = {}
for new_word in word_list do
if d.keys.include? new_word then
d[new_word] = d[new_word] + 1
else
d[new_word] = 1
end
end
return d
end
############################################
# compute word frequencies for input file ##
############################################
def word_frequencies_for_file(filename)
"""
Return dictionary of (word, frequency) pairs of the given file
"""
line_list = read_file(filename)
word_list = get_words_from_line_list(line_list)
freq_mapping = count_frequency(word_list)
print "File ", filename, ":"
print line_list.length, " lines,"
print word_list.length, " words,"
print freq_mapping.length, " distinct words\n"
return freq_mapping
end
def inner_product(d1, d2)
=begin
Inner product between two vectors, where vectors
are repeated as dictionaries of (word, freq) pairs.
Example : inner_product({"and":3, "of":2, "the":5},
{"and":4, "in":1, "of":1, "this":2}) = 14.0
=end
sum = 0.0
d1.keys.each do |key|
if d2.keys.include? key then
sum = sum + d1[key] * d2[key]
end
end
return sum
end
def vector_angle(d1, d2)
"""
The input is a list of (word, freq) pairs, sorted alphabetically.
Return the angle between these two vectors.
"""
numerator = inner_product(d1, d2)
denominator = Math.sqrt(inner_product(d1, d1) * inner_product(d2, d2))
return Math.acos(numerator/denominator)
end
def main
if ARGV.length != 2 then
puts "Usage: docdist.rb filename_1 filename_2"
else
filename_1 = ARGV[0]
filename_2 = ARGV[1]
sorted_word_list_1 = word_frequencies_for_file(filename_1)
sorted_word_list_2 = word_frequencies_for_file(filename_2)
distance = vector_angle(sorted_word_list_1, sorted_word_list_2)
print "The distance between the documents is: %0.6f (radians)\n" % distance
end
end
if __FILE__ == $PROGRAM_NAME then
main
end
view raw docdist-v0.rb hosted with ❤ by GitHub
I 'time'd the program. The base version was awfully slow - it took about 2 minutes 37.388 seconds.

Base version 0
----------------
$ time ./docdist-v0.rb t2.bobsey.txt t3.lewis.txt
File t2.bobsey.txt:262111 lines,49785 words,3354 distinct words
File t3.lewis.txt:1015474 lines,182355 words,8530 distinct words
The distance between the documents is: 0.574160 (radians)
real 2m37.479s
user 2m27.495s
sys 0m9.893s

Years ago, I'd switched from Python to Ruby for my glue tasks and now I felt I shouldn't have stuck with Python. Should I go back? But before giving up, I came back to the program the next day and started making a few optimizations.

Version 1
-----------
In get_words_from_line_list function, replaced
text = text.downcase.gsub(/[^[:alnum:]]/, ' ')
with
text = text.downcase.tr('^a-zA-Z0-9', ' ')

$ time ./docdist-v1.rb t2.bobsey.txt t3.lewis.txt
real 2m35.658s
user 2m26.369s
sys 0m9.217s

user + sys = 2 min 35.586 sec.

Version 2
-----------
In inner_product function, replaced
sum = sum + d1[key] * d2[key]
with
sum += d1[key] * d2[key]

$ time ./docdist-v2.rb t2.bobsey.txt t3.lewis.txt
real 2m35.404s
user 2m26.267s
sys 0m9.116s

user + sys = 2 min 35.383 sec.

Version 3
-----------
I stopped at version 2 and came back to the program the next day. I noticed that when searching or iterating through hash keys, I was getting the keys as an array and then applying include? or each on the array. I realized that Ruby offers direct key words has_key and each_key and I used them.

In count_frequency function, replaced
if d.keys.include? new_word then
with
if d.has_key? new_word then
In inner_product function, replaced
d1.keys.each do |key|
with
d1.each_key do |key|

and also replaced
if d2.keys.include? key then
with
if d2.has_key? key then

$ time ./docdist-v3.rb t2.bobsey.txt t3.lewis.txt
real 0m0.309s
user 0m0.290s
sys 0m0.018s

user + sys = 0.308 sec. This was a huge improvement in the run time and brought back my confidence in Ruby. Cool. No more harbouring thoughts of switching back to Python.

Version 4
-----------
In count_frequency function, replaced
for new_word in word_list do
with
word_list.each do |new_word|

$ time ./docdist-v4.rb t2.bobsey.txt t3.lewis.txt
real 0m0.328s
user 0m0.308s
sys 0m0.018s

user + sys = 0.326 sec. Performance degraded. So I stopped my optimization cycle, with version 3.

Moral of the story: Code written past midnight HAS to be reviewed next morning.

A couple of observations that looked weird to me:

1) When translating the Python program into Ruby, I copied the block comments with triple double-quotes as is. I thought that they (""") work as delimiters for block comments in Ruby also. I blamed myself for not knowing this basic thing in Ruby all these years. So I thought I learnt something - you could use """ for block comments.

But I could not understand why they did not work for the following comment block in inner_product function. The comments are:
Inner product between two vectors, where vectors
are repeated as dictionaries of (word, freq) pairs.
Example : inner_product({"and":3, "of":2, "the":5},
{"and":4, "in":1, "of":1, "this":2}) = 14.0
If I wrap the above block with """ delimiters, I get the following errors:

./docdist-v3.rb:71: syntax error, unexpected tIDENTIFIER, expecting kEND
Example : inner_product({"and":3, "of":2, "the":5},
^

./docdist-v3.rb:71: syntax error, unexpected tIDENTIFIER, expecting kEND
Example : inner_product({"and":3, "of":2, "the":5},
^

./docdist-v3.rb:72: syntax error, unexpected kIN, expecting kEND
... {"and":4, "in":1, "of":1, "this":2}) = 14.0
^

./docdist-v3.rb:72: syntax error, unexpected tIDENTIFIER, expecting kEND
... {"and":4, "in":1, "of":1, "this":2}) = 14.0
^

./docdist-v3.rb:72: syntax error, unexpected tIDENTIFIER, expecting kEND
..."and":4, "in":1, "of":1, "this":2}) = 14.0

But if I wrap the block with =begin and =end delimiters, Ruby takes it as a comment block.

2) If I get a profile of the run with the statement require 'profile' at the top, the program runs into a massive amount of degradation. V3 with require 'profile' at the top of the program takes the following time:
real 0m30.497s
user 0m27.765s
sys 0m2.683s

So I posted these two questions on stackoverflow.

Version 5
---------
Apparently, the triple double-quotes create a string that is never used. They were not delimiting a comment block at all. Thanks to Stefan for clarifying the mystery for me. Version 5 replaces all the """ delimited strings with =begin and =end. Run time stays the same as Version 3 --> user = 0.29 seconds and sys = 0.018 seconds for a total of 0.308 seconds.

Version 5 is my final version.
#!/usr/bin/ruby
def read_file(filename)
=begin
Read the text file with the given filename;
return a list of the lines of text in the file.
=end
begin
f = File.open(filename, "r")
return f.read()
rescue
print "Error opening or reading input file: ", filename
exit
end
end
################################################
# Operation 2: split the text line into words ##
################################################
def get_words_from_line_list(text)
=begin
Parse the given text into words.
Return list of all words found.
=end
text = text.downcase.tr('^a-zA-Z0-9', ' ')
word_list = text.split
return word_list
end
###############################################
# Operation 3 : count frequency of each word ##
###############################################
def count_frequency(word_list)
=begin
Return a dictionary mapping words to frequency.
=end
d = {}
for new_word in word_list do
if d.has_key? new_word then
d[new_word] = d[new_word] + 1
else
d[new_word] = 1
end
end
return d
end
############################################
# compute word frequencies for input file ##
############################################
def word_frequencies_for_file(filename)
=begin
Return dictionary of (word, frequency) pairs of the given file
=end
line_list = read_file(filename)
word_list = get_words_from_line_list(line_list)
freq_mapping = count_frequency(word_list)
print "File ", filename, ":"
print line_list.length, " lines,"
print word_list.length, " words,"
print freq_mapping.length, " distinct words\n"
return freq_mapping
end
def inner_product(d1, d2)
=begin
Inner product between two vectors, where vectors
are repeated as dictionaries of (word, freq) pairs.
Example : inner_product({"and":3, "of":2, "the":5},
{"and":4, "in":1, "of":1, "this":2}) = 14.0
=end
sum = 0.0
d1.each_key do |key|
if d2.has_key? key then
sum += d1[key] * d2[key]
end
end
return sum
end
def vector_angle(d1, d2)
=begin
The input is a list of (word, freq) pairs, sorted alphabetically.
Return the angle between these two vectors.
=end
numerator = inner_product(d1, d2)
denominator = Math.sqrt(inner_product(d1, d1) * inner_product(d2, d2))
return Math.acos(numerator/denominator)
end
def main
if ARGV.length != 2 then
puts "Usage: docdist.rb filename_1 filename_2"
else
filename_1 = ARGV[0]
filename_2 = ARGV[1]
sorted_word_list_1 = word_frequencies_for_file(filename_1)
sorted_word_list_2 = word_frequencies_for_file(filename_2)
distance = vector_angle(sorted_word_list_1, sorted_word_list_2)
print "The distance between the documents is: %0.6f (radians)\n" % distance
end
end
if __FILE__ == $PROGRAM_NAME then
main
end
view raw docdist-v5.rb hosted with ❤ by GitHub
Finally, comparing with the Python run time: If I remove profiling from docdist8.py, and use the time command, the program runs on my imac in about 0.2 seconds. user = 0.183 and sys = 0.022 seconds for a total of 0.205 seconds.
$ time python docdist8.py t2.bobsey.txt t3.lewis.txt
real 0m0.207s
user 0m0.183s
sys 0m0.022s

My Ruby version is 1.8.6 and Python version is 2.5.1. As you can see, the Python version runs 50% faster. Perhaps that's a reason why the Yahoos of the world use Python.

Stefan also suggested that I use ruby-prof for profiling. ruby-prof requires 1.8.7 or higher. So I installed Ruby 1.8.7. Strangely docdist-v5.rb runs slower with Ruby 1.8.7. Also, ruby-prof is reporting zero seconds time in its output. Looks like I'll have to stick to 1.8.6 and time command for the moment. The main reason for the faster execution by Python seems to be its string.maketrans util function. Ruby provides the tr keyword, but it is not as fast as maketrans and for Ruby to catchup, perhaps it has to provide something similar. Hence, in conclusion, it seems to me that the only way to speed up the program now is to Ruby to provide something similar to Python's string.maketrans utility which will be faster than tr.

Comments

Popular posts from this blog

Mentoring Trainees In Java

27th

Fetching Blogger Posts With Python