Term Frequency with Transducers gem in Ruby
11 Oct 2014Yesterday, Cognitect announced about a Ruby implementation among others of the transducers concept that Rich Hickey talked about at StrangeLoop.
After watching the talk a number of times and reading a little the source of the gem, I gave it a shot at using it to solve the term frequency exercise from the Exercises in Programming Styles book.
Transducers usage in Ruby
The full example looks as follows: (explanation of the steps is further below).
require 'transducers'
SRC_ROOT = File.join(File.expand_path("../../.."), "src", "exercises-in-programming-style")
PRIDE_AND_PREJUDICE = File.join(SRC_ROOT, "pride-and-prejudice.txt")
STOP_WORDS = File.join(SRC_ROOT, 'stop_words.txt')
# load words to ignore
stop_words = File.read(STOP_WORDS).split(',')
stop_words << ('a'..'z').to_a # also alphabet
stop_words.flatten!.uniq!
lines = File.read(PRIDE_AND_PREJUDICE).lines
frequency_counter = Class.new do
def step(words_frequency, words)
words.each { |w| words_frequency[w.downcase] += 1 }
words_frequency
end
def complete(result)
result.sort {|a,b| a[1] <=> b[1]}.reverse
end
end.new
t = Transducers.compose(
Transducers.map do |line|
line.gsub!(/[^a-zA-Z0-9]/, " ") # remove non alphanumeric
line.split(' ') # split into words
end,
Transducers.map do |words| # remove stop words
words.delete_if {|w| stop_words.include?(w.downcase) }
end)
words_frequency = Hash.new {|h,k| h[k] = 0 }
words_frequency_list = Transducers.transduce(t, frequency_counter, words_frequency, lines)
# Print the top 25 terms
words_frequency_list.to_a[0...25].each do |k, v|
puts "#{k} - #{v}"
end
Going over the steps of the implementatoin
Dependencies and definitions
Gemfile
source "http://rubygems.org"
gem 'transducers'
Imports
require 'transducers'
Location of the files
SRC_ROOT = File.join(File.expand_path("../../.."), "src", "exercises-in-programming-style")
PRIDE_AND_PREJUDICE = File.join(SRC_ROOT, "pride-and-prejudice.txt")
STOP_WORDS = File.join(SRC_ROOT, 'stop_words.txt')
Stop words
The stop words are the list of words and single letters that appear in the test that we decide to ignore.
# load words to ignore
stop_words = File.read(STOP_WORDS).split(',')
stop_words << ('a'..'z').to_a # also alphabet
stop_words.flatten!.uniq!
Using Transducers
In order to determine the term frequency in the pride and prejudice text, we will do the following transformations:
- Create a collection of the lines from the raw body of text
- Remove non-alphanumeric characters from the line
- Split the line into words
- Remove words which are considered to be stop words
Then as part of the reducing step:
- Increment counter in
words_frequency
hash each time the word appears - Sort the words by frequency
- Reverse the list
Then from this list, we will take the top 25 terms and print them to stdout.
The Collection
The collection upon which we will be applying the transformations are each one of the lines that exist in the text.
lines = File.read(PRIDE_AND_PREJUDICE).lines
The Reducer
From the source:
A reducer is an object with a `step` method that takes a result (so far) and an input and returns a new result. This is similar to the blocks we pass to Ruby’s `reduce` (a.k.a `inject`), and serves a similar role in transducing process.
As part of its reducing step, the reducer will receive a list of words
in a line, then iterate over this list and increment the counter in
the words_frequency
hash.
Then, once the words_frequency
has been computed, we returned the
result sorted by frequency in descending order.
frequency_counter = Class.new do
def step(words_frequency, words)
words.each { |w| words_frequency[w.downcase] += 1 }
words_frequency
end
def complete(result)
result.sort {|a,b| a[1] <=> b[1]}.reverse
end
end.new
The Transducer
Again from the source:
A handler is an object with a `call` method that a reducer uses to process input. In a `map` operation, this would transform the input, and in a `filter` operation it would act as a predicate.
A transducer is an object that transforms a reducer by adding additional processing for each element in a collection of inputs.
A transducing process is invoked by calling `Transducers.transduce` with a transducer, a reducer, an optional initial value, and an input collection.
Our transducer will compose the functions which filter the non-alphanumeric characters that exists on a line, as well as words that should be ignored from the text.
This results in the reducer receiving a list of words to compute the term frequency list.
t = Transducers.compose(
Transducers.map do |line|
line.gsub!(/[^a-zA-Z0-9]/, " ") # remove non alphanumeric
line.split(' ') # split into words
end,
Transducers.map do |words| # remove stop words
words.delete_if {|w| stop_words.include?(w.downcase) }
end
)
words_frequency = Hash.new {|h,k| h[k] = 0 }
words_frequency_list = Transducers.transduce(t, frequency_counter, words_frequency, lines)
Printing the results
To verify the results, we only want to check which were the top 25 terms that have the highest frequency
# Print the top 25 terms
words_frequency_list.to_a[0...25].each do |k, v|
puts "#{k} - #{v}"
end