Term Frequency with Transducers gem in Ruby

Yesterday, Cognitect announced about a Ruby implementation among others of the transducers concept that Rich Hickey talked about at StrangeLoop.

After watching the talk a number of times and reading a little the source of the gem, I gave it a shot at using it to solve the term frequency exercise from the Exercises in Programming Styles book.

Transducers usage in Ruby

The full example looks as follows: (explanation of the steps is further below).

require 'transducers'

SRC_ROOT = File.join(File.expand_path("../../.."), "src", "exercises-in-programming-style")
PRIDE_AND_PREJUDICE   = File.join(SRC_ROOT, "pride-and-prejudice.txt")
STOP_WORDS            = File.join(SRC_ROOT, 'stop_words.txt')

# load words to ignore
stop_words = File.read(STOP_WORDS).split(',')
stop_words << ('a'..'z').to_a # also alphabet
stop_words.flatten!.uniq!

lines = File.read(PRIDE_AND_PREJUDICE).lines
frequency_counter = Class.new do
  def step(words_frequency, words)
    words.each { |w| words_frequency[w.downcase] += 1 }
    words_frequency
  end

  def complete(result)
    result.sort {|a,b| a[1] <=> b[1]}.reverse
  end
end.new

t = Transducers.compose(
      Transducers.map do |line|
        line.gsub!(/[^a-zA-Z0-9]/, " ") # remove non alphanumeric
        line.split(' ')                 # split into words
      end,
      Transducers.map do |words|        # remove stop words
        words.delete_if {|w| stop_words.include?(w.downcase) }
      end)

words_frequency  = Hash.new {|h,k| h[k] = 0 }
words_frequency_list = Transducers.transduce(t, frequency_counter, words_frequency, lines)

# Print the top 25 terms
words_frequency_list.to_a[0...25].each do |k, v|
  puts "#{k}  -  #{v}"
end

Going over the steps of the implementatoin

Dependencies and definitions

Gemfile

source "http://rubygems.org"

gem 'transducers'

Imports

require 'transducers'

Location of the files

SRC_ROOT = File.join(File.expand_path("../../.."), "src", "exercises-in-programming-style")
PRIDE_AND_PREJUDICE   = File.join(SRC_ROOT, "pride-and-prejudice.txt")
STOP_WORDS            = File.join(SRC_ROOT, 'stop_words.txt')

Stop words

The stop words are the list of words and single letters that appear in the test that we decide to ignore.

# load words to ignore
stop_words = File.read(STOP_WORDS).split(',')
stop_words << ('a'..'z').to_a # also alphabet
stop_words.flatten!.uniq!

Using Transducers

In order to determine the term frequency in the pride and prejudice text, we will do the following transformations:

  1. Create a collection of the lines from the raw body of text
  2. Remove non-alphanumeric characters from the line
  3. Split the line into words
  4. Remove words which are considered to be stop words

Then as part of the reducing step:

  1. Increment counter in words_frequency hash each time the word appears
  2. Sort the words by frequency
  3. Reverse the list

Then from this list, we will take the top 25 terms and print them to stdout.

The Collection

The collection upon which we will be applying the transformations are each one of the lines that exist in the text.

lines = File.read(PRIDE_AND_PREJUDICE).lines

The Reducer

From the source:

A reducer is an object with a `step` method that takes a result (so far) and an input and returns a new result. This is similar to the blocks we pass to Ruby’s `reduce` (a.k.a `inject`), and serves a similar role in transducing process.

As part of its reducing step, the reducer will receive a list of words in a line, then iterate over this list and increment the counter in the words_frequency hash.

Then, once the words_frequency has been computed, we returned the result sorted by frequency in descending order.

frequency_counter = Class.new do
  def step(words_frequency, words)
    words.each { |w| words_frequency[w.downcase] += 1 }
    words_frequency
  end

  def complete(result)
     result.sort {|a,b| a[1] <=> b[1]}.reverse
  end
end.new

The Transducer

Again from the source:

A handler is an object with a `call` method that a reducer uses to process input. In a `map` operation, this would transform the input, and in a `filter` operation it would act as a predicate.

A transducer is an object that transforms a reducer by adding additional processing for each element in a collection of inputs.

A transducing process is invoked by calling `Transducers.transduce` with a transducer, a reducer, an optional initial value, and an input collection.

Our transducer will compose the functions which filter the non-alphanumeric characters that exists on a line, as well as words that should be ignored from the text.

This results in the reducer receiving a list of words to compute the term frequency list.

t = Transducers.compose(
  Transducers.map do |line|
    line.gsub!(/[^a-zA-Z0-9]/, " ") # remove non alphanumeric
    line.split(' ')                 # split into words
  end,
  Transducers.map do |words|        # remove stop words
    words.delete_if {|w| stop_words.include?(w.downcase) }
  end
)

words_frequency  = Hash.new {|h,k| h[k] = 0 }
words_frequency_list = Transducers.transduce(t, frequency_counter, words_frequency, lines)

Printing the results

To verify the results, we only want to check which were the top 25 terms that have the highest frequency

# Print the top 25 terms
words_frequency_list.to_a[0...25].each do |k, v|
  puts "#{k}  -  #{v}"
end