Basic NATS

There aren’t many docs out there on NATS even though it is so awesome (and fast) so I thought I would share some notes on what it is needed to implement a basic client for NATS, (and maybe one day write a client for it…)

Connecting and receiving INFO

Let’s try a simple connecting test:

export msgcounter=0
nc 127.0.0.1 4222 | while read msg; do
  msgcounter=`expr $msgcounter + 1`
  echo "$msgcounter :: " $msg
done

We can see that as soon as we establish the connection, the server will respond with an INFO message, and then after 2 minutes, it will send its first PING message to check the connection with the client.

The server will try to send 3 more PING messages to the client, and on the 4th message, it will close the connection.

[2014-11-16T14:17:12 -0500] run-nats-server -- 2014/11/16 14:17:12 [INFO] Starting gnatsd version 0.5.6
[2014-11-16T14:17:12 -0500] run-nats-server -- 2014/11/16 14:17:12 [INFO] Listening for client connections on 0.0.0.0:4222
[2014-11-16T14:17:12 -0500] run-nats-server -- 2014/11/16 14:17:12 [INFO] gnatsd is ready
[2014-11-16T14:17:12 -0500] client          -- 1 ::  INFO {"server_id":"da9955fe007233fa724ec91a8978c44d","version":"0.5.6","host":"0.0.0.0","port":4222,"auth_required":false,"ssl_required":false,"max_payload":1048576} 
[2014-11-16T14:19:12 -0500] client          -- 2 ::  PING
[2014-11-16T14:21:12 -0500] client          -- 3 ::  PING
[2014-11-16T14:23:12 -0500] client          -- 4 ::  -ERR 'Stale Connection'

PING <-> PONG

After we have received the first PING, it is needed to reply with another PONG:

To keep on using netcat for this examples, we will be relying on named pipes and a multiprocess run. Then, we will be using the named pipe to feed the PONG when we receive a PING from NATS, otherwise the connection would be closed.

[[ -e nats-in ]]  || mkfifo nats-in
[[ -e nats-out ]] || mkfifo nats-out

echo "Start client..."
nc 127.0.0.1 4222 <nats-in >nats-out &

# open nats-in for writing  on fd 3
# open nats-out for reading on fd 4
exec 3> nats-in 4<nats-out

export msgcounter=0
while read msg <&4; do
  echo "GOT ::" $msg

  export msgcounter=`expr $msgcounter + 1`
  echo "$msgcounter messages received"

  # process ping
  echo $msg | grep -q PING && echo "PONG" >&3 ;

done
Sample output
gnatsd		-- [INFO] Starting gnatsd version 0.5.7
gnatsd		-- [INFO] Listening for client connections on 0.0.0.0:4222
gnatsd		-- [INFO] gnatsd is ready
nats-client	-- Start client..
gnatsd		-- [DEBUG] Client connection created%!(EXTRA *server.client=127.0.0.1:58790 - cid:1)
nats-client	-- 1 messages received
nats-client	-- GOT :: INFO {"server_id":"93f5d02d3b613ec2f6135e4ffcee199a","version":"0.5.7","host":"0.0.0.0","port":4222,"auth_required":false,"ssl_required":false,"max_payload":1048576} 
gnatsd		-- [DEBUG] Client Ping Timer%!(EXTRA *server.client=127.0.0.1:58790 - cid:1)
nats-client    -- 2 messages received
nats-client    -- GOT :: PING
gnatsd		-- [DEBUG] Client Ping Timer%!(EXTRA *server.client=127.0.0.1:58790 - cid:1)
nats-client    -- 3 messages received
nats-client    -- GOT :: PING
gnatsd		-- [DEBUG] Client Ping Timer%!(EXTRA *server.client=127.0.0.1:58790 - cid:1)
nats-client    -- 4 messages received
nats-client    -- GOT :: PING
gnatsd		-- [DEBUG] Client Ping Timer%!(EXTRA *server.client=127.0.0.1:58790 - cid:1)
nats-client    -- 5 messages received
nats-client    -- GOT :: PING
gnatsd		-- [DEBUG] Client Ping Timer%!(EXTRA *server.client=127.0.0.1:58790 - cid:1)
nats-client    -- 6 messages received
nats-client    -- GOT :: PING
gnatsd		-- [DEBUG] Client Ping Timer%!(EXTRA *server.client=127.0.0.1:58790 - cid:1)
nats-client    -- 7 messages received
nats-client    -- GOT :: PING
gnatsd		-- [DEBUG] Client Ping Timer%!(EXTRA *server.client=127.0.0.1:58790 - cid:1)
nats-client    -- 8 messages received
nats-client    -- GOT :: PING
gnatsd		-- [DEBUG] Client Ping Timer%!(EXTRA *server.client=127.0.0.1:58790 - cid:1)
nats-client    -- 9 messages received
nats-client    -- GOT :: PING

PUB <-> SUB

As we have seen, we need to constantly be sending PING otherwise the server will close the connection.

Now let’s try creating a subscription!

Another step that we need to handle is CONNECT, so we do that before starting a subscription. After that, we subscribe to the hello.world channel:

echo 'CONNECT {"verbose":false,"pedantic":false}' > nats-in
echo "SUB hello.world  2" > nats-in

Now if we try to send a message to this channel by nats-pub

nats-pub hello.world -s nats://127.0.0.1:4222 "hoge hoge"

…we get the following displayed on our netcat console:

[2014-11-16T23:14:20 -0500] nats-client    -- GOT :: MSG hello.world 2 9
[2014-11-16T23:14:20 -0500] nats-client    -- GOT :: hoge hoge

The 9 here represents the number of letters that were in the MSG. Now let’s make them cooperate and send a PUB that says fuga fuga soon after it gets the first message:

[[ -e nats-in ]]  || mkfifo nats-in
[[ -e nats-out ]] || mkfifo nats-out

# cat > nats-in
echo "Start client.."
nc 127.0.0.1 4222 <nats-in >nats-out &

# open nats-in for writing  on fd 3
# open nats-out for reading on fd 4
exec 3> nats-in 4<nats-out

export pingcounter=0
while read msg <&4; do

  export pingcounter=`expr $pingcounter + 1`
  echo "$pingcounter messages received"
  echo "GOT :: $msg"

  # As soon as we get INFO, send CONNECT and subscriptions
  echo $msg | grep -q INFO && {
    echo 'CONNECT {"verbose":false,"pedantic":false}' > nats-in
    echo "SUB hello.world  2" > nats-in
  }

  # respond to PING
  echo $msg | grep -q PING && echo "PONG" > nats-in

  # respond to MSG hello.world
  echo $msg | grep -q "MSG hello.world" && {
    echo "PONG" > nats-in
    echo -ne 'PUB hello.world  9\r\nfuga fuga\r\n' > nats-in
    sleep 1
  }

done
Using the special channel ’>

Here we are subscribing to all channels by using >, which is kind of a special channel to match all:

echo 'CONNECT {"verbose":false,"pedantic":false}' > nats-in
echo "SUB >  2" > nats-in

Replying to an INBOX

Once having covered subscriptions and publishing, we can try creating an _INBOX to support requests.

Agents subscribed to the channel will get this message and respond to to it, but from the client side it is possible to stop receiving once we got enough.

$ nats-request hello.world -s nats://127.0.0.1:4222 -n 1

SUB _INBOX.46bdca94dd22f452e836b5e1f6  2\r\n"
PUB hello.world _INBOX.46bdca94dd22f452e836b5e1f6 11\r\nHello World\r\n"

[#1] Replied with : 'trying to help'

And in our client we handle it as

MSG hello.world 2 _INBOX.e16d52026b72575471357cc17d 11

The client for this would look like this:

[[ -e nats-in ]]  || mkfifo nats-in
[[ -e nats-out ]] || mkfifo nats-out

echo "Start client.."
nc 127.0.0.1 4222 <nats-in >nats-out &

# open nats-in for writing  on fd 3
# open nats-out for reading on fd 4
exec 3> nats-in 4<nats-out

export pingcounter=0
cat nats-out | while read msg; do

  export pingcounter=`expr $pingcounter + 1`
  echo "$pingcounter messages received"
  echo "GOT :: $msg"

  # As soon as we get INFO, send CONNECT and subscriptions
  echo $msg | grep -q INFO && {
    echo 'CONNECT {"verbose":false,"pedantic":false}' > nats-in
    echo "SUB hello.world  2" > nats-in
  }

  # respond to PING
  echo $msg | grep -q PING && echo "PONG" > nats-in

  # respond to MSG hello.world
  echo $msg | grep -q "MSG hello.world" && {
    echo "PONG" > nats-in
    echo -ne "PUB hello.world  9\r\nfuga fuga\r\n" > nats-in
    sleep 1
  }

  # process MSG hello.world requests
  echo $msg | grep -q "^MSG hello.world..._INBOX" && {
    echo "PONG" > nats-in
    inbox=`echo $msg | awk '{print $4}'`
    echo -ne "PUB $inbox 14\r\ntrying to help\r\n" > nats-in
  }

done

…which makes up for an extremely basic version of a NATS client :)

gnatsd		-- [DEBUG] Client Ping Timer%!(EXTRA *server.client=127.0.0.1:58802 - cid:7)
nats-client	-- 134 messages received
nats-client	-- GOT :: fuga fuga
nats-client	-- 135 messages received
nats-client	-- GOT :: MSG hello.world 2 9
gnatsd		-- [DEBUG] Client Ping Timer%!(EXTRA *server.client=127.0.0.1:58790 - cid:1)
nats-client	-- 136 messages received
nats-client	-- GOT :: fuga fuga
nats-client	-- 137 messages received

Counting words with NATS

Still going through the Exercises in Programming Styles, after diving into Haskell and Transducers for a bit, and since I’ve mentioned to other HackerSchoolers about NATS, I decided to use it for the map reduce programming example. It is somewhat loosely based on the dist-adder example from Derek.

NATS server

There is readily available gnatsd Docker container that we can use to start the server that will be used by the clients to communicate:

sudo docker run -p 4222:4222 apcera/gnatsd

Aggregator

I call it the aggregator, but basically this process will be in charge of dispatching a range of the lines in the text and dispatch the task to another node so that it computes the word frequency computation.

require 'nats/client'
require 'json'

NATS.start {

  # Run options
  $stdout.sync = true
  ["TERM", "INT"].each { |sig| trap(sig) { NATS.stop } }
  SRC_ROOT   = File.join(File.expand_path("."), "src", "exercises-in-programming-style")
  PRIDE_AND_PREJUDICE = File.join(SRC_ROOT, "pride-and-prejudice.txt")
  STOP_WORDS = File.join(SRC_ROOT, 'stop_words.txt')

  # Compute the stop words once.
  # This payload information is small enough that it will be transmitted
  # to the frequency counters via the channel
  @stop_words = File.read(STOP_WORDS).split(',')
  @stop_words << ('a'..'z').to_a # also the alphabet
  @stop_words.flatten!.uniq!

  # Initialize
  @words = Hash.new {|h,k| h[k] = 0 }
  @available_computing_nodes = []

  # Discovery Channel
  NATS.subscribe('pride-prejudice.discovery') do |msg, reply, sub|
    computing_node = JSON.parse(msg)
    unless @available_computing_nodes.include?(computing_node)
      puts "[DISCOVERED]      :: #{computing_node}"
      @available_computing_nodes << computing_node
      puts "[AVAILABLE NODES] :: #{@available_computing_nodes.count}"
    end
  end

  # {"id"=>3, "results"=>{"words"=>[{"test"=>1}]}}
  NATS.subscribe('pride-prejudice.responses') do |msg, reply, sub|
    results = JSON.parse(msg)
    puts "[DONE]      :: Job #{results['id']} is done."

    # Mark the job as done
    @chunks[results['id']][:done]    = true

    begin
      # Use the partial results and start to count the words
      counted_words = results['results']['words']
      counted_words.each_pair do |w, c|
        @words[w] += c
      end
    rescue => e
      puts "Error while trying to count the words..."
      puts e
      puts e.backtrace
    end

    puts "TOP counted words so far"
    @words.sort {|a,b| a[1] <=> b[1]}.reverse[0...25].each do |k, v|
      puts "#{k}  -  #{v}"
    end
  end

  puts "Waiting 5 seconds to get resources for the job..."
  EM.add_timer(5) do
    pride_and_prejudice_text  = File.read(PRIDE_AND_PREJUDICE)
    total_lines = pride_and_prejudice_text.lines.count
    puts "Total lines to split: #{total_lines}"

    # Most likely cannot split the computation perfectly into the number of nodes,
    # so we take the remaining lines and add them to the first batch
    chunk_size = total_lines / (@available_computing_nodes.count)
    out_of_chunk = total_lines % @available_computing_nodes.count
    puts "Chunk size per node: #{chunk_size}"

    # Read the file, count the number of lines, and divide in chunks
    # according to the number of available nodes
    @chunks = {} # {index => {:start, :end, :done, :stop_words }}
    chunk_start = 0
    chunk_end   = 0
    1.upto(@available_computing_nodes.count) do |n|
      chunk_end += chunk_size
      if out_of_chunk > 0
        chunk_size += out_of_chunk
        out_of_chunk = 0
      end
      chunk_end  = [chunk_end, total_lines].min
      @chunks[n]  = {:start => chunk_start, :end => chunk_end, :done => false, :stop_words => @stop_words }
      chunk_start = chunk_end + 1
    end

    @chunks.each do |job|
      job_id, range = job

      # Only want one checker to respond to this
      NATS.request('pride-prejudice.requests', nil, :max => 1) do |response|
        node = JSON.parse(response)
        puts "[REQUEST]   :: Job ##{job_id} needs to be done. Anyone can help? Range is (#{range[:start]}:#{range[:end]})"
        NATS.publish("pride-prejudice.#{node['id']}.compute", job.to_json) do
          puts "[HOPING]    :: #{range[:start]} -- #{range[:end]} to be done by #{node['id']}."
        end
      end
    end
  end
}

Word Frequency Counter

Then each one of the counters, will receive a chunk of words to process and ignore, then reply with the partial computed frequency when done.

While receiving tasks, we counter taints itself a little according to the number of tasks that it is in charge of processing.

require 'nats/client'
require 'securerandom'
require 'json'

$stdout.sync = true
["TERM", "INT"].each { |sig| trap(sig) { NATS.stop } }
SRC_ROOT = File.join(File.expand_path("."), "src", "exercises-in-programming-style")
PRIDE_AND_PREJUDICE = File.join(SRC_ROOT, "pride-and-prejudice.txt")

ID   = SecureRandom.uuid
INFO = {'id' => ID }

def compute(range)
  range_start     = range['start'].to_i
  range_end       = range['end'].to_i
  stop_words      = range['stop_words']
  words_frequency = Hash.new {|h,k| h[k] = 0 }

  # Read local copy of the document and fetch that range of lines
  lines = File.read(PRIDE_AND_PREJUDICE).lines[range_start..range_end]
  lines.each do |line|
    line.gsub!(/[^a-zA-Z0-9]/, " ") # remove non alphanumeric
    words = line.split(" ")
    words.each do |w|
      next if stop_words.include?(w.downcase)
      words_frequency[w.downcase] += 1
    end
  end

  results = {'words' => words_frequency }

  results
end

NATS.start do

  @offerings = 0

  EM.add_periodic_timer(1) do
    NATS.publish('pride-prejudice.discovery', INFO.to_json)
  end

  NATS.subscribe('pride-prejudice.requests') do |msg, reply, sub|
    EM.add_timer(@offerings) { NATS.publish(reply, INFO.to_json) }
    @offerings += 1 # decrease taint delay
  end

  NATS.subscribe("pride-prejudice.#{ID}.compute") do |msg, reply, sub|
    job = JSON.parse(msg)

    job_id, range = job
    puts "[OK]        :: Start to work on (#{range['start']}:#{range['end']})"
    results = compute(range)
    @offerings -= 1 # delay ourselves according to the number of task being done

    job_done = {
     :id      => job_id,
     :results => results
    }
    NATS.publish("pride-prejudice.responses", job_done.to_json)
  end
end

Output

Once the run is done the output would look something similar to this:

nats-server         -- started with pid 15660
aggregator          -- Waiting 5 seconds to get resources for the job...
aggregator          -- [DISCOVERED]      :: {"id"=>"251adacb-0605-4e96-8904-9fd5f239fce2"}
aggregator          -- [AVAILABLE NODES] :: 1
aggregator          -- [DISCOVERED]      :: {"id"=>"a97518f9-2c10-4362-9800-5ec8e57651b9"}
aggregator          -- [AVAILABLE NODES] :: 2
aggregator          -- [DISCOVERED]      :: {"id"=>"4ce45026-3cba-4209-a17b-e27e425366c0"}
aggregator          -- [AVAILABLE NODES] :: 3
aggregator          -- Total lines to split: 13426
aggregator          -- Chunk size per node: 4475
aggregator          -- [REQUEST]   :: Job #1 needs to be done. Anyone can help? Range is (0:4475)
aggregator          -- [HOPING]    :: 0 -- 4475 to be done by 4ce45026-3cba-4209-a17b-e27e425366c0.
frequency-counter:1 -- [OK]        :: Start to work on (0:4475)
aggregator          -- [DONE]      :: Job 1 is done.
aggregator          -- TOP counted words so far
aggregator          -- mr  -  395
aggregator          -- elizabeth  -  202
aggregator          -- very  -  198
aggregator          -- bingley  -  181
aggregator          -- darcy  -  162
aggregator          -- bennet  -  160
aggregator          -- miss  -  142
aggregator          -- such  -  131
aggregator          -- much  -  130
aggregator          -- mrs  -  121
aggregator          -- jane  -  105
aggregator          -- more  -  102
aggregator          -- collins  -  95
aggregator          -- one  -  95
aggregator          -- though  -  77
aggregator          -- think  -  74
aggregator          -- being  -  73
aggregator          -- know  -  72
aggregator          -- never  -  71
aggregator          -- lady  -  70
aggregator          -- well  -  69
aggregator          -- good  -  67
aggregator          -- man  -  65
aggregator          -- soon  -  64
aggregator          -- before  -  63
aggregator          -- [REQUEST]   :: Job #2 needs to be done. Anyone can help? Range is (4476:8951)
aggregator          -- [HOPING]    :: 4476 -- 8951 to be done by 251adacb-0605-4e96-8904-9fd5f239fce2.
frequency-counter:3 -- [OK]        :: Start to work on (4476:8951)
aggregator          -- [DONE]      :: Job 2 is done.
aggregator          -- TOP counted words so far
aggregator          -- mr  -  630
aggregator          -- elizabeth  -  437
aggregator          -- very  -  378
aggregator          -- darcy  -  327
aggregator          -- such  -  251
aggregator          -- bingley  -  245
aggregator          -- much  -  243
aggregator          -- miss  -  240
aggregator          -- mrs  -  239
aggregator          -- more  -  218
aggregator          -- bennet  -  206
aggregator          -- one  -  187
aggregator          -- jane  -  185
aggregator          -- herself  -  173
aggregator          -- collins  -  167
aggregator          -- think  -  152
aggregator          -- lady  -  149
aggregator          -- before  -  146
aggregator          -- though  -  143
aggregator          -- well  -  141
aggregator          -- never  -  140
aggregator          -- sister  -  139
aggregator          -- little  -  136
aggregator          -- soon  -  133
aggregator          -- know  -  133
aggregator          -- [REQUEST]   :: Job #3 needs to be done. Anyone can help? Range is (8952:13426)
aggregator          -- [HOPING]    :: 8952 -- 13426 to be done by 251adacb-0605-4e96-8904-9fd5f239fce2.
frequency-counter:3 -- [OK]        :: Start to work on (8952:13426)
aggregator          -- [DONE]      :: Job 3 is done.
aggregator          -- TOP counted words so far
aggregator          -- mr  -  786
aggregator          -- elizabeth  -  635
aggregator          -- very  -  488
aggregator          -- darcy  -  418
aggregator          -- such  -  395
aggregator          -- mrs  -  343
aggregator          -- much  -  329
aggregator          -- more  -  327
aggregator          -- bennet  -  323
aggregator          -- bingley  -  306
aggregator          -- jane  -  295
aggregator          -- miss  -  283
aggregator          -- one  -  275
aggregator          -- know  -  239
aggregator          -- before  -  229
aggregator          -- herself  -  227
aggregator          -- though  -  226
aggregator          -- well  -  224
aggregator          -- never  -  220
aggregator          -- sister  -  218
aggregator          -- soon  -  216
aggregator          -- think  -  211
aggregator          -- now  -  209
aggregator          -- time  -  203
aggregator          -- good  -  201

Further work

  • Extend this example so that it becomes resilient to fault tolerance, and see how does defending against this impact the style.
  • Dispatch to other nodes with using a different client.

Term Frequency with Transducers gem in Ruby

Yesterday, Cognitect announced about a Ruby implementation among others of the transducers concept that Rich Hickey talked about at StrangeLoop.

After watching the talk a number of times and reading a little the source of the gem, I gave it a shot at using it to solve the term frequency exercise from the Exercises in Programming Styles book.

Transducers usage in Ruby

The full example looks as follows: (explanation of the steps is further below).

require 'transducers'

SRC_ROOT = File.join(File.expand_path("../../.."), "src", "exercises-in-programming-style")
PRIDE_AND_PREJUDICE   = File.join(SRC_ROOT, "pride-and-prejudice.txt")
STOP_WORDS            = File.join(SRC_ROOT, 'stop_words.txt')

# load words to ignore
stop_words = File.read(STOP_WORDS).split(',')
stop_words << ('a'..'z').to_a # also alphabet
stop_words.flatten!.uniq!

lines = File.read(PRIDE_AND_PREJUDICE).lines
frequency_counter = Class.new do
  def step(words_frequency, words)
    words.each { |w| words_frequency[w.downcase] += 1 }
    words_frequency
  end

  def complete(result)
    result.sort {|a,b| a[1] <=> b[1]}.reverse
  end
end.new

t = Transducers.compose(
      Transducers.map do |line|
        line.gsub!(/[^a-zA-Z0-9]/, " ") # remove non alphanumeric
        line.split(' ')                 # split into words
      end,
      Transducers.map do |words|        # remove stop words
        words.delete_if {|w| stop_words.include?(w.downcase) }
      end)

words_frequency  = Hash.new {|h,k| h[k] = 0 }
words_frequency_list = Transducers.transduce(t, frequency_counter, words_frequency, lines)

# Print the top 25 terms
words_frequency_list.to_a[0...25].each do |k, v|
  puts "#{k}  -  #{v}"
end

Going over the steps of the implementatoin

Dependencies and definitions

Gemfile

source "http://rubygems.org"

gem 'transducers'

Imports

require 'transducers'

Location of the files

SRC_ROOT = File.join(File.expand_path("../../.."), "src", "exercises-in-programming-style")
PRIDE_AND_PREJUDICE   = File.join(SRC_ROOT, "pride-and-prejudice.txt")
STOP_WORDS            = File.join(SRC_ROOT, 'stop_words.txt')

Stop words

The stop words are the list of words and single letters that appear in the test that we decide to ignore.

# load words to ignore
stop_words = File.read(STOP_WORDS).split(',')
stop_words << ('a'..'z').to_a # also alphabet
stop_words.flatten!.uniq!

Using Transducers

In order to determine the term frequency in the pride and prejudice text, we will do the following transformations:

  1. Create a collection of the lines from the raw body of text
  2. Remove non-alphanumeric characters from the line
  3. Split the line into words
  4. Remove words which are considered to be stop words

Then as part of the reducing step:

  1. Increment counter in words_frequency hash each time the word appears
  2. Sort the words by frequency
  3. Reverse the list

Then from this list, we will take the top 25 terms and print them to stdout.

The Collection

The collection upon which we will be applying the transformations are each one of the lines that exist in the text.

lines = File.read(PRIDE_AND_PREJUDICE).lines

The Reducer

From the source:

A reducer is an object with a `step` method that takes a result (so far) and an input and returns a new result. This is similar to the blocks we pass to Ruby’s `reduce` (a.k.a `inject`), and serves a similar role in transducing process.

As part of its reducing step, the reducer will receive a list of words in a line, then iterate over this list and increment the counter in the words_frequency hash.

Then, once the words_frequency has been computed, we returned the result sorted by frequency in descending order.

frequency_counter = Class.new do
  def step(words_frequency, words)
    words.each { |w| words_frequency[w.downcase] += 1 }
    words_frequency
  end

  def complete(result)
     result.sort {|a,b| a[1] <=> b[1]}.reverse
  end
end.new

The Transducer

Again from the source:

A handler is an object with a `call` method that a reducer uses to process input. In a `map` operation, this would transform the input, and in a `filter` operation it would act as a predicate.

A transducer is an object that transforms a reducer by adding additional processing for each element in a collection of inputs.

A transducing process is invoked by calling `Transducers.transduce` with a transducer, a reducer, an optional initial value, and an input collection.

Our transducer will compose the functions which filter the non-alphanumeric characters that exists on a line, as well as words that should be ignored from the text.

This results in the reducer receiving a list of words to compute the term frequency list.

t = Transducers.compose(
  Transducers.map do |line|
    line.gsub!(/[^a-zA-Z0-9]/, " ") # remove non alphanumeric
    line.split(' ')                 # split into words
  end,
  Transducers.map do |words|        # remove stop words
    words.delete_if {|w| stop_words.include?(w.downcase) }
  end
)

words_frequency  = Hash.new {|h,k| h[k] = 0 }
words_frequency_list = Transducers.transduce(t, frequency_counter, words_frequency, lines)

Printing the results

To verify the results, we only want to check which were the top 25 terms that have the highest frequency

# Print the top 25 terms
words_frequency_list.to_a[0...25].each do |k, v|
  puts "#{k}  -  #{v}"
end

Hacker School - week #1 recap

It’s been almost a week since I came to New York to join HackerSchool, a quick recap:

Week of [2014-10-06]–[2014-10-10]

  • Crista Lopes was the resident during this week and she talked about her Exercises in Programming Styles book during the first days. I’m not even halfway through the book yet, though I have to say that the book has already left a mark.

    Crista then went on to concisely explain theoretical concepts around types, data structures, and even monads in Haskell.

    I also shared a bit about Org Babel, which may have to do something to do with the following tweet:

  • Right now I’m working through the exercises here (of course, in Org mode). I named it Exercises in Org to keep the scope open to other exercises, though focusing on the programming styles book at this moment.
  • I took a bit of time to revamp my blog yet again. This time using Hyde as the the theme and the org/, src/ folder structure that I have been practicing lately. Repo at wallyqs/org-hyde.
  • Another great thing of being in NYC is the amount of meetups that there are. I assisted to the Mesos NYC meetup at the Twitter offices to hear about the practical usage of Aurora and Mesos, as presented by Bill Farner. In the following weeks I will be diving more into Mesos, so this was a good opportunity to ask directly to those who operate regarding its practical usage.
  • Living in NYC related:
    • Got my Airbnbs setup until the beginnings of December.
    • Discovered Gasoline Alley Coffee which is just around the corner of HackerSchool.
  • Overall, having a great time. The social rules being applied (no feigning, no subtleisms, …) make up for an comfortable environment of collaboration which I had not experienced before, at least definitely not on this level.

Switching to Hyde

Here is a modified version of the original Hyde fit for my flow to continue doing things with mostly Org mode.

The repo can be found here: https://github.com/wallyqs/org-hyde