Yuletide Logs and MongoDB Capped Collections

Nick Quaranto

This Christmas, maybe you’re thinking of the long trip back home through snowy highways and bustling airports, spending time with the family next to a warm fireplace, or perhaps being serenaded by The Big Maestro himself. Sadly I didn’t get to see Shaq conduct the Pops, instead I’m thinking of how much Capped Collections in MongoDB have made my job a bit easier.

Warmup

Here’s the problem: Lots of data coming in extremely frequently, on the order of several thousand discrete chunks of data every 5 seconds. Since it’s coming in too fast to be able to comprehend it, we really just want to show the last 15 chunks so the user. This lets the user know the service is accepting data, and it’s kind of neat to see data updating that fast in real time. My first implementation? ActiveRecord, of course.

create_table "raws" do |t|
  t.text     "line"
  t.datetime "ran_at"
end

Seems simple enough. Since the data had to be parsed I tossed it into a background job, so it looked something like:

class Parser
  attr_accessor :log
  def perform
    lines = ActiveSupport::Gzip.decompress(log).split("\n")
    lines.each do |raw_line|
      create_raw(DateTime.now, raw_line.strip)
    end
  end

  def create_raw(ran_at, line)
    Raw.create :ran_at => ran_at, :line => line
  end
end

Yes, I was pretty stupid and my first implementation just gzipped the data and shot it to the server. We’re now using a proper JSON API hooked up with yajl-ruby, but the real problem here had two steps:

  1. Parse input
  2. Create row for each chunk of data

Moving to JSON/YAJL made parsing faster, but what about writing the data? I thought this seemed great until I left it running overnight, and big surprise: there were over a million rows. Ok, that’s fine, we don’t need to keep it all anyway…

def perform
  Raw.delete_all(["created_at < ?", 1.minute.ago])
  # parsing/inserting code here
end

That still wasn’t enough, as I increased traffic to that endpoint. Before I refactored how this endpoint accepted data, the table actively had 50,000 rows in it (and that’s just the last minute of data), and we were around the 140 millionth primary key dished out. This was simply storing too much input, and given that we only needed to ever show the last few pieces of it, there had to be a better way to model the data.

Flurries

I had to sit back and consider my options here. I couldn’t figure out a way to implement this kind of write heavy behavior in SQL after poring over the PostgreSQL docs and mailing lists.

I still consider most NoSQL solutions to be awesome utility belts to be used along side of relational data stores. I turned to Redis first. This could have used a simple Redis list like so:

# in config/initializers/redis.rb
$redis = Redis.new

# when writing data
$redis.lpush("latest-data", JSON.dump(line))
$redis.ltrim("latest-data", 0, 15)

# when reading data
raws = $redis.lrange("latest-data", 0, -1)
raws.map { |raw| JSON.parse(raw) }

This definitely would have been faster, since Redis is all in memory, and the LTRIM command will basically keep a list short.

The reads are kind of awkward though, since we’re storing more than one tidbit of data (when the data was parsed, and the data itself) and we plan on adding more later. Since Redis only understands strings, it seemed awkward to use this solution. YAJL would make short work of it, but when we also have to maintain the size of the list ourselves, it seems like there could be a more natural and built-in way of modeling this problem.

Blizzard

I next looked at MongoDB. I had a great introduction during MongoBoston, and I heard of Capped Collections but it didn’t click until…

Capped collections are fixed sized collections that have a very high performance auto-FIFO age-out feature (age out is based on insertion order). […] In addition, capped collections automatically, with high performance, maintain insertion order for the objects in the collection; this is very powerful for certain use cases such as logging.

Booyah! I immediately hooked up a free MongoHQ plan and got started. The Mongo ruby driver is pretty easy to use, after reading some tutorials I ended up with a decent solution.

First off, connecting to MongoHQ on Heroku was a bit irksome to discover, so here’s a sample if you can’t find why this isn’t documented properly. It includes hooking up the Rails.logger and using environments properly on your local machine.

# in config/initializers/mongo.rb
if ENV['MONGOHQ_URL']
  uri    = URI.parse(ENV['MONGOHQ_URL'])
  conn   = Mongo::Connection.from_uri(ENV['MONGOHQ_URL'], :logger => Rails.logger)
  $mongo = conn.db(uri.path.gsub(/^\//, ''))
else
  $mongo = Mongo::Connection.new(nil, nil,  :logger => Rails.logger).db("app_#{Rails.env.downcase}")
end

And our capped collection write implementation:

class Tail
  def self.insert(line)
    collection.insert(
      :at => Time.at(line['at']),
      :command => line['command'])
  end

  def self.collection
    @collection ||= $mongo.create_collection("tail",
      :capped => true, :max => 15)
  end
end

Creating a capped collection is pretty simple, you can specify the number of max documents it can have or bytes total. Mongo::DB#create_collection won’t blow away an existing collection, so it’s ok to keep calling it in this class.

Reading from the capped collection is simple, except that Mongo insists on always returning a BSON::ObjectId on every document. That’s not necessary to show the user, so I end up filtering it out in Ruby:

class Tail
  def self.last
    collection.find.to_a.tap do |tails|
      tails.each do |tail|
        tail.delete('_id')
      end
    end
  end
end

In practice, each Tail collection is segmented out by users, and MongoHQ even provides a nice interface to browse what’s in your database:

''

Clear Skies

Overall, I think a few MongoDB capped collections served this use case extremely well over going with Postgres or Redis. Use the best tool for the job, and it’s even better if someone else runs the tool for you. If you have similar use cases of bringing in NoSQL DBs along side of relational DB’s, let us know in the comments!