Optimizing Full Text Search with Postgres tsvector Columns and Triggers

Dan Croak

Postgres full-text search is awesome but without tuning, searching large columns can be slow. Introducing a tsvector column to cache lexemes and using a trigger to keep the lexemes up-to-date can improve the speed of full-text searches.

This article shows how to accomplish that in Rails.

Using pg_search with Rails

We want our users to search for products. Let’s add the pg_search gem to our Rails app:

gem "pg_search"

Then, configure it for our Product model:

class Product < ActiveRecord::Base
  include PgSearch

  pg_search_scope(
    :search,
    against: %i(
      description
      manufacturer_name
      name
    ),
    using: {
      tsearch: {
        dictionary: "english",
      }
    }
  )
end

This example shows configuration for full-text search of the products table’s description, manufacturer, and name columns. See Implementing Multi-Table Full Text Search with Postgres in Rails for an example of full-text search of multiple tables.

Our pg_search_scope is named :search, so we can invoke it with:

Product.search("wool")

We’re explicitly specifying the :tsearch option (which is the default Postgres full-text search) in order to use the english dictionary instead of the default simple dictionary.

What we get for a SQL query

Wonderful. We have full-text searching set up in minutes.

Now, what does our SQL query look like?

SELECT products.*
FROM products
INNER JOIN (
  SELECT products.id AS pg_search_id,
  (
    ts_rank(
      (
       to_tsvector('english', coalesce(products.description::text, '')) ||
       to_tsvector('english', coalesce(products.manufacturer_name::text, '')) ||
       to_tsvector('english', coalesce(products.name::text, ''))
      ),
      (
        to_tsquery('english', ''' ' || 'wool' || ' ''')
      ),
      ?
    )
  ) AS rank
  FROM products
  WHERE (
   (
     (
       to_tsvector('english', coalesce(products.description::text, '')) ||
       to_tsvector('english', coalesce(products.manufacturer_name::text, '')) ||
       to_tsvector('english', coalesce(products.name::text, ''))
     ) @@
     (
       to_tsquery('english', ''' ' || 'wool' || ' ''')
     )
   )
  )
) pg_search ON products.id = pg_search.pg_search_id
ORDER BY pg_search.rank DESC
LIMIT 24
OFFSET 0

This is all pretty standard SQL plus a few cool functions: ts_rank, to_tsvector, and to_tsquery. The to_tsvector function in is worth a closer look. It generates tsvector data types, which are “a sorted list of distinct lexemes.” Lexemes, in turn, are “words that have been normalized to make different variants of the same word look alike”.

For example, given the following product:

Product.create(
  description: "Michael Kors",
  name: "Sunglasses",
  manufacturer_name: "Michael Kors"
)

The tsvector looks like:

'kor':2,4,6 'michael':1,3,5 'sunglass':7

The resulting lexemes were “normalized to make different variants” by lowercasing, removing suffixes, etc. The lexemes were sorted into a list and the numbers represent the position of the lexeme in the original strings.

For tons of awesome examples and details on these three functions, see Postgres full-text search is Good Enough!

Caching tsvector lexemes

On a large products table, our searches may be slow. If so, we have some tuning options.

One option would be to cache the tsvectors using a materialized view. Read Caching with Postgres materialized views or Postgres full-text search is Good Enough! (again) for more information materialized views with Postgres and Ruby.

Materialized views may be a good option for your data. One downside is that the entire view must be refreshed with:

REFRESH MATERIALIZED VIEW view_name;

That may be a good fit in some scenarios, perhaps run daily as a cron or Heroku Scheduler job. In our case, we want a cache to be updated when a Product is created or updated.

Let’s edit our pg_search_scope:

      using: {
        tsearch: {
+         tsvector_column: "tsv",
        }
      }

Since we can’t dump a tsvector column to schema.rb, we need to switch to the SQL schema format in our config/application.rb:

config.active_record.schema_format = :sql

Remove the now-unnecessary db/schema.rb:

rm db/schema.rb

And generate a migration:

class AddTsvectorColumns < ActiveRecord::Migration
  def up
    add_column :products, :tsv, :tsvector
    add_index :products, :tsv, using: "gin"

    execute <<-SQL
      CREATE TRIGGER tsvectorupdate BEFORE INSERT OR UPDATE
      ON products FOR EACH ROW EXECUTE PROCEDURE
      tsvector_update_trigger(
        tsv, 'pg_catalog.english', description, manufacturer_name, name
      );
    SQL

    now = Time.current.to_s(:db)
    update("UPDATE products SET updated_at = '#{now}'")
  end

  def down
    execute <<-SQL
      DROP TRIGGER tsvectorupdate
      ON products
    SQL

    remove_index :products, :tsv
    remove_column :products, :tsv
  end
end

This change introduces a tsv column of type tsvector to search against, a GIN index on the new column, a TRIGGER on those new columns BEFORE INSERT OR UPDATE, and a backfill UPDATE for existing products, to keep the data in sync.

Postgres has a built-in tsvector_update_trigger function to make this easier.

The GIN index could alternatively be a GiST index. See the GIN vs. GiST tradeoffs.

Here’s the resulting query with the new tsvector-type column:

SELECT products.*
FROM products
INNER JOIN (
  SELECT products.id AS pg_search_id,
  (
    ts_rank(
     (products.tsv),
     (to_tsquery('english', ''' ' || 'wool' || ' ''')), 0
    )
  ) AS rank
  FROM products
  WHERE (
    ((products.tsv) @@ (to_tsquery('english', ''' ' || 'wool' || ' ''')))
  )
) pg_search ON products.id = pg_search.pg_search_id
ORDER BY pg_search.rank DESC
LIMIT 24
OFFSET 0

We can see that our run-time to_tsvector function calls are gone, and our cached tsvector data in the GIN-indexed tsv column are being queried against.

We’ve now improved the speed of our queries by introducing a tsvector column to cache lexemes. The trigger will keep the lexemes up-to-date as products are created and updated, without any daily cron job to run.

Happy searching.