There are a number of variables that allow a DBA to tune a PostgreSQL database
server for specific loads, disk types and hardware. These are fondly called the
GUCS (Global Unified Configuration Settings) and you can take a look via the
pg_settings view. There are also a few of things that you can do in your
application to get the most out of Postgres:
Know the postgres index types
CREATE INDEX will create B-tree indexes which will serve well for
most cases where we use equality, inequality and range operators. However there
are cases where you can build different indexing strategies with GiST
(Generalized Search Tree) indexes. For example, Postgres ships with built in
GiST operator classes for geometric
— for dealing with the geometric types like point, box, polygon, circle,
and others. There are more interesting GiST index
examples in the
contrib packages for things like textual search, tree structures, and more.
Consider multicolumn indexes, when it makes sense
When your query filters the data by more than one column, be it with the
JOINs, multicolumn indexes may prove useful. If you create an index
on columns (a, b), the Postgres planner can use it for queries
WHERE a = 1 WHERE a = 1 AND b = 2
However, it will not use it for queries using:
WHERE a = 1 OR b = 2 WHERE b = 2
But Postgres also has the ability to use multiple indexes in a single query.
This may come in handy if you are using the
OR operator, but will also make
use of it for
AND queries. So it boils down to what the most common case is
according to your application’s read patterns and optimize for that, either with
an an index on (a, b) and another on (b), or two separate single column indexes.
Simply put, a partial index is an index with a
WHERE clause. It will only
index rows that match the supplied predicate. You can use them to exclude values
from an index that you hardly query against.
For example, you have an
orders table with a
completed flag. The sales
people want to know what orders over $100,000.00 haven’t been completed because
they want to collect their bonuses, so you build a view in your app to show them
just that (and negotiate a cut on the bonus). You could create the following
CREATE INDEX orders_incomplete_amount_index on orders (amount) WHERE complete is not true;
Which will be used by queries of the form:
SELECT * FROM orders where amount > 100000 AND complete is not true;
Don’t over index
Part of maintaining a healthy database is going back and making sure you don’t have any unused indexes. It’s common to add indexes to address a specific performance issue for a particular query, but in many cases indexes start to pile up becoming dead weight. Remember that the more indexes you have, the slower INSERTs will become because more writes will need to happen to keep the indexes updated.
Keep statistics updated
Make sure to run
VACUUM ANALYZE to keep data statistics up to date — as
well as recover disk space. In addition, Postgres ships with a built in
auto-vacuum daemon whose purpose is to automate the execution of
ANALYZE. You should read
considerations for setting the auto-vacuum daemon’s frequency according to your
database size and usage characteristics.
Make sure you ANALYZE when creating a new index, otherwise Postgres will not have analyzed the data and determined that the new index may help for the query.
Use more joins
Postgres is perfectly capable of joining multiple tables in a single query. In a running app, queries with five joins are completely acceptable, and will help bring in the data required by your app, reducing the number of trips to the database. In most cases, joins are also a better solution than subqueries — Postgres will even internally “rewrite” a subquery, creating a join, whenever possible, but this of course increases the time it takes to come up with the query plan. So be a pal and use joins instead of subselects.
Prefer INNER JOINs
If the cardinality of both tables in a join is guaranteed to be equal for your
result set, always prefer doing an
INNER JOIN instead of a
LEFT OUTER JOIN.
A lot of research and code has gone into optimizing outer joins in Postgres over
the years. But the reality is that especially as more joins are added to a
query, left joins limit the planner’s ability optimize the join order.
Know how to understand the explain output
Paste the output of
explain analyze [some query] into
explain.depesz.com to help identify the most
costly nodes in the query plans.
EXPLAIN output is a very extensive topic, but these are some
general guidelines when reading plans:
- Are the cost estimates vs. actuals close, or are there discrepancies? Typically a sign of not having ANALYZEd recently.
- Is an index not being used? The planner may be choosing not to use it for good reason.
- Is there query using the
some_string LIKE pattern? If so, make sure the pattern is anchored at the beginning of the string. Postgres can use an index when doing
some_string LIKE 'pattern%'but not for
some_string LIKE '%pattern%'
- Have you vacuumed recently? Have you indexed foreign keys?
- Are there table scans that should use an index instead? Not all table scans are bad — there are cases where it will perform better than an index scan
- Good database schema design yields better query plans. Read up on database normalization.
Never try to optimize queries on your development machine
The Postgres planner collects statistics about your data that help identify the best possible execution plan for your query. In fact, it will just use heuristics to determine the query plan if the table has little to no data in it. Not only do you need realistic production data in order to analyze reasonable query plans, but also the Postgres server’s configuration has a big effect. For this reason it’s required that you run your analysis on either the production box, or on a staging box that is configured just like production, and where you’ve restored production data.
Experimentation is key
There are no hard and fast rules to a perfectly optimized system. The best advice is to try out different configurations, use a tool like NewRelic to find out what the bottlenecks are, and liberally try out different combinations of indexes and queries that yield best results for your particular situation.