I’ve recently been reminded, in no uncertain terms, of the importance of running the “ANALYZE” (or “VACUUM ANALYZE”) command periodically on a PostgreSQL database installation.
Amongst other tables, a production database I administer has three tables: two moderately large tables, A(1) and B, and one quite large one (>700 000 records), C (for “Cross”), which is essentially the full cross product of A and B, with an additional piece of information for each tuple (record). So I’m associating some information with every combination of records from tables A and B.
Queries pulling the data for a single record of A or B worked quickly, as the database engine correctly used a lookup index instead of scanning the entire table. But any query that required records from C corresponding to more than one record of both A and B immediately triggered a sequential scan. So, say, for two records from A, a query that should take at most (2 x 1ms) was taking about 1.5s — nearly a thousand times longer, and very noticeable in applications.
Solution 1: Split the queries
One could, of course, handle this problem in the application code, by issuing a seperate query for each record from A, and combining the results in code. But this is really messy, and anyway surely direct data manipulation in PostgreSQL should be faster than PHP’s very flexible but not speedy number handling!
Solution 2: Set enable_seqscan to “off”
My initial research discovered the existence of a connection-wide variable “enable_seqscan”, that can be set to off using
SET enable_seqscan TO off;
Once so set, index scans are always forced. This works, and doesn’t require much additional application code. But it’s a rough tool, as it turns off ALL sequential scans, even those that would in fact be more efficient.
Really, what I wanted was the query analyzer and planner in PostgreSQL to be better — ie., to stop living in a fantasy world where sequential scans on table C were relatively fast. Turns out, however, that my problem was exactly that the planner was in such a world:
Solution 3: Run “ANALYZE”
This command takes a random sampling of rows from each table, to compile statistics about the tables, to help the planner plan. And it turned out I hadn’t run it in ages — in fact, since C had been a much, much smaller table. After running the command (it took a few seconds on over 50 tables), all my queries executed exactly as I wanted them to. Everywhere. A PHP page that took over two seconds to load is now effectively instantaneous, and much elsewhere is far improved.
Needless to say, my crontab now includes a nightly “VACUUM ANALYZE”. Yours should too.
(1)Names have been changed to protect witness confidentiality.