kosfar's Blog A systems traveller logbook

How complex systems slow down

It’s a Saturday evening (of course it is!). And suddenly all but the lightest queries targetting your PostgreSQL database start slowing down.

Engineering team members start logging in and doing their best while recovering most of the lost performance. Part of the ongoing efforts includes completely removing database querying and offloading to caches based on static files. But there are still some pretty important queries that cannot be taken off the database and are still slow; access share locks are high enough to manifest the slowness and these queries, oh, these queries are very frequent, too many times per minute.

You look around and there are no DBAs in the room. Guess what? This is the day you are wearing their hat. So you collect these slow query statements and fire up your casual EXPLAIN (analyze, buffers) spells. At a first glance there are no hints of seqscans in the plan but no presence of nested loops either. The slow queries are still using their indexes wherever they can though, but they are so slow. And then, among the lines, it hits you:

   Buffers: shared hit=221991 read=150361 written=15

That’s almost 1.5GB for a single query. Wait! Let’s check our shared buffers. 4GB. Boom! Let’s bump this up.

Shared buffers go up to 32GB, server gets restarted and the query now executes in 5 seconds vs 120 seconds. Success! So it turns out that was happening was that these queries were thrashing internal shared buffers. By increasing shared buffers we let the query optimizer breathe and select better paths for executing the query.

During that session we went down the road of debating over a few other options like tweaking the query planner or experimenting with clustering indexes but data volatility would work against any benefits we could try to gain back.

For the record, PG stats were showing 96% buffer hit ratio during the incident, and that’s a reminder that although 90%-ish may sound a lot for a hit ratio, most of the times it’s an indicator of poor performance in the PostgreSQL world.