[clug] Tuning Postgres for large amounts of incoming data

Fri Feb 3 07:47:25 UTC 2017

Hi Stephen,

Code would help but some general advice.

For bulk inserts the COPY command is far far faster than INSERT as it
batches things up.

Remove any indexes, this includes primary keys. These help greatly with
queries but slow down insertion as the index has to be updated each time.
It is faster to drop the index and then recreate it at the end if required.

Remove any constraints or triggers, including foreign keys. You can reapply
these at the end to ensure your data integrity, for large datasets it is
faster to check everything once rather than on each insert.

Ensure that your code isn't the slow point. Possible suspects would be
recreating the DB connection for each command, doing a new transaction for
each command, analysing all your data for escaping and to prevent SQL
injection attacks (obvious trade off here) and not reusing prepared
statements.

Finally once the low hanging fruit has been taken you can look at your COPY
or INSERT command using EXPLAIN, you can also enable extensive logging
which will highlight the slow points showing you where to target (turn the
logging off later, it also slows you down).

This is the official Postgres guide, they cover some of the same points in
more detail.
They also talk about some of the server variables which can be tweaked. I
would suggest tackling the above before going down that path.
https://www.postgresql.org/docs/9.4/static/populate.html

You should be able to get considerably better performance than you are
currently seeing, years ago I was throwing gigs of data in and out of
postgres, it took 10-15 minutes and that was mostly because of my slow code.

David

On 1 February 2017 at 13:03, Stephen Hocking <stephen.hocking at gmail.com>
wrote:

> Hi all,
>
> As some of you know, I've created a setup with a few raspberry pis around
> the state that record plane positions, this is then logged in a Postgres DB
> on a small VM, with about 1.5-2 million records being logged each day.
>
> I've written some code that pulls down the ADS-B Exchange's daily archive
> and logs them into my own DB. This is about 120M records per day, being
> positions of aircraft all over the world. I'm doing this on a stand-alone
> machine, a 6 core AMD box with 32GB of memory. Doing it with a single
> threaded steam takes about 2.5 days to log a full day's data, keeping one
> core busy, with 60% Postgres & 40% my code Wait on I/O is about 2%.
> Splitting the data up into 6 separate streams (one for each core) take
> about 14.5 hours to log a day's data. A days' data seems to take up about
> 30GB in the DB. There's about 20% wait I/O and 54% idle time, which leads
> me to suspect I could tune various Postgres parameters.
>
> Has anyone else had experience with this? Googling turns up a bunch of
> interesting pages, but some of the tweaks mentioned therein seem more
> appropriate for workloads with intensive read queries, which are, as yet,
> not on my horizon.
>
> --
>
>   "I and the public know
>   what all schoolchildren learn
>   Those to whom evil is done
>   Do evil in return"            W.H. Auden, "September 1, 1939"
> --
> linux mailing list
> linux at lists.samba.org
> https://lists.samba.org/mailman/listinfo/linux
>