The right indexes make big SQL queries fast. If you’ve been using Postgres for more than 5 minutes, you’re almost certainly familiar with the everyday B-Tree index. This can deal with data that has a one-dimensional ordering: numbers, timestamps, text, and so on.
And if you’ve had anything to do with spatial data in PostGIS, or indeed with Postgres’ own geometry types, you’ve probably also encountered GiST-based indexes. These can handle 2D and higher-dimensional data, enabling speedy queries on proximity, containment, overlap, and so on.
Both index types can work with multiple data items. For example, if you need to find people by their name and date of birth, you can create a multi-column B-Tree index that enables fast searching on the two together.
But what if you need to search by multiple columns, and the data in those columns is a mix of one- and multi-dimensional? A case in point: what if you need to search simultaneously over space and time? Well, Postgres has you covered here too.
The TL;DR is: use the btree_gist extension. Some context and a worked example follow.
Crimes
This issue cropped up while I was working with the detailed crime data published by UK police forces. The data set has one row per crime: each row has the crime’s type, a latitude and longitude (generalised slightly for anonymity), and the month and year it was reported.
I needed a big join query to count crimes occurring near millions of specific point-locations at particular times.
Get the data
The data I used was the April 2017 archive (which is the latest that goes all the way back to 2010). It’s organised into folders named by year and month (2017-04, 2017-03, etc.). In each folder there are CSV files for each police force, and for a few different sorts of data. I was interested in street-level crime data, so the files I wanted look like [year-month]/[year-month]-[police-force]-street.csv (for instance, 2017-04/2017-04-thames-valley-street.csv).
When I’m exploring CSV data and loading it into Postgres, I generally find xsv invaluable. Let’s use it to take a look at what we have in these CSVs.
How many crimes are recorded?
What do the columns represent?
What are the types of crime?
Note that I’m using pv with the row count calculated above to monitor progress, and we need the --limit argument because xsv frequency tables default to showing only the top 10.
Now let’s load just the subset of data we need into Postgres.
First, create the schema:
Then, load in the data with COPY … FROM. Again I use pv to monitor progress. I use xsv to select specific columns. And I use sed to put the year-month dates into a format Postgres understands, by appending -01 (i.e. the first day of the month) to each.
As quick check that the data have loaded OK, let’s see how crimes are distributed over years with a quick crosstab:
This looks sensible enough. It also tells us that crime classifications changed a few times between 2010 – 2017, which is why some of the categories are a bit oddly overlapping.
Make it spatial
The next step is to create a point geometry out of the latitude and longitude coordinates we’ve been given.
As is annoyingly common, no coordinate reference system is specified in the data set. The gov.uk standard is British National Grid (BNG). But that’s clearly not what we have here, since that would give us eastings and northings in metres, not latitude and longitude in degrees.
For now, I’m going to guess we’ve got WGS84, which is the GPS coordinate system (EPSG code 4326). And then I’m going to transform to BNG (EPSG code 27700), because having the coordinates in metres makes distance calculations quick and easy.
Two small caveats: (1) I’m using PostGIS’ built-in parametric transformation between the two coordinate systems, which may be off by a few metres. For really precise work, I’d use OSTN15. And (2) BNG isn’t really appropriate for the Northern Ireland data, but it should still work well enough here.
The final step in loading the data is to update table statistics, helping the query planner make good decisions in what follows.
Ask questions
Let’s start by counting crimes within 1km of Stratford station (near the Olympic Park) around the 2012 Olympics in London. Stratford is at grid point E 538566, N 184375, so:
Unsurprisingly, since there were so many more people about, it turns out that crime in these months is a bit higher than usual (the highest July/August crime count in any other year in the data set is 1,430).
It also turns out that this is a pretty slow query, taking about 8 seconds on a fast MacBook Pro. That, of course, is because we haven’t indexed anything.
If we explain analyze the query, there’s a Parallel Seq Scan on crimes in there. It’s nice that Postgres can parallelise it, but it’s still a sequential scan: every row of the database has to be retrieved and checked. Yawn.
Index on space
The first thing I tried, therefore, was a standard spatial index:
This is very much better. But the whole point of this post is that we can do better still.
Index on space and time
For this particular search we should really be indexing over both space and time. That will then allow Postgres to do a simple Index Scan encompassing all conditions simultaneously.
For this, we use the btree_gist extension, which reimplements a BTree index using GiST infrastructure.
Ordinarily, btree_gist has no advantage (and some disadvantages) over the native BTree implementation. What it’s useful for, though, is combining a BTree index with GiST index types. Which we can do like so:
When we now re-run the query, it takes about 8 milliseconds: almost 1000x quicker than the unindexed version, and about 4x quicker than the spatial-only index version. An explain analyze shows a simple Index Scan as the search method, which is what we were hoping to see.
Of course, there is a trade-off to be made, and in this case it’s primarily about disk space. Running psql’s \di+ and \dt+ commands we can see that the spatial-only index takes 1.9GB of storage, while the space-and-time index requires 2.7GB. This is a significant fraction of the size of the data itself: the crimes table is 5.5GB.
Still, this is a nice, flexible technique that extends to various sorts of one-dimensional data that you might want to combine with a GiST index. I hope you find it useful.