For certain projects it might be necessary to work on the complete dataset of the OpenStreetMap project (OSM). My preferred way of using data from OSM is usually done by performing the following steps:
- download the OSM data
- selecting my area of interest by clipping it to my preferred extent
- importing the data into a PostGIS database
The complete database-dump of OSM is called the “Planet File” and weights at the moment about 36GB in its compressd form.
To clip the data I usually resort to a very handy command-line tool called Osmium where you are able to define a region of interest and/or specify certain tags to filter the data by. The import into a PostGIS database can be performed by multiple tools, of which I prefere one called Imposm3 because of its speed. But still, when using it on the complete planet file with limited ressources (I used a dual core CPU and 8GB ram), it gets terrible slow and does not complete without error. My guess is, that the index which is created during the import procedure to access the data itselfe becomes so large that the access to it is not fast enough. For the extract of alone the continent of Europe, the index is more than 20GB.
The logical thing to do is to split the data prior to import and use different databases for each part of the dataset. When using the data later on, one will have to apply whatever further steps are taken to each database individually, but still – when not using too many splits, this should be not a lot of hassle.
What would be more reasonable than to split by continents? Luckily the company Geofabrik already offers extracts of the planet file split by continents and countries, if prefered.
Instead of splitting the complete dataset on my own, I wrote a small script which downloads each continent on its own and then goes on to import each file in a unique PostGIS database. This procedere was quite fast for all countries, except Europe.
Still too Large
If you take a closer look at the metadata of each extract by continent, you can see, that Europe is by far the largest one. While the import of the other continents was done in a matter of hours, the one of Europe took over a day before my computer automatically logged off and on again. Sadly, I have no idea what happened in detail since I was not present when this incident occured.
But then I got another idea. I observed that the import procedure of Europe was happening at an remarkably slower rate than the other ones. What if there was some kind of internal timeout? The amount of system ressources used during the import did not change during the firt two hours. Assumin it dit not change during the complete import procedure, this can not be the reason for stopping.
The complete import procedure was done on a traditional harddrive. The PostGIS database also is located on the same HDD. What I did next, was to specify the -cachedir parameter of Imposm3 to use my internal SSD drive. This should speed up sequential access to the readout of the index which happens a lot during the import.
And so it happened! The import itself was about 10 to 15 times faster when using the SSD as storage medium for the temporary index. This is still not as fast as with the other continents, but still ways faster than before. The source file itself as well as the PostGIS database remained on the HDD. This was great news, since the index is generated only once but accessed millions of times. As far as I know, only write access to SSDs wears them down.
A problematic situation migth occure when paying special attention to the areas on the border between two continents. Since the sliced data overlaps for a certain amount, there is redundand information. In case for the “places” layer shown in the image above, I merged all input datasets by the “Merge shapes layers” SAGA command in the Processing toolbox of QGIS. After that I could apply a cleaning algorithm to the dataset. This might not be practical (or even feasable) with very big layers, so there has to be found another solution.
To avoid this problem, one could split the planet file into regions by oneself and care to cut sharply.
I can recommend using the “continent split planet” way of importing OpenStreetMap data when not having access to a big server with loads of RAM and cpu-cores. The hassel of e.g. reapplying the same cartographic design to each of the continents individually is within limits, since there are only 7 (in case of the splits 8) of them.