OSM in CouchDB on Raspberry Pi
The Raspberry Pi is a cheap credit-card-sized ARM-computer with 700MHZ and 256MB of RAM that consumes only about 3.5W energy. It costs about 30€ and runs an optimised debian-linux named “Raspbian“. Since I have heared about it I wanted to see how it performs with a CouchDB installation. CouchDB is a document-based database that should perform well under low-ram circumstances. This is perfect for the Raspberry Pi. There exists a spatial extension named Geocouch which allows the use of a spatial index.
What could be the benefits of this kind of system? Clearly the benefits only lie within a system-architecture that only seldom updates the data but regularily queries it. The swapping of an OSM-database to an autonomous hardware makes the system independet from the main computer. Also, depending on further research on possible queries, such a system could prove to be a versatile information-gateway for OSM-data and lift some weight on heavily used default APIs offered by the OSM-project. With a setup like this, it could be possible for everyone to set up a information-delivering API that is cost-effective to set-up and to contain.
So, it was time to test this configuration against an OpenStreetMap dataset! The setup of CouchDB was not 100% straightforeward, after installing the package I had to modify the start-up script because there was a problem with the ownership of an automatically created directory. For the installation of Geocouch I had to compile it by myself and again modify the start-up script since the method proposed in the readme of the Geocouch-project did not work for me. (I will not go into more detail about setting this environment up, but maybe I will write a post about this later on)
On the hardware-side, additionally to the 8GB-SD card that held the system, I attached a USB-HDD with 400GB space on it. I had to reconfigure the CouchDB configs to relocate its storage of the database as well as its views to a directory on the USB device.
I had two datasets at hand: First all points in the area of vienna, which sum up to 445220 entries and second the complete OSM-dataset for Austria.
To convert OSM data to JSON format and batch-upload it to the CouchDB I used the method and tools described on the OSMCouch page of the OpenStreetMap Wiki. It uses the great OSMIUM framework in combination with a custom description-file to generate GeoJSON compatible output. Preparing the dataset for the extent of Vienna was not that time-consuming, whereas the one for Austria took quite some time to process and resulted in a 6.1GB JSON file. The pre-processing of these datasets was not done on the Raspberry Pi but on a much more powerful computer.
Upload to CouchDB
Uploading was done by a little script also mentioned on the OSMCouch page named “chunkybulks.py”. What this script does is just take a huge JSON-file and upload it to a specified CouchDB-database in chunks of 10.000 entries (the size can be specified manually). If one would try to upload a big file at once, the server most probably could not cope with that. On the same page there is a note saying that the software ImpOSM soon will come with integrated CouchDB support but since it was not yet available I sticked to the old but still reliable method.
This method worked fine when uploading the Viennese points, but with the complete dataset of Austria, it crashed after 450.000 inserted entities. I guess this was because the CouchDB-server could not response fast enough because the bulk size was too large (10.000). After changing it to 1.000 it uploaded entries but it was terribly slow.
Index Generation, Queries
Now came the critical part: Simply put (maybe too simply, but to get the general idea it is good enough), in CouchDB queries are pre-defined by java-script functions. Such a pre-defined query is called a “View”. For faster access, CouchDB generates an index per view. The generation of the index is very time-consuming but once the index is available any future query will be very fast. I’m especially interested in the performance of these kind of queries which rely on an already defined index.
So, I had to create a view – preferably a spatial-view (for a more detailed explanation on this topic, please take a look at the Geocouch-Dokumentation) and execute it once to trigger the generation of an index. As expected, the generation of the index for all 445220 entries in the Vienna-dataset took hours. The index generation for the dataset of Austria took days.
When querying the dataset there is a slow delay noticeable but it happens quite fast considering the power of the RaspberryPi and the size of the database.
One interesting question is: What will happen if I use the complete world-file? Given enough hard-disk space and time, the upload and generation of indices should be possible – but how fast will the query be?
Another thing wich will be interesting is how the import method of the future version of imposm will perform, especially since it has support for DIFFs!