Today I want to talk about my very first python module that I did publish on PyPI, the Python package index. It is called mpcouch and performs a single task, I was surprised to find no module for. But first some theory!
Uploading to CouchDB as it is
To upload a documents to a CouchDB database, it simply gets pushd to the databases HTTP REST interface. If you have more than one document to store, you repeat this step. This can become a very slow process when you have to add thousands or even millions of documents.
Therefore, CouchDB supports to upload a set of documents as a whole. This is called batch-upload an performs tremendously faster than single document uploads. The existing couchdb module for python offers this way to upload data with its API.
The tedious part
This sounds very good so far, but here comes my problem: When using CouchDB I usually deal with a very high amount of data I have to initially fill the database with. In most cases this lookes like this: Take some data, perform some preparatory steps on it, make the document entry for the database ready, and finally push this single document to the database (or collect them to store them later as a batch). It would be very comfortable to just have to push a single document to the database and the database driver would only collect these documents. When enough have been collected, they would be batch-uploaded to the database while possible further documents are collected again. Ideally, this batch-upload would work in parallel to the collection of new documents, so the actual processing does not have to hold.
The mpcouch module
This is exactly what the mpcouch module does. After importing the module with
you can create a new mpcouchPusher object that now represents your bulk-uploading interface to a CouchDB:
myCouchPusher = mpcouch.mpcouchPusher( "
http://localhost:5984/myDatabase", 30000 )
The first parameter is the URL to an existing CouchDB database. It has to be specified in this way so it is easy for the user to switch to a non-local CouchDB. The second parameter (“30000”) represents the amount of doucments that have to be collected before a batch-upload is performed. This value seems to be a reasonable value for my system – it might be different with yours.
You now can save a document to the database by using the pushData function of the pyCouchPusher object:
myCouchPusher.pushData( MyPythonJSONDocument )
The document you push has to be formatted as a Python object representing a valid JSON document in the same way as the couchdb module would require. The mpcouch module now collects all incoming documents and bulk-uploads them as soon as 30000 of them have been received. One great thing is, that to do so, the module spawns an own process to perform the upload while it continues to catch new documents. Your program does not have to wait for the database upload to finish!
When you have generated you last document and want to finish your database upload procedure, it is important to call the method
This assueres all still running batch-uploading processes are finished and the last batch of accumulated documents is uploaded before the program is allowed to continue.
Still not perfect
Of course there are still many rough edges around this package. For example, a new uploading process is started every time when enough documents have been collected – no matter how many uploading processes are already running. If the upload performs to slowly more and more upload batches are started in parallel which in turn make the overall upload even more slower. By implementing a simple process-cap this could be solved quite easily. Until then one has to find the right limit value for the size of the batch upload for any given scenario.
For certain projects it might be necessary to work on the complete dataset of the OpenStreetMap project (OSM). My preferred way of using data from OSM is usually done by performing the following steps:
- download the OSM data
- selecting my area of interest by clipping it to my preferred extent
- importing the data into a PostGIS database
The complete database-dump of OSM is called the “Planet File” and weights at the moment about 36GB in its compressd form.
To clip the data I usually resort to a very handy command-line tool called Osmium where you are able to define a region of interest and/or specify certain tags to filter the data by. The import into a PostGIS database can be performed by multiple tools, of which I prefere one called Imposm3 because of its speed. But still, when using it on the complete planet file with limited ressources (I used a dual core CPU and 8GB ram), it gets terrible slow and does not complete without error. My guess is, that the index which is created during the import procedure to access the data itselfe becomes so large that the access to it is not fast enough. For the extract of alone the continent of Europe, the index is more than 20GB.
The logical thing to do is to split the data prior to import and use different databases for each part of the dataset. When using the data later on, one will have to apply whatever further steps are taken to each database individually, but still – when not using too many splits, this should be not a lot of hassle.
What would be more reasonable than to split by continents? Luckily the company Geofabrik already offers extracts of the planet file split by continents and countries, if prefered.
Instead of splitting the complete dataset on my own, I wrote a small script which downloads each continent on its own and then goes on to import each file in a unique PostGIS database. This procedere was quite fast for all countries, except Europe.
Still too Large
If you take a closer look at the metadata of each extract by continent, you can see, that Europe is by far the largest one. While the import of the other continents was done in a matter of hours, the one of Europe took over a day before my computer automatically logged off and on again. Sadly, I have no idea what happened in detail since I was not present when this incident occured.
But then I got another idea. I observed that the import procedure of Europe was happening at an remarkably slower rate than the other ones. What if there was some kind of internal timeout? The amount of system ressources used during the import did not change during the firt two hours. Assumin it dit not change during the complete import procedure, this can not be the reason for stopping.
The complete import procedure was done on a traditional harddrive. The PostGIS database also is located on the same HDD. What I did next, was to specify the -cachedir parameter of Imposm3 to use my internal SSD drive. This should speed up sequential access to the readout of the index which happens a lot during the import.
And so it happened! The import itself was about 10 to 15 times faster when using the SSD as storage medium for the temporary index. This is still not as fast as with the other continents, but still ways faster than before. The source file itself as well as the PostGIS database remained on the HDD. This was great news, since the index is generated only once but accessed millions of times. As far as I know, only write access to SSDs wears them down.
A problematic situation migth occure when paying special attention to the areas on the border between two continents. Since the sliced data overlaps for a certain amount, there is redundand information. In case for the “places” layer shown in the image above, I merged all input datasets by the “Merge shapes layers” SAGA command in the Processing toolbox of QGIS. After that I could apply a cleaning algorithm to the dataset. This might not be practical (or even feasable) with very big layers, so there has to be found another solution.
To avoid this problem, one could split the planet file into regions by oneself and care to cut sharply.
I can recommend using the “continent split planet” way of importing OpenStreetMap data when not having access to a big server with loads of RAM and cpu-cores. The hassel of e.g. reapplying the same cartographic design to each of the continents individually is within limits, since there are only 7 (in case of the splits 8) of them.
Do you know what a “Schwarzplan” is? In English it is called “figure-ground diagram” and is a map showing all built areas as black and everything else in white colour (including the background). Here is an example of such a map for the area of Paris.
From such a map you can gain insights into many different aspects of urban structures. So it is for example possible to recognize hints of the age of parts of the city, their kind of reignship, their history of development, suspected density and many more. Furthermore, figure-ground diagrams have an aesthetically appealing effect and strongly underline the differences between different cities when comparing them.
Up until now, there was no atlas focussing on this kind of map which was sad . This might be due to the fact that one would have to get costly data about the buildings for each city individually But thanks to the OpenStreetMap Project I can present to you the atlas of figure-ground diagrams, a book called:
“Schwarzplan” is an atlas displaying figure-ground plans of 70 different cities all over the world together with some basic statistical information about their populations and extent on 152 pages made out of high quality glossy paper.
Since a week ago you can download the Upper Austria Quiz from the Google Play store: OÖ Quiz on Google Play
I always wanted to learn how to program mobile devices. During Christmas holidays the government of Upper Austria called for mobile apps using their OpenGovernmentalData. So, the chance was taken and Kathi and me started to design an Android app. Actually it is an hybrid app. This means that you develop your complete app as an HTML5 project and bundle it with a small server. The result is a valid android APK which behaves like a native app. The advantages are that the toolkit we used to generate the package (Apache Cordova) is capable of producing running binaries for a whole bunch of operating systems, including Windows 7, BlackBerry, Symbian, WebOS, Android and, if you can afford it, iOS. Sinc we can’t pay 100$ per year to be allowed to just push the Quiz to the AppStore, there is no version for iPhone.
Since we both own Android devices, the app is optimized for Android devices. Still, there were some difficulties. First, we had to decide on a framework. jQuery Mobile seemed to be a reasonable choice at that time since it is well known and I already are used to default jQuery. Next time, I will go with a toolkit supporting any kind of MVC architecture that avoides usage of HTML whenever possible. One of these would be e.g. Kendo UI. But we chose jQuery Mobile and stuck to it.
The size of the toolbar at the bottom was a perculiar problem: While it was easy to adjust the size of the font depending on different screen resolutions of different devices, the toolbar would not change its display height.
We decided that despite its fixed behaviour it will be usable in any case, even when being very small and optimized it for the lowest screen resolution we could find. This means the app runs on even very small devices.
The map background is variable in size and automatically adjusts to the display size.
The questions are generated out of OGD and OSM data. The most work was not to generate the questions which was mostly a case of combining a fixed phrase with a word describing a location, but to manually filter these questions for plausability. It just does not make any sense to let the quiz ask for a very small mountain which only a few people of Austria know by name themselves. The same goes with lakes – in this case we filtered by their area.
It was a fun project and I find myself playing the quiz which is a good sign. Kathi and me, we already have ideas to extend the OÖ Quiz and even have plans for a quite advanced second version with multiplayer support – it depends on our free time whether we will implement it or not.