High performance CouchDB uploader for Python
Today I want to talk about my very first python module that I did publish on PyPI, the Python package index. It is called mpcouch and performs a single task, I was surprised to find no module for. But first some theory!
Uploading to CouchDB as it is
To upload a documents to a CouchDB database, it simply gets pushd to the databases HTTP REST interface. If you have more than one document to store, you repeat this step. This can become a very slow process when you have to add thousands or even millions of documents.
Therefore, CouchDB supports to upload a set of documents as a whole. This is called batch-upload an performs tremendously faster than single document uploads. The existing couchdb module for python offers this way to upload data with its API.
The tedious part
This sounds very good so far, but here comes my problem: When using CouchDB I usually deal with a very high amount of data I have to initially fill the database with. In most cases this lookes like this: Take some data, perform some preparatory steps on it, make the document entry for the database ready, and finally push this single document to the database (or collect them to store them later as a batch). It would be very comfortable to just have to push a single document to the database and the database driver would only collect these documents. When enough have been collected, they would be batch-uploaded to the database while possible further documents are collected again. Ideally, this batch-upload would work in parallel to the collection of new documents, so the actual processing does not have to hold.
The mpcouch module
This is exactly what the mpcouch module does. After importing the module with
you can create a new mpcouchPusher object that now represents your bulk-uploading interface to a CouchDB:
myCouchPusher = mpcouch.mpcouchPusher( "
http://localhost:5984/myDatabase", 30000 )
The first parameter is the URL to an existing CouchDB database. It has to be specified in this way so it is easy for the user to switch to a non-local CouchDB. The second parameter (“30000”) represents the amount of doucments that have to be collected before a batch-upload is performed. This value seems to be a reasonable value for my system – it might be different with yours.
You now can save a document to the database by using the pushData function of the pyCouchPusher object:
myCouchPusher.pushData( MyPythonJSONDocument )
The document you push has to be formatted as a Python object representing a valid JSON document in the same way as the couchdb module would require. The mpcouch module now collects all incoming documents and bulk-uploads them as soon as 30000 of them have been received. One great thing is, that to do so, the module spawns an own process to perform the upload while it continues to catch new documents. Your program does not have to wait for the database upload to finish!
When you have generated you last document and want to finish your database upload procedure, it is important to call the method
This assueres all still running batch-uploading processes are finished and the last batch of accumulated documents is uploaded before the program is allowed to continue.
Still not perfect
Of course there are still many rough edges around this package. For example, a new uploading process is started every time when enough documents have been collected – no matter how many uploading processes are already running. If the upload performs to slowly more and more upload batches are started in parallel which in turn make the overall upload even more slower. By implementing a simple process-cap this could be solved quite easily. Until then one has to find the right limit value for the size of the batch upload for any given scenario.