Part 1 — The Database. How to build a real time data sync, multi platform app with CouchDB and PouchDB

Tudor Dumitriu
Adaptabi
Published in
8 min readDec 9, 2020

--

This is the first part of a series of posts on how to build real time data sync, multi-platform app. Here we address the database design.

Context

The application we are building is a personal assistant app, that would allow the user to manage and share his tasks in real time.
Since one of our biggest concerns was for the data to sync automatically on multiple platforms and devices, the DB selection was our top priority and had a heavy influence on the rest of our architectural choices.

Criteria

When we started investigating the database options, the requirements were clear:

  • Data must be synced automatically on all devices and in between all user databases replication should be done according to business rules
  • Data must be available in offline mode
  • The server DB should have a browser DB integration supported both on web and mobile
  • Avoid vendor and format lock-in as much as possible
  • The DB must be developed and maintained by strong organizations, with real experience, good support and active communities

Given the above, the alternatives were gradually dismissed as following:

  • We started with SQL Server Compact, since we were looking for a SQL DB initially, but it was deprecated since 2013
  • The first NoSQL option was Google Firebase Realtime Database, but at that time didn’t have any client support for offline data (was in beta actually) and even if at the current date it does have, it would force us into using the Firebase Platform and has quite a few limitations when looking to achieve filtered replication between user databases
  • There is also AWS Amplify, which wasn’t also available at that time, but looks promising, having offline and real time data sync support, but is opinionated and definitely imposes a vendor lock-in
  • We have even tried building our own offline data sync mechanisms in previous apps, but with no real time data sync. Not just that such a mechanism is limited and requires a lot of effort to implement, it’s sometimes brittle (on poor connectivity) and a small team cannot possibly cover all the scenarios as the above product teams can

Choice

All that lead us to the CouchDB / PouchDB pair, which, even at the time of the writing seems the best choice.

CouchDB is a NoSQL database with “Seamless multi-master sync, that scales from Big Data to Mobile, with an Intuitive HTTP/JSON API and designed for Reliability”, that has a large community and is used by giants like npmjs.
Developed by The Apache Software Foundation, has also strong support from IBM (Cloudant is a compatible DB developed by IBM implementing the same replication protocol).
With the browser PouchDB that “enables applications to store data locally while offline, then synchronize it with CouchDB”, the pair seemed the obvious choice for us.

Theory

Switching the paradigm from SQL to NoSQL is not easy, especially if you’ve only used SQL databases, so a little bit of theory and technical mentions are in order. But we’ll keep it short.

In the world of distributed systems we have the CAP Theorem that states “it is impossible for a distributed data store to simultaneously provide more than two out of the following three guarantees: Consistency, Availability, Partition Tolerance”.
With that in mind, CouchDB places itself at the intersection between Availability and Partition Tolerance, in an area named Eventual Consistency. Meaning that the accent falls on the high Availability and NOT on strong Consistency. Two clients might have different versions up until the data gets replicated and the eventual conflicts solved, that’s why we say the data is eventually consistent.

On top of that, a NoSQL database works differently that a SQL database, and because of their particularities, we need to consider the following:

  • Instead of rows in tables there are JSON documents in databases
  • There are no tables and no joins, so when designing a DB schema one needs to bear in mind how to architect the data structure based on the frequency of reads and writes
  • There is support for views/indexes based on the MapReduce mechanism, basically each view stands on a Map function that filters the documents and has a Reduce option to aggregate the data (reduce/rereduce and underlying B+ Trees structure need to be understood)
  • There is a “No locking” concept, each document gets a new revision number on each update and the update works much like a version control system. There is also no concept of transaction
  • Since the DB is eventually consistent, the user might be updating “old” data and conflicts will arise. Even if there is an automatic conflict solving mechanism, we still need to build on top of that to make sure information is not lost (conflicts are NOT automatically merged, just a winning revision is selected)
  • CouchDB has limited authentication/authorization support, meaning that in order to achieve full separation of data at the user DB level we need to have per user databases with filtered replication between them
  • Even if the replication is the “killer” feature of CouchDB, abusing continuous replication between databases can also lead to resource abuse (ports and sockets)
  • If from the client side there isn’t much we can do to improve the replication performance, on the server level there is the powerful Spiegel engine (“Scalable replication and change listening for CouchDB”) that allows triggering replications on demand and according to specific rules
  • CouchDB doesn’t come with a full text search but in the latest version it’s much easier to configure it. There is no full text search (out of the box) in PouchDB either, but most of the time a DB SCAN will do, because it’s quite fast
  • PouchDB uses long polling to keep the data in sync with CouchDB, and under the hood it uses IndexedDB to persist data, but in Cordova we can safely use SQLite via a plugin and overcome the storage quota limitations of the web views

As you can see, if you’re coming from a SQL world, things are a bit different and you’re in for much more. But it might be worth it, because it’s all about finding the right tool for the job.

Design

Before diving into the specifics of our Database design, here are a few words about the product being built.
The app is basically a task management system, that allows organizing tasks in lists and calendars, sharing them with friends and more, all while being focused on your goals. It also needs to be able to run simultaneously on browsers and mobile devices.
That should explain pretty well why the app should be available in offline mode, and why the data needs to sync in real time.
Also, only the shared lists of tasks need to be kept in sync between friends, and since the data is entirely synced with the client, each user needs their own DB, with restricted access, but the common records (and only the common records) need to by synced automatically between per-user DBs.

Image 1: Database Architecture

Image 1 presents the high-level database architecture, but what is very important here is that the standard communication paradigm has shifted from UI-API-DB to UI-DB-SERVICES.

This approach has its challenges, because on the server side we need listener services always running to track events that trigger business logic. Also this forced us to have quite a lot of the business logic residing on the client side, which is not necessarily a bad thing but needs to be kept in mind.

In order to understand it better, a general data flow process should shed some light. Assuming that User 1 and User 2 are friends and have a shared list of tasks, when User 1 updates one of the shared tasks, the following data flow occurs:

  1. Task gets replicated from PouchDB Client User DB1 to CouchDB User DB1
  2. Spiegel listens to CouchDB User DB1 changes and triggers a CouchDB replication from User DB1 to CentralDB
  3. Spiegel listens to CentralDB changes and determines (via an API call, but let’s keep things simple) what changes go where and if the User 2 has the required permissions
  4. Spiegel triggers a CouchDB replication from CentralDB to the User DB2
  5. PouchDB long polling via HTTP mechanism is getting the changes in Couch DB User DB 2 and updates the PouchDB Client User DB 2

Quirks

The process, as always, wasn’t defined straight through, the design emerged gradually while adding more features, facing more and more challenges and limitations. We had to come up with various solutions, listeners, complementing services, even cleanup jobs for the deleted revisions and here are some of them:

  • Conflicts will happen when 2 users update the same document at the same time. Then, scheduled and constantly running jobs need to be created, to watch those conflicts and to gracefully (merge) solve them (otherwise CouchDB will select one of the revisions and that’s it)
  • Solving conflicts on the client side: in PouchDB conflicts need to be considered and solved, but it might not be enough
  • If you need more granular control when replicating a document according to specific business rules, Spiegel can call an API endpoint where you can implement specific logic
  • A CentralDB, holding ALL the important documents is needed because otherwise it would be very difficult to query documents for specific reports. But also, listening to document changes in EVERY User DB is highly impractical
  • Because the DB is responsible of Client-Server communication, this leads to a CQRS-ish way of doing things, meaning that the services are listening to changes and if there is a business-concerning event, the event is raised to the proper service that in turn will be responsible for generating a notification that will finally (via PouchDB to CouchDB replication) arrive to the client. This forced us, for immediate response requests (like adding a friend) to create an API endpoint as well
  • Because even though PouchDB is capable of using various plugins for local storage, the storage footprint needs to be monitored, so we are keeping a limited number of revisions and auto-compaction enabled. Also, since the entire client database gets replicated automatically to the CouchDB server, that too needs to be configured carefully to limit the number of revisions, to enable auto-compaction as well and even more special jobs might have to periodically purge the deleted documents, because otherwise will end up with a huge storage usage
  • We don’t recommend using CouchDB attachments for file storage, because behind the scenes each file would be converted to base64 and replicated via the standard mechanism, which is fantastic for small documents but not so great (actually bad) for larger ones, especially files, so the file, images and attachments support should be handled separately

Conclusion

Both CouchDB and PouchDB are mature technologies, developed by strong teams with active communities. They are most definitely alternatives worth considering and both of them are capable of being used independently.

There is so much more than this, CouchDB is designed with multi cluster replication and sharding in mind, PouchDB has a rich set of plugins allowing it to run from Web Workers or in-memory to transforming the DB into an Event Sourcing one with Delta Pouch. All these are topics for the advanced classes and will require a deeper understanding of the two.

Hoping that we’ve scratched the surface and made you curious, please join the discussion if there’s anything you want to share!

--

--