Secondary indexes have landed in PouchDB

By: Nolan Lawson
Published: 01 May 2014

With the release of PouchDB 2.2.0, we're happy to introduce a feature that's been cooking on the slow simmer for some time: secondary indexes, a.k.a. persistent map/reduce.

This is a powerful new tool for developers, since it allows you to index anything in your JSON documents – not just the doc IDs. Starting in 2.2.0, your data is sortable and searchable in ways that just weren't feasible before, thanks to a new cross-platform indexing engine we've built to work with every supported backend - IndexedDB, WebSQL, and LevelDB (and soon: LocalStorage!).

As usual with PouchDB, the new API is modeled after CouchDB's. So developers who are already familiar with the CouchDB map/reduce API will be up and running in no time.

And did I mention? It's fast. Our performance tests show that the new persistent map/reduce API could give you orders of magnitude improvements over the old on-the-fly query() method. In a database containing 1,000 documents, we found queries in PouchDB 2.2.0 to be between 10 and 100 times faster than in PouchDB 2.1.2.

Show me the code!

Here's the old way to query:

var pouch = new PouchDB('mydb');

// temporary view, each doc is processed in-memory (slow)
pouch.query(function (doc) {
  emit(doc.name);
}, {key: 'foo'}).then(function (result) {
  // found docs with name === 'foo'
});

The emit pattern is part of the standard CouchDB map/reduce API. What the function basically says is, "for each document, emit doc.name as a key."

And here's the new way:

// document that tells PouchDB/CouchDB
// to build up an index on doc.name
var myIndex = {
  _id: '_design/my_index',
  views: {
    'my_index': {
      map: function (doc) { emit(doc.name); }.toString()
    }
  }
};
// save it
pouch.put(myIndex).then(function () {
  // kick off an initial build, return immediately
  return pouch.query('my_index', {stale: 'update_after'});
}).then(function () {
  // query the index (much faster now!)
  return pouch.query('my_index', {key: 'foo'});
}).then(function (result) {
  // found docs with name === 'foo'
});

Ew, you put map/reduce in my JavaScript?

First, let's define what map/reduce is, so you can understand why you might want it.

Indexes in SQL databases

Quick refresher on how indexes work: in relational databases like MySQL and PostgreSQL, you can usually query whatever field you want:

SELECT * FROM pokemon WHERE name = 'Pikachu';

But if you don't want your performance to be terrible, you first add an index:

ALTER TABLE pokemon ADD INDEX myIndex ON (name);

The job of the index is to ensure the field is stored in a B-tree within the database, so your queries run in O(log(n)) time instead of O(n) time.

Indexes in NoSQL databases

All of the above is also true in document stores like CouchDB and MongoDB, but conceptually it's a little different. By default, documents are assumed to be schemaless blobs with one primary key (called _id in both Mongo and Couch), and any other keys need to be specified separately. The concepts are largely the same; it's mostly just the vocabulary that's different.

In CouchDB, queries are called map/reduce functions. This is because, like most NoSQL databases, CouchDB is designed to scale well across multiple computers, and to perform efficient query operations in parallel. Basically, the idea is that you divide your query into a map function and a reduce function, each of which may be executed in parallel in a multi-node cluster.

Map functions

It may sound daunting at first, but in the simplest (and most common) case, you only need the map function. A basic map function might look like this:

function myMapFunction(doc) {
  emit(doc.name);
}

This is functionally equivalent to the SQL index given above. What it essentially says is: "for each document in the database, emit its name as a key."

And since it's just JavaScript, you're allowed to get as fancy as you want here:

function myMapFunction(doc) {
  if (doc.type === 'pokemon') {
    if (doc.name === 'Pikachu') {
      emit('Pika pi!');
    } else {
      emit(name);
    }
  }
}

Then you can query it:

// find pokemon with name === 'Pika pi!'
pouch.query(myMapFunction, {key: 'Pika pi!', include_docs: true}).then(function (result) {
  // handle result
});

// find the first 5 pokemon whose name starts with 'P'
pouch.query(myMapFunction, {
  startkey: 'P', endkey: 'P\uffff', limit: 5, include_docs: true
}).then(function (result) {
  // handle result
});

The pagination options for query() – i.e., startkey/endkey/key/keys/skip/limit/descending – are exactly the same as with allDocs(). For a beginner's guide to pagination, read Pagination strategies with PouchDB.

Reduce functions

As for reduce functions, there are a few handy built-ins that do aggregate operations ('_sum', '_count', and '_stats'), and you can typically steer clear of trying to write your own:

// emit the first letter of each pokemon's name
var myMapReduceFun = {
  map: function (doc) {
    emit(doc.name.charAt(0));
  },
  reduce: '_count'
};
// count the pokemon whose names start with 'P'
pouch.query(myMapReduceFun, {
  key: 'P', reduce: true, group: true
}).then(function (result) {
  // handle result
});

If you're adventurous, though, you should check out the CouchDB documentation or the PouchDB documentation for details on reduce functions.

Map/reduce, reuse, recycle

PouchDB has actually had map/reduce queries since way before version 2.2.0, via the query() API. It had a big flaw, though: all queries were performed in-memory, reading in every document in the entire database just to perform a single operation.

This is fine when your data set is small, but it could be murder on large databases. On mobile devices especially, memory is a precious commodity, so you want to be as frugal as possible with the resources given to you by the OS.

The new persistent query() method is much more memory-efficient, and it won't read in the entire database unless you tell it to. It has two modes:

Temporary views (like the old system)
Persistent views (new system)

Both of these concepts exist in CouchDB, and they're faithfully emulated in PouchDB.

A room with a view

First off, some vocabulary: CouchDB calls indexes views, and these views are stored in a special document called a design document. Basically, a design document describes a view, and a view describes a map/reduce query, which tells the database that you plan to use that query later, so it better start indexing it now. In other words, creating a view is the same as saying CREATE INDEX in a SQL database.

Temporary views are exactly that – temporary. Instead of using a design document, you just call pouch.query(), so the view is created, written to disk, queried, and then poof, it's deleted. That's why, in older version of PouchDB, we could get away with doing it all in-memory.

Crucially, though, temporary views have to do a full scan of all documents every time you execute them, along with all the reading and writing to disk that that entails, only to throw it away at the end. That may be fine for quick testing during development, but in production it can be a big drag on performance.

Persistent views, on the other hand, are a much better solution if you want your queries to be fast. Persistent views need to be saved in a design document (hence "persistent"), so that the emitted fields are already indexed by the time you look them up. Subsequent lookups don't need to do any additional writes to disk, unless documents have been added or modified, in which case only the updated documents need to be processed.

Why design docs? Design documents are special meta-documents that CouchDB uses for things like indexes, security, and schema validation (think: "designing" your database). They are stored, updated, and deleted exactly like normal documents, but their _ids are prefixed with the reserved string '_design/'. They also show up in allDocs() queries, but you can filter them out by using {startkey: '_design0'}.

Writing your first view

A design document with a view looks just like a regular document. The simplest one might look like this:

var designDoc = {
  _id: '_design/my_index',
  views: {
    'my_index': {
      map: function(doc) {
        emit(doc.name);
      }.toString()
    }
  }
};

All you need to do is put() this document into your database:

pouch.put(designDoc).then(function (info) {
 // design doc created
}).catch(function (err) {
  // if err.name === 'conflict', then
  // design doc already exists
});

… and then it will be available for querying using the name 'my_index':

pouch.query('my_index').then(function(result) {
 // do something with result
});

Whatever name you give, be sure to use the same one in both the _id (after '_design/') and in the views; otherwise you will need to use the awkward format 'my_design_doc_name/my_view_name' when you query.

Technically, a design doc can contain multiple views, but there's really no advantage to this. Plus, it can even cause performance problems in CouchDB, since all the indexes are written to a single file. So we recommend that you create one view per design doc, and use the same name for both, in order to make things simpler.

Since there's a lot of boilerplate involved in creating views, you can use the following helper function to create a simple design doc based on a name and a map function:

function createDesignDoc(name, mapFunction) {
  var ddoc = {
    _id: '_design/' + name,
    views: {
    }
  };
  ddoc.views[name] = { map: mapFunction.toString() };
  return ddoc;
}

Then you just need to put() it:

var designDoc = createDesignDoc('my_index', function (doc) {
  emit(doc.name);
});
pouch.put(designDoc).then(function (doc) {
  // design doc created!
}).catch(function (err) {
  // if err.name === 'conflict', then
  // design doc already exists
});

From now on, you can call pouch.query('my_index') and it will summon this design document for querying.

Just like regular documents, design docs can always be deleted or changed later. The index will be updated automatically the next time you query() it.

If you do this a lot, though, then old indexes will build up on disk. You can call pouch.viewCleanup() to clean up any orphaned indexes.

Performance tip: Technically, the view will not be built up on disk until the first time you query() it. So a good trick is to always call query(viewName, {stale: 'update_after'}) after creating a view, to ensure that it starts building in the background. You can also use {stale: 'ok'} to avoid waiting for the most up-to-date results. In the future, we may add an API for auto-updating.

For those who get nostalgic

Of course, some of our loyal Pouchinistas may have been perfectly happy with the old way of doing things. The fact that the old query() function slurped the entire database into memory might have served them just fine, thank you.

To be sure, there are some legitimate use cases for this. If you don't store a lot of data, or if all your users are on desktop computers, or if you require closures (not supported by persistent views), then doing map/reduce in memory may have been fine for you. In fact, an upgrade to 2.2.0 might actually make your app slower, because now it has to read and write to disk.

Luckily we have a great solution for you: instead of using the query() API, you can use the much more straightforward allDocs() API. This will read all documents into memory, just like before, but best of all, you can apply whatever functions you want to the data, without having to bother with learning a new API or even what the heck "map/reduce" is.

But in general, we strongly recommend upgrading your apps to the new API instead.

When not to use map/reduce

Now that I've sold you on how awesome map/reduce is, let's talk about the situations where you might want to avoid it.

First off, you may have noticed that the CouchDB map/reduce API is pretty daunting. As a newbie Couch user who very recently struggled with the API, I can empathize: the vocabulary is new, the concepts are (probably) new, and there's no easy way to learn it, except to clack at your keyboard for awhile and try it out.

Second off, views can take awhile to build up, both in CouchDB and PouchDB. Each document has to be fetched from the database, passed through the map function, and indexed, which is a costly procedure for large databases. And if your database is constantly changing, you will also incur the penalty at query() time of running the map function over the changed documents.

Luckily, it turns out that the primary index, _id, is often good enough for a variety of querying and sorting operations. So if you're clever about how you name your document _ids, you can avoid using map/reduce altogether.

For instance, say your documents are emails that you want to sort by recency: boom, just set the _id to new Date().toJSON(). Or say they're web pages that you want to look up by URL: use the URL itself as the _id! Neither CouchDB nor PouchDB have a limit on how long your _ids can be, so you can get as fancy as you want with this.

Use and abuse your doc IDs

For a more elaborate example, let's imagine you're writing a music app. Your database might contain artists:

[ { _id: 'artist_bowie',
    type: 'artist',
    name: 'David Bowie',
    age: 67 },
  { _id: 'artist_dylan',
    type: 'artist',
    name: 'Bob Dylan',
    age: 72 },
  { _id: 'artist_joni',
    type: 'artist',
    name: 'Joni Mitchell',
    age: 70 } ]

… as well as albums:

[ { _id: 'album_bowie_1971_hunky_dory',
    artist: 'artist_bowie',
    title: 'Hunky Dory',
    type: 'album',
    year: 1971 },
  { _id: 'album_bowie_1972_ziggy_stardust',
    artist: 'artist_bowie',
    title: 'The Rise and Fall of Ziggy Stardust and the Spiders from Mars',
    type: 'album',
    year: 1972 },
  { _id: 'album_dylan_1964_times_they_are_changin',
    artist: 'artist_dylan',
    title: 'The Times They Are a-Changin\'',
    type: 'album',
    year: 1964 },
  { _id: 'album_dylan_1965_highway_61',
    artist: 'artist_dylan',
    title: 'Highway 61 Revisited',
    type: 'album',
    year: 1965 },
  { _id: 'album_dylan_1969_nashville_skyline',
    artist: 'artist_dylan',
    title: 'Nashville Skyline',
    type: 'album',
    year: 1969 },
  { _id: 'album_joni_1974_court_and_spark',
    artist: 'artist_joni',
    title: 'Court and Spark',
    type: 'album',
    year: 1974 } ]

See what I did there? Artist-type documents are prefixed with 'artist_', and album-type documents are prefixed with 'album_'. This naming scheme is clever enough that we can already do lots of complex queries using allDocs(), even though we're storing two different types of documents.

Want to find all artists? It's just:

allDocs({startkey: 'artist_', endkey: 'artist_\uffff'});

Want to list all the albums? Try:

allDocs({startkey: 'album_', endkey: 'album_\uffff'});

How about all albums by David Bowie? Wham bam, thank you ma'am:

allDocs({startkey: 'album_bowie_', endkey: 'album_bowie_\uffff'});

Let's go even fancier. Can we find all of Bob Dylan's albums released between 1964 and 1965, in reverse order? Gather 'round people, and try this:

allDocs({startkey: 'album_dylan_1965_', endkey: 'album_dylan_1964_\uffff', descending: true});

In this example, you're getting all those "indexes" for free, each time a document is added to the database. It doesn't take up any additional space on disk compared to the randomly-generated UUIDs, and you don't have to wait for a view to get built up, nor do you have to understand the map/reduce API at all.

Of course, this system starts to get shaky when you need to search by a variety of criteria: e.g. all albums sorted by year, artists sorted by age, etc. And you can only sort strings – not numbers, booleans, arrays, or arbitrary JSON objects, like the map/reduce API supports. But for a lot of simple applications, you can get by without using the query() API at all.

Performance tip: if you're just using the randomly-generated doc IDs, then you're not only missing out on an opportunity to get a free index – you're also incurring the overhead of building an index you're never going to use. So use and abuse your doc IDs!

Tips for writing views

When you do need the full map/reduce API, though, it pays to know the tips and tricks. Below is a list of advanced techniques and common pitfalls, which you may find useful when you're trying to squeeze that last bit of performance out of your views.

1. Don't index it if you don't need it

If your database has several document types, but only the "person" type has a lastName field, do this:

function (doc) {
  if (doc.lastName) {
    emit(doc.lastName);
  }
}

If you took out the if statement, you'd emit a null key for every non-person in your database, which needlessly bloats the size of your index.

If the null field is meaningful for some reason, though, you can emit it and look it up later using query(viewName, {key: null}). Any undefined fields are treated as null.

2. Move logic to the query params

Don't do:

function (doc) {
  if (doc.highScore >= 1000) {
    emit(doc.highScore);
  }
}

Instead, do:

function (doc) {
  emit(doc.highScore);
}

And then query with {startkey: 1000}.

3. Use high characters for prefix search

To find every lastName that starts with an 'L', you can use the high Unicode character '\uffff':

pouch.query(viewName, {startkey: 'L', endkey: 'L\uffff'});

Or you can set inclusive_end to false:

pouch.query(viewName, {startkey: 'L', endkey: 'M', inclusive_end: false});

Both queries will do the same thing.

4. Use {} for complex key ranges

If your keys are [lastName, firstName], and you need to find everybody with the last name 'Harvey', you can do:

pouch.query(viewName, {startkey: ['Harvey'], endkey: ['Harvey', {}]});

In CouchDB collation ordering, {} is higher than everything except other objects.

For the database geeks: implementation details

If you want to know how map/reduce works in PouchDB, it's actually exceedingly easy: just open up your browser's developer tools, head over to "Resources," and see where we've created a bonus database to store the index.

One of the neat things about how we implemented map/reduce is that we managed to do it entirely within PouchDB itself. Yep, that's right: your map/reduce view is just a regular old PouchDB database, and map/reduce itself is a plugin with no special privileges, creating databases in exactly the same way a regular user would. Since the implementation sits on top of the PouchDB API, it's exactly the same for all three backends: LevelDB, IndexedDB, and WebSQL.

How did we pull off this trick? To be clear: it's not like we set out to make PouchDB some kind of map/reduce engine. In fact, as we were debating how to implement this feature, we originally ran on the assumption that we'd need to build additional functionality on top of PouchDB. It was only after a few months of discussions and experiments that we realized the whole thing could be done in PouchDB proper.

In the end, the only additional functionality we had to add to the main PouchDB code was a hook to tell PouchDB to destroy the map/reduce database when its parent database was destroyed. That's it.

This technique also has a nice parallel in CouchDB: it turns out that they, too, reused the core _view code in order to write _all_docs. In CouchDB, _all_Docs is just a special kind of _view, whereas in PouchDB, we built map/reduce views on top of allDocs(). We did it backwards, but the spirit of the idea was the same.

As for the index itself, what it basically boils down to is clever serialization of arbitrary JSON values into CouchDB collation ordering – i.e. nulls before booleans, booleans before numbers, numbers before strings, strings before arrays, arrays before objects, and so on recursively. Every JSON object maps to an indexable string that guarantees the same ordering in every database backend, and we only deviate from CouchDB by using ASCII string ordering instead of ICU ordering (since that's what our backends use).

Then, we take this value and concatenate it with whatever else needs to be indexed (usually just the doc _id), meaning the entire lookup key is a single string. We decided on this implementation because:

LevelDB does not support complex indexes.
IndexedDB does support complex indexes, but not in IE.
WebSQL supports multi-field indexes, but it's also a dead spec, so we're not going to prioritize it above the others.

The next part of the design was to divide the data we needed to store into two kinds of documents: 1) emitted key/values with doc IDs, which are used for querying and pagination, and 2) mappings from doc IDs to those key/values, so that we can delete or update them when their parent documents are deleted or updated. Since by design, PouchDB databases do not have multi-document transaction semantics, we implemented a task queue on top of the database to serialize writes.

Aside from that, the only other neat trick was a liberal use of _local documents, which are a special class of documents that aren't counted in the total_rows count, don't show up in allDocs, but can still be accessed using the normal put() and get() methods. This is all it takes to fully reimplement CouchDB map/reduce in a tiny JavaScript database!

Anyone interested in more implementation details can read the months-long discussions in these Github issues. The map/reduce code itself is also pretty succinct.