Introduction to NoSQL query

If NoSQL is all about leaving SQL behind, then what takes its place?

NoSQL query

First we need to clear something up: NoSQL is a bad name. Really these are non-relational databases. There’s a good reason for that confusion, though.

Up until recently, data model and query method were closely tied to each other. That’s understandable: getting the most out of limited hardware has been a leading driver of database development. So query method has been, for much of the history of database development, a product of the data model.

Today we have the luxury of more compute power. Thanks to cloud computing and Moore’s Law, we can create database systems that prioritise developer productivity and begin to abstract the query method from the data model. Think of it as similar to what has happend in application development: we have taken a path from machine code, through assembly, to low level and then high level languages.

The early databases

The world’s first commercial database was SABRE, a collaboration between IBM and American Airlines for improving the efficiency of airline ticketing. Before computerisation, ticket booking was a lengthy manual process taking up to 90 minutes. SABRE launched in 1960 and reduced that time to seconds.

SABRE was a hierarchical database. Hierarchical databases have an entry point at the top with links that descend through the data, much like a family tree or the Windows Registry. To query such a database the developer works down the paths of the hierarchy until they find the element they want.

More recent hierarchical databases — such as IMS, which grew out of SABRE — have since introduced declarative query but in those early days query was very much procedural:

Is this the thing? 
No? 
Okay, go to the next one down. 
Is this the thing?

A few years after SABRE, the same group who created COBOL produced a specification for the network database. This built on the hierarchical database by adding peer-to-peer links resulting in something that worked much like a linked list. Still, query was quite manual.

It wasn’t until 1970 when IBM researcher EF Codd published his paper A Relational Model of Data for Large Shared Data Banks, in which he described the relational databases we know today. However, it would be another four years before two of Codd’s IBM colleagues would propose the language that became SQL and supplanted Codd’s own query proposals.

The hierarchical, network and relational models each dictated the nature of the query method. While the implementations might vary, the scope of the query language was bound by the data model.

Query in NoSQL today

When the current crop of NoSQL databases came on the scene from 2005 onwards, query was not the priority. With the exception of graph databases, non-relational databases have optimised for scalability, up-time, redundancy, flexibility and usually at the expense of queryability.

So, where does that leave NoSQL query? Broadly, there have been five approaches:

Not my problem
Map-reduce
DBMS-specific approaches.
Attempts at standardisation.
SQL-derivatives.

That last one is perhaps the strongest reason against using the term NoSQL: those same databases are increasingly implementing some form of SQL.

Not my problem/map-reduce

If you’re building a database system that’s designed to scale effortlessly and handle infinitely varied data, then perhaps query isn’t your priority.

When talking about those databases, their proponents often talk about moving query to the application layer. In other words, you as the application developer get to do more work. The database vendor or open source project provides a REST API and some libraries that let you put data in and pull that same data out.

Actually, that’s fine if you have a genuinely key-value shaped use case. If all you need is something that behaves like a file system, then the lack of query frees the DMBS creators to fix other hard problems.

Map-reduce

For some use cases, though, that’s not enough.

CouchDB, arguably the first of the current NoSQL databases, drew from the big data world to offer map-reduce query. That suited its document data model and its architecture where data and compute-power could be distributed across a cluster of nodes. Riak and MongoDB also added map-reduce.

However, for all its usefulness in dealing with distributed data sets, map-reduce isn’t the friendly ad-hoc declarative query that SQL led us to expect.

DBMS-specific query

MongoDB is the one non-relational database that has prioritised developer friendliness above all else.

While MongoDB offers map-reduce as an option, most MongoDB query takes the form of method chaining:

db.collection.command(data)

If we’re working in the MongoDB shell and we want to find all the fish and chip restaurants in our database, we might write something like this:

db.restaurants.find( { "cuisine": "Fish and chips" } )

Each language specific driver then implements the same query in an idiomatically correct way.

This is easy to learn but it’s still something new; and it doesn’t integrate easily with SQL tooling, for example.

Other databases, such as ArrangoDB, also have their own query languages. While that might allow them to innovate quickly, learning something entirely new is a barrier to adoption.

Attempts at standardisation

Standards are enablers of innovation: if all document databases, for example, shared a common query language then developers could move easily from one DBMS to another and everyone could focus on innovating in areas other than the query language.

XQuery is an attempt at creating such a standard for working with XML and Jsoniq is a superset tuned for JSON. The vision for both languages is to create a standard method of querying the type of hierarchical documents we see represented in XML and JSON.

MarkLogic, a document database that works with XML, implements a form of XQuery. Here’s how we’d insert a new document:

xquery version "1.0-ml";
xdmp:document-insert("restaurants.xml", 
    <restaurants xmlns="http://example.com/namespace/restaurants">
        <restaurant restaurantid="1">
            <name>Philipp's Fish and Chips</name>
            <cuisine>Fish and chips</cuisine>
            <owner>
                <first>Philipp</first>
                <last>Strube</last>
            </owner>
        </restaurant>
    </restaurants>
)

To return all the fish and chip restaurants in our restaurants.xml file, we’d do something like:

xquery version "1.0-ml";
declare namespace rest = "http://example.com/namespace/restaurants";
{
    for $restaurant in doc("restaurants.xml")/rest:restaurants/rest:restaurant
    return
    {   
        $restaurant/rest:name/text() 
    }
}

Jsoniq is somewhat different to allow for the difference in the data model and that goes to show that both XQuery and Jsoniq are deeply tied to the format of the data stored.

XQuery has seen commercial implementations but none of the major document databases have implemented Jsoniq.

SQL for NoSQL

The name NoSQL shows just how closely tied SQL and the relational model are. However, two things are making that less true: increasing compute power that makes even more complex indexing and query parsers possible and SQL’s place as the one query language every developer knows.

Even if you don’t know SQL, it’s easy to pick up. Then there’s all that tooling out there: including ODBC, JDBC, query analysers and more.

Perhaps that’s why more and more non-relational databases are adopting or adapting SQL. The first to do so was Cassandra, with CQL.

Cassandra’s columnar data model shares some superficial similarities with the relational model but it’s sufficiently far removed that Cassandra provides an intermediary representation of the data model that differs somewhat from what is actually stored on disk.

Thanks to this, CQL looks remarkably familiar. Translating our XQuery query above into CQL gives us:

SELECT name FROM restaurants WHERE cuisine='Fish and chips';

It’s readable and, importantly, it’s declarative: we’re not telling the query processor what to do but instead we’re telling it what we want to get back.

For document databases, there are two related query languages. SQL++ is a research project from the University of California San Diego that modifies SQL to work better with denormalised data. In particular, it adds keywords to handle the missing and nested data that we expect to find in a denormalised document database.

Similar to SQL++ is N1QL, Couchbase’s implementation of a SQL-like language for JSON documents. N1QL (or non-first form query language) is supremely SQL-like. If we took our restaurant example, we might have one document per restaurant. To find our fish and chip restaurants we’d write:

SELECT name FROM restaurants WHERE cuisine='Fish and chips';

For this query, it’s impossible to tell SQL, CQL and N1QL apart.

Both Couchbase and Cassandra have put the development effort into their indexing and query parsing to enable SQL, which was designed for relational data, to fit entirely different data models.

Despite the nesting and variable schema, document databases are pretty well suited to a SQL-like query language. Even though relations are not enforced, we create and store documents that rely on each other.

In summary

Data model has long influenced how we query the data we’re storing. That influence remains but it’s apparent that the ties are looser than before.

Non-relational databases offer us all kinds of query methods and now we no longer have to consider queryability a trade-off we must make in favour of the traditional NoSQL promises.

In particular, we can now continue to use SQL-like languages even when querying data that isn’t strictly relational. NoSQL was never about getting rid of SQL; it was about using non-relational data models where they were better suited. Query might not have been the priority for all of them but now we can expect rich query from non-relational databases as well as everything else they have to offer.