Using Apache Cassandra to Build Object Storage for the Cloud

Article originally published on Planet Cassandra

interview of Pierre-Yves Ritschard CTO at Exoscale, by Planet Cassandra on January 16th, 2014:

"Cassandra definitely feels like one of the really production ready systems in the pool of new databases."

Pierre-Yves Ritschard

Exoscale

Exoscale is the leading Swiss cloud provider, and I am the CTO there. We are an IaaS (infrastructure as a service) provider exposing two products: an EC2 like public cloud and a Virtual Private Cloud.

Activity streams, time-series data and Cassandra backed Cyanite

We are using Cassandra for several projects. We use it internally to hold activity streams of events happening on the IaaS platform, which helps us build a fast usage meter log and compute resource consumption in real time, this helps our customers get a clear idea of their resource usage and credit activity.

We also use it as a time-series database for key metrics inside the platform as well, and have just recently created an open source project building on our experience storing time-series which can use Cassandra as a backend for the graphite metric storage and visualization tool. It is early work and available here: https://github.com/pyr/cyanite.

Last, we are currently privately rolling out an S3 compatible object store built on top of Cassandra, the core of which will be open sourced Q1 2014, once we deem the product ready for GA.

As far as versions are concerned, we have had the luxury of starting these projects when CQL was already available (contrarily to previous Cassandra based projects I led) and thus have been able to follow versions. Cyanite and our object store rely on Cassandra 2.0.

Production ready + low operational costs

The projects where we use Cassandra are similar in nature, they all have a write heavy workload of commutative data, a pattern where Cassandra excels. We do not use Cassandra across the board, but rather reserve it for those use cases where it really shines. A special case is our object store layer, which is a bit different, designing the data model took a bit more thinking since it is outside of the general use cases.

One of the big wins we see when deploying Cassandra is the low operational overhead it incurs. Cassandra definitely feels like one of the really production ready systems in the pool of new databases out there. Nodetool and the JMX instrumentation have been life savers in several occasions.

Exoscale’s Cassandra deployment

We operate on 2 data centers right now and use separate clusters with different profiles for each project. The activity stream relies on nodes with low amounts of fast storage to get optimal performance whereas time-series data can run on bigger and slower disks since it is a vastly write dominated project.

Words of wisdom

I think one must not underestimate the “cognitive” shift needed to properly design effective Cassandra data models. CQL is a great tool but should not let users get tricked into thinking SQL like data models should be used. Wherever possible, build commutative data models which alleviate many of Cassandra’s potential pitfalls. TTLs should also be leveraged as much as possible to avoid dealing with tombstones.

Help with Cyanite

We’d love input and help on Cyanite, please check it out.

I’ll also mention that we take a great amount of care in delivering cloud instances with good IOP capacity and thus think we are a pretty good match for running Cassandra, if you want to verify it yourself, use the “WELOVECASSANDRA” promo code here to get started with an initial 20 CHF credit.