Monday, October 20, 2014

Introduction to Cassandra

Apache Cassandra is a NoSQL persistence solu- ton that ofers distributed data, linear scalability, and a tunable consistency level, making it well suited for highly available and high- volume applicatons. Devel- oped by Facebook and open sourced in 2008, Cassandra was infuenced by the Bigtable and Dynamo white papers. Dynamo inspired the parttoning and repli- caton in Cassandra, while Cassandra’s log-structured Column Family data model— which stores data associated with a key in a schemaless way—is similar to Bigtable. Cassandra is used by many big names in the online industry including Netfix, eBay, Twiter, and reddit.


There are many benefts to choosing Cassandra as a storage soluton; one of the most compelling is its speed. Cassandra is known for its exceptonally fast writes but is also extremely compet- tve with its fast read speed. It is also highly available. Its decentralized peer-to-peer architecture means there are no master components that could become single points of failure, so it is extremely fault-tolerant—especially if you spread your nodes over multple data centers. Cassandra also ofers linear scaling with no downtme. If you need x number of nodes for y number of requests, 2x nodes will cater to 2y requests. With the introduc- ton of virtual nodes in ver- sion 1.2, the load increase normally experienced while increasing capacity is spread across all the nodes making scaling while under load a non-issue. It is very simple to scale both up and down. Another beneft is the fex- ibility achieved by the ability to tune the consistency level for each request according to the needs of your applicaton. Cassandra has an actve community, with big players such as Netfix and Twiter, contributng open source libraries. Cassandra is actvely under development, which means that there are fre- quent updates with many improvements.


Although Cassandra is highly efectve for the right use cases and has many advan- tages, it is not a panacea. One of the biggest chal- lenges facing developers who come from a relatonal data- base background is the lack of ACID (atomicity, consistency, isolaton, and durability) propertes—especially with transactons, which many developers have come to rely on. This issue is not unique to Cassandra; it is a common side efect of distributed data stores. Developers need to carefully consider not only the data that is to be stored, but also the frequency and access pat- terns of the data in order to design an efectve data model. To ensure that data access is efcient, data is normally writen to multple col- umn families according to how it is read. This denormalizaton, which is frequently used to increase per- formance, adds extra responsibility onto the developer to ensure that the data is updated in all relevant places. A poorly thought-out and poorly accessed data model will be extremely slow. If you are unsure how you will need to access the data (for example, for an analytcs system where you might need a new view on old data), you typically will need to combine Cassandra with a MapReduce technology, such as Hadoop. Cassandra is best suited to accessing specifc data against a known key. Cassandra is a rela- tvely young technol- ogy and although it is very stable, this lack of maturity means that it can be difcult to fnd staf who have exper- tse. However, there is a wealth of informaton, and conference videos are available online, which makes it easy to develop Cassandra skills quickly. So, you should not let this put you of. Also, there is a small amount of administraton to run Cassandra, because you need to ensure that repairs are run approximately weekly. Repairs ensure that all the data is synchronized across all the appropriate nodes.


Let’s take a look at the structure of Cassandra. The data is stored in and replicated across nodes. Nodes used to refer to a server hostng a Cassandra instance, but with the introducton of virtual nodes, one server will now contain many (vir- tual) nodes. The number of nodes the data is duplicated to is referred to as the replicaton factor . The nodes are distributed around a token ring, and the hash of the key determines where on the token ring your data will be stored. Cassandra provides a number of replica- ton strategies, which, combined with the snitch, determine where the data is rep- licated. The snitch fnds out informa- ton about the nodes, which the strategy then uses to replicate the data to the correct nodes. For instance, the NetworkTopologyStrategy lets you specify a replicaton factor per data center, thereby ensuring that your data is replicated across data cen- ters and providing availability even if an entre data center goes down.

Data Structure 

A keyspace is the namespace under which Column Families are defned, and it is the conceptual equivalent of a database (see Figure 1). The replicaton factor and strategy are defned at the keyspace level. A column family contains a num- ber of rows, indexed by a row key, with each row containing a col- lecton of columns. Each column contains a column name, value, and tme stamp. One important thing to note is that each row can have a completely unique collecton of columns. The ordering of the columns in a row is important; to achieve optmal performance you should read columns that are close to each other to minimize disk seeks. The ordering of columns and the datatype of each column name are specifed by the comparator. The datatype of each column value is specifed by the validator. It is pos- sible to make columns automat- cally expire by setng the tme to live (TTL) on the column. Once the READY FOR VOLUME Cassandra is a NoSQL persistence solution that ofers distributed data, linear scalability, and a tunable consistency level, making it well suited for highly available and high- volume applications. Figure 1 Keyspace Column Family Column Column Column Column row key 2 row key 1 name value time stamp name value time stamp name value time stamp name value time stamp ... ... ... TTL period has passed, the column will automatcally be deleted. There are also a couple of spe- cial types of columns and column families. The Counter column fam- ily can contain only counter col- umns. A counter is a 64-bit signed integer and is valuable because you can perform an atomic incre- ment or decrement on it. The only other operatons you can perform on a counter column are read and delete. It is the only column type that has the ability to perform an atomic operaton on the column data and, as such, incrementng a counter column is typically slower than updatng a regular column unless data is writen at consis- tency level one. Supercolumn families can con- tain only supercol- umns. A supercolumn is a column of columns and allows you to store nested data, as shown in Figure 2. For exam- ple, a “home” super- column might contain the columns “street,” “city,” and “postcode.” Supercolumns are limited in that they can be nested only to one level. If you require more-dynamic nestng, you should use com- posite columns. Composite columns are normal columns in which the column name consists of multple, distnct com- ponents (see Figure 3), allowing for queries over partal matches on these names. A comparator of comparators would then be used to ensure the ordering. Composite columns allow you to nest col- umns as deeply as you want. Due to this ability and the performance gains composite columns had over supercolumns in earlier versions of Cassandra, they, rather than super- columns, have become the stan- dard for nested data.

Distributed Deletes

Distributed deletes pose a tricky problem. If Cassandra were to sim- ply delete the data, it is possible that the delete opera- ton would not be suc- cessfully propagated to all nodes (for example, if one node is down). Then, when a repair was performed, the data that wasn’t removed would be seen as the newest data and would be replicated back across the appropriate nodes, thereby reviving the deleted data. So instead of delet- ing the data, Cassandra marks it with a tombstone. When a repair is run, Cassandra can then see that the tombstoned data is the latest and, therefore, replicate the tombstone to the node that was unavailable at the tme of the delete operaton. Tombstones and associated data are automatcally deleted afer 10 days by default as part of the compacton process. Therefore, it is important that a repair (which is a manual process) be run more frequently than the compacton process (which is normally done every 7 days) to ensure that the tombstones are replicated to all appropriate nodes before the tombstones are cleaned up.


Cassandra is frequently labeled as an “eventually consistent data. Cassandra is a brilliant tool if you require a scalable, high-volume data store. Its linear scalability at virtually no load cost is hard to beat when you have a sudden surge in trafc. store,” although the consistency level is tunable. A database can be considered consistent when a read operaton is guaranteed to return the most-recent data. Note: When working with Cassandra, it is important to under- stand the CAP theorem. Out of the three elements—consistency, availability, and part- ton tolerance—you can achieve a maximum of two of the three. When writng or reading data, you can set your consistency level with each data request. Consistency levels include one, two, all, and quorum, among others. Let’s assume your replicaton factor is 3. If you write and read at consistency level one, it is possible that the node you read from has not yet been updated for the write operaton that occurred and you could read old data. It is important to note that if you write at level one, it doesn’t mean that the data will be writen only to one node; Cassandra will atempt to write to all the applicable nodes. All it means is that the opera- ton will be returned as soon as Cassandra has successfully writen to one node. The quorum consistency level is 1 more than half the replicaton fac- tor. For example, if the replicaton factor is 3, quorum would be 2. If you read and write at the quorum consis- tency level, data consistency is guaran- teed. If data is writen to nodes A and B but could not be writen to node C, and then data is read from B and C, Cassandra will use the tme stamps associated with the data to determine that B has the latest data, and that is what will be returned. While giving you consistency, the quorum consistency level also allows for one of the nodes to go down without afectng your requests. If your consistency level is all, your reads and writes will be con- sistent but will not be able to han- dle one node being unavailable. It is also worth remembering that the more nodes you need to hear from to achieve the specifed consis- tency level, the slower Cassandra will be. Reading at the quorum consistency level (with a replica- ton factor of 3), your speed will be that of the second-fastest node, whereas with a consistency level of all, Cassandra will need to wait for the slowest node to respond. It is worth considering whether stale data is acceptable for some of your use cases.

Accessing Cassandra 

Cassandra has a command-line interface, cqlsh, which you can use to directly access the data store. cqlsh takes commands in CQL, which is very similar to SQL. Don’t let the syntactc similarites fool you into thinking that the back- ground processing is the same. There are a number of diferent libraries for accessing Cassandra through Java (as well as through other languages). There is a Java driver developed by Datastax that accepts CQL. Prior to the develop- ment of this driver, Java spoke to Cassandra only through the Thrif API. There are a number of libraries to support the Thrif API, such as Hector and Astyanax, among oth- ers. Both the Thrif API and the Java driver are now supported.


Cassandra is a brilliant tool if you require a scalable, high-volume data store. Its linear scalability SPEED DEMON There are many benefts to choosing Cassandra as a storage solution; one of the most compelling is its speed. Cassandra is known for its exceptionally fast writes but is also extremely competitive with its fast read speed. at virtually no load cost is hard to beat when you have a sudden surge in trafc. It is relatvely easy to get up and running quickly, but you must pay your dues by spending tme getng to know your access paterns, so you can model your data appropri- ately. It is worth looking under the hood of Cassandra because doing this can provide great insights into the optmal usage paterns. Given today’s growing data requirements, Cassandra is a valuable.

• “  Cassandra Data Modeling Best Practces, Part 1
• “  Cassandra Data Modeling Best Practces, Part 2
• “ Virtual Nodes in Cassandra 1.2
Post a Comment