Big Data/Cassandra

Apache Cassandra is a NoSQL wide column-oriented database management system, distributed and scalable. In 2015, it has become one of the world's most popular SGBD.

Installation
The Java sources are available on https://github.com/apache/cassandra, but a tarball is on http://cassandra.apache.org/download/. See also http://cassandra.apache.org/doc/latest/getting_started/installing.html for more information.
 * MacOS:

To launch the server:
 * On Linux:
 * On Windows: \cassandra\bin\cassandra.bat

Graphical user interface
There are several GUI to manage Cassandra. For example Helenos: its Java sources are available on https://github.com/tomekkup/helenos, and a compiled version on http://sourceforge.net/projects/helenos-gui/.

It includes an Apache + Tomcat server, launchable by \helenos\bin\startup.bat. Then, the web interface must be visible on http://localhost:8080 (login: admin / password: admin).



NB: it can create some column families, but not see the ones which were created in CQL.

Data manipulation
In 2011 Cassandra introduced the Cassandra Query Language (CQL), you can interact with CQL using the   client. Using   you can create keyspaces and tables, insert and query tables among other operations. The CQL 3.0 syntax looks like this :

Additional Notes:
 * There isn't any autoincrement option.
 * No case-sensitive field names.
 * Inserting a new record with an existing primary key will replace the old one, without any warning.
 * When inserting more than 1,000 records, cqlsh may ignore the rest. It's recommended to use the ETL sstableloader.

Cassandra port usage

 * 7000, cluster communication
 * 7001, cluster communication if SSL enabled
 * 7199 JMX (was 8080 pre Cassandra 0.8.xx)
 * 9042 CQL native clients
 * 9160 Thrift client API

How to use several nodes
To communicate from one server to another Cassandra needs to open the ports : 7000, 7001, 7199 (SSL), 9042 and 9160.

There isn't any master node, so the fail-over is automatic. Each node must own a "seed node" in its configuration, to get the distributed architecture. Their description is stored into \cassandra\conf\cassandra-rackdc.properties.

To let the nodes communicate, into cassandra.yaml, the parameter endpoint_snitch must be RackInferringSnitch (instead of SimpleSnitch by default).

Then, the nodes list is visible with:
 * On Linux: \cassandra\bin\nodetool status
 * On Windows: \cassandra\bin\nodetool.bat status

NB: when a keyspace is cerated with a replication_factor superior to one, the nodes become redundant (mirroring).

Related Technologies

 * Amazon Dynamo - uses similar concepts like data distribution, vault tolerance
 * BigTable - uses similar data model (column-families)
 * Redis - in memory key value database
 * MongoDB