How to setup cassandra on Ubuntu 14.04

What is Cassandra?

The Apache Cassandra database is the right choice when you need scalability and high availability without compromising performance. Linear scalability and proven fault-tolerance on commodity hardware or cloud infrastructure make it the perfect platform for mission-critical data.Cassandra’s support for replicating across multiple datacenters is best-in-class, providing lower latency for your users and the peace of mind of knowing that you can survive regional outages.

Cassandra cluster data flow :

request_response

Cassandra architecture:

cassandra_arch

More information can be found here. Apache Cassandra, Datastax Cassandra Documantation and Users.

Cassandra cluster read/updates:

read_write

Install dependencies:-

Java8 :

#sudo apt-get install software-properties-common python-software-properties
#sudo apt-add-repository ppa:webupd8team/java
#sudo apt-get update
#sudo apt-get install oracle-java8-installer

#java -version
java version "1.8.0_111"
Java(TM) SE Runtime Environment (build 1.8.0_111-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.111-b14, mixed mode)

#python3 --version
Python 3.4.3

Install Cassandra:-

Following steps will install Cassandra and other tools like cqlsh client, noodtool administrative tool.


#echo "deb http://www.apache.org/dist/cassandra/debian 36x main" | sudo tee -a /etc/apt/sources.list.d/cassandra.sources.list
#curl https://www.apache.org/dist/cassandra/KEYS | sudo apt-key add -
#sudo apt-get update
#sudo apt-key adv --keyserver pool.sks-keyservers.net --recv-key A278B781FE4B2BDA
#sudo apt-get install cassandra

#nodetool status
#cqlsh
cqlsh> show VERSION
[cqlsh 5.0.1 | Cassandra 3.6 | CQL spec 3.4.2 | Native protocol v4]
cqlsh> SHOW HOST
Connected to Test_Cluster at 127.0.0.1:9042.
cqlsh> SELECT cluster_name, listen_address FROM system.local;
cqlsh> tracing off
cqlsh> paging off
cqlsh> expand off

Setup Cassandra Multi-node Cluster:

Cassandra uses,
storage_port: 7000 for cluster communication (7001 if SSL is enabled),
native_transport_port: 9042 for native protocol clients,
JMX : 7199 for Java tools
rpc_port: 9160 for remote client communication.

Now let’s setup environment for cluster for 3 servers (192.168.0.1,192.168.0.2,192.168.0.3),
Do following steps for all cassandra nodes.

#sudo service cassandra stop

If you nhave SSD, point data dirs for SSDs.

#sudo mkdir -m=777 /mnt/ssd/cassandra
#sudo vim /etc/cassandra/cassandra.yaml

Setup clustor,

cluster_name: 'Stockopedia_Production_Cluster'

Point data dir to SSD,

data_file_directories:
- /mnt/ssd/cassandra/data
commitlog_directory: /mnt/ssd/cassandra/commitlog
saved_caches_directory: /mnt/ssd/cassandra/saved_caches
hints_directory: /mnt/ssd/cassandra/hints

Add node seeds,

seed_provider:
- seeds: "192.168.0.1,192.168.0.2,192.168.0.3"

Update Ip/ports, eth0 : 192.168.0.1

listen_interface: eth0
start_rpc: true
rpc_interface: eth0
broadcast_rpc_address: 1.2.3.4

Update snitch for grouping machines into “datacenters” and “racks.”:

endpoint_snitch: GossipingPropertyFileSnitch

#vim /etc/cassandra/cassandra-rackdc.properties
dc=dc2
rack=rack2

#sudo service cassandra start

Test cassandra environment:


# nodetool status
Datacenter: dc1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 192.168.0.1 7.84 GiB 256 100.0% bde625de-8af6-477f-9a69-asdedec0fc62 rack1
Datacenter: dc2
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 192.168.0.2 7.84 GiB 256 100.0% 162ae9dc-e572-4927-98a4-qwe371c9f071 rack2
Datacenter: dc3
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 192.168.0.3 7.84 GiB 256 100.0% 162ae9dc-e572-4927-98a4-qwe371sdf071 rack3

Monitoring cassandra servers using VisualVm and Jconsole:

Remote process connection string : service:jmx:rmi://server-ip:7299/jndi/rmi://server-ip:7199/jmxrmi

vim /etc/cassandra/cassandra-env.sh

LOCAL_JMX=no
JVM_OPTS="$JVM_OPTS -Djava.rmi.server.hostname=server-ip"
JVM_OPTS="$JVM_OPTS -Dcom.sun.management.jmxremote.authenticate=false"
JVM_OPTS="$JVM_OPTS -Dcom.sun.management.jmxremote.ssl=false"

Cassandra Model:

cassandra_model

Import CSV using cqlsh client and upload it to cassandra table:

First export data from mysql database,

mysql>SELECT instrument,IFNULL(open,''),IFNULL(high,''),IFNULL(low,''),IFNULL(close,''),IFNULL(volume,''),IFNULL(date,''),created_at INTO OUTFILE '/tmp/price.csv'
FIELDS TERMINATED BY ',' ENCLOSED BY '"' ESCAPED BY ''
LINES TERMINATED BY '\n'
FROM price;
Query OK, 129865916 rows affected (7 min 5.59 sec)


#cqlsh
cqlsh> CREATE KEYSPACE price_db WITH REPLICATION = { 'class' : 'NetworkTopologyStrategy', 'dc1' : 1, 'dc2' : 1, 'dc3' : 1 };
cqlsh> USE price;
cqlsh> COPY price_db.price_table (instrument,open,high,low,close,volume,date,created_at) FROM '/tmp/price.csv';

cqlsh> CREATE TABLE price_db.price_table (
instrument varchar,
open decimal,
high decimal,
low decimal,
close decimal,
volume decimal,
date date,
created_at timestamp,
PRIMARY KEY (instrument, date)
) WITH CLUSTERING ORDER BY (date DESC) AND caching = {'keys': 'ALL', 'rows_per_partition': 'ALL'} ;

cqlsh> CREATE INDEX close ON price_db.price_table (close);
cqlsh> DESCRIBE price_db.price_tabl;
cqlsh> SELECT * FROM price_db.price_tabl;
cqlsh> SELECT close FROM price_db.price_tabl where instrument = 'XYZ';
cqlsh> DROP TABLE price_db.price_tabl;
cqlsh> TRUNCATE TABLE price_db.price_tabl;

Install Cassandra php-driver :

# php --version
PHP 5.5.9-1ubuntu4.20 (cli) (built: Oct 3 2016 13:00:37)

PHP data driver can be found here, http://datastax.github.io/php-driver/

sudo apt-get install g++ clang make cmake libssl-dev libgmp-dev libpcre3-dev git

#sudo wget http://downloads.datastax.com/cpp-driver/ubuntu/14.04/dependencies/libuv/v1.8.0/libuv_1.8.0-1_amd64.deb
#sudo wget http://downloads.datastax.com/cpp-driver/ubuntu/14.04/dependencies/libuv/v1.8.0/libuv-dev_1.8.0-1_amd64.deb
#sudo wget http://downloads.datastax.com/cpp-driver/ubuntu/14.04/cassandra/v2.5.0/cassandra-cpp-driver-dev_2.5.0-1_amd64.deb
#sudo wget http://downloads.datastax.com/cpp-driver/ubuntu/14.04/cassandra/v2.5.0/cassandra-cpp-driver_2.5.0-1_amd64.deb

#sudo dpkg -i libuv_1.8.0-1_amd64.deb
#sudo dpkg -i libuv-dev_1.8.0-1_amd64.deb
#sudo dpkg -i cassandra-cpp-driver_2.5.0-1_amd64.deb
#sudo dpkg -i cassandra-cpp-driver-dev_2.5.0-1_amd64.deb

#sudo pecl install cassandra

#sudo vim /etc/php5/mods-available/cassandra.ini
extension=cassandra.so;
#php5enmod cassandra

Conclusion:
We found cassandra is 3x faster that relational mysql for timeseries data. You will need to design data model according to query patterns using de-normalization so that you can pull all needed data using just one query.

Thank you.

This entry was posted in Cassandra, Php, Uncategorized and tagged . Bookmark the permalink.

Leave a Reply