Wednesday, 17 July 2019

Cassandra Architecture sub-point

Calculating Tokens for a Multi-Data Center Cluster
In multi-data center deployments, replica placement is calculated per data center using the NetworkTopologyStrategy replica placement strategy. In each data center (or replication group) the first replica for a particular row is determined by the token value assigned to a node. Additional replicas in the same data center are placed by walking the ring clockwise until it reaches the first node in another rack.
If you do not calculate partition-er tokens so that the data ranges are evenly distributed for each data center, you could end up with uneven data distribution within a data center. The goal is to ensure that the nodes for each data center are evenly dispersed around the ring, or to calculate tokens for each replication group individually (without conflicting token assignments).

One way to avoid uneven distribution is to calculate tokens for all nodes in the cluster, and then alternate the token assignments so that the nodes for each data center are evenly dispersed around the ring.
Another way to assign tokens in a multi data center cluster is to generate tokens for the nodes in one data center, and then offset those token numbers by 1 for all nodes in the next data center, by 2 for the nodes in the next data center, and so on. This approach is good if you are adding a data center to an established cluster, or if your data centers do not have the same number of nodes.
strategy_options
Specifies configuration options for the chosen replication strategy.
For SimpleStrategy, it specifies replication_factor in the format of replication_factor:number_of_replicas.
For NetworkTopologyStrategy, it specifies the number of replicas per data center in a comma separated list of datacenter_name:number_of_replicas. Note that what you specify for datacenter_name depends on the cluster-configured snitch you are using. There is a correlation between the data center name defined in the keyspace strategy_options and the data center name as recognized by the snitch you are using. The nodetool ring command prints out data center names and rack locations of your nodes if you are not sure what they are.
See Choosing Keyspace Replication Options for guidance on how to best configure replication strategy and strategy options for your cluster.
Setting and updating strategy options with the Cassandra CLI requires a slightly different command syntax than other attributes; note the brackets and curly braces in this example:

[default@unknown] CREATE KEYSPACE test WITH placement_strategy = 'NetworkTopologyStrategy' AND strategy_options=[{us-east:6,us-west:3}];

Choosing Keyspace Replication Options
When you create a keyspace, you must define the replica placement strategy and the number of replicas you want.
DataStax recommends always choosing NetworkTopologyStrategy for both single and multi-data center clusters. It is as easy to use as SimpleStrategy and allows for expansion to multiple data centers in the future, should that become useful. It is much easier to configure the most flexible replication strategy up front, than to reconfigure replication after you have already loaded data into your cluster.
NetworkTopologyStrategy takes as options the number of replicas you want per data center. Even for single data center (or single node) clusters, you can use this replica placement strategy and just define the number of replicas for one data center. For example (using cassandra-cli):
[default@unknown] CREATE KEYSPACE test WITH placement_strategy = 'NetworkTopologyStrategy' AND strategy_options=[{us-east:6}];
Or for a multi-data center cluster:
[default@unknown] CREATE KEYSPACE test WITH placement_strategy = 'NetworkTopologyStrategy' AND strategy_options=[{DC1:6,DC2:6,DC3:3}];
When declaring the keyspace strategy_options, what you name your data centers depends on the snitch you have chosen for your cluster. The data center names must correlate to the snitch you are using in order for replicas to be placed in the correct location.
As a general rule, the number of replicas should not exceed the number of nodes in a replication group. However, it is possible to increase the number of replicas, and then add the desired number of nodes afterwards. When the replication factor exceeds the number of nodes, writes will be rejected, but reads will still be served as long as the desired consistency level can be met.

listen_address

The IP address or hostname that other Cassandra nodes will use to connect to this node. If left blank, you must have hostname resolution correctly configured on all nodes in your cluster so that the hostname resolves to the correct IP address for this node (using /etc/hostname, /etc/hosts or DNS).

Configuring the PropertyFileSnitch
The PropertyFileSnitch requires you to define network details for each node in the cluster in a cassandra-topology.properties configuration file. A sample of this file is located in /etc/cassandra/conf/cassandra.yaml in packaged installations or $CASSANDRA_HOME/conf/cassandra.yaml in binary installations.
Every node in the cluster should be described in this file, and this file should be exactly the same on every node in the cluster if you are using the PropertyFileSnitch.

No comments:

Post a Comment

Architecture of Cassandra

Architecture of Cassandra A Cassandra instance is a collection of independent nodes that are configured together into a cluster. In a C...