For a long time, we had been using a substantially old version of Aerospike(v 3.12). It had never really been considered for upgrade mostly owing to its stability. However, not too long ago, we hit a throughput surge of approximately 2X traffic(over 16K reads/sec per box and over 1K writes/sec per box) for a prolonged period of time on our aerospike cluster which resulted in an OOM kill of the aerospike daemon in one of the nodes and eventually the entire cluster in a relatively short amount of time. Starting the aerospike daemon for an n node cluster essentially involves loading indices into main memory for persistent namespaces followed a migration of data once new nodes are discovered. After successfully starting the aerospike daemons in all the nodes, we started facing a new issue wherein a fetch from aerospike using secondary indices was returning partial data. The initial hunch was a potential data loss however we found that a full scan of the data without using the secondary index was actually returning the full data. After some investigation, we realized that this was a bug in aerospike where records created via migrations were not added to secondary indices. This bug had been fixed in a later version of aerospike(188.8.131.52 to be precise). Owing to the gravity of the situation, an upgrade was necessary at the earliest.
After a fair amount of study through scores of official aerospike articles, blogs and white papers, we decided to go with aerospike 184.108.40.206 which also happened to be the second most recent version of aerospike at the time. There were major feature enhancements and bug fixes in this version that would give us a substantial mileage.
Note: We run on aerospike enterprise and most features discussed in this article are supported by the enterprise edition.
Features that stand out
- Strong Consistency
- Quiescing a node
- Delaying fill migrations
Features validations involved validating relevant bug fixes, any relevant aerospike specific SLAs that indirectly contribute to our SLAs and new features that aerospike 220.127.116.11 promised.
Aerospike has its own set of recommendations on deploying aerospike on cloud providers. For AWS, aerospike recommends use of i3 instances1 which are optimized for IO intensive workloads however i3s come at a considerable cost and spinning VMs for testing purposes incurs substantial amount of time as well.
Docker to the rescue
This is not to say that docker is a suitable candidate for validating features of aerospike in a way that it mimics a near production cluster. However the feature validations we were concerned with could be effectively and rather efficiently tested with docker containers.
We used docker for 2 purposes, feature validations and for testing our migration strategies. At this point in the article its fair to say, we cannot proceed any further without discussing aerospike’s behaviour in case of a network partition as this dictates a lot of aerospike’s core features and functionality.
Eric Brewer’s CAP theorem talks about how consistency, availability and partition tolerance cannot be guaranteed in any distributed system all at the same time. Aerospike prior to version 4.0 provided an AP system which meant that it could provide high availability and partition tolerance by compromising consistency. In case of a network partition within the cluster, aerospike would prioritize availability over consistency.
To illustrate this idea lets consider the following scenario:
We have a 3 node aerospike cluster with a replication factor of 2. If a network partition occurs between any one of the nodes, the other 2 nodes will no longer be able to find the 3rd node. The most important consequence of this is that the 3rd node which got isolated now considers itself to be in its own cluster(popularly called an island). For the 2 node cluster, migration starts for rebalancing the partitions(and hence the data) across 2 nodes. Similarly in the 1 node cluster, previously replica partitions assume the master role and a replica is created to fulfil the replication factor.
Clients in aerospike are thick which is to say that they have a partition map. In other words, aerospike clients know which records are stored in which partitions and also where(in which node of the cluster) the partitions(master and replicas) lie. Clients in aerospike have what is called a tend process that monitors the aerospike cluster topology at a regular interval. Typically this interval is 1 second.
If we have 20 clients contacting the 3 node aerospike cluster at the time of the network partition, certain clients are handed the topology of the 2 node cluster and certain clients are handed the topology of the 1 node cluster. At this point, you may already start to realize the data inconsistency we are heading towards. Data written and read from certain clients is different from that written and read by other clients. This is what we mean by compromising consistency and prioritizing availability.
What we discussed here is not only prevalent in aerospike but also in most other distributed databases.
Aerospike 4.0 introduced Strong Consistency(SC). Unlike AP, SC as the name suggests is basically a CP(consistency and partition tolerance) system and would prioritize consistency over availability.
In this mode, we would have to define what is called a roster which simply is a set of nodes that will be part of the aerospike cluster. roster-master refers to the node containing the master partition and roster-replica refers to the node containing the replica partition for a replication factor of 2. In case the above network partition scenario occurs, for SC namespaces, aerospike would not allow reads and writes in one of the islands based on a set of heuristics. The visibility of partitions from the client’s perspective are based on the following rules as documented by aerospike:
- ‘If a sub cluster (a.k.a. split-brain) has both the roster-master and roster-replica for a partition, then the partition is active for both reads and writes in that sub cluster2‘
- ‘If a sub cluster has a majority of nodes and has either the roster-master or roster-replica for the partition within its component nodes, the partition is active for both reads and writes in that sub cluster2‘
- ‘If a sub cluster has exactly half of the nodes in the full cluster (roster) and it has the roster-master within its component nodes, the partition is active for both reads and writes2‘
Although the above rules sound confusing, the gist of it boils down to the idea that in case there is a network partition, aerospike would make sure 2 partitions in 2 different islands can never assume the master role i.e. a split brain will never happen for any given partition. The official aerospike documentation on SC provides an excellent explanation of what the above 3 rules imply so reiterating it here is irrelevant.
Quiescing a node
Whenever aerospike detects a change in the number of nodes in the cluster, the partitions start rebalancing to such that if n bytes of data are present in the cluster of m nodes, each node is supposed to have n/m amount of data. For a replication factor of 2, each partition will have 2 replicas(a master and a replica or a leader and a follower). Whenver a node is taken down or it goes down due to a daemon crash or the like, it takes a tangible amount of time for a master handoff to occur. Aerospike clients monitor the topology of the cluster constantly. However for the amount of time that a new leader has not been elected for a partition, the client uses the old partition map that gives it stale data. This is popularly called a master gap. Here is where quiescing comes in handy. To quiesce a node is to inform aerospike that it is going to be taken down for some sort of maintenance or it is going to be removed from the cluster. When aquiesce is issued, the node still continues to serve traffic however a master handoff is initiated in the background. All writes and reads are eventually proxied to a different node, one that has the master partitions. Once no writes are happening on the node, it can be taken down. In fairness, the word quiesce is actually misleading as it may imply that it can only be used in cases where the node has to be brought back after a maintenance when in fact it can totally be used before taking down a node completely from the cluster.
For performing patch version upgrades or really any operation involving a restart of the aerospike daemon, quiescing can therefore help. Bear in mind that taking down a node without quiescing it first will almost definitely lead to lost writes for high throughputs. Aerospike clients(the recent versions) are capable of handling scenarios involving daemons going down, however a write that has already been sent to the node that is going down cannot be unsent and the client will never receive an acknowledgement from that node thus leading to the client timing out. Applications are supposed to handle this scenario with a retry or the like3.
Delaying fill migrations
Migrations are IO intensive, it hurts the cluster to do migrations if it is handling user traffic and performing high volume writes at the same time. Before quiescing a node, it is therefore always recommended to delay fill migrations by a specific time. This way, fill migrations are paused during the time of taking down of the nodes for maintenance or the like.
Check out the second part of our blog on how we actually performed the migration to the new aerospike cluster.