Five minutes with the Consistent Hash Algorithm

Consistent hash algorithm in 1997 by the Massachusetts Institute of Technology Karger et al. in the solution to the distributed Cache proposed, the design goal is to solve the Internet in the hot spot (Hot spot) problem, the original intention and CARP is very similar. Consistent hashing corrects the problems brought about by the simple hash algorithm used by CARP, so that DHT can be really used in the P2P environment.

But now the consistency hash algorithm is also widely used in distributed systems, people who have studied the memcached cache database know that the memcached server-side itself does not provide distributed cache consistency, but by the client to provide, specifically in the calculation of consistency hash using the following steps:

First the hash value of the memcached server (node) is derived and configured to a circle (continuum) of 0 to 232.
The same method is then used to derive the hash value of the key that stores the data and maps it to the same circle.
It then starts a clockwise lookup from the location the data is mapped to and saves the data to the first server it finds. If the server is still not found for more than 232, it is saved to the first memcached server.

Add a memcached server from the state above. The Residual Distributed algorithm affects cache hits due to the fact that the servers that hold the keys change drastically, but in Consistent Hashing, only the keys on the first server counterclockwise from the location where the server is added on the garden (continuum) are affected, as shown in the figure below:

Consistent Hash Properties

Considering that each node of a distributed system may fail, and new nodes are likely to be added dynamically, it is worth considering how to ensure that when the number of nodes in the system changes, it is still able to provide good service to the outside world, especially when designing a distributed caching system, if a certain server fails, for the whole system, if it does not use a suitable algorithm to ensure consistency, all the data cached on the system, all the data cached in the system may be invalidated (i.e., because the number of nodes in the system becomes less, the client needs to recalculate its hash value when requesting a certain object (which is usually related to the number of nodes in the system), and it is likely that it cannot find the server node that saves the object because the hash value has already been changed), therefore, the consistency hash becomes crucial, and the consistency hash algorithm of a good distributed cahce system should satisfy the following criteria The consistent hash algorithm of a good distributed cahce system should satisfy the following aspects:

Balance

Balance means that the result of the hash can be distributed to all the buffers as much as possible, which can make all the buffer space is utilized. Many hashing algorithms are able to fulfill this condition.

Monotonicity

Monotonicity is that if there has been some content through the hash assigned to the corresponding buffer, and there are new buffers added to the system, then the result of the hash should be able to ensure that the original has been assigned to the content can be mapped to the new buffer, and will not be mapped to the old set of buffers in the other buffers. Simple hash algorithms often can not meet the requirements of monotonicity, such as the simplest linear hash: x = (ax + b) mod (P), in the above equation, P represents the size of the full buffer. It is not difficult to see that when the buffer size changes (from P1 to P2), all the original hash results will change, thus not meeting the requirements of monotonicity. The change in the hash result means that when the buffer space changes, all the mapping relationships need to be updated all over the system. In the P2P system, the change of buffer is equivalent to Peer joining or exiting the system, this situation will occur frequently in the P2P system, so it will bring a great computational and transmission load. Monotonicity is the requirement for hash algorithms to be able to cope with this situation.

Spread

In a distributed environment, it is possible that the terminal may not see all of the buffer, but only part of it. When the terminal wants to map the content to the buffer through the hashing process, the range of buffers seen by different terminals may be different, which leads to inconsistent hashing results, and the end result is that the same content is mapped to different buffers by different terminals. This situation should obviously be avoided, because it leads to the same content being stored in different buffers, reducing the efficiency of the system storage. Decentralization is defined as the severity with which the above situation occurs. A good hashing algorithm should be able to avoid inconsistencies as much as possible, i.e., minimize dispersion.

Load

The load problem is actually another way of looking at the decentralization problem. Since different endpoints may map the same content to different buffers, it is also possible for a given buffer to be mapped to different content by different users. Like decentralization, this should be avoided, and therefore good hashing algorithms should be able to minimize the load on the buffer.

Smoothness

Smoothness means that a smooth change in the number of cache servers is consistent with a smooth change in the cached objects.

basic concept

The Consistent Hashing algorithm was first presented in the paper Consistent Hashing and Random Trees: Distributed Caching Protocols for Relieving Hot Spots on the World Wide Web was proposed. Simply put, Consistent Hashing organizes the entire hash value space into a virtual circle, such as assuming that the value space of a hash function H is 0-2^32-1 (i.e., the hash value is a 32-bit unsigned shaping), and the entire hash space ring is as follows:

The entire space is organized in a clockwise direction.0 and 232-1 coincide in direction in the zero point.

The next step will be the individual servers using Hash for a hash, specifically you can choose the server’s ip or hostname as a keyword for hash, so that each machine can determine its location on the hash ring, here assume that the four servers above using ip address hash after the location in the ring space is as follows:

Next, use the following algorithm to locate the data to access the corresponding server: the data key using the same function Hash to calculate the hash value, and determine the location of this data in the ring, from this location along the ring clockwise “walk”, the first encountered server is the server that should be located to.

For example, we have four data objects Object A, Object B, Object C, Object D. After hash calculation, the position on the ring space is as follows:

According to the consistent hash algorithm, data A will be designated to Node A, B to Node B, C to Node C, and D to Node D.

The following analyzes the fault tolerance and scalability of the consistent hash algorithm. Now assume that Node C is unfortunately down, you can see that objects A, B, D will not be affected, only the C object is relocated to Node D. In general, in the consistent hash algorithm, if a server is unavailable, the affected data is only the data between this server to the previous server in the ring space (i.e., along the counterclockwise direction of the walk to the first encountered servers), and other will not be affected. The rest will not be affected.

Consider another scenario below, if you add a server Node X to your system, as shown below:

At this point, objects Object A, B, and D are not affected, and only object C needs to be relocated to the new Node X . In general, in the consistent hash algorithm, if a server is added, the data affected is only between the new server and the previous server in its ring space (i.e., the first server encountered by walking along the counterclockwise direction), and the other data will not be affected.

In summary, the consistent hash algorithm for the node increase or decrease are only need to relocate a small portion of the data in the ring space, has a better fault tolerance and scalability.

In addition, the consistent hash algorithm is prone to data skewing problems due to uneven node division when too few nodes are served. For example, there are only two servers in the system with the following ring distribution, the

This will inevitably result in a large amount of data concentrated on Node A, while only a very small amount will be localized to Node B. In order to solve this data skew problem, the consistent hash algorithm introduces a virtual node mechanism, that is, for each service node to calculate multiple hash, each calculation result location is placed a this service node, called the virtual node. Specific practices can be realized by adding a number to the end of the server ip or host name. For example, in the above case, three virtual nodes can be computed for each server, and so “Node A#1”, “Node A#2”, “Node A#3 “, “Node B#1”, “Node B#2”, “Node B#3” hash values, so the formation of six virtual nodes :

At the same time, the data localization algorithm remains unchanged, but there is only one more step of mapping virtual nodes to actual nodes, for example, the data of three virtual nodes are localized to Node A#1, Node A#2, Node A#3, and Node A#3. Node A#1″, “Node A#2”, and “Node A#3”, for example, the data of the three virtual nodes are localized to Node A. This solves the problem of data skewing when there are few service nodes. In practice, the number of virtual nodes is usually set to 32 or even larger, so that even few service nodes can achieve relatively uniform data distribution.

JAVA code implementation

package org.java.base.hash;
import java.util.Collection;
import java.util.HashSet;
import java.util.Iterator;
import java.util.Set;
import java.util.SortedMap;
import java.util.SortedSet;
import java.util.TreeMap;
import java.util.TreeSet;

public class ConsistentHash<T> {
 private final int numberOfReplicas;

 private final SortedMap<Integer, T> circle = new TreeMap<Integer, T>();

 public ConsistentHash( int numberOfReplicas,
 Collection<T> nodes) {
 this.numberOfReplicas = numberOfReplicas;
 for (T node : nodes){
 add(node);
 }
 }

 public void add(T node) {
 for (int i = 0; i < numberOfReplicas; i++){


 String nodestr =node.toString() + i;
 int hashcode =nodestr.hashCode();
 System.out.println("hashcode:"+hashcode);
 circle.put(hashcode, node);
 
 }
 }

 public void remove(T node) {
 for (int i = 0; i < numberOfReplicas; i++)
 circle.remove((node.toString() + i).hashCode());
 }


 public T get(Object key) {
 if (circle.isEmpty())
 return null;
 int hash = key.hashCode();
 System.out.println("hashcode----->:"+hash);
 if (!circle.containsKey(hash)) {
 SortedMap<Integer, T> tailMap = circle.tailMap(hash);
 hash = tailMap.isEmpty() ? circle.firstKey() : tailMap.firstKey();
 }
 return circle.get(hash);
 }

 public long getSize() {
 return circle.size();
 }
 

 public void testBalance(){
 Set<Integer> sets = circle.keySet();
 SortedSet<Integer> sortedSets= new TreeSet<Integer>(sets);
 for(Integer hashCode : sortedSets){
 System.out.println(hashCode);
 }
 
 System.out.println("----each location 's distance are follows: ----");

 Iterator<Integer> it = sortedSets.iterator();
 Iterator<Integer> it2 = sortedSets.iterator();
 if(it2.hasNext())
 it2.next();
 long keyPre, keyAfter;
 while(it.hasNext() && it2.hasNext()){
 keyPre = it.next();
 keyAfter = it2.next();
 System.out.println(keyAfter - keyPre);
 }
 }
 
 public static void main(String[] args) {
 Set<String> nodes = new HashSet<String>();
 nodes.add("A");
 nodes.add("B");
 nodes.add("C");
 
 ConsistentHash<String> consistentHash = new ConsistentHash<String>(2, nodes);
 consistentHash.add("D");
 
 System.out.println("hash circle size: " + consistentHash.getSize());
 System.out.println("location of each node are follows: ");
 consistentHash.testBalance();
 
 String node =consistentHash.get("apple");
 System.out.println("node----------->:"+node);
 }
 
}

Five minutes with the Consistent Hash Algorithm

Consistent Hash Properties

basic concept

JAVA code implementation

By lzz

Related Post

Leave a Reply Cancel reply

You Missed

8 Python practical scripts, save them for future use!

Python logging library logging summary – probably the best article summarizing the logging library so far

I hear you know Python?

An article on collection manipulation functions in Kotlin