1. Confirm whether caching is required

Before using caching, you need to make sure that your project really needs it. Using caching introduces certain technical complexities, which are described later. Generally speaking, there are two ways to determine if you need to use caching.

CPU usage:If you have some application which need to consume a lot of cpu to compute, such as regular expression, if you use regular expression more frequently, and its use a lot of cpu, then you should use cache to cache the result of the regular expression to the cache.
Database IO Usage:If you find that your database connection pool is relatively idle, then caching should not be used. But if the database connection pool is relatively busy, or even often report the alarm of not enough connections, then it is time to consider caching. I once had a service, called by many other services, other times are fine, but at 10:00 every morning will always report the database connection pool connection is not enough alarm, after investigation, found that a few services have chosen to do a timed task at 10:00, a large number of requests to hit the DB connection pool is not enough, so that the connection pool is not enough to report the alarm. At this time there are several options, we can solve the problem by expanding the capacity of the machine, or by increasing the database connection pool, but there is no need to increase the cost of these, because the problem only occurs at 10 o’clock. Later caching was introduced which not only solved the problem but also increased the read performance.

If there aren’t the two problems mentioned above, then you don’t have to add caching for the sake of caching.

2. Choose the right cache

Caching is divided into in-process caching and distributed caching. Many people, including the author in the beginning of the selection of caching frameworks are confused: the Internet cache is too much, everyone boasts that they are very impressive, how do I choose?

2.1 Choosing the right process cache

First, let’s look at a comparison of some of the more commonly used caches. For more information on how they work, see What you should know about the evolution of caching.

1	ConcurrentHashMap	LRUMap	Ehcache	Guava Cache	Caffeine
Read/Write Performance	Good. Sectional locks.	General, global locking	good	Okay, we need to do a phase-out.	great
phase-out algorithm	none	LRU, general	Support multiple elimination algorithms, LRU, LFU, FIFO.	LRU, general	W-TinyLFU, great.
Functional richness	Functionality is relatively simple	Functionality is relatively homogeneous	Functionality is very rich	It’s very feature rich, supports refresh and dummy references, etc.	Functions similar to Guava Cache
Tool Size	The jdk comes with classes that are very small	Based on LinkedHashMap, smaller	Very large, 1.4MB for the latest version	is a small part of the Guava tool class, the smaller	General, latest version 644KB
Persistent or not	no	no	yes	no	no
Whether or not clustering is supported	no	no	yes	no	no

For ConcurrentHashMap, more suitable for caching more fixed unchanging elements, and cache a smaller number of. Although from the above table compared to a little inferior, but because it is the jdk comes with the class, in a variety of frameworks still have a lot of use, for example, we can use to cache our reflection of the Method, Field and so on; can also cache some links, to prevent its repeated establishment. In Caffeine is also used in the ConcurrentHashMap to store elements.
For LRUMap, you can use this if you don’t want to introduce a third-party package and you want to use an elimination algorithm to eliminate data.
For Ehcache, due to its jar package is very large, more heavyweight. For the need of persistence and clustering of some of the features , you can choose Ehcache. I have not used this cache , if you want to choose , you can choose a distributed cache to replace Ehcache.
For Guava Cache, Guava this jar package in many Java applications have a large number of introductions, so many times it is in fact directly with the good, and its own is lightweight and more feature-rich, in the case of Caffeine do not know you can choose Guava Cache.
For Caffeine, the author is very recommended, its hit rate, read and write performance are much better than Guava Cache, and its API and Guava cache is basically the same, or even a little more. Caffeine is used in real environments, and has achieved good results.

To summarize:If you don’t need the elimination algorithm then choose ConcurrentHashMap, if you need the elimination algorithm and some rich API, here we recommend choosing Caffeine.

2.2 Choosing the Right Distributed Cache

Here we select three well-known distributed cache as a comparison, MemCache (not used in practice), Redis (also known as Squirrel in the U.S. Mission), Tair (also known as Cellar in the U.S. Mission). Different distributed cache functional characteristics and implementation principles are very different, so they are adapted to different scenarios.

1	MemCache	Squirrel/Redis	Cellar/Tair
data structure	Supports only simple Key-Value structures	String, Hash, List, Set, Sorted Set	String,HashMap, List,Set
objectification	unsupported	supported	supported
dimension of capacity	Data is purely in-memory, not too much data storage	Data is all in-memory and should not exceed 100GB for resource cost considerations	Can be configured with all-memory or memory + disk engine for unlimited data capacity expansion
Read/Write Performance	High	Very high (RT 0.5ms or so)	String types are higher (RT 1ms or so), complex types are slower (RT 5ms or so)

MemCache: this piece of contact less, do not do too much recommendation. Its throughput is larger, but supports fewer data structures and does not support persistence.
Redis:Supports rich data structures, read and write performance is very high, but the data is all in-memory, you must consider the cost of resources, support for persistence.
Tair: Supports rich data structures, high read/write performance, some types are slower, theoretically unlimited capacity expansion.

Summary:If the service is more sensitive to latency, Map/Set data is also more words, more suitable for Redis. if the service needs to be put into the cache amount of data is very large, the latency is not particularly sensitive, then you can choose Tair. in the United States of America’s many applications on the application of the Tair have applications, in the author’s project to use it to store our generated payment token, payment code, to replace the database storage. In my project, I use it to store the payment token and payment code we generated, which is used as an alternative to database storage. In most cases, both can be chosen as alternatives to each other.

3. Multi-level caching

When many people think of caching, the following image immediately comes to mind.

Redis is used to store hot data, and data not available in Redis goes directly to the database for access.

In the previous introduction to the local cache, many people have asked me, I already have Redis, why do I need to understand Guava, Caffeine process cache. I basically responded uniformly to the following two answers.

Redis if you hang or use an old version of Redis, it will be fully synchronized, at this time Redis is not available, this time we can only access the database, it is easy to cause avalanches.
Access to Redis will have a certain network I/O and serialization deserialization, although the performance is very high but its ultimately not as fast as the local method, you can store the hottest data locally in order to further speed up access. This idea is not unique to the Internet architecture we do, in the computer system using L1,L2,L3 multi-level cache, used to reduce direct access to memory, thus speeding up access.

So if you just use Redis, can meet most of our needs, but when the need to pursue higher performance and higher availability, then you have to understand the multi-level cache.

3.1 Using process caching

For in-process caching, which is originally limited by the size of memory and the fact that other caches cannot know when the process cache is updated, process caching is generally applicable to.

The amount of data is not very large, the data update frequency is low, before we have a query merchant name service, need to call when sending SMS, due to the low frequency of merchant name change, and even if the change did not change in time to change the cache, SMS with the old merchant name inside the customer can also accept. Using Caffeine as the local cache, the size is set to 10,000, and the expiration time is set to 1 hour, which basically solves the problem during the peak period.
If the amount of data is updated frequently and you also want to use the process cache, then you can set its expiration time to be shorter, or set it to have a shorter auto-refresh time. These are readily available APIs for Caffeine or Guava Cache.

3.2 Using multi-level caching

As the saying goes, there’s nothing in the world that can’t be solved with one cache, and if there is, then two.

Generally we choose a process cache and a distributed cache to match to do multi-level caching, generally speaking, the introduction of two is also enough, if you use three, four, then the technical maintenance costs will be very high, but instead, there is a possibility that it will not pay off, as follows.

Caffeine is utilized as the first level cache and Redis as the second level cache.

First go to Caffeine and query the data, if it is there return it directly. If not proceed to step 2.
Then go to Redis and query it, if it is found return the data and populate this data in Caffeine. If it is not found then proceed to step 3.
Finally go to Mysql to query, if the query returns data and populate this data in Redis, Caffeine in turn.

For Caffeine’s cache, if there is a data update, the cache can only be deleted on the machine that updated the data, and the other machines can only expire the cache through a timeout, which can be set with two strategies.

Set to expire after a certain amount of time has passed since it was written.
Set to how much time to refresh after writing

For Redis cache updates are immediately visible to other machines, but a timeout must also be set, which is longer than Caffeine’s expiration.

To solve the problem of in-process caching, the design is further optimized: the

Through Redis pub/sub, you can notify other process caches to delete this cache. If Redis hangs or the subscription mechanism is unreliable, you can still do the bottom line by relying on the timeout setting.

4. Cache update

Generally the cache is updated in two ways.

Delete the cache first, then update the database.
Update the database first, then delete the cache. Both of these scenarios are in the industry, and everyone has their own opinion about them. How to use them depends on their own trade-offs. Of course, some people will certainly ask why delete the cache? Instead of updating the cache? You can think of when there are multiple concurrent requests to update the data, you can not guarantee that the order of updating the database and update the cache in the same order, then there will be inconsistent data in the database and cache. So in general consider deleting the cache.

4.1 Delete the cache before updating the database

For a simple update operation, it is to go to all levels of cache deletion, and then update the database. This operation has a bigger problem, after the cache deletion, there is a read request, this time because the cache is deleted so directly read the database, read operation data is old and will be loaded into the cache, subsequent read requests all access to the old data.

Successful or unsuccessful operations on the cache can not block our operations on the database, so many times delete cache can be used asynchronous operation, but first delete cache can not be well suited for this scenario.

Deleting the cache first also has the advantage that if the operation on the database fails, then at most it will cause a Cache Miss due to the cache that was deleted first.

4.2 Update database first, then delete cache (recommended)

If we use the update database, and then delete the cache can avoid the above problem. But the same introduces a new problem, imagine there is a data at this time is not cached, so the query request will fall directly into the database, the update operation in the query request after the update operation to delete the database operation in the query after the backfill cache before the operation will lead to our cache and the database cache inconsistency.

Why do many companies, including Facebook, still choose when we have a problem with this condition? Because it is more demanding to trigger this condition.

First it is necessary that the data is not in the cache.
Secondly the query operation needs to reach the database before the update operation.
Finally, the backfill of a query operation is triggered after the deletion of an update operation. This condition is basically hard to occur because the update operation is supposed to come after the query operation, and in general update operations are slightly slower than query operations. But the delete of the update operation is after the query operation, so this condition is less likely to occur.

The probability of this problem is very low compared to the problem in 4.1 above, and we have a timeout mechanism so it basically meets our needs. If you really need to go for perfection, you can use two-stage commits, but the cost is generally not proportional to the benefit.

Of course, another problem is that if we fail to delete, the cached data will be inconsistent with the database data, and then we can only rely on the expiration timeout to cover. We can optimize this so that if the deletion fails, we can’t affect the main process, so we can put it into a queue for subsequent asynchronous deletion.

5. Cache Trenching Three Musketeers

People hear what are the caveats of caching, certainly the first thing that comes to mind is cache penetration, cache hit, cache avalanche of these three digging hole in the small hands, here is a brief introduction to what they are specifically as well as ways to cope with it.

5.1 Cache Penetration

Cache penetration refers to the query data in the database is not available, then in the cache naturally not, so, in the cache can not be found will go to the database to fetch the query, so that more requests, then the pressure on our database will naturally increase.

In order to avoid this problem, the following two instruments can be used.

Convention: for the return of NULL is still cached, for the return of the exception is not cached, note that do not throw an exception to the cache. The use of this means will increase the maintenance cost of our cache, you need to delete the empty cache when inserting the cache, of course, we can set a shorter timeout to solve this problem.

2. formulate some rules to filter some impossible data, small data with BitMap, big data can be used Bloom filter, for example, your order ID is clearly in a range of 1-1000, if not 1-1000 within the data that in fact can be directly filtered out.

5.2 Cache Breakdown

For some key set the expiration time, but it is hot data, if a key is invalid, may be a large number of requests to hit, the cache did not hit, and then go to the database to access, at this time, the database access will increase dramatically.

To avoid this problem, we can take the following two means.

Add distributed locks: load data when you can use distributed locks to lock the data Key, in Redis directly using the setNX operation can be obtained for the lock thread, query the database to update the cache, the other threads to take the retry strategy, so that the database will not be subject to the same data at the same time by many threads to access the same data.
Asynchronous loading:Since cache hit is a problem that only occurs with hot data, you can adopt the strategy of auto-refresh on expiration for this part of hot data instead of auto-elimination on expiration. The elimination is actually for the timeliness of the data, so it’s okay to use auto-refresh.

5.3 Cache Avalanche

Cache avalanche means that the cache is not available or a large number of caches are invalidated at the same time due to the same timeout period, a large number of requests directly access the database, and the database is over-pressurized resulting in a system avalanche.

In order to avoid this problem, we adopt the following means.

Increase the availability of the caching system, monitor the health of the cache, and expand the cache appropriately according to the business volume.
Using multi-level cache, different levels of cache timeout time set differently, in time, a certain level of cache are expired, there are other levels of cache backing up.
The cache expiration time can take a random value, for example, before is set 10 minutes of timeout, then each Key can be random 8-13 minutes expiration, try to make different Key expiration time is different.

6. Cache Pollution

Cache pollution generally occurs when we use the local cache, you can imagine that in the local cache if you get the cache, but you next modify this data, but this data is not updated in the database, this causes cache pollution:.

The above code causes cache pollution, get Customer by id, but the demand needs to modify the Customer’s name, so the developer directly in the object taken out to modify directly, this Customer object will be polluted, other threads to take out this data is the wrong data.

To avoid this problem requires developers to pay attention from coding, and the code must undergo rigorous REVIEW, as well as a full range of regression tests to solve this problem to some extent.

7. Serialization

Serialization is a lot of people do not pay attention to a problem, many people ignore the serialization problem, immediately after the line reported a strange error exception, resulting in unnecessary losses, the last row are serialization problems. List a few common problems with serialization.

key-value object is too complex to cause serialization is not supported:I have had a problem before, in the United States Mission Tair internal default is to use protostuff for serialization, and the United States Mission to use the communication framework is thfift, thrift’s TO is automatically generated, a lot of complex data structures in this TO, but will be stored in the Tair. The query deserialization did not report an error, the single test also passed, but to qa test found that this piece of functionality has a problem, found that there is a field is a boolean type is false by default, change it to true after serialization to tair and then deserialization or false. locate is protostuff for complex structure of the object (such as arrays, lists, etc.) is not supported by the Tair. List, etc.) is not very good, will cause some problems. Later on this TO was converted to use ordinary Java objects can be properly serialized deserialization.
Fields were added or removed, causing deserialization errors or some data shifting when the old cache was fetched after going live.
Different JVMs serialize differently, and if your cache has different services all in common (not advocated), then you need to be aware of the fact that different JVMs may order the Field inside the Class differently, which affects the serialization. For example, the following code, in Jdk7 and Jdk8 object A is arranged in a different order, which will eventually lead to problems with the deserialization results:.

//jdk 7
class A{
    int a;
    int b;
}
//jdk 8
class A{
    int b;
    int a;
}

The problem of serialization must be taken seriously, and the solutions are as follows.

test:For serialization needs to be a comprehensive test, if there are different services and their JVM is different then you also need to do this piece of testing, in the above problem the author’s single test passes the reason is to use the default data false, so there is no test of true, but fortunately the QA to give the power to give it to the test out.
For different serialization frameworks have their own different principles, for adding fields after if the current serialization framework is not compatible with the old, then you can change the serialization framework. For protostuff he is in accordance with the order of the Field for deserialization, for adding fields we need to put at the end, that is, can not be inserted in the middle, otherwise there will be an error. For deleting fields, use the @Deprecated annotation to mark them as deprecated, if they are deleted hastily, unless they are the last field, there will definitely be a serialization exception.
You can use double-write to avoid, for each cache key value can be added to the version number, each time the online version number is added 1, for example, now the online cache is Key_1, will soon be online is Key_2, online after the addition of the cache is written to the old and the new two different versions (Key_1, Key_2) of the Key-Value, read the data is still read the old version of Key_1 data, assuming that the previous cache expiration time is half an hour, then after half an hour online, the old cache and the new cache stock data will be eliminated. Key_1 data, assuming that the previous cache expiration time is half an hour, then half an hour after the line, before the old cache stock of data will be eliminated, at this time the old cache and the new cache on the line, their data is basically the same, switch to read the operation to the new cache, and then stop the double write. Using this method can basically smooth the transition between the old and new Model alternation, but the bad point is that you need to briefly maintain two sets of new and old Model, the next time you need to delete the old Model on line, increasing the maintenance cost.

8. GC tuning

For applications that make heavy use of local caching, as it relates to cache elimination, then GC issues must be a common occurrence. If there are more GCs and longer STW times, then service availability will be affected. The following recommendations are given for this piece.

Check the GC monitor often and how you find irregularities and need to figure out how to optimize it.
For CMS garbage collector, if you find that the RMark is too long, it should be normal if it is a large number of local cache applications, because it is easy to have a lot of new objects into the cache during the concurrency phase, thus the RMark phase scanning is very time-consuming, and the RMark will be paused again. You can enable -XX:CMSScavengeBeforeRemark to perform a YGC before the mark phase to reduce the overhead of scanning the gc root during the mark phase.
You can use the G1 garbage collector to improve service availability by setting the maximum pause time with -XX:MaxGCPauseMillis.

9. Cache monitoring

Many people for the cache monitoring is also relatively ignored, basically after the line if not reported errors and then the default he took effect. But there is this problem, many people due to lack of experience, it is possible to set an inappropriate expiration time, or inappropriate cache size leads to the cache hit rate is not high, so that the cache has become a decorative object in the code. So for the monitoring of various indicators of the cache, but also more important, through its different indicators of data, we can optimize the parameters of the cache, so that the cache to achieve the most optimal:.

The above code is used to record the get operation, through the Cat records the success of getting the cache, the cache does not exist, the cache expires, the cache fails (if an exception is thrown when getting the cache, it is called a failure), through these indicators, we will be able to statistically hit rate, we adjust the time of the expiration of the size of the time and we can refer to these indicators for optimization.

10. A good framework

What’s a good swordsman without a good sword? If you want to use a good cache, a good framework is also essential. At the very beginning of the use of everyone using the cache with some util, the cache logic written in the business logic.

The above code couples the cache logic in the business logic, if we want to increase into a multi-level cache that we need to modify our business logic, does not meet the principle of open and closed, so the introduction of a good framework is a good choice.

We recommend using JetCache this open source framework , which implements the Java caching specification JSR107 and supports automatic refresh and other advanced features . I refer to JetCache combined with Spring Cache, monitoring framework Cat and the Mission’s fusion flow limiting framework Rhino to achieve a set of its own caching framework , so that the operation of the cache , hit the monitoring , fusion degradation , business people do not need to care. The above code can be optimized as follows.

For some monitoring data it is also easy to see on the big boards.

How to elegantly design and use caching?