Performance optimization has always been the focus of back-end service optimization, but online performance failure problems are not frequent, or limited by business products, there is no way to performance problems, including the author’s own performance problems encountered are not many, so in order to reserve knowledge in advance, when the problem will not be overwhelmed, we this post to simulate the next common several Java performance failures, to learn how to analyze and locate.

 background knowledge

Since locating the problem definitely requires the use of tools, let’s first understand what tools are needed to help locate the problem.

 top command

top command makes us one of the most commonly used Linux commands, it can real-time display the CPU utilization of the currently executing process, memory utilization and other system information. top -Hp pid You can check the system resource usage of a thread.

 vmstat command

vmstat is a virtual memory inspection tool that specifies the period and the number of times it is collected, it can count the usage of memory, CPU, and swap, it also has an important and common function for observing the context switching of processes. The field descriptions are as follows.

  • r: number of processes in the run queue (when the number is greater than the number of CPU cores, it means there are blocking threads)
  •  b: Number of processes waiting for IO
  •  swpd: use virtual memory size
  •  free: free physical memory size

  • buff: size of memory used as a buffer (memory and hard disk buffers)

  • cache: size of memory used as cache (buffer between CPU and memory)

  • si: the size of writes per second from the swap area to memory, transferred from disk to memory

  • so: the amount of memory written to the swap area per second, transferred from memory to disk
  •  bi: number of blocks read per second
  •  bo: blocks written per second
  •  in: number of interrupts per second, including clock interrupts.
  •  cs: number of context switches per second.

  • us: Percentage of user process execution time (user time)

  • sy: Percentage of kernel system process execution time (system time)
  •  wa: Percentage of IO wait time
  •  id: Percentage of free time

 The pidstat command

pidstat is a component of Sysstat and a powerful performance monitoring tool. top and vmstat commands monitor the memory, CPU and I/O usage of a process, while the pidstat command can detect it at the thread level. pidstat The command thread switching fields are described below:

  •  UID : The real user ID of the monitored task.

  •  TGID : Thread Group ID.

  •  TID: Thread ID.

  • cswch/s: the number of active context switches, here the thread is switched because of resource blocking, such as lock waiting and other cases.

  • nvcswch/s: number of passive context switches, in this case CPU scheduling switched threads.

 jstack commands

jstack is a JDK tool command, it is a thread stack analysis tool, the most commonly used function is to use the jstack pid command to view the thread’s stack information, but also often used to rule out deadlock situations.

 jstat command

It can detect the real-time situation of Java program operation, including heap memory information and garbage collection information, we often use to check the program garbage collection situation. The commonly used command is jstat -gc pid . The information fields are described below:

  • S0C: Capacity of To Survivor in young generation (in KB);

  • S1C: Capacity (in KB) of From Survivor in young generation;

  • S0U: Space (in KB) currently used by To Survivor in the younger generation;

  • S1U: Space currently used by From Survivor in the younger generation (in KB);

  • EC: the capacity of Eden in the young generation (in KB);

  • EU: space currently used by Eden in the younger generation (in KB);

  •  OC: Capacity of the old age (in KB);

  •  OU: Old age currently used space (in KB);

  •  MC: Capacity of the metaspace (in KB);

  •  MU: Metaspace currently utilized space (in KB);

  • YGC: number of gc times in the young generation from application startup to sampling;

  • YGCT: the time (s) taken by gc in the young generation from application startup to sampling;

  • FGC: Number of Full Gc gc times from application startup to sampling;

  • FGCT: Time (s) taken from application startup to sampling Full Gc gc;

  • GCT: Total time (s) used by gc from application startup to sampling.

 jmap command

jmap is also a JDK tool command that allows you to view heap memory initialization information as well as heap memory usage, and also generates a dump file for detailed analysis. View heap memory situation command jmap -heap pid .

 mat memory tool

MAT (Memory Analyzer Tool) tool is a plug-in eclipse (MAT can also be used separately), it analyzes the large memory dump file, you can very intuitively see the various objects in the heap space occupied by the size of the memory, the number of class instances, the object reference relationship, the use of OQL object query, as well as can be very easy to find out the object GC Roots information. Click here to download

 Simulation environment preparation

Basic environment jdk1.8, using SpringBoot framework to write a few interfaces to trigger the simulation scenarios, first of all, the simulation of the CPU occupied situation

 CPU utilization

Simulating a full CPU is still relatively simple, just write a dead loop to calculate the CPU consumption.

    public void testCPULoop() throws InterruptedException {

        int num = 0;
        while (true) {
            if (num == Integer.MAX_VALUE) {
            num = 0;


Request interface address test curl localhost:8080/cpu/loop ,found that the CPU immediately spiked to 100%!

View Java threads by executing top -Hp 32805

Execute printf '%x' 32826 to get the thread id in hexadecimal for the dump info query and the result is 803a . Finally, we execute jstack 32805 |grep -A 20 803a to see the detailed dump information.

Here dump information directly locates the problematic method as well as the line of code, which locates the CPU hogging problem.

 memory leak

The simulation of memory leaks is done with the help of the ThreadLocal object, which is a thread private variable that can be bound to the thread and will exist throughout the thread’s lifecycle, but due to the specificity of ThreadLocal, which is implemented based on the ThreadLocalMap. The Entry of ThreadLocalMap inherits WeakReference, and the Key of Entry is the encapsulation of WeakReference, in other words Key is a weak reference, weak reference will be recycled after the next GC, if ThreadLocal does not carry out the subsequent operations after the set, because the GC will be Key If ThreadLocal does not perform subsequent operations after set, because GC will clear the Key, but the Value will never be reclaimed because the thread is still alive, and eventually a memory leak will occur.

    @GetMapping(value = "/memory/leak")
    public String leak() {
        ThreadLocal<Byte[]> localVariable = new ThreadLocal<Byte[]>();
        localVariable.set(new Byte[4096 * 1024]);
        return "ok";

We put a heap memory size limit on startup and also set it to output a snapshot of the stack and output a log when memory overflows.

java -jar -Xms500m -Xmx500m -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/heapdump.hprof -XX:+PrintGCTimeStamps -XX:+PrintGCDetails -Xloggc:/tmp/heaplog.log analysis-demo-0.0.1-SNAPSHOT.jar

After successful startup, we looped through 100 times, for i in {1..500}; do curl localhost:8080/memory/leak;done ,and before the execution was complete, the system has returned a 500 error. The following exception appears in the system log:

java.lang.OutOfMemoryError: Java heap space

Let’s take a look at the GC of the program using the jstat -gc pid command.

Obviously, the memory is overflowing. The heap memory has not been freed after 45 Full Gc, which means that the objects in the current heap memory are alive, have GC Roots references, and can’t be reclaimed. What is the cause of the memory overflow? Do I just need to increase the memory? If it’s a normal memory overflow, maybe it’s enough to expand the memory, but if it’s a memory leak, the expanded memory will be filled up in no time, so we still need to determine whether it’s a memory leak or not. We have saved a heap dump file, so we can analyze it with the help of our MAT tool. To import the tool, select Leak Suspects Report and the tool will give you a report of the problem directly.

The suspected 4 memory leaks have been listed here, let’s click on one of them to see the details.

It has been pointed out that the memory is occupied by the thread close to 50M memory, the occupied object is ThreadLocal. If you want to analyze it manually in detail, you can click on Histogram ,to see who the largest object occupation is, and then analyze its reference relationship, you can determine who caused the memory overflow.

The above figure shows that the object that occupies the most memory is a Byte array, let’s see what GC Root it is referenced by that has not been reclaimed. Following the guidelines in the red box above, the result is shown below:

We find that the Byte array is referenced by the Thread object. The diagram also indicates that the GC Root of the Byte array counterpart is a thread, so it is not reclaimed. Expanding the details to view the details, we find that the final memory occupation object is occupied by the ThreadLocal object. This is also consistent with the results that the MAT tool automatically analyzes for us.

Deadlock will lead to exhaustion of thread resources, memory occupation, the performance of memory occupation is elevated, the CPU does not necessarily soar (depending on the scenario to decide), if it is a direct new thread, it will lead to the JVM memory is exhausted, reported that the inability to create thread error, which also reflects the benefits of using the thread pool.

 ExecutorService service = new ThreadPoolExecutor(4, 10,
            0, TimeUnit.SECONDS, new LinkedBlockingQueue<Runnable>(1024),
            new ThreadPoolExecutor.AbortPolicy());

    public String testCPU() throws InterruptedException {
        Object lock1 = new Object();
        Object lock2 = new Object();
        service.submit(new DeadLockThread(lock1, lock2), "deadLookThread-" + new Random().nextInt());
        service.submit(new DeadLockThread(lock2, lock1), "deadLookThread-" + new Random().nextInt());
        return "ok";

public class DeadLockThread implements Runnable {
    private Object lock1;
    private Object lock2;

    public DeadLockThread(Object lock1, Object lock2) {
        this.lock1 = lock1;
        this.lock2 = lock2;

    public void run() {
        synchronized (lock2) {
            System.out.println(Thread.currentThread().getName()+"get lock2 and wait lock1");
            try {
            } catch (InterruptedException e) {
            synchronized (lock1) {
                System.out.println(Thread.currentThread().getName()+"get lock1 and lock2 ");

We loop request interface 2000 times, found that not long after the system log error, the thread pool and queue are full, because I chose when the queue is full to reject the strategy, so the system directly throws an exception.

java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.FutureTask@2760298 rejected from java.util.concurrent.ThreadPoolExecutor@7ea7cd51[Running, pool size = 10, active threads = 10, queued tasks = 1024, completed tasks = 846]

Through the ps -ef|grep java command to find out the Java process pid, the implementation of jstack pid can appear java thread stack information, here found five deadlocks, we only list one of them, it is clear that the thread pool-1-thread-2 locked 0x00000000f8387d88 waiting for the 0x00000000f8387d98 lock, thread pool-1-thread-1 locked 0x00000000f8387d98 waiting for the lock 0x00000000f8387d88 ,which produces a deadlock.

Java stack information for the threads listed above:
        - waiting to lock <0x00000000f8387d98> (a java.lang.Object)
        - locked <0x00000000f8387d88> (a java.lang.Object)
        at java.util.concurrent.Executors$
        at java.util.concurrent.ThreadPoolExecutor.runWorker(
        at java.util.concurrent.ThreadPoolExecutor$
        - waiting to lock <0x00000000f8387d88> (a java.lang.Object)
        - locked <0x00000000f8387d98> (a java.lang.Object)
        at java.util.concurrent.Executors$
        at java.util.concurrent.ThreadPoolExecutor.runWorker(
        at java.util.concurrent.ThreadPoolExecutor$
 Found 5 deadlocks.

  Frequent thread switching

Context switching can result in a significant amount of CPU time being wasted on saving and restoring registers, kernel stacks, and virtual memory, leading to a decrease in overall system performance. When you notice a significant drop in system performance, you need to consider whether a large number of thread context switches are occurring.

 @GetMapping(value = "/thread/swap")
    public String theadSwap(int num) {
        for (int i = 0; i < num; i++) {
            new Thread(new ThreadSwap1(new AtomicInteger(0)),"thread-swap"+i).start();
        return "ok";
public class ThreadSwap1 implements Runnable {
    private AtomicInteger integer;

    public ThreadSwap1(AtomicInteger integer) {
        this.integer = integer;

    public void run() {
        while (true) {

Here I create multiple threads to perform the basic atomic +1 operation, and then give up CPU resources, the theory is that the CPU will go to schedule other threads, we request the interface to create 100 threads to see how the effect curl localhost:8080/thread/swap?num=100 . After the interface request is successful, we execute `vmstat 1 10, which means to print every 1 second, print 10 times, thread switching collection results are as follows:

procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
101  0 128000 878384    908 468684    0    0     0     0 4071 8110498 14 86  0  0  0
100  0 128000 878384    908 468684    0    0     0     0 4065 8312463 15 85  0  0  0
100  0 128000 878384    908 468684    0    0     0     0 4107 8207718 14 87  0  0  0
100  0 128000 878384    908 468684    0    0     0     0 4083 8410174 14 86  0  0  0
100  0 128000 878384    908 468684    0    0     0     0 4083 8264377 14 86  0  0  0
100  0 128000 878384    908 468688    0    0     0   108 4182 8346826 14 86  0  0  0

Here we focus on 4 indicators, r , cs , us , sy .

r=100,indicating that the number of waiting processes is 100 and the thread is blocking.

cs=8+ million, indicating that the context switched over 8 million times per second, which is quite a large number.

us=14, indicating that the user state is using 14% of the CPU time slice to process the logic.

sy=86, indicating that the kernel state is taking up 86% of the CPU, which is obviously doing the context switching work here.

We viewed the process and thread CPU via the top command as well as top -Hp pid , and found that the Java thread CPU was full, but the thread CPU usage was very even, and there was no situation where a particular thread ate up all the CPU.

PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                            
 87093 root      20   0 4194788 299056  13252 S 399.7 16.1  65:34.67 java 
 PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND                                                                             
 87189 root      20   0 4194788 299056  13252 R  4.7 16.1   0:41.11 java                                                                                
 87129 root      20   0 4194788 299056  13252 R  4.3 16.1   0:41.14 java                                                                                
 87130 root      20   0 4194788 299056  13252 R  4.3 16.1   0:40.51 java                                                                                
 87133 root      20   0 4194788 299056  13252 R  4.3 16.1   0:40.59 java                                                                                
 87134 root      20   0 4194788 299056  13252 R  4.3 16.1   0:40.95 java 

Combined with the above user-state CPU used only 14%, kernel-state CPU occupied 86%, can basically determine that the Java program thread context switching caused performance problems.

We use the pidstat command to look at the thread switching data inside the Java process. Execute pidstat -p 87093 -w 1 10 ,and collect the data as follows:

11:04:30 PM   UID       TGID       TID   cswch/s nvcswch/s  Command
11:04:30 PM     0         -     87128      0.00     16.07  |__java
11:04:30 PM     0         -     87129      0.00     15.60  |__java
11:04:30 PM     0         -     87130      0.00     15.54  |__java
11:04:30 PM     0         -     87131      0.00     15.60  |__java
11:04:30 PM     0         -     87132      0.00     15.43  |__java
11:04:30 PM     0         -     87133      0.00     16.02  |__java
11:04:30 PM     0         -     87134      0.00     15.66  |__java
11:04:30 PM     0         -     87135      0.00     15.23  |__java
11:04:30 PM     0         -     87136      0.00     15.33  |__java
11:04:30 PM     0         -     87137      0.00     16.04  |__java

Based on the information collected above, we know that Java threads switch about 15 times per second, which should be a single digit or a decimal under normal circumstances. Combining this information we can conclude that too many Java threads are turned on, leading to frequent context switching, which affects the overall performance.

Why is it that the system’s context switches are more than 8 million per second, while a particular thread in a Java process switches only about 15 times?

 System context switching is categorized into three cases: the

1, multitasking: in a multitasking environment, a process is switched out of the CPU to run another process, where a context switch occurs.

2, interrupt processing: when an interrupt occurs, the hardware switches the context. In the vmstat command is in

3, user and kernel mode switching: when the operating system needs to be in the user mode and kernel mode conversion between the need for context switching, such as system function calls.

Linux maintains a ready queue for each CPU, which sorts the active processes by priority and wait time for the CPU, and then selects the process that needs the CPU the most, i.e., the one with the highest priority and the longest wait time for the CPU, to run. This is also known as r in the vmstat command.

So, at what point does the process get scheduled to run on the CPU?

  • When the process is terminated, the CPU it was using is released, and a new process is taken from the ready queue to run.

  • In order to ensure that all processes can be scheduled fairly, CPU time is divided into time slices, which are allocated to each process in turn. When a process runs out of time slice, it will be hung by the system and switched to other processes waiting for CPU.

  • Processes wait for resources to be met before they can run when system resources are insufficient, at which point the process also hangs and is scheduled by the system for other processes to run.

  • Processes are also rescheduled when they are actively hung via the sleep function sleep.

  • When a process with a higher priority is running, the current process is hung and run by the higher priority process in order to keep the higher priority process running.

  • When a hardware interrupt occurs, the process on the CPU hangs by the interrupt and executes the interrupt service program in the kernel instead.

Combined with our previous content analysis, the blocking ready queue is about 100, and our CPU is only 4 cores, this part of the reason for the context switching can be quite high, plus the number of interrupts is about 4000 and the system’s function calls and so on, the whole system’s context switching to the 8 million is not surprising.Java’s internal thread switching is only 15 times, is because of the thread to use Thread.yield() to give up CPU resources, but it is possible for the CPU to continue scheduling the thread, and there is no switching between threads at this time, which is why the number of times a particular thread switches internally is not very large.

By lzz

Leave a Reply

Your email address will not be published. Required fields are marked *