A change was recently submitted to the main ART thread to improve the performance of JNI calls. It could benefit the vast majority (85%~90%) of the app’s Java native methods.

The whole development and submission took several months, and the process was quite bumpy. The main reason for writing this article is to record the mistakes made and experience gained along the way, and to leave some references for myself and others.

The changes have now been merged into the official release of Android 15 and will appear later, involving 20 files and 1,155 lines of code changes.

In the middle of last year I wrote an article called ART Virtual Machine | A Brief History of JNI Optimization. At that time, in order to write that article, I looked at quite a bit of machine code generated by native methods. It was then that I realized that some of them were identical (in fact, the generation of JNI springboard functions only depends on the parameter type and flags). Identical also means that they can be shared, so when I was learning JIT code later I was thinking about one thing: if two native methods can share the same JNI stub (the springboard function can be called a stub, or trampoline) , then when one of the hotness_count is reduced to 0 and thus triggers the JIT compilation, the other method enjoys the same compiled machine code?

The existing mechanism definitely doesn’t work, so I asked the Google ART engineers to see what they thought about it, and one of the Google engineers said that the idea is feasible, but would require some new data structures and overhead. Another engineer said that there are a lot of native methods in Boot images, and they are already compiled, so we can think about how to reuse them. After that he added: I’ve had this idea for a long time, but never had the time to work on it (on Android 15 they put most of the work on RISC-V support). If anyone wants to take it on, they’re more than welcome to.

This is indeed a great idea, but should I take this job? After all, I’ve only submitted some bug fixes and small changes to the mainline before, but I have no experience in this kind of systematic development inside a virtual machine. After thinking about it, I finally decided to give it a try, because even if I can’t do it, I can still deepen my understanding of ART. Of course, it would be better if I could do it.

 finalize the program

For ART, any idea that gets off the ground is going to be a big deal, so it’s important to be careful enough. In the larger dimension, this feature needs to be considered in three ways:

  1. Boot images need to be loaded when zygote starts up, so is it appropriate to have the boot JNI stub information in the .oat file or the .art file? Where exactly in the file and what is the structure? And how can cross-references in different files be corrected after loading?

  2. What kind of data structure can two native methods use to quickly determine if they can share a JNI stub? In addition, for different architectures, the criteria for whether they can be shared is different. Parameter type consistency is only the highest criterion, can we relax this criterion so that more native methods can benefit? But this requires understanding the machine code generation process for each specific architecture, and then optimizing the rules accordingly.

  3. How is the app’s native method loaded? And when does its entrypoint change? When should we enable this optimization without conflicting with existing mechanisms (e.g. JIT, AOT, Deoptimize, Intrinsic Method)?

Once the general direction had been decided, the discussion of specific solutions followed. Google engineers gave full guidance during the discussion phase, without their help, this work would not have been completed (Special thanks to Vladimir Marko, Santiago Aboy Solanes, Mythri Alle and Nicolas Geoffray). When I looked at it afterwards, there were more than 160 discussions about the program and issues during the whole development process, not counting review comments.

 The final solution is finalized and the next step is the coding phase.

 Download Code

A copy of the code is required before Coding. Of course, I’m not referring to the usual sense of downloading the AOSP source code, but the code download for the ART single module.

Back in Android 10, Android introduced the APEX mechanism in an attempt to allow system modules to be installed and updated like apps, and ART was one of them. This mechanism greatly simplified the development process of ART.

In the traditional way, we need to download the entire AOSP source code and keep it up-to-date, and then find a hardware device that can support it for development and testing. But with the APEX mechanism we no longer need to do that. Now we can refer to art/build/README.md to download the single module code. When the development work is done, we can verify and test it with any Android 10~15 device (user-debug and eng are fine). For example, the following two commands can make ART changes take effect without burning the machine, just like installing an application.

adb install out/dist/com.android.art.apex
adb reboot

Having said that, I took a wrong turn during the initial development. Due to the compatibility of the APEX mechanism on domestic devices, I initially only chose to develop on a major version of Android 14, in which the ART code was not up-to-date. Until the middle of a leisure time, I tried to download, compile and install APEX single module, and found that the process is extremely silky smooth, which realized how much effort Google has done to make development convenient.

 Configuration environment

Since I was developing remotely on Linux, I didn’t try graphical IDEs like VSCode or Android Studio, but used VIM instead, which was so difficult to use without configuration that I had to wait until compilation to find spelling mistakes. I realized that this kind of development efficiency was too low after I got the first version out of it.

So spent some time configuring VIM (NeoVim). The whole process is fun and makes you feel like you have everything under control. It’s also easy to modify the code to tweak things in the plugin that don’t fit what you’re used to. I also configured tmux, lazygit and fzf, which made the development efficiency skyrocket. The last thing is the choice of terminals, I tried quite a few but most of them don’t support nerd font well. In the end, the Windows default terminal perfectly fit the needs, so I decided to give up putty and other conventional ssh tools. The whole configuration session also let me deeply realize the law of Occam’s razor: if not necessary, do not add entities.

Over a period of time, I have found this environment to be easy to use and basically indistinguishable from a modern IDE.

The pitfalls encountered in the development process are too many, often write the code with a face of confidence, run up to slap in the face. To summarize, the understanding of ART is not comprehensive enough. Since this feature affects many mechanisms, often changes to A involve B, and sometimes there is a C that I don’t even understand, so most of the time in the development process is spent on solving bugs, and even more troublesome is that virtual machine bugs have a certain degree of transmissibility, the logs that you see and the root cause of the problem may not be related at all, but due to the role of the various mechanisms involved in the transmission. The problem is not related to the root cause of the problem at all, but is passed on through various mechanisms. In such cases, you have to keep tweaking the code to try the possible directions.

Also, the biggest lesson I learned from the whole development process was to write the test code too late. As a rule of thumb, I’d start with the development first, and then fill in the test code when the development was pretty much done. But in retrospect, this was a complete mistake.

At that time, I needed to optimize the hash and equal strategies for different architectures (mainly arm64 and x86_64, I’ll add riscv later) in order to get more methods covered by this feature. But this process of adjusting the strategy is painful, any small change or sloppy writing can make methods that shouldn’t be using the same JNI stub mistakenly share the same stub, thus creating all kinds of weird and hard to debug problems.

It was only when I started writing the test code that I reacted: not only would such problems as the above have been more easily exposed if I had prepared it earlier, but the carefully printed machine code would have allowed me to see the differences in the assemblies at a glance, without having to agonize over those strange logs.

The testing framework for ART is very complete and is divided into two types. One is the anatomical testing of ART from the native level (C++) using the GTest framework. The other type is a test that treats ART as a whole and runs various Java or Smali codes on it to experiment with various features of the virtual machine, called run-tests.

The test code for this feature is mainly of type GTest, which focuses on whether the shared policy and the final generated machine code match. Specifically, that is, the sharing policy that can be shared, then the machine code generated must be identical; sharing policy that can not be shared, then the machine code generated must be somewhat different.

Although I didn’t write test cases for run-tests, the thousands of run-tests that already exist in ART could still be affected by this change. For example, the deoptimize and jit test cases happen to be affected by this change.

So what machines does the test actually run on? Since this change is architecture related, I need to test on as many different machines as possible. First of all, the host test, my Linux host architecture for x86_64, of course, can also be compatible with running x86. In general, the Host test is the most convenient, the output log is also the most complete, so it is regarded as the first choice for testing. Next is the target on the test, find an arm64 can be compatible with running arm cell phone, refer to the “ART Chroot-Based On-Device Testing”, can be realized in the cell phone without affecting the existing system on the premise of the test. Specifically, it uses chroot to install the ART module to be tested in the data directory, which will not interfere with the system already running on the phone.

According to the above operation, 4 architectures can be tested in place, respectively x86, x86_64, arm and arm64. as for the remaining riscv, due to the lack of actual equipment can only be dismissed, left to Google’s test server after the submission of the test to go to the test.

Of course, in addition to the automated testing framework, installing new ART modules and testing them manually is a necessary part of the process.

 measure the effect

Once the functional validation is passed, the next stage is to measure the effect, which needs some data to support it. The ideal situation is of course to have a ready-made benchmark to use, but it didn’t work out as I had hoped. After communicating with Google’s engineers, I decided to measure from 4 aspects:

  1. How broad is the coverage, i.e. what percentage of native methods could benefit from it?
  2.  How much can the call time be reduced?
  3.  Does it have any effect on the compilation time of the AOT afterwards?
  4.  Does it have any effect on the size of the odex file generated afterwards?

First of all, the first point of the coverage ratio, by testing the top 10 domestic applications (rankings have upset), found that the coverage rate is basically stable between 85% and 90%, which proves that the vast majority of native methods can be found in boot images from the available JNI stub.

Next is the second point of call time, which can be looked at either microscopically or macroscopically. Micro is only for JNI calls to measure the time, to see the proportion of this optimization to improve. The macro is to find a daily use scenario to see the overall time change. Since JNI has a variety of parameter types, we chose the simple addOne(Object, int) method for basic measurement (we measured a few complex parameters, and the difference between the simple method and the simple method is not very big). The time for 50,000 calls dropped from 3919.2μs to 1065.3μs, which is 267% if we take the inverse of the improvement ratio.

The macro chose the scenario of the first application startup, and the measured improvement was not significant, probably because the percentage of JNI calls in the overall application startup time is already very low. In addition, the startup time measured by am start fluctuates every time, and it feels that even if there is a slight gain, it is buried in the noise. In short, the gains were minimal.

Insert a mistake I made here. Originally, I was going to count the time of all JNI stub calls at startup, so I staked both the compiled JNI stub and the generic JNI stub, logged each stub call, and then counted the time by thread and process. It turns out that this solution is not feasible at all, because the system call that counts the time is inherently time-consuming, and it is even more time-consuming than a single stub call. This is because the observation behavior itself has a significant impact on the results, making the results unreliable.

Then there is the third point on the impact on AOT compilation time, which was measured for two applications, and there was roughly a 1% to 2% improvement in compilation time.

Finally, the fourth point on the odex file size, file size is stable and measurable, but the improvement effect is very weak, the reason is that the odex file itself has de-emphasis, different methods to generate the machine code if the same then only one copy of the file will be retained, but the compilation time will increase with the increase in the number of methods.

Overall, this optimization is a significant improvement from a micro perspective, but a weak improvement from a macro perspective. Perhaps some JNI call heavy scenarios will see more significant gains.

 Optimize code formatting and performance

With the measured data to back it up, the next step could be to make a formal submission to the AOSP. Just as I was waiting with joy for the changes to be reviewed and approved, Google’s engineers replied with more than 30 comments at once, and then also said the following.

I have previously focused only on correctness but, as I’m reviewing the code this time, I’ll have some suggestions related to performance and making the code tidy, such as not putting too much code in a header file, splitting and renaming things.

Translation: previously my focus was on code correctness, but for this review I’ll make some suggestions related to performance and code tidiness, such as: don’t put too much code into header files, split certain code and rename things.

It doesn’t seem to be easy to merge in. Getting the functionality right is only the first step, the way the code is written is also critical. There are some internal conventions of ART, some considerations of subsequent extension and maintenance, and of course, the performance choices of different ways of writing. Here is a simple example:

Depending on the architecture to choose a different equal strategy, the writing style has changed from the original switch case to a template function. Since the isa passed in by the template can be determined during compilation, the upper limit of the float register returned according to isa can also be defined as constexpr , i.e. a compile-time constant. This eliminates the need to make choices based on isa at runtime, thus improving code execution efficiency.

To summarize, the feedback given by Google’s engineers was very careful and specific, which I think is an important factor in ensuring the quality of ART code.

Once everything is ready, submit again. There may be some value in highlighting the submission process here, some of the details of which are not publicly documented.

According to the official docs, we have to repo sync before committing because there are a lot of new changes on the mainline during development, some of which may conflict with our changes. Therefore, before committing, we must pull the latest changes and do a rebase to ensure that there is no conflict before committing.

Submissions are made using the repo upload command, and a successful submission returns a link to android-review.googlesource.com . This page is where we interact with the reviewer to get various +1’s and +2’s. In addition to the manual +1’s and +2’s, this page has two bots.

One of them is called Lint and the other is called Treehugger .

 Lint is interpreted in English as:

small loose pieces of cotton, wool, etc. that stick on the surface of a fabric, etc.

 Translation: Small loose pieces of cotton, wool, etc. that stick to the surface of fabrics, etc.

Its main role is to check for spelling and formatting errors in the code from a textual level, including the contents of the license in the header of the file. Lint is started whenever a new patch is updated. If no errors are checked, then both the Lint entry and the Open-Source-Licensing entry are set to +1 .

 English definition of Treehugger:

an environmental campaigner (used in reference to the practice of embracing a tree in an attempt to prevent it from being felled).

Translation: environmental campaigner (used to refer to the practice of hugging a tree in an attempt to stop it being cut down).

Its main purpose is to run some automated tests to check if the code is running correctly. If a problem is detected, the Presubmit-Verified entry will be placed on -1 or -2 to protect the main repository from intrusion of the problematic code. If there is no problem, Presubmit-Verified will be set to +2 . Treehugger is usually started by Google engineers after +2 , because +2 of Presubmit-Verified will expire after two working days, so it is usually used as the last process before merge (of course, we can auto-submit to trigger it actively, but as I said, it will expire).

Recently Google added a Performance bot to detect if performance degradation is occurring. It is usually started with Treehugger, and when it detects no problems, the Performance entry is placed on +1 or +2 .

Once all the SUBMIT requirements have been passed, the code is ready to be merged in.

But after the merger does not mean that everything is fine, because Google also has a continuous integration system called LUCI.

LUCI: the Layered Universal Continuous Integration system

 Translation: Multi-tier Universal Continuous Integration System

For ART, we can check this website to check the test results of each change. It will run richer tests across multiple architectures and platforms after each change is merged in, and Google’s engineers will be notified if there is a problem with the testing. If it was confirmed that our changes were the cause, then they would submit a REVERT that would overwrite our commit. Unfortunately, this feature also went through a revert because the local default test method didn’t measure the debuggable option.

Of course, there is no need to be frustrated with being REVERT, which is a side note to the robustness of the ART mainline. After carefully reviewing the error messages from the LUCI bot, I finally realized that this feature has some conflicts with the existing debugging mechanism. In order to solve this conflict, it took some more time to learn the principles related to debugging and deoptimize. So rather than a development effort, this was an opportunity to learn more.

Relatively elegant code can only be designed if the principles are fully understood. Otherwise, if it is just a headache, then the code will encounter more problems as it evolves. Once the conflict has been resolved, it can be committed again, also known as Reland .

At this point, an optimized change has been finally merged in. But the story doesn’t end here, larger scale testing and real-world runs await it in the future.

By hbb

Leave a Reply

Your email address will not be published. Required fields are marked *