Binstock on Software: 2021

Friday, December 17, 2021

How the Jacobin JVM Accesses Methods

Executing methods is the principal activity of the JVM. There are many steps involved in finding, loading, executing methods correctly. The Jacobin JVM uses a variety of techniques to accelerate this process as described here. (To follow, you need to know just a little Java.)

Methods in class files

Java methods are stored in class files in a section where various kinds of class attributes are located. Each method contains instructions in the form of Java bytecodes. It also contains a series of attributes that provide additional execution information (such as data for handling exceptions, debugging info, etc.) Functions are stored by name and type, which are represented by indexes into an area of the class file called the constant pool. Those indexes ultimately point to strings in UTF-8 format (actually, a Java-specific variant of UTF-8). A typical example looks like this:

java/io/PrintStream.println:(I)V

This shows the usual println() method that prints an integer to the console. Note that the name of the class precedes the method name. The class name has transformed the usual . into forward slashes. The single dot demarks the method name, which is followed by a colon and the method signature. The part in parentheses indicates the parameter type (I=integer) and the V after the closing parenthesis indicates the return value, which here is void (V=void). Note to Java nerds: the signature of a method typically does not include the return value. It's specified here so that the JVM knows what to expect as a return value.

Extracting methods for use by the JVM

The classloader is a JVM subsystem that locates classes needed by the application, parses them, and places (or loads) the parsed data into an important area of the JVM called the method area. The method area, despite its name, holds entire classes. When an app requires a method, it looks into the method area and determines whether the class has been loaded. If not, it asks the classloader subsystem to locate and load the class into the method area. Once the class is there, the JVM looks through all the method, resolves the name and signature strings for each of the methods and sees whether they match the method being looked for. When a match is found, the bytecodes are loaded and executed. (If the method is not found, a runtime error results.)

This search can be extremely expensive. For example, the Java standard Class.class in Java 11 has 139 methods--that's potentially a lot of look-ups! To save time, most JVMs, including Jacobin, cache the method data once it's been looked up, so that the search is performed only once.

The Method Table

In Jacobin, the caching is done using a method table (see file MTable.go). When a method is invoked, Jacobin (like many JVMs), first checks the method table to see whether the method has previously been located and loaded. If not, then the search as described previously is performed. In Jacobin, the method is located and stored in the method table and then the look-up in the table is performed a second time and the result passed to the calling method.

Additional Considerations

Thread safety: The method table, like the method area, is a JVM-wide data structure. That is, all executing threads in the JVM can access it. As a result, it's conceivable that two threads would be updating the method table simultaneously. To avoid this problem, the table uses a mutex lock on every update.

Performance: While developing the many capabilities of a JVM, Jacobin is aiming for acceptable performance. Eventually, though we'll be working very hard to maximize performance. Some of the techniques we have in our notebooks for future enhancements (some of which are used in other JVMs):

For the main() class and other classes that might appear in the same JAR, loading the methods directly into the method table, rather than waiting for the initial method search to load them.
When a method is loaded into the method table, deleting it from the class entry in the method area. There is no need to have the same data in memory twice. Doing this, reduces the memory footprint of the JVM.
When a class's methods are searched for a match (that is, prior to an entry in the method table), if no match is found in the class, then the superclass must be checked. If that fails, then that super-class's superclass is checked and so on up the chain until java.lang.Object is reached at the top of the object hierarchy. A simple optimization is to give every loaded class a complete list of all the superclass methods with pointers to them, so that the JVM does not have to climb the hierarchy in its search, but can tell quickly whether the method exists or not.

There are surely other optimizations and refinements, which we hope to explore and to include if they lead to better execution.

Tuesday, November 02, 2021

Jacobin JVM project after three months

Development on Jacobin, the JVM written in go that supports Java 11, has been proceeding rapidly. In the 100 days since the beginning of the project, there have been 314 pushed commits. I'll give more stats below. Here's where we stand:

Jacobin can read, parse, format check, and load class files. This process happens very quickly. For example, running all these steps on one of the largest classes in the JDK distribution, BigDecimal.class, takes just 2ms. When parsed, BigDecimal has 1567 entries in its constant pool, 37 fields, and 167 methods. That's a huge class!

When a class is loaded by Jacobin or any other JVM, it necessarily pulls in other classes to be loaded. For example, all classes run from the command-line have a superclass. Often, that superclass is java.lang.Object, which depends on other classes. Among these are java.lang.Class and java.lang.String; various I/O classes are needed as well. The OpenJDK-based JVMs (essentially, all JVMs except IBM's J9 and some embeddable VMs) address this need by preloading hundreds of widely used classes at JVM start-up. For a look at the list of all the classes loaded just to display the JVM version info, run this from the command line:

java -verbose:class -version

On my Java 11 test system, this command preloads 381 classes (in 347ms!) While Jacobin does not need as many classes loaded to run the specified class, it needs a subset of them. The next step in the project is to identify the required classes and load them quickly. To this end, loading opertions (parsing and format checking) will need to be done in parallel. Fortunately, one of the go language's strengths is a rich set of easy-to-use resources for precisely this kind of concurrent operation.

After this task is completed, work will begin on execution.

Testing Thoroughly

One of the principal goals of Jacobin is to be a reliable JVM. This requires disciplined work in the planning, development, and testing. Development is based entirely in tasks which are logged in a cloud-instance of JetBrains' excellent tool, YouTrack (graciously provide for free). You can see the presence of this tracking, in that every commit on GitHub starts with the corresponding task name. (Presently, the most recent task is JACOBIN-89.) Quality of the code is reviewed by automatic linters on GitHub. Currently, the code merits an A+. The goreport badge on the jacobin GitHub project, takes you to the most recent report.

Testing is done on a near-fanatical basis. Let me explain:

In 2005, I was a contractor with Agitar, a now-shuttered company that made a tool which would read a Java codebase and generate unit tests for missing areas of coverage. It worked great. In conversations with their sales engineers, they told me they used a back-of-the-envelope calculation to assess a company's commitment to testing. They compared the size of the test codebase to the production code. If the test codebase was 50% the size, the company had some commitment to testing. Over 80% was a clear and strong commitment to testing, and over 100% meant a deeply engrained testing culture.

The current code base of Jacobin consists of 8,342 lines (includes: code, comments, blank lines). Of those, 4,718 lines are in tests. That is, the testing codebase is 130.2% the size of the production code. The goal is to get that ratio even higher. Future quarterly updates will reveal our success in this effort.

Want to help?

It's always great to know a project is interesting to others. If Jacobin is interests you and you want to encourage its progress, a GitHub star is our preference. If you want to participate more directly, let me know in the comments, which are kept private. We also love code reviews, suggestions, and later on, we'll surely need folks to do testing. Whatever your interest, thanks for your time!

Thursday, August 05, 2021

A Whole New Project: A JVM

Ever since I started out in programming, I've wanted to undertake a programming project that was developed with the rigorous approach used in mission-critical software: write out the requirements; enforce traceability between requirements between requirements, code, and tests; and, of course, do rigorous testing.

The main problem has been finding the time to dedicate to such a project. There is a reason that the agile movement eschews this approach: it is the opposite of agility--it relies on an unchanging product definition, relies on extensive documentation, and does not accept the concepts of failing fast and releasing often. It's a whole different mindset to "fail never and release when ready."

In the light of these constraints, the ideal project is one with a well-defined set of specifications. I've decided to meet that need by writing a simplified version one of my favorite pieces of software: the Java Virtual Machine (JVM).

The specs for much of the JVM are published in detail and updated by the Java team at Oracle with every new release. You can find them here. On the basis of these docs alone, the JVM is the best documented virtual machine in commercial use. There are many additional resources available, such as the excellent articles by Ben Evans and Aleksey Shipilev (both of Red Hat) on how the innards of the JVM work. And, I should add the source code to the JVM is publicly available.

My project is entitled Jacobin and can be accessed at jacobin.org, which for the time being (and possibly permanently) points to the Jacobin project page on GitHub. There you'll find a detailed write-up of the project status.

Choosing a Language

I have spent the last eight months researching the JVM--reading the docs and articles and doing exploratory coding in various languages with which to write the Jacobin JVM. My requirements for the implementation language are simple enough: it must have decent tools and a viable ecosystem, it must compile to native code on the three major platforms (Windows, Mac, and Linux), and it must have built-in garbage collection (GC). The latter requirement is important. The JVM performs garbage collection, but I don't want to write a garbage collector. They are exceedingly difficult tools to write and, especially, to debug. By using a language that does its own GC, a huge amount of work has been removed from the project.

Three languages meet my requirements: Dart, Swift, and Go. I've written several thousand lines of code in the first two and have eliminated them from consideration. Here is why. Dart is a lovely language, but it's slow (even when compiled to binaries), its ecosystem is wanting, and the kind of threading it does is a poor match to the JVM. The problem with the ecosystem is exemplified by the nearly complete absence books on the language since Dart 2.0 came out a few years ago. Almost all written tutorials are way out of date. Those that are current focus, without exception, on Flutter--the UI toolkit that dominates the use cases for Dart. As a result, it's not easy to learn Dart in depth unless you want to focus primarily on Flutter. The Dart team should really address this. As to the threading model, it is based entirely on single-channel message passing: there is no shared memory. The JVM must perforce share memory between threads and so even if Dart were faster and the docs were up-to-date, it would not meet my needs.

Swift is a truly beautiful language. It's rich in features and has a lot of the type-checking and code safety rules of Rust, but without the endless head-banging that Rust entails. I would have loved to write the JVM in Swift, but it has several drawbacks: it doesn't run on Windows and its libraries are intimately tied to the Mac. Let me clarify. There is an official version of Swift for Windows, but it's maintained entirely by a single engineer at Google. There are effectively no docs for this version and the installation instructions don't work no matter how much tweaking and configuration I have done. The second problem is that while Swift is trying to become a language that works beyond just Apple platforms (for example, it runs fine on Linux), this worthy goal is far from especially when it comes to libraries. Consider that the equivalent of libzip (which is a core library in most languges--it is used to compress/decompress data using the zip format) is maintained by a third party on Github on a project that has at present 22 stars. The collections library has at most a handful of basic data structures, etc. Unless I want to write many of these libraries myself--which I have no desire to do--I am forced down the same road as Node developers: grabbing bits of functionality here and there from different contributors, many of which have unknown code quality. The alternative is to use Apple's Cocoa frameworks on the Mac, which would make my project Mac-only. In sum, until Swift grows its non-Mac ecosystem, it's not a viable option for this project--much to my chagrin.

This leaves Go, which is an easy-to-learn language that runs well on the major platforms and has a flourishing set of libraries, many of which are maintained by core Go developers. While it checks all the boxes, it presents its own challenges. For example, it's the only one of the languages that is not object-oriented and the transition from thinking in objects (after all, Java is my home language, so to speak) to using an imperative style of coding requires some rewiring of how I approach problems. In addition, the standard Go tools have weaknesses. For example, the testing framework is minimal--there is nothing like JUnit in terms of range of features. In the language itself, return values for errors and the lack of generics both feel a little crude, especially to someone coming from Java. Nonetheless, it looks like the best option for my project.

There was one other language candidate: Java. That is, write a JVM that runs on the JVM. I don't find this interesting at all. The code for the JVM is currently mostly written in Java and I'll be consulting it frequently--so what would I do then? Cut and paste? Rewrite the code in my preferred style? It's hard to see how that's an advantage.

What's Next?

In the next few months, I'll continue writing requirements and traceability docs and work through various Go books to transition from beginner Gopher to advanced, so that coding can proceed apace, rather than through constant searches. By that time, I should be in good position to rewrite the 2500 lines of Java-bytecode parsing routines I wrote in Swift, finish that parser, and then begin working on building the execution environment.

In my next blog post, I'll write about the benefits of such a project and how personal projects like this deliver unexpected rewards.

In the meantime, if you want to show your interest or support, follow the project on GitHub or give it a star, so that I know I'm not working alone in a dark alley.

Binstock on Software