Our columnist, Kirk Pepperdine, tells us about the performance session highspots at JavaOne: corestat, fork-join & how to load balance using dequeues, a lockless fully concurrent HashMap that scales to hundreds of cores, and the G1 garbage collector.Published May 2008, Author Kirk Pepperdine
This years JavaONE was littered with a number of performance focused sessions. So many in fact that it was impossible for me to get to a large number of them. That said, the sessions that I was able to attend were all very well done. I'll summarize the ones that I was able to attend here but first I'd like to take some time to exercise some bragging rights.
This year marks a very special event for both myself and JavaONE. I presented a Hands-on-Lab. While I don't think it is necessary to explain why it was special for me, please bear with me for a second while I explain why this was special for JavaONE. To understand why you first need to know that up until this year, JavaONE Hands-on-Labs were the exclusive domain of Sun Microsystems employees. I am the first non-Sun employee to present a Hands-on-Lab at JavaONE.
I don't want to get into all of the gory details so I'll just leave it at this: the two hour lab took more preparation time than any other presentation that I've ever put together. That said, I'd do it again in a heartbeat. But more importantly, the experience has opened the possibility for others to follow. I have a debt of gratitude to Kim LiChong from Sun performance labs. Though we got thrown together more by chance than anything, Kim was instrumental in helping to navigate the Hands-on-Lab process. Also joining me was Victor Grazi. Unfortunately the process didn't allow me to include him until much later on in the process. Even so, with 140 souls signed up, it was great to have an extra set of hands to pitch in.
Presentations of note include; Brian Goetz's talk on what new in concurrency (TS-5515), Cameron Purdy's poetic talk on 10 patterns for scaling out Java (TS-6339) , Cliff Click on his wait free fully concurrent HashMap (TS-6256) and Tony Printezis on his Garbage First GC (TS-5419).
The first performance related session that I attended wasn't actually a JavaONE talk. It was presented at CommunityONE, a JavaONE pre-conference. The information rich talk, presented by Denis Sheahan, was about hardware or more specifically, Chip multi-threading (CMT) processors. Why include a hardware track at what is ostensibly a software conference? Well it is no secret that changes in the CPU landscape are really making a difference in how we need to think about how we build our software systems. More importantly, how these types of processors may break our ability to measure or report on hardware utilization using the tools that we have available today.
The Niagara processor offered the ability to have 4 threads share the same pipeline. Each thread, in succession, contributes a single instruction into the pipeline. It then waits three cycles before injecting the next instruction. Of course it gets a little bit more complicated than this when you start to consider the insertion of memory fences and barriers used for synchronization. The main idea is that a thread once in the CPU will be managed by the CPU.
The next generation of Sparc, code named Victoria Falls, has 64 threads per CPU. This chip contains a more sophisticated scheduler used to control when threads should be running and when they need to wait.
By now you are probably getting the idea that thread scheduling, traditionally the domain of the operating system, has just been moved into the CPU itself. While it is true that the CPU now has scheduling capabilities, the OS still has primary responsibility for scheduling. However, today's operating systems are reporting CPU utilization using the same techniques that have been used since the beginning of time. Clearly these new CPUs break those implementations.
Consider the case when a thread is in the CPU but is idle due to some hold condition. In this case the OS based tools will report the thread as being busy. Clearly this is a problem which fortunately does come with a solution called corestat.
Corestat works very much like vmstat. However corestat has CPU level awareness as to what is really running and what has been parked. There is simply no other way to get this information yet this is the information that we desperately need when we are in the process of tuning.
The other problem is latency on fetches and writes to memory. Simply put, increases in CPU speed have outstripped memory's ability to keep pace. The answer is try to keep the CPU from having to hit main memory as much as possible. Read this to mean, lets put more L1 and L2 cache on the CPU (Intel has an L3 level cache). This means that the JVM becomes NUMA aware. NUMA meaning, non-uniform memory access.
The advantage of NUMA is that many optimizations can be applied so that applications don't have to feel the latency of a trip to main memory.
As you can imagine, all of these changes in hardware have put a lot of pressure on the operating systems. Solaris is now undergoing a number of changes in an attempt to catch up the OS. One last point. Don't believe for a second that AMD and Intel are immune from forcing these changes onto other operating systems. They too are working away on the same problems that Sun is with Sparc. Bottom line, if your profiler isn't using hardware assist, it may be giving you false information.
Brian's talk focused on the new fork-join framework that is being added into Java 7 (JSR 166). Brian's talks are all pretty much how we need to prepare ourselves to be able to deal with new hardware technology such as Niagara, Azul and so on. Fork-Join is a very fine grained way to address how to keep all of your cores busy. Brian used merge sort, a divide and conquer sorting strategy, to demonstrate how it works.
In the world of divide and conquer, we take a large problem and split it into many smaller problems. The idea being that the smaller problems should be easier to solve. Thus in merge sort, we take the list and split it in half and then take each half and split it again. The splitting continues until the problem is trivially to solve. For example, it is trivial to sort a list containing 1 item. This is the fork in fork-join.
At the end of the fork, we will need to do some calculations and this will produce some result. The complete set of results will need to be processed in order to produce the final result. In the case of the merge sort, each of the smaller lists are successively merged together to create the final sorted list. This is the join in fork-join. So far nothing spectacular here until we add in a Dequeue and the idea of work stealing.
A Dequeue is a double ended queue where puts and gets can occur on either end of the queue. In fork-join, every thread gets it's own dequeue or work list. If it's work list is empty, it will look to steal work from the tail of another threads work list.
To seed this, one thread will get the initial list and it will split it in half. It will push both halves onto the head of the work list. At this point other threads can "steal" the work off of the tail of the work list. Each thread will now split and push the work until the unit of work is sufficiently small enough that it's time to do the work. Merge will happen in a similar manner.
What is interesting to note is that access to the head of the work list is single threaded, read no lock contention. The tail of the work list may be contended but the largest jobs are at the tail of the list. Because of this, when a thread steals work and goes away, it will go away for a long time. This means that the tail of the work list will only be lightly contended.
So the advantages of being able to split work up using fork-join will be a smoother load across all of your CPUs. Second advantage is that fork-join minimizes contention resulting in fewer processor stalls due to memory fences. All in all a great talk by Brian.
Continuing on the theme of how can we improve performance by reducing contention was Azul's Cliff Click. Cliff was on Sun's first HotSpot development team and is a leading expert on how multi-threading works at the hardware level. He also has a wealth of knowledge on how JITs work to optimize our code. Over the last couple of years Cliff has been tinkering with a lockless fully concurrent HashMap.
In a multi-core world there is really no practical strategy to avoid data races which means we are forced to be synchronized. But since synchronization comes with a performance penalty, people have tried devise clever tricks to avoid the synchronization. Almost all have failed including the spectacular failure of double checked locking. The clever idea with the HashMap is the recognition that you can't avoid the data races. So, if you can't avoid them, join them or in this case, let them happen and then do some careful checking afterwards to see who's won and who's lost.
How this works is that threads are allows to "claim" a slot using a Compare and Set (CAS) operation. If the CAS succeeds, you've won and the slot is yours. If the CAS fails, you get to try again. Controlling all of this is a simple but elegant state machine. It knows if you've failed, succeeded or if you need to support a resize. I guess I should mention that resizing is also lockless. When a resize needs to happen the old table gets marked as being old and new writes need to happen in the new one. There is quite a bit of logic to manage late writes and all kinds of other concurrency issues. But in the end, passing the CAS is key to forward progress. The question is, does it all work? Cliff does run through a case study where he shows that the regular HashMap tops out pretty quickly. The Concurrent HashMap tops out either at the number of CPUs or the level of striping (32 for the current implementation) while Cliff's implementation is able to manage Azul's larger 768 core machine without much difficulty.
Finally I'd like to summarize Tony Printezis talk on the Garbage First Garbage Collector. Again the theme of the talk is how do we move to a more fully concurrent collector. The Concurrent Mark and Sweep (CMS) collector includes two brief "stop the world" phases. It also doesn't compact so every once in a while you will get a much longer pause when compaction becomes necessary. So while CMS is better for those applications that are sensitive to GC pause times, it still isn't quite good enough. The goals of the G1 collector are to add compacting and improve predictability.
Currently Sun's Java heap space is split into Eden, 2 survivor spaces, tenured space, and perm space. With the G1 collector this changes to a young and old space. However, each of these spaces are divided into 1mb buckets called regions.
Objects that survive a young generational GC will be moved into a survivor region or an old region. Young GC uses evacuation pauses that move the objects from one set of regions to another. In old space GC, objects are deallocated in place. The collector then calculates "liveness" which can result in live objects being evacuated to another region. What is interesting is that using this technique, you don't need to sweep old space as space is reclaimed when regions are empty.
G1 also uses remember-sets to help redirect references to objects that have been moved. It also includes a pause prediction model that works to meet pause time goals.
In June I will be speaking at a couple of JUG event, one in Athens and the other in London. In addition I will be chairing the performance track at TSSJS in Prague. I've managed to corral a number of interesting speakers for that event including John Davies on grid computing and Holly Cummins from IBM. She will be talking trash.