Fasterj

|home |articles |cartoons |site map |contact us |
Tools: | GC log analysers| Multi-tenancy tools| Books| SizeOf| Thread analysers| Heap dump analysers|

Our valued sponsors who help make this site possible
JProfiler: Get rid of your performance problems and memory leaks! 

Training online: Concurrency, Threading, GC, Advanced Java and more ... 

Finding the root cause of a rarely occurring race condition

JProfiler
Get rid of your performance problems and memory leaks!

Modern Garbage Collection Tuning
Shows tuning flow chart for GC tuning


Java Performance Tuning, 2nd ed
The classic and most comprehensive book on tuning Java

Java Performance Tuning Newsletter
Your source of Java performance news. Subscribe now!
Enter email:



JProfiler
Get rid of your performance problems and memory leaks!


Tracking down a rarely occurring non-reproducible bug is one of the hardest challenges for a developer. In this article I walk through the methodology you can apply to tracking down rarely occurring race conditions, using one example I had in the OpenTelemetry Java agent's inferred-spans project. The failure only showed up occasionally in CI, never in local development, and could not be reproduced in debugging - exactly the kind of bug that is hardest to diagnose. The code is open source so you can investigate with me if you like a challenge.

The debugging detective journey is a good illustration of how to debug a hidden race condition (concurrency) bug, even when all the code involved is technically thread-safe.
Published December 2025, Author Jack Shirazi

The problem

We had an intermittent CI failure https://github.com/open-telemetry/opentelemetry-java-contrib/issues/2429 caused by a NullPointerException coming from a call to WeakConcurrentMap.get():

CallTreeSpanifyTest > testSpanification() FAILED
    java.lang.NullPointerException: Cannot invoke "java.lang.Long.longValue()" because the return value of "com.blogspot.mydailyjava.weaklockfree.WeakConcurrentMap.get(Object)" is null
        at io.opentelemetry.contrib.inferredspans.internal.SpanAnchoredClock.getAnchor(SpanAnchoredClock.java:43)
        at io.opentelemetry.contrib.inferredspans.internal.SamplingProfiler$ActivationEvent.set(SamplingProfiler.java:877)
        at io.opentelemetry.contrib.inferredspans.internal.SamplingProfiler$ActivationEvent.activation(SamplingProfiler.java:858)
        at io.opentelemetry.contrib.inferredspans.internal.SamplingProfiler.lambda$new$1(SamplingProfiler.java:191)
        at com.lmax.disruptor.RingBuffer.translateAndPublish(RingBuffer.java:986)
        at com.lmax.disruptor.RingBuffer.tryPublishEvent(RingBuffer.java:538)
        at io.opentelemetry.contrib.inferredspans.internal.SamplingProfiler.onActivation(SamplingProfiler.java:321)
        at io.opentelemetry.contrib.inferredspans.internal.ProfilingActivationListener.beforeActivate(ProfilingActivationListener.java:121)
        at io.opentelemetry.contrib.inferredspans.internal.ProfilingActivationListener$ContextStorageWrapper.attach(ProfilingActivationListener.java:61)
        at io.opentelemetry.context.Context.makeCurrent(Context.java:231)
        at io.opentelemetry.context.ImplicitContextKeyed.makeCurrent(ImplicitContextKeyed.java:33)
        at io.opentelemetry.contrib.inferredspans.internal.CallTreeTest.getCallTree(CallTreeTest.java:1005)
        at io.opentelemetry.contrib.inferredspans.internal.CallTreeSpanifyTest.testSpanification(CallTreeSpanifyTest.java:50)

Without the ability to reproduce the bug, you only have the failure log and the codebase to go on. That means finding the root cause becomes a hunt through code workflows, trying hypotheses and looking for clues that will support or rule out each hypothesis. An intermittent failure is most likely a concurrency bug (or an underlying system resource exhaustion, but that latter would typically show up as a timeout or IO failure).

What could cause a missing Map entry?

Checking the codebase, it's clear that the entry is always inserted to the map with a non-null value. The failure message meant that a key that should always be present in the map was sometimes not there. There is a simple workaround that stops the failure, which is to check for null and return a default value, which I applied. But while that workaround eliminates the failure, it just masks the underlying race condition, I needed to make sure we weren't missing a true production bug.

When a map lookup returns null unexpectedly, there are only a handful of general causes:

  1. The map "read" saw a stale or inconsistent view (map not thread-safe)
  2. The key was removed
  3. The value was removed or set to null
  4. The key was never added
  5. The key changed in a way that breaks the lookup (e.g., mutated hashcode or identity)

These five possibilities cover all "map miss" failures in concurrent Java code.

I walked through each possibility, as described in the following sections.

Thread-safety check

The class holding the map, SpanAnchoredClock, is pretty simple. It has one updater and one accessor for the (private) map, so it's pretty simple to reason about. Furthermore the map is constructed at object creation and never replaced (the field is final). Finally, the com.blogspot.mydailyjava.weaklockfree.WeakConcurrentMap is a thread-safe map. I can't entirely rule out there being a bug in the WeakConcurrentMap implementation, but it's a widely used mature map so quite unlikely to have such a bug in a simple update and access to the map. I can essentially rule out cause #1.

Parallel execution check

Could access occur before insertion (to consider cause #4)? The inferred-spans engine is parallel, so this is worth checking.

Tracing through the code flow from the test entry point (CallTreeSpanifyTest.testSpanification) to the failing lookup shows:

This rules out cause #4.

Mutability check

One of the most common map usage bugs come from keys whose hashcodes change after insertion. But:

So now we've ruled out cause #5.

Garbage collection check

A WeakConcurrentMap removes keys when they become weakly reachable. This means if nothing is strongly referencing the key, GC can clear it.

But in this case:

So GC cannot explain the missing entry, ruling out both causes #2 and #3 (keeping in mind that we've already established the key is always inserted with a non-null value and never removed - only GC can remove the entry, there is no code that does removal).

Wait what?

All five generic root causes were ruled out, yet we definitely have a failure showing a missing map entry! When all options are ruled out, it's almost always because one of the assumptions is wrong.

Let's review what we definitely know

Could the insertion happen on one map, and access happen on a different map?

Actually I sort of considered this right near the beginning, when I noted that the map is constructed on SpanAnchoredClock and never replaced. But I didn't consider whether the SpanAnchoredClock itself is ever replaced, that would indeed cause this issue. But again another trawl through the workflow rules this out - the test constructs exactly one SpanAnchoredClock.

In fact, extending this concept further, and looking at all the objects in the tests, everything seems to be being constructed specifically for this test, then discarded at the end of the test, even that ring buffer in the stack.

Shared mutable state

Having ruled out all possible root causes from generic Map management considerations, I'll go back to basics. Fundamentally, concurrency bugs come from shared mutable state. All three words need to be active:

Looking through the test setup, almost everything is newly constructed for each test run - except for one thing: The ProfilingActivationListener.activeListeners. This is a static list shared across tests. But

In other words, it is completely thread-safe.

But "thread-safe" does not mean "free of races."

The real root cause

I need to go back to an outline of the sequence of activity, focusing specifically on how that one shared structure is used

  1. Test A starts, creates a SamplingProfiler, and registers listener Listener_A into the activeListeners.
  2. Test A finishes and shuts down the profiler.
  3. Profiler shutdown happens asynchronously, on a background thread. It removes Listener_A from activeListeners, but as it is running asynchronously, it is not immediately removed on test shutdown.
  4. Test B can start while that background removal is still pending.
  5. Test B registers listener Listener_B, but it is possible that Listener_A is still present in the list for a brief moment.
  6. When a profiling activation occurs in Test B, the listener list is iterated:
  7. Test B's key is not in Test A's map, hence the NPE.

This explains everything:

A concurrency bug from fully thread-safe code

The concurrency bug turned out not to be a bug in the profiler at all, but a race created by the test harness. Even perfectly thread-safe components can interact badly if tests share static or global state.

The null-check workaround in the map accessor is a safe workaround because the root cause doesn't exist outside the test environment.

Some useful lessons:


Last Updated: 2025-12-25
Copyright © 2007-2025 Fasterj.com. All Rights Reserved.
All trademarks and registered trademarks appearing on Fasterj.com are the property of their respective owners.
URL: http://www.fasterj.com/articles/concurrencybug.shtml
RSS Feed: http://www.JavaPerformanceTuning.com/newsletters.rss
Trouble with this page? Please contact us