Finding the root cause of a rarely occurring race condition

Fasterj

Tools: |

Our valued sponsors who help make this site possible
JProfiler: Get rid of your performance problems and memory leaks!

Training online: Concurrency, Threading, GC, Advanced Java and more ...

Finding the root cause of a rarely occurring race condition

JProfiler

Get rid of your performance problems and memory leaks!

Modern Garbage Collection Tuning

Java Performance Tuning, 2nd ed

The classic and most comprehensive book on tuning Java

Java Performance Tuning Newsletter

Your source of Java performance news. Subscribe now!

JProfiler

Get rid of your performance problems and memory leaks!

Tracking down a rarely occurring non-reproducible bug is one of the hardest challenges for a developer. In this article I walk through the methodology you can apply to tracking down rarely occurring race conditions, using one example I had in the OpenTelemetry Java agent's inferred-spans project. The failure only showed up occasionally in CI, never in local development, and could not be reproduced in debugging - exactly the kind of bug that is hardest to diagnose. The code is open source so you can investigate with me if you like a challenge.

The debugging detective journey is a good illustration of how to debug a hidden race condition (concurrency) bug, even when all the code involved is technically thread-safe.

Published December 2025, Author Jack Shirazi

The problem

We had an intermittent CI failure https://github.com/open-telemetry/opentelemetry-java-contrib/issues/2429 caused by a NullPointerException coming from a call to WeakConcurrentMap.get():

CallTreeSpanifyTest > testSpanification() FAILED
    java.lang.NullPointerException: Cannot invoke "java.lang.Long.longValue()" because the return value of "com.blogspot.mydailyjava.weaklockfree.WeakConcurrentMap.get(Object)" is null
        at io.opentelemetry.contrib.inferredspans.internal.SpanAnchoredClock.getAnchor(SpanAnchoredClock.java:43)
        at io.opentelemetry.contrib.inferredspans.internal.SamplingProfiler$ActivationEvent.set(SamplingProfiler.java:877)
        at io.opentelemetry.contrib.inferredspans.internal.SamplingProfiler$ActivationEvent.activation(SamplingProfiler.java:858)
        at io.opentelemetry.contrib.inferredspans.internal.SamplingProfiler.lambda$new$1(SamplingProfiler.java:191)
        at com.lmax.disruptor.RingBuffer.translateAndPublish(RingBuffer.java:986)
        at com.lmax.disruptor.RingBuffer.tryPublishEvent(RingBuffer.java:538)
        at io.opentelemetry.contrib.inferredspans.internal.SamplingProfiler.onActivation(SamplingProfiler.java:321)
        at io.opentelemetry.contrib.inferredspans.internal.ProfilingActivationListener.beforeActivate(ProfilingActivationListener.java:121)
        at io.opentelemetry.contrib.inferredspans.internal.ProfilingActivationListener$ContextStorageWrapper.attach(ProfilingActivationListener.java:61)
        at io.opentelemetry.context.Context.makeCurrent(Context.java:231)
        at io.opentelemetry.context.ImplicitContextKeyed.makeCurrent(ImplicitContextKeyed.java:33)
        at io.opentelemetry.contrib.inferredspans.internal.CallTreeTest.getCallTree(CallTreeTest.java:1005)
        at io.opentelemetry.contrib.inferredspans.internal.CallTreeSpanifyTest.testSpanification(CallTreeSpanifyTest.java:50)

Without the ability to reproduce the bug, you only have the failure log and the codebase to go on. That means finding the root cause becomes a hunt through code workflows, trying hypotheses and looking for clues that will support or rule out each hypothesis. An intermittent failure is most likely a concurrency bug (or an underlying system resource exhaustion, but that latter would typically show up as a timeout or IO failure).

What could cause a missing Map entry?

Checking the codebase, it's clear that the entry is always inserted to the map with a non-null value. The failure message meant that a key that should always be present in the map was sometimes not there. There is a simple workaround that stops the failure, which is to check for null and return a default value, which I applied. But while that workaround eliminates the failure, it just masks the underlying race condition, I needed to make sure we weren't missing a true production bug.

When a map lookup returns null unexpectedly, there are only a handful of general causes:

The map "read" saw a stale or inconsistent view (map not thread-safe)
The key was removed
The value was removed or set to null
The key was never added
The key changed in a way that breaks the lookup (e.g., mutated hashcode or identity)

These five possibilities cover all "map miss" failures in concurrent Java code.

I walked through each possibility, as described in the following sections.

Thread-safety check

The class holding the map, SpanAnchoredClock, is pretty simple. It has one updater and one accessor for the (private) map, so it's pretty simple to reason about. Furthermore the map is constructed at object creation and never replaced (the field is final). Finally, the com.blogspot.mydailyjava.weaklockfree.WeakConcurrentMap is a thread-safe map. I can't entirely rule out there being a bug in the WeakConcurrentMap implementation, but it's a widely used mature map so quite unlikely to have such a bug in a simple update and access to the map. I can essentially rule out cause #1.

Parallel execution check

Could access occur before insertion (to consider cause #4)? The inferred-spans engine is parallel, so this is worth checking.

Tracing through the code flow from the test entry point (CallTreeSpanifyTest.testSpanification) to the failing lookup shows:

The map entry is inserted before it is ever accessed.
The update and access happen in a clear sequence in the same thread.

This rules out cause #4.

Mutability check

One of the most common map usage bugs come from keys whose hashcodes change after insertion. But:

This key type has an immutable hashcode.
Worth noting is that in this case the WeakConcurrentMap is implemented to use object identity (==) rather than equals(). A different object instance, even with the same hashcode, would not match the one in the map ...
But tracing the object lifecycle in the code confirms that the same key instance was used throughout the entire test

So now we've ruled out cause #5.

Garbage collection check

A WeakConcurrentMap removes keys when they become weakly reachable. This means if nothing is strongly referencing the key, GC can clear it.

But in this case:

The key is strongly referenced at the time of lookup (visible in the stack).
There is no point in the test flow where all strong references disappear.
Java objects cannot be garbage collected and later "re-acquired."
In most Map implementations the key could have been garbage collected and then an "equal" key created for access. But in WeakConcurrentMap identity (==) is used for comparison checks, so that's not a possible cause of failure here

So GC cannot explain the missing entry, ruling out both causes #2 and #3 (keeping in mind that we've already established the key is always inserted with a non-null value and never removed - only GC can remove the entry, there is no code that does removal).

Wait what?

All five generic root causes were ruled out, yet we definitely have a failure showing a missing map entry! When all options are ruled out, it's almost always because one of the assumptions is wrong.

Let's review what we definitely know

From the code review of the test workflow, we know we definitely insert the key
The stack trace shows we definitely access the key
The same stack trace shows we definitely don't have the key in the map
Reviewing the list of five generic root causes above, there is one other possibility I didn't consider, a sort of sub-clause of cause #4 - it's a different map!

Could the insertion happen on one map, and access happen on a different map?

Actually I sort of considered this right near the beginning, when I noted that the map is constructed on SpanAnchoredClock and never replaced. But I didn't consider whether the SpanAnchoredClock itself is ever replaced, that would indeed cause this issue. But again another trawl through the workflow rules this out - the test constructs exactly one SpanAnchoredClock.

In fact, extending this concept further, and looking at all the objects in the tests, everything seems to be being constructed specifically for this test, then discarded at the end of the test, even that ring buffer in the stack.

Shared mutable state

Having ruled out all possible root causes from generic Map management considerations, I'll go back to basics. Fundamentally, concurrency bugs come from shared mutable state. All three words need to be active:

There needs to be state, ie some data structure
The data structure needs to mutate, ie it gets updated (as well as accessed)
The data structure needs to be shared by more than one thread

Looking through the test setup, almost everything is newly constructed for each test run - except for one thing: The ProfilingActivationListener.activeListeners. This is a static list shared across tests. But

It is a volatile field (so all threads always see the latest value)
It is updated inside synchronized blocks (so all threads always see the latest list)
It is always rebuilt as an unmodifiable list (so only constructing a new list changes the list, and the above two points mean all threads always see the latest list)

In other words, it is completely thread-safe.

But "thread-safe" does not mean "free of races."

The real root cause

I need to go back to an outline of the sequence of activity, focusing specifically on how that one shared structure is used

Test A starts, creates a SamplingProfiler, and registers listener Listener_A into the activeListeners.
Test A finishes and shuts down the profiler.
Profiler shutdown happens asynchronously, on a background thread. It removes Listener_A from activeListeners, but as it is running asynchronously, it is not immediately removed on test shutdown.
Test B can start while that background removal is still pending.
Test B registers listener Listener_B, but it is possible that Listener_A is still present in the list for a brief moment.
When a profiling activation occurs in Test B, the listener list is iterated:
- Listener_B works fine.
- Listener_A points to the SpanAnchoredClock from Test A with the Test A map
Test B's key is not in Test A's map, hence the NPE.

This explains everything:

All code was individually thread-safe.
But the static listener list allowed cross-test interference because it holds shared mutable state that can cross more than one test's lifecycle.
The race only occurs when test shutdown is slow enough that it overlaps the next test's start.
This situation cannot occur in production, because production only creates one profiler instance.

A concurrency bug from fully thread-safe code

The concurrency bug turned out not to be a bug in the profiler at all, but a race created by the test harness. Even perfectly thread-safe components can interact badly if tests share static or global state.

The null-check workaround in the map accessor is a safe workaround because the root cause doesn't exist outside the test environment.

Some useful lessons:

Rare failures are often caused by shared state.
Rare CI-only failures are often caused by shared static test state.
"Thread-safe" does not prevent races between components with overlapping lifecycles.
When you've eliminated all root causes, revisit your assumptions - especially around isolation.
Focus on exactly how the shared mutable state is used by all the components and instances that use it
Minimize the shared mutable state being used. The more there are in a workflow, the more exponentially difficult it is to reason about how they are being used

Last Updated: 2025-12-25
Copyright © 2007-2025 Fasterj.com. All Rights Reserved.
All trademarks and registered trademarks appearing on Fasterj.com are the property of their respective owners.
URL: http://www.fasterj.com/articles/concurrencybug.shtml
RSS Feed: http://www.JavaPerformanceTuning.com/newsletters.rss
Trouble with this page? Please contact us