|
|
|
Tracking down a rarely occurring non-reproducible bug is one of the hardest challenges for a developer. In this article I walk through the methodology you can apply to tracking down rarely occurring race conditions, using one example I had in the OpenTelemetry Java agent's inferred-spans project. The failure only showed up occasionally in CI, never in local development, and could not be reproduced in debugging - exactly the kind of bug that is hardest to diagnose. The code is open source so you can investigate with me if you like a challenge.Published December 2025, Author Jack Shirazi
The debugging detective journey is a good illustration of how to debug a hidden race condition (concurrency) bug, even when all the code involved is technically thread-safe.
We had an intermittent CI failure https://github.com/open-telemetry/opentelemetry-java-contrib/issues/2429 caused by a NullPointerException coming from a call to WeakConcurrentMap.get():
CallTreeSpanifyTest > testSpanification() FAILED
java.lang.NullPointerException: Cannot invoke "java.lang.Long.longValue()" because the return value of "com.blogspot.mydailyjava.weaklockfree.WeakConcurrentMap.get(Object)" is null
at io.opentelemetry.contrib.inferredspans.internal.SpanAnchoredClock.getAnchor(SpanAnchoredClock.java:43)
at io.opentelemetry.contrib.inferredspans.internal.SamplingProfiler$ActivationEvent.set(SamplingProfiler.java:877)
at io.opentelemetry.contrib.inferredspans.internal.SamplingProfiler$ActivationEvent.activation(SamplingProfiler.java:858)
at io.opentelemetry.contrib.inferredspans.internal.SamplingProfiler.lambda$new$1(SamplingProfiler.java:191)
at com.lmax.disruptor.RingBuffer.translateAndPublish(RingBuffer.java:986)
at com.lmax.disruptor.RingBuffer.tryPublishEvent(RingBuffer.java:538)
at io.opentelemetry.contrib.inferredspans.internal.SamplingProfiler.onActivation(SamplingProfiler.java:321)
at io.opentelemetry.contrib.inferredspans.internal.ProfilingActivationListener.beforeActivate(ProfilingActivationListener.java:121)
at io.opentelemetry.contrib.inferredspans.internal.ProfilingActivationListener$ContextStorageWrapper.attach(ProfilingActivationListener.java:61)
at io.opentelemetry.context.Context.makeCurrent(Context.java:231)
at io.opentelemetry.context.ImplicitContextKeyed.makeCurrent(ImplicitContextKeyed.java:33)
at io.opentelemetry.contrib.inferredspans.internal.CallTreeTest.getCallTree(CallTreeTest.java:1005)
at io.opentelemetry.contrib.inferredspans.internal.CallTreeSpanifyTest.testSpanification(CallTreeSpanifyTest.java:50)
Without the ability to reproduce the bug, you only have the failure log and the codebase to go on. That means finding the root cause becomes a hunt through code workflows, trying hypotheses and looking for clues that will support or rule out each hypothesis. An intermittent failure is most likely a concurrency bug (or an underlying system resource exhaustion, but that latter would typically show up as a timeout or IO failure).
Checking the codebase, it's clear that the entry is always inserted to the map with a non-null value. The failure message meant that a key that should always be present in the map was sometimes not there. There is a simple workaround that stops the failure, which is to check for null and return a default value, which I applied. But while that workaround eliminates the failure, it just masks the underlying race condition, I needed to make sure we weren't missing a true production bug.
When a map lookup returns null unexpectedly, there are only a handful of general causes:
These five possibilities cover all "map miss" failures in concurrent Java code.
I walked through each possibility, as described in the following sections.
The class holding the map, SpanAnchoredClock, is pretty simple. It has one updater and one accessor for the (private) map, so it's pretty simple to reason about. Furthermore the map is constructed at object creation and never replaced (the field is final). Finally, the com.blogspot.mydailyjava.weaklockfree.WeakConcurrentMap is a thread-safe map. I can't entirely rule out there being a bug in the WeakConcurrentMap implementation, but it's a widely used mature map so quite unlikely to have such a bug in a simple update and access to the map. I can essentially rule out cause #1.
Could access occur before insertion (to consider cause #4)? The inferred-spans engine is parallel, so this is worth checking.
Tracing through the code flow from the test entry point (CallTreeSpanifyTest.testSpanification) to the failing lookup shows:
This rules out cause #4.
One of the most common map usage bugs come from keys whose hashcodes change after insertion. But:
WeakConcurrentMap is implemented to use object identity (==) rather than equals(). A different object instance, even with the same hashcode, would not match the one in the map ...
So now we've ruled out cause #5.
A WeakConcurrentMap removes keys when they become weakly reachable. This means if nothing is strongly referencing the key, GC can clear it.
But in this case:
WeakConcurrentMap identity (==) is used for comparison checks, so that's not a possible cause of failure here
So GC cannot explain the missing entry, ruling out both causes #2 and #3 (keeping in mind that we've already established the key is always inserted with a non-null value and never removed - only GC can remove the entry, there is no code that does removal).
All five generic root causes were ruled out, yet we definitely have a failure showing a missing map entry! When all options are ruled out, it's almost always because one of the assumptions is wrong.
Let's review what we definitely know
Could the insertion happen on one map, and access happen on a different map?
Actually I sort of considered this right near the beginning, when I noted that the map is constructed on SpanAnchoredClock and never replaced. But I didn't consider whether the SpanAnchoredClock itself is ever replaced, that would indeed cause this issue. But again another trawl through the workflow rules this out - the test constructs exactly one SpanAnchoredClock.
In fact, extending this concept further, and looking at all the objects in the tests, everything seems to be being constructed specifically for this test, then discarded at the end of the test, even that ring buffer in the stack.
Having ruled out all possible root causes from generic Map management considerations, I'll go back to basics. Fundamentally, concurrency bugs come from shared mutable state. All three words need to be active:
Looking through the test setup, almost everything is newly constructed for each test run - except for one thing: The ProfilingActivationListener.activeListeners. This is a static list shared across tests. But
In other words, it is completely thread-safe.
But "thread-safe" does not mean "free of races."
I need to go back to an outline of the sequence of activity, focusing specifically on how that one shared structure is used
SamplingProfiler, and registers listener Listener_A into the activeListeners.
activeListeners, but as it is running asynchronously, it is not immediately removed on test shutdown.
SpanAnchoredClock from Test A with the Test A map
This explains everything:
The concurrency bug turned out not to be a bug in the profiler at all, but a race created by the test harness. Even perfectly thread-safe components can interact badly if tests share static or global state.
The null-check workaround in the map accessor is a safe workaround because the root cause doesn't exist outside the test environment.