Test Coverage?
In Distributed Systems??!

Rohan Padhye
Carnegie Mellon University

Knock, knock
Distributed systems
Who's there?
(cue laughter)
Knock, knock
Who's there?
Distributed systems, who?
(cue undefined behavior)

Test Coverage?
In Distributed Systems??!

Rohan Padhye
Carnegie Mellon University

(and also: )

A Real Testing Campaign

Your team is testing a Raft-based distributed store. You throw everything at it:

  • Randomized client workloads — mixed reads, writes, CAS ops
  • Fault injection — node crashes, network partitions, clock skew
  • Schedule perturbation — random message delays, reordered RPCs
  • Invariant checking — linearizability, no data loss, leader uniqueness

The Results

Test iterations Cumulative bugs found 0 10k 20k 30k 40k 50k 0 4 8 12 12 bugs found & fixed

The Results

Test iterations Cumulative bugs found 0 10k 20k 30k 40k 50k 0 4 8 12 12 bugs found & fixed 50,000 more runs → 0 new bugs

Looks fine. Ship it?

Were those 50,000 "quiet" runs exploring new territory…
or retreading the same ground?

"How thorough was this testing?"

In unit testing and fuzzing, we have ready tools to answer this question.
In distributed systems testing, it's complicated.

Code Coverage: What It Gives Us

1def transfer(src, dst, amt): 2 if src.balance >= amt: 3 src.balance -= amt 4 dst.balance += amt 5 return True 6 elif src.overdraft_ok: 7 src.balance -= amt 8 dst.balance += amt 9 return True10 return False
  • Progress metric — 6/10 lines covered
  • Actionable gap — lines 6–9 untested: need a test with overdraft_ok
  • Stopping criterion — "90% branch coverage" is at least a legible statement

Coverage ≠ Correctness, But…

100% coverage doesn't mean bug-free.

But it's a necessary-condition check:

"If this branch was never executed,
it was never tested."

AFL: Coverage Guidance for Fuzzing

Generate Execute Measure Prioritize mutate inputs run program branch coverage new coverage → keep FEEDBACK LOOP

Coverage as equivalence classes:
same coverage => redundant behavior
new coverage => new behavior worth exploring

Can we develop a notion of coverage
for distributed systems?

Complex inter-component interactions => code coverage is rarely good enough.

What Are We Even Covering?

In unit testing or fuzzing, the space is paths through code.
In distributed systems, the space is multi-dimensional:

  • States — global snapshots across all nodes
  • Faults — severity, scope, and timing
  • Schedules — message delivery orderings
  • Workloads — concurrent client operations

We need something that measures progress, identifies gaps,
and can provide actionable feedback for further testing.

State-Space Coverage

Reachable state space Visited (7) Unvisited (8)

Coverage = distinct states visited as a fraction of all states.
But what is a state? Is the universe of states finite?

State-Space Coverage: What is a State?

Node A (Leader) role: "leader" term: 5 commitIndex: 42 sendBuf: [msg3,msg7] heapPtr: 0x7fa3... timerState: 1482ms fdTable: [3,7,12,19] Node B (Follower) role: "follower" term: 5 commitIndex: 40 recvBuf: [msg3] heapPtr: 0x3bc1... timerState: 873ms cacheEntries: 1,247 α(s) abstraction function Abstract State Tuple roles: ⟨L, F, F⟩ term: 5 queueSizes: ⟨2,1,0⟩ repLag: ⟨0,2,3⟩ phase: replicating cacheLoad: medium heap pointers, timers, fd tables… discarded by abstraction

An abstraction function extracts only protocol-relevant dimensions.
Choosing the right abstraction is domain-specific.

State-Space Coverage: Exploiting Symmetry

Naive: 3 distinct states Leader: N1 1 2 3 Leader: N2 1 2 3 Leader: N3 1 2 3 Reduced: 1 equivalence class 1 leader & 2 followers (any permutation) SAMC / FlyMC — Semantic-Aware Model Checking Uses protocol-level knowledge to identify semantically equivalent states

Collapsing equivalent behaviors speeds up exploration and gives a more honest picture of coverage.

Leesatapornwongsa et al. — OSDI 2014
Lukman et al. — EuroSys 2019

Fault-State Coverage

Protocol phase → Election Replication Commit Compaction Steady state Fault type: Node crash Partition Clock skew Slow disk Tested Not tested — timing gap

Coverage over fault type × protocol phase (^ combinations).
E.g., From Jepsen's nemesis logs.

Happens-Before (Lamport Timeline) Coverage

Write Read Sync Node A Node B Node C time e₁ e₃ e₆ e₂ e₄ e₇ e₅ e₈ e₃ ∥ e₅ — concurrent!

In model checking, DPOR treats two executions with the same happens-before relation as equivalent.
Lamport timelines are analogous to code paths in sequential code.

Flanagan & Godefroid — "Dynamic Partial-Order Reduction for Model Checking Software" (POPL 2005)

Happens-Before Pair Coverage

Zoomed-out timeline (abstract events) A B C W R W S W S S R Write Read Sync extract pairs per node Pair Coverage Matrix (before → after) W R S W R S Empty cells = gaps to target in future runs
  • Abstracts concrete events to types
    (e.g., "[12:41] Read /foo in 3ms" → "Read")
  • For every pair of abstract event types, have we observed both possible orderings?
  • Much smaller space than full interleavings; reduction analogous to branch coverage for code paths
  • Extends to k-tuple coverage for deeper interactions

Mallory: Grey-Box Fuzzing of Distributed Systems

Run system inject faults Observe Lamport timelines Abstract Happens-before pairs Q-learn policy maximize diversity Reward if new "happens-before pair" observed

Key insight: new "happens-before pairs" are the distributed analogue of "new code branches" in AFL.

Meng, Pîrlea, Roychoudhury & Sergey — CCS 2023

PCT: Probabilistic Concurrency Testing

Serialized event stream (k = 10 events) A₁ A₂ A₃ B₁ B₂ B₃ C₁ D₁ D₂ switch A → B switch C → D A B C D n = 4 nodes P(depth-d bug) ≥ 1 / n·kd-1 n = nodes, k = events, d = bug depth
  • Small model hypothesis: most bugs need only d specific ordering constraints (e.g., crashcommitrollback for d=3)
  • Assumes simulation testing where all node operations can be serialized into a single event stream.
  • Key idea: Aways simulate events from highest-priority node only, and randomly place d−1 priority-change points in the schedule.
  • Each run finds a depth-d bug with probability ≥ 1/nkd−1
  • Doesn't track coverage directly — but gives a probabilistic completeness guarantee as a function of number of runs

Burckhardt, Kothari, Musuvathi & Nagarakatte — ASPLOS 2010

Predicate / Scenario Coverage

Scenario Coverage Checklist "A read occurred during a leader election" "Messages were reordered between replicas A and B" "Two clients observed different leaders simultaneously" "A write was in-flight when a partition healed" "Leader crashed between prepare and commit" "Follower rejoined during compaction with stale log"

Coverage = fraction of predicates witnessed at least once.
Uncovered predicates → actionable: "these scenarios haven't been tested."

Fest: Feedback-Guided Adaptive Testing

1. Bound the search space with PCT (Probablistic Concurency Testing) A₁ A₂ A₃ B₁ B₂ B₃ C₁ C₂ D₁ D₂ D₃ switch switch switch d−1 random priority change points 2. Use coverage as feedback Happens-Before Pair Coverage (before → after) W R S W R S Scenario Coverage "Write in-flight during leader election + follower crash" "Partition healed while two clients see different leaders" Steer schedule generation

Key idea: Use PCT to bound the search space,
then use happens-before pairs + scenario coverage as feedback to guide exploration.

Li, Desai & Padhye — NSDI 2026

Lineage-Driven Fault Injection (LDFI)

1. Observe successful execution A B C OK 2. Trace data lineage A B C critical! OK 3. Cut the critical path A B Still correct? If not → bug found If yes → try pairs, triples…

Alvaro, Rosen, Hellerstein — SIGMOD 2015

Opiniated Summary

ApproachWhat it measuresAssumesGranularityTractabilityActionability
State-space coverageGlobal statesState abstractionMediumMediumMedium
Fault-state coverageFault combo × timingLogging (states)CoarseHighHigh
Happens-before / DPORCausally distinct schedulesLogging (events)FineLowHigh
Happens-before pairsEvent orderingsEvent abstractionFineHighHigh
PCTSmall-depth bug probabilityControlled SimulationCoarseHighLow
Scenario coverageUser-defined predicatesDomain expertiseMediumHighHigh
LDFICritical causal pathsLineage InfoFineMediumVery high

Closing

We started with: "50,000 runs, no new bugs — are we done?"

We now have a vocabulary for answering that more rigorously.

(It was probably the same knock-knock joke)

References

Alvaro et al. — "Lineage-Driven Fault Injection" (SIGMOD 2015)

Burckhardt et al. — "Randomized Testing of Distributed Systems with Probabilistic Guarantees" (ASPLOS 2010)

Flanagan & Godefroid — "Dynamic Partial-Order Reduction for Model Checking Software" (POPL 2005)

Kingsbury — Jepsen (jepsen.io)

Lamport — "Specifying Systems: TLA⁺"

Leesatapornwongsa et al. — "SAMC: Semantic-Aware Model Checking" (OSDI 2014)

Li, Desai & Padhye — "Feedback-guided Adaptive Testing of Distributed Systems Designs" (NSDI 2026)

Lukman et al. — "FlyMC: Highly Scalable Testing of Complex Interleavings in Distributed Systems" (EuroSys 2019)

Meng, Pîrlea et al. — "Greybox Fuzzing of Distributed Systems" (CCS 2023)