▲ Overview

Hard-link TLA+ Verification — Overview

5 topologies + a non-atomic test pattern + per-action fairness

Background: "Hard-link" = a single child node referenced from two distinct parents. Previously called out informally in CLAUDE.md / README.md as a "transactions may always fail and retry" liveness caveat. Production reports of low-frequency dyn_node aborts ("30/30 abort on another env") prompted formal modelling of the protocol under hard-link topologies.

The 5 topologies + non-atomic variant

SpecTopologyBug surface
_hardlink_dynamicParent1 ‖ Parent2 → DynChild1 (sibling parents)Migration race when one parent's bundle pulls the child from the other parent's sub[]
_hardlink_self_collisionR → A → C and R → C (root is also a direct parent)bundle(R) walks both R → C and R → A → C; the is_bundle_root Phase 4 m_missing=false override interferes with the doubled path
_hardlink_4nodeRoot → A, Root → B, A → C, B → C (C shared between A and B)bundle(Root) Phase 4 reachability gating + outer DISTURBED retry; reproduces a production race
_hardlink_externalGN1 → P2 (hardlink), P1 → P2 (P2's packet lives in P1's tree)bundle(GN1) finalises GN1 ~missing while GN2.sub[P2] = Null is unreachable from GN1 → SnapshotConsistency violation without the Phase 4 gate
_hardlink_external_migrationAs above, but the bundle migrates P2 from P1's tree into GN2bundle(GN1) must atomically pull P2 out of P1's sub[] and into GN2's sub[] while a peer races on P1
_hardlink_nonatomicNon-transactional p1->insert(p2) / release(p2) interleaved with transactional opsFinalising in a "limbo" state where bundledBy = A but A no longer references C. Compares master's chain walk vs b23fa954's self-promote

Where this layer sits: Standard (non-hard-link) topology verification is completed by slides_layer2_LLfree.html + slides_layer2_LLfree_3level_en.html / _dynamic_en.html. Hard-link is a separate verification layer covered by this deck and the 6 specs.

Self-collision (R-A-C) — bug repro + fix

BundleUnbundle_hardlink_self_collision.tla

Topology

    R
    ├── A ── C    (R → A → C)
    └── C         (R → C is the hard-link; C is both R's direct child and A's child)

Bug surface (reproduced from a real production trace)

bundle(R) Phase 2 starts: R is missing=TRUE, sub-packets being collected
② Peer thread's insert(R, C) (hard-link registration) CAS overwrites R's wrapper with missing=FALSE
③ Bundling thread's Phase 4 CAS fails → retry
④ Phase 1 re-collects C via a now-stale bundledBy=A branch
⑤ Writes R.sub[A].sub[C] = Nullpacket loss
SnapshotConsistency violated (mirrors C++ Packet::checkConsistensy at transaction_impl.h:870-871)

Proposed fix (confirmed via TLA+)

// bundle() Phase 3 — pseudo-diff
for (child in subnodes) {
    if (local.subpackets[child] == nullptr) continue;  // ← NEW
    // existing CAS to BundledRef(this)
}

Hard-link references (i.e. "parent.sub[c] = Null indicates the packet lives elsewhere") must not be CAS-tagged in Phase 3. After the fix, bundle completes without losing packets.

Rejected alternative: Gating insert(R, C) on ~R.missing (wait for in-flight bundle to clear) — breaks lock-freedom and risks deadlock at 3+ threads. The Phase 3 skip lives entirely inside the bundle protocol and needs no external coordination.

Results

cfgStatesResult
2-thread (before fix)222❌ FAIL — SnapshotConsistency violated
2-thread (after Phase 3 skip-Null fix)114✅ Pass + liveness
3-thread (after fix)760✅ Pass + liveness

4-node — production race reproduction

BundleUnbundle_hardlink_4node.tla (gold standard)

Topology

    Root
    ├── A ──┐
    └── B ──┴── C  (C shared between A and B)

Race pattern

StepAction
T1bundle(Root) Phase 4 attempts m_missing=false finalise
T2.❶Non-atomic release: writes B.sub[C] = Null (ReleaseBCNoMigrate)
T2.❷MigrateCToA: transfers C's parent info to A; re-establishes A.sub[C]

If T2 pauses between ❶ and ❷ while Root's Phase 4 runs, C is unreachable from either parent's sub[] (the production race).

Fix = C++ Phase 4 reachability gate

// transaction_impl.h Phase 4 (commits 87892b35, 92b15f62, 404fa137, b12e1895)
newpacket->m_missing = false;             // clear FIRST (avoid mid-bundle short-circuit)
if (! newpacket->allSubReachable(...)) {
    newpacket->m_missing = true;          // restore
    return BundledStatus::DISTURBED;     // outer snapshot() retries
}

Results

cfgStatesResult
2-thread (before fix)97❌ FAIL — SnapshotConsistency violated
2-thread (Phase 4 gate + DISTURBED retry + WF(MigrateCToA))531✅ Pass + liveness

4-node state-space bounding policy: No retryCount, no CONSTRAINT, no MOD wrap-around. The "in_release" pc state binds the 2-step release into a single atomic-from-caller API call (mirrors the C++ release(B, C) shape). All other variables have finite domains, so TLC exhausts naturally.

External / External-migration — cross-tree races

External (_hardlink_external.tla)

    GN1 ── P2  (hardlink)
    P1  ── P2  (P2's packet lives in P1's tree)

The moment bundle(GN1) finalises GN1 as ~missing, the slot GN2.sub[P2] = Null is no longer reverseLookup-able from GN1 → SnapshotConsistency violation. The Phase 4 reachability gate directly prevents this.

cfgStatesResult
1-thread (with gate)15✅ Pass + liveness
2-thread (with gate)136✅ Pass + liveness

External migration (_hardlink_external_migration.tla)

Same topology, but the bundle atomically migrates P2 from P1's tree into GN2.sub[P2]. Models every step of the bundle pipeline competing with a peer thread racing on P1.

cfgStatesResult
1-thread8✅ Pass + liveness
2-thread~hundreds✅ Pass + liveness
3-thread1,202✅ Pass + liveness (per-action WF on every progress step)

Why external migration needs per-action fairness: This spec attaches WF_vars(...) to each step of the bundle pipeline (BundlePhase1 → BundlePullP1 → BundleCASP2 → BundleUpdateGN1 → BundlePhase4 → BundlePhase5). See next slide.

Per-action fairness — a first in the KAME corpus

_hardlink_4node / _hardlink_external_migration

Typical spec pattern

\* 4-node:
Spec == Init /\ [][Next]_vars
        /\ WF_vars(NextStep)
        /\ \A t \in Threads : WF_vars(MigrateCToA(t))

\* external_migration: per-action WF on every progress step
Spec == Init /\ [][Next]_vars
        /\ WF_vars(NextStep)
        /\ \A t \in Threads :
            /\ WF_vars(BundlePhase1(t))
            /\ WF_vars(BundlePullP1(t))
            /\ WF_vars(BundleCASP2(t))
            /\ WF_vars(BundleUpdateGN1(t))
            /\ WF_vars(BundlePhase4(t))
            /\ WF_vars(BundlePhase5(t))

Why blanket WF_vars(NextStep) is too weak

With the production-faithful retry semantics (Phase 4 reachability failure → return to bundle_phase1, matching the C++ BundledStatus::DISTURBED outer snapshot() retry loop), executions are admitted where the bundling thread retries arbitrarily many times. The retry terminates only after the peer's MigrateCToA fires.

WF on a disjunction only requires that "some enabled action fires." A trace that fires only BundlePhase1/2/4 infinitely still satisfies the property, but this trace violates the production fairness assumption. Per-action WF forces MigrateCToA to fire as long as it stays continuously enabled.

Doesn't "older-wins" subsume this fairness?

LL-free negotiate arbitrates CAS contention on the same linkage. The race here spans three different linkages (T1: Root, T2.❶: B, T2.❷: A); T1 and T2 never CAS the same linkage, so older-wins has nothing to arbitrate. T2's progress depends on OS-level thread scheduling. In TLA+, WF_vars(MigrateCToA) abstracts that OS guarantee.

Why the other hardlink specs don't need per-action WF: the race surface of _hardlink_2level_LLfree, _3level_LLfree, _dynamic variants, atomic_shared_ptr.tla, and the other hardlink specs sits on a single linkage chain, where LL-free priority gating structurally bounds retries and WF_vars(NextStep) suffices.

Non-atomic test pattern

BundleUnbundle_hardlink_nonatomic.tla

Motivation

Decide whether the test pattern restored by remote branch claude/refactor-negotiate-scoped-f7de2 commit b23fa954 — non-Tx p1->insert(p2) / release(p2) interleaved with transactional ops — exposes a real liveness gap on master, or is merely a scheduler artefact.

Modeling

StepAction
NonTxInsertACNon-tx insert of A → C hard-link
TxInsertHardlinkTransactional hard-link registration
NonTxReleaseACNon-tx release of A → C. Leaves C in limbo: bundledBy = A but A no longer references C
TxReleaseHardlinkTransactional hard-link release
dtorDestructor finalises A and C out of limbo

Two finalize variants (CONSTANT UseFixVariant)

VariantMeaning
FALSE (master)Bundle fall-through; walks bundledBy chain → requires scope on Root (heavily contended by peer Tx)
TRUE (b23fa954)Direct self-promote CAS on C's linkage only — no chain walk, no Root contention

Results (all four configurations)

Spec / VariantStatesResult
Spec (per-action WF) + master308✅ Pass + liveness
Spec + fix308✅ Pass + liveness
WeakSpec (blanket WF) + master308✅ Pass + liveness
WeakSpec + fix308✅ Pass + liveness

Conclusion: at this modeling abstraction both paths are theoretically live. The b23fa954 self-promote is a CAS-count optimization — it reduces the chain walk's O(chain × retry) CASes to O(1), but the master fall-through is not livelocked in principle. The ~10% residual hang reported on the remote branch is consistent with real-OS scheduling artefacts (real schedulers do not strictly meet WF in finite time), not a logic gap.

The optimization is gated by KAME_STM_OPTIONAL_OPTIMIZATION (default 1 in kame/transaction_definitions.h). Build with -DKAME_STM_OPTIONAL_OPTIMIZATION=0 to disable the self-promote shortcut and obtain a build that matches the strictly TLA+-modeled master path.

Dynamic (sibling parents) + State-space policy

BundleUnbundle_hardlink_dynamic.tla

Topology

    Parent1   Parent2
       \   /
     DynChild1  (child has both parents)

Coverage

The most basic hard-link structure: two sibling parents sharing one child. Exhausts the migration race where one parent's bundle should pull DynChild1 from the other parent's sub[] while a peer thread races.

Results

cfgStatesResult
1-thread (MaxCommits=1)7✅ Pass + liveness
2-thread (MaxCommits=1)62✅ Pass + liveness
2-thread (MaxCommits=2)703✅ Pass

State-space bounding policy (uniform across all hardlink specs)

Per user feedback (2026-05-20), all hardlink models follow a uniform policy: no retryCount, no CONSTRAINT, no MOD wrap-around. Artificial cutoffs can false-negative liveness verdicts. Every spec terminates naturally on bounded variable domains.

ModelRetry boundBounding mechanism
_hardlink_4noderemoved (gold standard)"in_release" pc state binds the 2-step release into one atomic-from-caller call
_hardlink_self_collisionremovedFinite by bounded domains
_hardlink_externalremovedFinite by bounded domains
_hardlink_external_migrationremovedPer-action WF on every progress step
_hardlink_dynamicnever had oneBounded domains + MaxCommits=1

Shared invariants + impact on other models

Verified invariants / property

Invariant / PropertyMeaning
SnapshotConsistencyMirrors C++ Packet::checkConsistensy — a Null sub-slot is consistent only if reachable via reverseLookup (hard-link path)
HardlinkExclusiveAt most one parent's sub[] holds the child's packet (no duplicate homing)
BundleRefConsistencyThe bundledBy parent has priority and either holds the packet or the child wrapper carries an InsertedRef
NoPriorityLoss / NoMissingHole / MissingPropagationStructural sanity
EventuallyAllDone (PROPERTY)All threads eventually complete — uniform safety+liveness coverage across all hardlink cfgs as of 2026-05-20

C++-side commit history for the Phase 4 reachability gate

CommitContent
87892b35Introduces reachability gate (Packet::allSubReachable)
92b15f62Also adds checkConsistensy handling; mid-bundle short-circuit handling
404fa137Order-of-operations: clear m_missing first
b12e1895Restores order-of-operations subtlety from the global-root parameter introduction
1ffd8dceAdds optional globalroot parameter to allSubReachable / checkConsistensy for hard-link Case B

Impact on non-hard-link models: none

The 2-level / 3-level LL-free (static + dynamic) TLA+ models do not need updating. In non-hard-link topology each child has exactly one parent. When bundle(P) Phase 4 encounters a Null sub-slot for an active child, reverseLookup always succeeds (it is structurally impossible for the packet to live outside P's subtree).
The existing SnapshotConsistency invariant (2level_LLfree_dynamic.tla:1093-1096, 3level_LLfree_dynamic.tla:1831-1835) captures exactly the gate's precondition. The 413 M-state dynamic release superfine live run found no violations, so the C++ else { return DISTURBED } branch is provably unreachable in non-hard-link topologies.

Effect on the C++ test suite

TestEffect
dyn_node_test / 3level_mixed_test / payload_integrity_*Every inserted child's packet lives in exactly one parent's sub[] — no Null sub-slot in steady state — so the fix is a no-op

Summary + references

SpecStatesResult
_hardlink_dynamic (sibling parents, superfine)7 / 62 / 703✅ Pass + liveness
_hardlink_self_collision (R-A-C, after fix)114 / 760✅ Pass + liveness
_hardlink_4node (R+A+B+shared C, Phase 4 gate + DISTURBED retry)531✅ Pass + liveness
_hardlink_external15 / 136✅ Pass + liveness
_hardlink_external_migration (3-thread, per-action WF)1,202✅ Pass + liveness
_hardlink_nonatomic (master/fix × Spec/WeakSpec)308 (each of 4)✅ Pass + liveness

Significance

This layer achieves several KAME-corpus firsts:

References