Background: "Hard-link" = a single child node referenced from two distinct parents.
Previously called out informally in CLAUDE.md / README.md as a
"transactions may always fail and retry" liveness caveat. Production reports of
low-frequency dyn_node aborts ("30/30 abort on another env") prompted formal
modelling of the protocol under hard-link topologies.
| Spec | Topology | Bug surface |
|---|---|---|
_hardlink_dynamic | Parent1 ‖ Parent2 → DynChild1 (sibling parents) | Migration race when one parent's bundle pulls the child from the other parent's sub[] |
_hardlink_self_collision | R → A → C and R → C (root is also a direct parent) | bundle(R) walks both R → C and R → A → C; the is_bundle_root Phase 4 m_missing=false override interferes with the doubled path |
_hardlink_4node | Root → A, Root → B, A → C, B → C (C shared between A and B) | bundle(Root) Phase 4 reachability gating + outer DISTURBED retry; reproduces a production race |
_hardlink_external | GN1 → P2 (hardlink), P1 → P2 (P2's packet lives in P1's tree) | bundle(GN1) finalises GN1 ~missing while GN2.sub[P2] = Null is unreachable from GN1 → SnapshotConsistency violation without the Phase 4 gate |
_hardlink_external_migration | As above, but the bundle migrates P2 from P1's tree into GN2 | bundle(GN1) must atomically pull P2 out of P1's sub[] and into GN2's sub[] while a peer races on P1 |
_hardlink_nonatomic | Non-transactional p1->insert(p2) / release(p2) interleaved with transactional ops | Finalising in a "limbo" state where bundledBy = A but A no longer references C. Compares master's chain walk vs b23fa954's self-promote |
Where this layer sits: Standard (non-hard-link) topology verification is completed by
slides_layer2_LLfree.html + slides_layer2_LLfree_3level_en.html / _dynamic_en.html.
Hard-link is a separate verification layer covered by this deck and the 6 specs.
BundleUnbundle_hardlink_self_collision.tla① bundle(R) Phase 2 starts: R is missing=TRUE, sub-packets being collected
② Peer thread's insert(R, C) (hard-link registration) CAS overwrites R's wrapper with missing=FALSE
③ Bundling thread's Phase 4 CAS fails → retry
④ Phase 1 re-collects C via a now-stale bundledBy=A branch
⑤ Writes R.sub[A].sub[C] = Null → packet loss
⑥ SnapshotConsistency violated (mirrors C++ Packet::checkConsistensy at
transaction_impl.h:870-871)
// bundle() Phase 3 — pseudo-diff
for (child in subnodes) {
if (local.subpackets[child] == nullptr) continue; // ← NEW
// existing CAS to BundledRef(this)
}
Hard-link references (i.e. "parent.sub[c] = Null indicates the packet lives elsewhere")
must not be CAS-tagged in Phase 3. After the fix, bundle completes without losing packets.
Rejected alternative: Gating insert(R, C) on ~R.missing
(wait for in-flight bundle to clear) — breaks lock-freedom and risks deadlock at 3+ threads.
The Phase 3 skip lives entirely inside the bundle protocol and needs no external coordination.
| cfg | States | Result |
|---|---|---|
| 2-thread (before fix) | 222 | ❌ FAIL — SnapshotConsistency violated |
| 2-thread (after Phase 3 skip-Null fix) | 114 | ✅ Pass + liveness |
| 3-thread (after fix) | 760 | ✅ Pass + liveness |
BundleUnbundle_hardlink_4node.tla (gold standard)| Step | Action |
|---|---|
| T1 | bundle(Root) Phase 4 attempts m_missing=false finalise |
| T2.❶ | Non-atomic release: writes B.sub[C] = Null (ReleaseBCNoMigrate) |
| T2.❷ | MigrateCToA: transfers C's parent info to A; re-establishes A.sub[C] |
If T2 pauses between ❶ and ❷ while Root's Phase 4 runs, C is
unreachable from either parent's sub[] (the production race).
// transaction_impl.h Phase 4 (commits 87892b35, 92b15f62, 404fa137, b12e1895)
newpacket->m_missing = false; // clear FIRST (avoid mid-bundle short-circuit)
if (! newpacket->allSubReachable(...)) {
newpacket->m_missing = true; // restore
return BundledStatus::DISTURBED; // outer snapshot() retries
}
| cfg | States | Result |
|---|---|---|
| 2-thread (before fix) | 97 | ❌ FAIL — SnapshotConsistency violated |
2-thread (Phase 4 gate + DISTURBED retry + WF(MigrateCToA)) | 531 | ✅ Pass + liveness |
4-node state-space bounding policy:
No retryCount, no CONSTRAINT, no MOD wrap-around. The "in_release" pc state
binds the 2-step release into a single atomic-from-caller API call (mirrors the C++
release(B, C) shape). All other variables have finite domains, so TLC
exhausts naturally.
_hardlink_external.tla)The moment bundle(GN1) finalises GN1 as ~missing, the slot
GN2.sub[P2] = Null is no longer reverseLookup-able from GN1
→ SnapshotConsistency violation. The Phase 4 reachability gate directly prevents this.
| cfg | States | Result |
|---|---|---|
| 1-thread (with gate) | 15 | ✅ Pass + liveness |
| 2-thread (with gate) | 136 | ✅ Pass + liveness |
_hardlink_external_migration.tla)Same topology, but the bundle atomically migrates P2 from P1's tree into
GN2.sub[P2]. Models every step of the bundle pipeline competing with a peer thread
racing on P1.
| cfg | States | Result |
|---|---|---|
| 1-thread | 8 | ✅ Pass + liveness |
| 2-thread | ~hundreds | ✅ Pass + liveness |
| 3-thread | 1,202 | ✅ Pass + liveness (per-action WF on every progress step) |
Why external migration needs per-action fairness:
This spec attaches WF_vars(...) to each step of the bundle pipeline
(BundlePhase1 → BundlePullP1 → BundleCASP2 → BundleUpdateGN1 → BundlePhase4 → BundlePhase5).
See next slide.
_hardlink_4node / _hardlink_external_migration\* 4-node:
Spec == Init /\ [][Next]_vars
/\ WF_vars(NextStep)
/\ \A t \in Threads : WF_vars(MigrateCToA(t))
\* external_migration: per-action WF on every progress step
Spec == Init /\ [][Next]_vars
/\ WF_vars(NextStep)
/\ \A t \in Threads :
/\ WF_vars(BundlePhase1(t))
/\ WF_vars(BundlePullP1(t))
/\ WF_vars(BundleCASP2(t))
/\ WF_vars(BundleUpdateGN1(t))
/\ WF_vars(BundlePhase4(t))
/\ WF_vars(BundlePhase5(t))
WF_vars(NextStep) is too weakWith the production-faithful retry semantics (Phase 4 reachability failure → return to
bundle_phase1, matching the C++ BundledStatus::DISTURBED outer
snapshot() retry loop), executions are admitted where the bundling thread retries
arbitrarily many times. The retry terminates only after the peer's MigrateCToA fires.
WF on a disjunction only requires that "some enabled action fires." A trace that
fires only BundlePhase1/2/4 infinitely still satisfies the property, but this trace
violates the production fairness assumption. Per-action WF forces
MigrateCToA to fire as long as it stays continuously enabled.
LL-free negotiate arbitrates CAS contention on the same linkage.
The race here spans three different linkages (T1: Root, T2.❶: B, T2.❷: A);
T1 and T2 never CAS the same linkage, so older-wins has nothing to arbitrate.
T2's progress depends on OS-level thread scheduling.
In TLA+, WF_vars(MigrateCToA) abstracts that OS guarantee.
Why the other hardlink specs don't need per-action WF:
the race surface of _hardlink_2level_LLfree, _3level_LLfree,
_dynamic variants, atomic_shared_ptr.tla, and the other hardlink specs
sits on a single linkage chain, where LL-free priority gating structurally bounds
retries and WF_vars(NextStep) suffices.
BundleUnbundle_hardlink_nonatomic.tlaDecide whether the test pattern restored by remote branch
claude/refactor-negotiate-scoped-f7de2 commit b23fa954 — non-Tx
p1->insert(p2) / release(p2) interleaved with transactional ops —
exposes a real liveness gap on master, or is merely a scheduler artefact.
| Step | Action |
|---|---|
❶ NonTxInsertAC | Non-tx insert of A → C hard-link |
❷ TxInsertHardlink | Transactional hard-link registration |
❸ NonTxReleaseAC | Non-tx release of A → C. Leaves C in limbo: bundledBy = A but A no longer references C |
❹ TxReleaseHardlink | Transactional hard-link release |
| dtor | Destructor finalises A and C out of limbo |
UseFixVariant)| Variant | Meaning |
|---|---|
FALSE (master) | Bundle fall-through; walks bundledBy chain → requires scope on Root (heavily contended by peer Tx) |
TRUE (b23fa954) | Direct self-promote CAS on C's linkage only — no chain walk, no Root contention |
| Spec / Variant | States | Result |
|---|---|---|
Spec (per-action WF) + master | 308 | ✅ Pass + liveness |
Spec + fix | 308 | ✅ Pass + liveness |
WeakSpec (blanket WF) + master | 308 | ✅ Pass + liveness |
WeakSpec + fix | 308 | ✅ Pass + liveness |
Conclusion: at this modeling abstraction both paths are theoretically live.
The b23fa954 self-promote is a CAS-count optimization — it reduces the chain
walk's O(chain × retry) CASes to O(1), but the master fall-through
is not livelocked in principle. The ~10% residual hang reported on the remote branch is
consistent with real-OS scheduling artefacts (real schedulers do not strictly meet WF in
finite time), not a logic gap.
The optimization is gated by KAME_STM_OPTIONAL_OPTIMIZATION
(default 1 in kame/transaction_definitions.h). Build with
-DKAME_STM_OPTIONAL_OPTIMIZATION=0 to disable the self-promote shortcut and obtain
a build that matches the strictly TLA+-modeled master path.
BundleUnbundle_hardlink_dynamic.tlaThe most basic hard-link structure: two sibling parents sharing one child.
Exhausts the migration race where one parent's bundle should pull DynChild1 from the other
parent's sub[] while a peer thread races.
| cfg | States | Result |
|---|---|---|
| 1-thread (MaxCommits=1) | 7 | ✅ Pass + liveness |
| 2-thread (MaxCommits=1) | 62 | ✅ Pass + liveness |
| 2-thread (MaxCommits=2) | 703 | ✅ Pass |
Per user feedback (2026-05-20), all hardlink models follow a uniform policy: no retryCount, no CONSTRAINT, no MOD wrap-around. Artificial cutoffs can false-negative liveness verdicts. Every spec terminates naturally on bounded variable domains.
| Model | Retry bound | Bounding mechanism |
|---|---|---|
_hardlink_4node | removed (gold standard) | "in_release" pc state binds the 2-step release into one atomic-from-caller call |
_hardlink_self_collision | removed | Finite by bounded domains |
_hardlink_external | removed | Finite by bounded domains |
_hardlink_external_migration | removed | Per-action WF on every progress step |
_hardlink_dynamic | never had one | Bounded domains + MaxCommits=1 |
| Invariant / Property | Meaning |
|---|---|
SnapshotConsistency | Mirrors C++ Packet::checkConsistensy — a Null sub-slot is consistent only if reachable via reverseLookup (hard-link path) |
HardlinkExclusive | At most one parent's sub[] holds the child's packet (no duplicate homing) |
BundleRefConsistency | The bundledBy parent has priority and either holds the packet or the child wrapper carries an InsertedRef |
NoPriorityLoss / NoMissingHole / MissingPropagation | Structural sanity |
EventuallyAllDone (PROPERTY) | All threads eventually complete — uniform safety+liveness coverage across all hardlink cfgs as of 2026-05-20 |
| Commit | Content |
|---|---|
87892b35 | Introduces reachability gate (Packet::allSubReachable) |
92b15f62 | Also adds checkConsistensy handling; mid-bundle short-circuit handling |
404fa137 | Order-of-operations: clear m_missing first |
b12e1895 | Restores order-of-operations subtlety from the global-root parameter introduction |
1ffd8dce | Adds optional globalroot parameter to allSubReachable / checkConsistensy for hard-link Case B |
The 2-level / 3-level LL-free (static + dynamic) TLA+ models do not need updating.
In non-hard-link topology each child has exactly one parent. When bundle(P) Phase 4
encounters a Null sub-slot for an active child, reverseLookup always succeeds
(it is structurally impossible for the packet to live outside P's subtree).
The existing SnapshotConsistency invariant
(2level_LLfree_dynamic.tla:1093-1096, 3level_LLfree_dynamic.tla:1831-1835)
captures exactly the gate's precondition. The 413 M-state dynamic release superfine live
run found no violations, so the C++ else { return DISTURBED } branch is
provably unreachable in non-hard-link topologies.
| Test | Effect |
|---|---|
dyn_node_test / 3level_mixed_test / payload_integrity_* | Every inserted child's packet lives in exactly one parent's sub[] — no Null sub-slot in steady state — so the fix is a no-op |
| Spec | States | Result |
|---|---|---|
_hardlink_dynamic (sibling parents, superfine) | 7 / 62 / 703 | ✅ Pass + liveness |
_hardlink_self_collision (R-A-C, after fix) | 114 / 760 | ✅ Pass + liveness |
_hardlink_4node (R+A+B+shared C, Phase 4 gate + DISTURBED retry) | 531 | ✅ Pass + liveness |
_hardlink_external | 15 / 136 | ✅ Pass + liveness |
_hardlink_external_migration (3-thread, per-action WF) | 1,202 | ✅ Pass + liveness |
_hardlink_nonatomic (master/fix × Spec/WeakSpec) | 308 (each of 4) | ✅ Pass + liveness |
This layer achieves several KAME-corpus firsts:
kamestm/tests/VERIFICATION.md §5 — full results + per-cfg runskamestm/tests/tlaplus/doc/verification_log.md — run logkamestm/tests/tlaplus/BundleUnbundle_hardlink_*.tla — spec files (6 files)slides_layer2_LLfree.html — base spec (LL-free Gen 3)