Serial Bus Reliability: Lock-Up Recovery, Watchdogs, Redundancy

Q: System occasionally locks up but the scope looks “normal” — what state evidence should be checked first?

Likely cause: controller no-progress (BUSY/IRQ/DMA) or peer stall without a clear line symptom. Quick check: snapshot bus-id/role/speed + controller state + DMA progress + last-success age + line snapshot. Fix: snapshot-on-suspect + progress-based watchdog + classify by first missing progress signal. Pass criteria: evidence coverage ≥ Z%; lock-up ≤ X/1k; MTTR p95 ≤ Y ms.

Q: Making the timeout shorter makes the system less stable — how to tell “self-trigger” from a real fault?

Likely cause: timeout below worst-case delay, causing false positives and retry storms. Quick check: compare timeout to p95/p99 latency and correlate with CPU/queue/ISR metrics. Fix: Timeout = T_queue + T_transfer + T_peer + T_verify + margin; add backoff + rate limit + tiered escalation. Pass criteria: false-trigger ≤ X/1k; recovery success ≥ Y%; MTTR p95 ≤ Z ms.

Q: The watchdog resets the system but the lock-up returns — first check reset cause or power sequencing?

Likely cause: half-reset (brown-out/weak gating) or a segment not covered by reset. Quick check: log reset-cause + power-good history; verify reset width/sequence across rails/segments. Fix: supervisor-gated reset + controllable reset/power-cycle per segment; block traffic until verify passes. Pass criteria: bus released within X ms; reset-loop ≤ Y/24h; recovery success ≥ Z%.

Q: Recovery succeeds but data is occasionally wrong — how to locate an idempotency / write-protect gap?

Likely cause: retries without de-duplication or partial commit during recovery. Quick check: search for duplicate tokens/sequence IDs, missing commit markers, or writes during SUSPECT/RECOVER. Fix: add transaction token + commit marker (or verify-after-write) and gate critical writes until VERIFY. Pass criteria: mismatch ≤ X/10k writes; duplicate commits = 0; recovery success ≥ Y%.

Q: Lock-up only appears at high throughput — how to disprove FIFO underrun / scheduling jitter?

Likely cause: DMA underrun/overrun or ISR latency spikes masquerading as link/protocol faults. Quick check: correlate lock-ups with underrun counters, queue depth, CPU load, and progress-freeze timestamps. Fix: increase buffering, enforce backpressure, prioritize ISR/DMA, and use progress-based timeouts. Pass criteria: underrun ≤ X/1M bytes; lock-up ≤ Y/1k; MTTR p95 ≤ Z ms.

Q: A new cable reduced lock-up rate but MTTR got longer — did the monitoring definition change?

Likely cause: MTTR markers/denominators changed or recovery escalates to heavier steps. Quick check: verify counting window/denominator and compare recovery-step distribution. Fix: freeze schema + MTTR markers; log step transitions with timestamps; standardize dashboards. Pass criteria: definition stable; MTTR p95 ≤ X ms; step-mix variation ≤ Y%.

Q: Adding bypass/isolation made failures worse — what is the most common power-domain/direction mistake?

Likely cause: unpowered-side clamping or enable/defaults that violate open-drain semantics. Quick check: measure line levels with segment off; verify EN/SEL defaults; confirm line release before traffic. Fix: enable sequencing + safe default route; PG-gated traffic; log isolation state in events. Pass criteria: stuck-low while off = 0; isolation error ≤ X/1k; recovery success ≥ Y%.

Q: Redundancy switching flaps back and forth — how to set hysteresis to avoid mis-switching?

Likely cause: no hold time, return condition too loose, or noisy health score. Quick check: plot health score; count toggles per hour; confirm hold timer and consecutive-pass windows. Fix: enforce Thold, require N good windows to return, and canary traffic before full cutover. Pass criteria: switch rate ≤ X/day; availability ≥ Y%; no burst oscillation > Z toggles/10 min.

Q: Lock-up increases at cold/hot temperatures — which fields should be recorded first for fastest diagnosis?

Likely cause: margin shrink (timing/threshold/leakage) or power droop affecting state machines. Quick check: slice by temperature, Vrail, clock source, segment ID, speed, retry counters; compare p95 latency. Fix: temperature-aware margins (budget + hysteresis), tighter reset/power gating, validate ladder under thermal sweep. Pass criteria: lock-up ≤ X/1k across temperature; recovery success ≥ Y%; MTTR p95 ≤ Z ms.

Q: Hot-plug always causes one lock-up event — which recovery step is most commonly missing?

Likely cause: missing re-enumeration/re-initialization or missing VERIFY gate before traffic resumes. Quick check: confirm hot-plug triggers Freeze→Snapshot→Re-init→Verify and blocks traffic until line release passes. Fix: isolate→power-stabilize→reinit/re-enumerate→verify→resume; gate critical writes during the window. Pass criteria: resume ≤ X ms; first-transaction failure ≤ Y/1k; no repeated lock-up bursts.

← Back to: I²C / SPI / UART — Serial Peripheral Buses

Reliability means two things: the bus should not lock up, and if it does, it must self-recover fast with measurable evidence.

This page turns “random lock-ups” into a closed-loop system: detect → isolate → recover → verify → monitor, with timeouts, watchdogs, bypass/redundancy, and production-ready metrics.

H2-1 · Definition & Scope Guard (Reliability for Serial Peripheral Buses)

Intent

Reliability is constrained to four outcomes: no lock-ups, self-recovery when lock-ups occur, verifiable recovery, and production/field observability.

Protocol details (I²C rise-time math, SPI mode sampling windows, UART baud derivations) are intentionally kept out of this chapter to prevent sibling-page overlap.

A) Engineering definition (operational)

Lock-up: progress stops (no forward movement in bus or controller state) beyond a defined timeout window.
Self-recovery: automated, bounded-time actions restore a known-good state without manual intervention.
Verifiable: recovery is accepted only after a post-recovery health check passes (not just “traffic resumes”).
Observable: every lock-up and recovery leaves a minimal, consistent evidence trail for production and field analytics.

B) Four SLOs (define success in numbers)

1) Lock-up rate

Definition: lock-up events / denominator
Denominator options: per 1k transactions · per hour · per boot
Window: rolling W minutes (e.g., 10/60/1440) + aggregation by build version
Pass criteria: < X / 1k (placeholder)

2) Recovery success rate

Definition: accepted recoveries / recovery attempts
Accepted recovery: traffic resumes and post-check passes (health probe / sanity read / loopback)
Bucket by: recovery level (soft-unwedge → block reset → domain reset → isolate/bypass)
Pass criteria: > Y% (placeholder)

3) MTTR (Mean Time To Recover)

Start timestamp: first detection trigger (timeout / stuck-line / watchdog pre-timeout)
Stop timestamp: recovery verification pass (not merely “bus active”)
Report: median + p95 (tail behavior matters for field stability)
Pass criteria: < Z ms (placeholder)

4) Data safety (write consistency)

Goal: no partial commit across lock-up/recovery boundaries
Hook: idempotent transactions / journaled writes / write-protect on brown-out
Evidence: commit markers + recovery replay logs + error injection results
Pass criteria: “no partial writes” under fault injection (placeholder)

C) Typical failure domains (where lock-ups originate)

Host controller

State-machine stalls, FIFO/DMA starvation, missed IRQs, clock/reset domain races, or bus logic stuck in BUSY/WAIT.

PHY / bridges / extenders

Enable/UVLO behaviors, direction-control mistakes (especially with open-drain), delay skew, or failure-domain coupling across segments.

Cabling / connectors

Intermittent contact, common-mode rise with long returns, hot-plug transients, and ESD/surge aftermath that shifts thresholds.

Peer devices

Busy cycles, incomplete resets, brown-out “half-alive” states, or internal write/commit phases that stop responding.

Power / grounding / hot-plug

Brown-out and ghost-powering create “logic anomalies”: lines driven unexpectedly, pins clamped, or reset not fully asserted.

D) Scope Guard (prevent overlap across sibling pages)

In-scope (deep)

SLO definitions and measurement windows (what “good” means).
Lock-up taxonomy + minimal evidence points (fast classification).
Detection hooks, recovery escalation logic, watchdog layering, and redundancy/bypass concepts.
Verification acceptance rules and production/field logging requirements.

Out-of-scope (link only)

I²C rise-time/pull-up math and mode-specific timing tables.
SPI CPOL/CPHA mode-by-mode sampling window derivations.
UART baud-error budgeting derivations and clock-selection deep dive.
Full EMC/edge-control theory and topology planning deep detail.

Minimal deliverables (this chapter sets the contract for the rest of the page)

Unified definitions for “lock-up”, “recovery success”, and “MTTR”.
Event evidence fields (snapshot) required at detection time.
Escalation ladder concept (soft → hard → isolate/bypass), with placeholder thresholds.

Diagram · Reliability closed-loop map (Failure → Detect → Recover → Monitor)

H2-2 · Failure Mode Taxonomy (What “Lock-up” really means)

Intent

A lock-up must become a classifiable event. Classification is driven by the fastest observable evidence (pin/register/counter), not by guesswork or symptom narratives.

Each class below includes: symptoms, minimal 1-second evidence, a first safe action (non-destructive), and an escalation trigger (placeholder thresholds).

How to use this taxonomy (fast path)

Freeze traffic: prevent additional damage and preserve evidence.
Capture minimal evidence: line levels + controller state + error counters + reset cause.
Classify: map evidence to a class (line/controller/peer/power/software).
Escalate safely: start with non-destructive actions; upgrade only when triggers are met.

Class A · Line stuck / electrical wedge

PIN-FIRST

Typical symptoms: sustained low level, missing edges, repeated start attempts without progress, “bus busy” never clears.
Minimal evidence (1 second): line level held low > T (placeholder) · edge count stalls · glitch density spikes.
First safe action: freeze traffic + snapshot line levels + identify which segment/device shares the wedge domain.
Escalate when: wedge persists after N non-destructive attempts (placeholder) → move to structured recovery ladder.

Class B · Controller hung / state-machine stall

REG-FIRST

Typical symptoms: controller BUSY stuck, FIFO never drains/fills, DMA completion missing, IRQ pending with no handler progress.
Minimal evidence (1 second): BUSY flag + FIFO level + underrun/overrun counters + DMA status snapshot.
First safe action: stop the transfer engine (DMA/interrupt-driven loop) and preserve a controller snapshot before any reset.
Escalate when: transfer engine restarts still show “no forward movement” → reset only the affected controller block before system-wide resets.

Class C · Peer hung / device half-alive

BUCKET

Typical symptoms: failures concentrate on one peripheral/segment; failures correlate with write/commit phases; short recovery after power cycle.
Minimal evidence (1 second): error counters bucketed by address/device/segment · “busy” readback (if available) · repeated timeouts to the same endpoint.
First safe action: isolate the suspected segment/device (bypass/mux or traffic gating) to keep the rest of the system alive.
Escalate when: data safety risk appears (partial writes possible) → switch to fail-safe write policy and stronger recovery level.

Class D · Power glitch / brown-out / hot-plug

CAUSE

Typical symptoms: temperature/load/hot-plug correlation; intermittent “logic anomalies” after ESD; devices appear powered but behave inconsistently.
Minimal evidence (1 second): reset-cause flags (WDT/BOR/POR) + brown-out indicator + rail dip record (if present).
First safe action: record cause and avoid partial reinitialization; use a clean reset boundary for the affected domain.
Escalate when: repeated causal flags → add supervisor/sequence hooks and tighten post-recovery verification gates.

Class E · Software deadlock / scheduling artifact

CORRELATE

Typical symptoms: failures appear only at high throughput; errors look like SI/EMI but track CPU load, queue depth, or ISR latency.
Minimal evidence (1 second): underrun/overrun counters + task/ISR latency + ring-buffer drops + throughput discontinuity.
First safe action: correlate events to scheduler metrics before hardware changes; verify whether “no forward progress” is a software gate.
Escalate when: correlation is strong → treat as budgeting/priority issue (timeouts/retries/queueing) rather than electrical integrity first.

Diagram · Lock-up taxonomy tree + minimal evidence tags

H2-3 · Observability & Metrics (Make reliability measurable)

Intent

Reliability cannot be judged by “feel”. It must be measurable, regression-testable, and comparable across builds, environments, and endpoints.

This chapter defines event semantics, a minimal evidence schema, slicing dimensions, and a health score that points to the worst buckets quickly.

A) Event semantics (what gets counted)

Lock-up event

Trigger: no forward progress beyond a timeout window.
Boundary: from first detection to entering the recovery state machine.
De-dup rule: repeated triggers inside one continuous stall are merged into a single event (same bus + same class + same window).

Recovery attempt

Unit: one execution of a recovery level (soft → block reset → domain reset → isolate/bypass).
Must record: level, step-id, and per-step duration.
Purpose: measures escalation pressure and pinpoints the first effective step.

Recovery acceptance

Definition: traffic resumes and a post-recovery health check passes.
MTTR stop: the verification pass timestamp, not “first activity”.
Reason: avoids counting fragile recoveries as “success”.

B) Denominator & window (make comparisons fair)

Denominator selection

Per 1k transactions: best when load varies widely (throughput changes).
Per hour: best for steady-duty systems (trend and alarms).
Per boot: best when boot/enum phases dominate risk.

Windowing

Rolling window: W minutes for real-time detection and regression (e.g., 10/60).
Aggregate window: 24h/7d for build-to-build comparisons.
Rule: all dashboards must show the denominator and window explicitly.

C) Minimal log schema (evidence snapshot)

A minimal schema prioritizes stable summaries over raw waveforms. It enables fast triage, production analytics, and consistent cross-version comparisons.

Identity

bus-id · segment-id · role · endpoint (addr/device/port) · speed/mode

Snapshot

line snapshot (level/hold/edge) · controller state (BUSY/FIFO/DMA/IRQ) · error code

Cause hints

reset cause (WDT/BOR/POR) · power flags · last-success token

Recovery trace

recovery step/level · result · duration (step + total) · verification outcome

Must-have slicing dimensions (buckets)

endpoint (addr/device) · speed · temperature band · power state · cable length class · firmware version

D) Health score (find the worst bucket fast)

Core signals

Recovery success rate by bucket (endpoint/speed/temp/version).
MTTR distribution (median + p95 tail).
Average recovery level (soft vs reset vs isolate share).
Retry anomaly ratio (excess retries over threshold X).
Recurrence rate within rolling window (same bucket repeats).

Outcome

The health score is a triage tool: it highlights the worst buckets to prioritize fixes, verification, and production gating.

Diagram · Metrics pipeline (event → schema → buffer → storage → dashboard)

H2-4 · Detection Hooks (Timeouts, line monitors, heartbeat)

Intent

Early detection enables fast recovery. Detection must be layered and evidence-driven to avoid self-inflicted resets.

The detection stack runs from electrical-level certainty (line/power) up to application-level semantics (watchdog). Recovery triggers should require evidence fusion.

A) Detection stack (fast → slow)

Line & power: stuck-low, missing edges, brown-out flags (fastest, highest certainty).
Controller progress: BUSY/FIFO/DMA/IRQ snapshots (catches stalls before protocol timeouts cascade).
Transaction timeouts: peer response, bus idle, DMA completion (policy-driven).
App watchdog: service-level progress (slowest, highest semantic value).

B) Timeout layering (define “no progress” precisely)

Transaction timeout

One command/transfer fails to complete. Best default detector, but must tolerate legitimate slow endpoints.

Peer response timeout

Wait-for-response exceeds a bound. Useful when endpoints have known response envelopes and busy phases.

Bus idle / busy timeout

Detects “busy forever”. Requires progress evidence to avoid false alarms under heavy traffic.

DMA completion timeout

High-throughput detector: distinguishes electrical errors from starvation artifacts using underrun/latency counters.

Rule

Timeout thresholds must be produced by a budgeting chapter (later). Short timeouts without evidence fusion can self-trigger recoveries and reduce stability.

C) Line monitor (open-drain vs push-pull)

Open-drain (wired-AND)

Detect: held-low duration, release failures, edge disappearance.
Evidence: SDA/SCL low > T, edge counter stalled.
Guard: slow edges are not lock-up if progress is present.

Push-pull

Detect: activity disappearance, burst error patterns, framing error runs.
Evidence: missing toggles, repeated framing/parity bursts, persistent abnormal idle.
Guard: avoid turning EMI noise into false “stalls” without progress checks.

D) Heartbeat probe (isolated from traffic)

Non-invasive: probe must not change device state (status/ping class reads).
Isolated queue: prevents starvation under bulk traffic (heartbeat must still run when most needed).
Evidence fusion: heartbeat miss alone does not trigger hard recovery; it contributes to a fused decision.
Rate control: frequency is selected by desired MTTR, bounded to avoid bus load spikes.

False-positive guardrails (avoid self-inflicted recovery)

Progress check: if counters advance, treat as slow-not-dead.
Load correlation: if failures only appear at high throughput, check starvation artifacts (DMA/latency) before electrical assumptions.
Power correlation: if reset-cause and brown-out flags align, classify as power-driven stall first.

Diagram · Detection stack (fast → slow) with evidence fusion and guardrails

H2-5 · Recovery Playbook (Unwedge → reset → re-enumerate)

Intent

Provide a repeatable recovery ladder that turns lock-up handling into an auditable state machine: isolate traffic, capture evidence, attempt minimal unwedge, escalate resets, then re-initialize and verify before resuming.

Protocol bit-level details remain in protocol-specific subpages; this chapter focuses on cross-bus recovery structure, escalation gates, and verification criteria.

A) The 5-step recovery ladder (fixed sequence)

Freeze Snapshot Soft unwedge Reset Re-init + Verify

Each step must emit a log record: step-id, result, and duration to support MTTR and recovery success measurements.

B) Step definitions (Goal · Evidence · Pass/Escalate)

Freeze traffic

Goal: stop new transactions to prevent damage amplification and queue churn.
Evidence: queue depth, last-success token, active endpoint identifiers.
Pass: isolation mode entered; critical writes are blocked during recovery.

Snapshot evidence

Goal: capture minimal evidence to classify the stall and compare across versions.
Evidence: line snapshot, controller state, counters, and reset-cause flags (if present).
Pass: schema-complete snapshot persisted to ring buffer or stable storage.

Attempt soft-unwedge

Goal: recover with minimal intrusion and without losing broader system context.
Evidence: progress counters resume, error bursts stop, bus transitions return.
Pass: health check passes within T; else escalate.

Escalate reset

Goal: clear stuck state machines and “half-alive” peripherals deterministically.
Evidence: record reset level (controller / peripheral / domain), and reset-cause attribution.
Pass: reset completes cleanly; proceed to re-init. Repeated resets indicate the need for isolation/bypass.

Re-init + verify + resume

Goal: restore stable operation, not just “first activity”.
Evidence: re-enumeration (if applicable), baseline transfer, and verification checks.
Pass: VERIFY gate meets thresholds (error rate < X, no re-stall for Y).

Escalation gates (soft → hard)

Count gate: soft-unwedge fails N times consecutively.
Time gate: SUSPECT → VERIFY exceeds T total duration.
Risk gate: data safety risk is detected (critical write in-flight, power anomaly); escalate immediately.

Data safety strategy (policy-level hooks)

Recovery must not turn transient stalls into persistent corruption. The following hooks provide cross-bus data safety without requiring protocol timing detail.

Idempotent writes

Repeat execution must not amplify damage. Design writes so retries are safe or detectable.

Rollback / two-phase

Either commit fully or remain at the previous state. Partial commits must be prevented or recoverable.

Write protection in SUSPECT

When the state machine enters SUSPECT/RECOVER, block critical writes until VERIFY gates pass.

Diagram · Recovery state machine (RUN → VERIFY → RUN) with escalation triggers

H2-6 · Watchdog Architecture (System WDT + Communication WDT)

Intent

Convert watchdogs from a single last-resort reset into a layered safety system: system health, communication progress, and clean reset gating.

Feeding conditions must be based on forward progress signals, not periodic timers, to avoid “runaway code still feeding the dog”.

A) Why layered watchdogs are required

Runaway still feeding: code is broken but continues to pet a standard WDT; window WDT blocks this failure mode.
Bus dead, CPU alive: main loop runs but the bus makes no progress; a communication WDT detects stalls via progress counters.
Half reset: brown-out creates partial resets and stuck peripherals; power supervisors enforce clean reset timing and sequencing.

B) System WDT (feed = main loop health)

Window WDT option

Window watchdog prevents “feed loops” by requiring feeds to occur within a valid timing window.

Feed conditions (examples)

critical task ticks advanced (scheduler progress)
event queues not saturated (no persistent backlog)
latency budget not exceeded for T window

Policy

When system WDT conditions fail, the preferred action is to enter the recovery ladder rather than immediate full reboot—unless risk gates demand escalation.

C) Communication WDT (feed = bus forward progress)

Progress signals

transaction completion counter increments
heartbeat probe passes (isolated queue)
error burst stays under threshold X

Guard

If progress exists, treat the system as slow-not-dead. Avoid hard resets driven only by slow endpoints or bursty traffic.

D) Reset-cause logging & clean reset gating

Reset cause: distinguish WDT reset, brown-out, and external reset; write cause to the event log for closed-loop analysis.
External supervisor: enforce reset width, voltage thresholds, and sequencing to prevent partial resets and stuck peripherals.

Diagram · Layered watchdogs and feeding conditions (System WDT vs Comm WDT + supervisor)

H2-7 · Timeout & Retry Budgeting (Don’t self-trigger failures)

Intent

Convert timeouts from guesswork into a budget row that prevents false triggers, and define retries as a controlled limiter rather than an error amplifier.

Thresholds are derived from worst-case queueing, transfer, peer delays, verification time, and safety margin—then validated with measurable pass criteria.

A) Timeout budget structure (fixed decomposition)

Budget row

Timeout = T_queue(worst) + T_transfer(worst) + T_peer(worst) + T_verify(worst) + T_margin

What each term captures

T_queue: RTOS scheduling, DMA service latency, lock contention, queue backlog.
T_transfer: worst-case payload and framing overhead at the selected bus speed.
T_peer: endpoint busy time and response delay (device internal work), plus jitter.
T_verify: health checks required to accept completion and resume safely.
T_margin: safety headroom for drift, temperature, and bursty contention.

B) Tiered timeouts (T1/T2/T3) and escalation mapping

Transaction timeout

Use for first-line detection. Response is soft retry with rate limiting.

No-progress / peer response

When progress counters stop, upgrade to re-init or soft-unwedge.

Systemic stall / stuck-line

Reserved for hard failures. Response is recovery ladder escalation (reset or isolate).

Rule

Escalation must be gated by evidence consistency (no progress + aligned signals), not by time alone.

C) Retry strategy (limiter, not amplifier)

Backoff

Prefer exponential or capped backoff to prevent retry storms during bursts, noise, or shared-bus contention.

Caps

max retries: N
max retry time: T_total
rate limit: R retries per window

Tiered actions

Escalate from soft retry → re-init → recovery ladder when caps are exceeded or risk gates are raised.

D) DMA / RTOS coupling (avoid “SI-lookalike” timeouts)

Symptom: CRC errors and timeouts rise only at high throughput.
Evidence: underrun/overrun counters correlate with CPU load, ISR latency, or queue depth.
Control: budget T_queue explicitly; gate escalation on forward progress and evidence alignment.

Pass criteria template

false timeout trigger rate < X / 1k
recovery success rate > Y%
MTTR (median / p95) < Z ms

Diagram · Budget row timeline (queue → transfer → peer → verify → margin) with T1/T2/T3 set points

H2-8 · Redundancy Patterns (Dual bus / fallback mode / safe degrade)

Intent

Keep the system usable under partial failures by applying redundancy patterns: A/B paths, controller failover, safe degrade modes, and alternate routes—driven by health monitoring and hysteresis.

Switching must be gated, rate-limited, and transaction-safe to avoid thrashing and duplicate writes.

A) Pattern library (what it protects)

Dual bus (A/B)

Protects against wiring, connector, transceiver, and segment faults by maintaining an alternate physical path.

Dual controller (hot/cold standby)

Protects against controller hang and software deadlocks by transferring ownership under controlled gating.

Safe degrade mode

Keeps core service available by reducing speed, write rate, or optional features when health degrades.

Alternate route (bridge / bypass)

Routes around bad endpoints or segments using a selector, mux, or bridge for containment and continued operation.

B) Switching policy (evidence-gated, anti-thrash)

Trigger: N consecutive failures or no-progress time > T, aligned with health score drop.
Hold: minimum dwell time T_hold after switching.
Return: require a stable pass window (not a single success) to switch back.
Gray verify: return via limited traffic or non-critical operations before full resume.

C) Transaction consistency (avoid duplicate writes)

Transaction token

Use sequence IDs or tokens to de-duplicate retries and switchovers across A/B paths.

Two-phase / rollback hooks

Ensure writes either commit fully or remain unchanged; switching during critical writes must be gated.

Write gate during switching

Block critical writes until the selected path passes VERIFY gates and the hold window is satisfied.

Diagram · A/B paths with health monitor, selector gating, and return hysteresis

H2-9 · Bypass / Isolation Channels (Mux, switches, isolators, power gating)

Intent

Translate recovery actions into hardware-controlled fault domains: isolate bad segments/endpoints, preserve service on healthy domains, and keep a reachable debug path.

The goal is containment and repeatability: every isolation/bypass action is controllable, observable, and safe under partial power states.

A) Fault-domain partition (design boundary before device choice)

Host domain

Controller/software faults must not take down all endpoints; recovery control must remain alive.

Bus fabric domain

Switch/mux elements must support safe disconnect/reconnect and report their selected state.

Segment domain

Long traces/cables and connectors are frequent fault sources; segment isolation prevents “one bad branch kills all”.

Endpoint domain

Misbehaving peripherals must be quarantinable; critical resets and straps must be reachable for forced recovery.

Rule

Every domain boundary must support two properties: disconnect control and debug observability.

B) Bypass & quarantine patterns (disconnect the bad actor)

Segment isolate: cut a noisy/shorted branch while keeping the trunk running; verify error-rate drops after isolation.
Endpoint quarantine: detach one peripheral when its health bucket degrades; keep service for the remaining endpoints.
Bypass route: switch around a suspect node via an alternate path; apply hysteresis and VERIFY gates before resuming writes.

C) Isolation and extenders (reliability-only requirements)

Containment

Prevent external faults (cable, ground shift, ESD aftermath) from propagating into the host domain.

Delay budgeting

Isolation/extension latency must be included in timeout budgets and verified under worst-case conditions.

Observability

Enable status and fault flags must be loggable to reconstruct isolation state during recovery.

D) Forced recovery access (pins, straps, power gating)

Control points

controller reset
endpoint reset / boot strap
mux/switch enable (disconnect/reconnect)
domain power gate (off → on)

Production reachability

Each control point must be reachable via test pads, GPIO, or debug interfaces, and must produce an action receipt in logs.

Reliability traps to block (bypass must not create new lock-up modes)

Unpowered-side clamping: disconnected or unpowered domains must not clamp line levels.
Unknown selector state on reset: define safe default path and deterministic reset behavior.
Direction control mismatch: preserve open-drain semantics and prevent accidental push-pull contention.
Lost debug after isolation: keep a read-only debug/log path even when endpoints are quarantined.

Diagram · Fault-domain isolation map (host → switch/mux → segments → devices), with control plane and debug survivability

H2-10 · Reliability Traps (Brown-out, hot-plug, ghost-powering)

Intent

Explain “hard-to-reproduce” lock-ups using a causal chain and attach hard hooks: power anomaly evidence, line-state checks, controlled re-initialization, and post-event self-tests.

Traps are treated as system faults: captured, gated, and recovered using deterministic sequences.

A) Brown-out (half-reset behaviors)

Mechanism: voltage dip can reset only part of logic, leaving drivers and state machines inconsistent.
Symptom: lines are held in unexpected states and progress counters stop even when software is alive.
Hook: capture reset-cause and power-good state; route brown-out into a dedicated recovery branch.

B) Ghost-powering (backfeed through IO/ESD paths)

Mechanism: an unpowered domain is partially energized through IO paths and clamps.
Symptom: “not powered, but still pulling the line” (unexpected line clamp).
Hook: require isolation/power-gating rules: lines must release when a domain is off.

C) Hot-plug (transients + state loss)

Mechanism: insertion/removal causes transient IO conditions and breaks configuration state.
Hard rule: recovery must include re-enumeration / re-initialization and VERIFY gates before writes resume.
Hook: record a hot-plug event marker and run post-event self-test (loopback/BIST-lite).

Minimal countermeasure kit (attachable to any bus)

Sequencing gate

Gate bus enable using power-good/reset-cause; avoid half-reset domains joining the bus.

Pre-power line release

Check that lines are not clamped before enabling traffic; isolate suspicious domains early.

Post-power self-test

Run lightweight loopback/BIST and health probes; only then lift write gates.

Hot-plug re-init path

Use re-enumeration + re-initialization + VERIFY gates as a mandatory post hot-plug recovery sequence.

Diagram · Power anomaly → lock-up causal chain with minimal hooks

Engineering Checklist (Design → Bring-up → Production)

Intent: make reliability executable Output: checklist + acceptance gates + production hooks

This chapter compresses the entire reliability stack into a phase-by-phase checklist that can be audited, tested, and carried into production. The checklist is written to avoid “hero debugging”: every recovery action must be triggerable, observable, and measurable.

Design — build-in detection, isolation, and clean reset

Detection points: expose BUSY/state bits, retry counters, timeout causes, and line snapshots (SCL/SDA/UART RX/TX/SPI SCLK/MISO/MOSI) in a single “evidence record.”
Isolation & segmentation: ensure each fault domain can be disconnected (mux/switch) without taking down the whole bus.
Clean reset path: guarantee a deterministic reset cause and reset timing (power supervisor + reset pin reachability).
Power gating option: reserve a controlled power-cut path to hard-reset a misbehaving peripheral segment.
Data safety hooks: write operations must be idempotent or guarded (write-protect, journaling, commit markers, or rollback flags).

Example building blocks (material numbers; verify package/suffix/value/availability)

I²C mux / fault-domain split: TI TCA9548A, NXP PCA9548A
Hot-swap / stuck-bus containment: TI TCA4311A, NXP PCA9511A
Bidirectional I²C level shifting: TI PCA9306 (or NXP PCA9306 variants)
Long cable / noisy reach (differential or buffered): NXP PCA9615, NXP P82B96
I²C isolation (fault-domain separation): TI ISO1540, ADI ADuM1250
Bypass / selector (multi-signal): TI TMUX1574 (SPDT multi-channel switch)
Supervisor / deterministic reset: TI TPS3808
Watchdog timer: TI TPS3431
Load switch for hard power-cycle: TI TPS22918
Low-C ESD arrays (port robustness): TI TPD4E05U06

Bring-up — prove recovery works under fault injection

Fault injection matrix: cable unplug/replug, short-to-GND, slow peripheral response, brown-out pulse, hot-plug spike, and “half-reset” scenarios.
Recovery state machine: run Freeze → Snapshot → Soft-unwedge → Hard-reset → Re-init → Verify → Resume, and log each transition.
Quantify MTTR: record recovery duration distribution (P50/P95/P99), not just averages.
False-trigger control: verify timeouts/retries do not self-trigger lock-ups during high throughput / heavy RTOS load.
Acceptance gates: Recovery success rate ≥ X%; MTTR ≤ Y ms; false recovery triggers ≤ Z / 1k (placeholders).

Debug access & bridges (examples; pick per lab/production constraints)

I²C/SPI ↔ UART bridge for remote console: NXP SC16IS750
USB ↔ UART (host debug): FTDI FT232R, Silicon Labs CP2102N
USB ↔ I²C/UART (mixed debug): Microchip MCP2221A
UART PHY layering examples: TI TRS3232E (RS-232), TI THVD1550 (RS-485)

Production — make reliability testable and diagnosable at scale

Fixture-triggerable recovery: test fixtures must be able to force reset, power-cycle a segment, and verify the bus is released afterward.
Reset cause pipeline: store WDT vs supervisor vs brown-out cause, plus a short evidence snapshot.
Log schema stability: freeze a minimal schema so field logs remain comparable across firmware revisions.
RMA bucketing rules: bucket by failure class (line stuck / controller hung / peer hung / power anomaly / software deadlock) and by MTTR tier.
Yield gates: lock-up rate ≤ X / 1k transactions; recovery success ≥ Y%; evidence coverage ≥ Z% (placeholders).

SVG 11 · Checklist pipeline (Design → Bring-up → Production)

Applications & IC Selection Logic (Reliability-first)

No product recommendation — only selection rules Includes example material numbers as anchors

Reliability-driven selection starts from failure cost and maintainability, then chooses isolation/segmentation, clean reset, and observability. Material numbers below are reference anchors to help structure a BOM; final choice must match voltage, timing, temperature, EMC, and qualification needs.

Bucket A — long cable / noisy environment / ground potential risk

Primary goal: keep line anomalies from becoming controller lock-ups.
Pattern: segment the bus + use an extender or differential physical layer + add robust port protection.
Reference parts: NXP PCA9615 (differential I²C), NXP P82B96 (buffer/extender), TI TPD4E05U06 (ESD array).

Pass criteria placeholders: lock-up rate ≤ X / 1k, recovery success ≥ Y%, MTTR ≤ Z ms.

Bucket B — hot-plug / brown-out / ghost-powering suspected

Primary goal: guarantee clean reset and controllable hard-recovery.
Pattern: hot-swap I²C buffering + supervisor reset gating + optional segment power-cut.
Reference parts: TI TCA4311A / NXP PCA9511A (hot-swap buffers), TI TPS3808 (supervisor), TI TPS22918 (load switch for power-cycle).

Bucket C — mixed-voltage domains / need controlled enable and isolation

Primary goal: avoid “powered-off side clamps the bus” and prevent lock-up by domain leakage.
Pattern: level translator with enable + isolation barrier when fault-domain separation is required.
Reference parts: TI PCA9306 (level shifting with EN), TI ISO1540 or ADI ADuM1250 (I²C isolation).

Bucket D — many peripherals / address conflicts / need fast isolation of a bad branch

Primary goal: isolate one bad branch without stopping the entire system.
Pattern: mux per branch + health monitor-driven disable + optional bypass selector.
Reference parts: TI TCA9548A / NXP PCA9548A (I²C mux), TI TMUX1574 (selector/bypass for multi-signal paths).

Bucket E — UART service/debug links with rugged cabling

Primary goal: keep debug connectivity reliable while avoiding lock-up cascades into the main system.
Pattern: PHY with robust ESD + flow control + watchdog on the debug task + log reset causes.
Reference parts: TI TRS3232E (RS-232), TI THVD1550 (RS-485), TI TPS3431 (watchdog), FTDI FT232R / Silicon Labs CP2102N (USB-UART).

Reliability-first selection steps (portable rule set)

Set targets: lock-up rate, recovery success rate, MTTR, evidence coverage (placeholders X/Y/Z).
Pick fault domains: decide what can fail without stopping the system (branch, peripheral, cable segment).
Choose controls: mux/switch/isolation + supervisor/reset + watchdog + optional power-cut.
Define evidence: fixed log fields and counters so failures are classifiable within seconds.
Validate by injection: prove recovery state machine under realistic stress and temperature.

SVG 12 · Selection flow (reliability-driven)

Request a Quote

Name

Company

Part Number(s) / BOM

Quantity & Target Lead Time

Alternates Allowed

Temperature Grade

Package / Footprint

Compliance

Budget Window

Lot Size / Qty

Message

Attachment

Accepted Formats

pdf, csv, xls, xlsx, zip

Attachment

Drag & drop files here or use the button below.

FAQs (Lock-up, Recovery, Watchdog, Bypass, Redundancy, Production)

Fixed 4-line answers Data-driven pass criteria (X/Y/Z placeholders)

These FAQs close long-tail debugging strictly within this page boundary: lock-up definition, evidence capture, timeout/retry budgeting, recovery ladder, watchdog layering, bypass/isolation, redundancy hysteresis, and production/field monitoring.

System occasionally locks up but the scope looks “normal” — what state evidence should be checked first?

Likely cause: controller state machine “no-progress” (BUSY/IRQ/DMA) or a peer that stalls without a clear line-level symptom.

Quick check: capture one snapshot record: bus-id/role/speed + controller BUSY/IRQ pending + DMA progress counter + “age of last successful transaction” + line snapshot (stuck-low detector state).

Fix: trigger Snapshot-on-Suspect, add progress-based watchdog (counter increments required), and classify lock-ups by the first missing progress signal.

Pass criteria: evidence coverage ≥ Z%; lock-up rate ≤ X / 1k transactions; MTTR p95 ≤ Y ms.

Making the timeout shorter makes the system less stable — how to tell “self-trigger” from a real fault?

Likely cause: timeout budget below worst-case queue/peer delay, causing false positives and retry storms that amplify load.

Quick check: compare timeout to measured latency percentiles (p95/p99) and correlate timeouts with CPU load/queue depth/ISR latency.

Fix: compute Timeout = T_queue + T_transfer + T_peer + T_verify + margin; add exponential backoff + rate limiting + tiered escalation (soft retry → re-init → ladder).

Pass criteria: false-trigger rate ≤ X / 1k; recovery success ≥ Y%; MTTR p95 ≤ Z ms.

The watchdog resets the system but the lock-up returns — first check reset cause or power sequencing?

Likely cause: “half reset” (brown-out / weak reset gating) or a peripheral/segment not covered by the reset tree.

Quick check: log reset-cause + power-good history, and verify reset width/sequence across all relevant rails and bus segments.

Fix: enforce supervisor-gated reset, ensure every segment/peripheral has a controllable reset or power-cycle path, and block traffic until line-release + verify pass.

Pass criteria: bus released within X ms after reset; repeated-reset loop rate ≤ Y / 24h; recovery success ≥ Z%.

Recovery succeeds but data is occasionally wrong — how to locate an idempotency / write-protect gap?

Likely cause: write transactions retried without de-duplication or partial-commit during a recover window.

Quick check: search logs for duplicate transaction tokens / sequence IDs, missing “commit marker,” or writes executed while state = SUSPECT/RECOVER.

Fix: add transaction token + commit marker (or verify-after-write), and gate critical writes until VERIFY passes.

Pass criteria: data mismatch ≤ X / 10k writes; zero duplicate commits; recovery success ≥ Y%.

Lock-up only appears at high throughput — how to disprove FIFO underrun / scheduling jitter?

Likely cause: DMA underrun/overrun or ISR latency spikes masquerading as protocol/line errors.

Quick check: correlate lock-ups with underrun counters, queue depth, CPU load, and “progress counter freeze” timestamps.

Fix: increase buffering, enforce backpressure, prioritize comm ISR/DMA, and use progress-based timeouts (not pure wall-clock).

Pass criteria: underrun rate ≤ X / 1M bytes; lock-up rate ≤ Y / 1k; MTTR p95 ≤ Z ms.

A new cable reduced lock-up rate but MTTR got longer — did the monitoring definition change?

Likely cause: MTTR start/stop markers changed, denominators/windows changed, or recovery escalates to a heavier step more often.

Quick check: verify metric definitions (per hour / per 1k txn / per boot) and compare recovery-step distribution (SOFT vs HARD vs REINIT).

Fix: freeze schema + MTTR markers, log step transitions with timestamps, and standardize dashboards by the same denominators.

Pass criteria: metric definition stable across releases; MTTR p95 ≤ X ms; recovery-step mix variation ≤ Y%.

Adding bypass/isolation made failures worse — what is the most common power-domain/direction mistake?

Likely cause: unpowered-side clamping (ghost-power) or direction/enable defaults that violate open-drain semantics.

Quick check: measure line levels with the segment powered off, verify EN/SEL reset defaults, and confirm “line release” before enabling traffic.

Fix: add enable sequencing + safe default route, enforce domain-aware gating (PG required), and log isolation state in every event record.

Pass criteria: zero stuck-low while any segment is off; isolation-enabled error rate ≤ X / 1k; recovery success ≥ Y%.

Redundancy switching flaps back and forth — how to set hysteresis to avoid mis-switching?

Likely cause: no minimum hold time, return condition too loose, or a noisy health score.

Quick check: plot health score vs time and count path toggles per hour; check whether hold timer and “consecutive good windows” exist.

Fix: enforce Thold (min residence), require N consecutive pass windows before return, and canary traffic before full cutover.

Pass criteria: switch rate ≤ X / day; availability ≥ Y%; no oscillation bursts (> Z toggles in 10 min).

Lock-up increases at cold/hot temperatures — which fields should be recorded first for fastest diagnosis?

Likely cause: margin shrink (timing, thresholds, leakage) or power droop that changes state-machine behavior.

Quick check: slice metrics by temperature + Vrail + clock source + cable/segment ID + speed + retry counters; compare p95 latency and lock-up rate per bucket.

Fix: apply temperature-aware margins (budget + hysteresis), tighten reset/power gating, and verify recovery ladder under thermal sweep.

Pass criteria: lock-up rate ≤ X / 1k across temperature range; recovery success ≥ Y%; MTTR p95 ≤ Z ms.

Hot-plug always causes one lock-up event — which recovery step is most commonly missing?

Likely cause: missing re-enumeration/re-initialization or missing VERIFY gate before resuming traffic.

Quick check: confirm hot-plug event triggers Freeze → Snapshot → Re-init → Verify, and that traffic remains blocked until line-release check passes.

Fix: enforce a dedicated hot-plug path: isolate segment → power-stabilize → reinit/re-enumerate → verify → resume; gate critical writes during the window.

Pass criteria: resume time ≤ X ms; first-transaction failure rate ≤ Y / 1k; no repeated lock-up bursts after hot-plug.

Field reports only “watchdog reset” — how to distinguish true deadlock vs a watchdog-fed logic bug?

Likely cause: watchdog feed is not tied to real progress; the main loop feeds WDT while comm progress is stalled.

Quick check: compare “last progress timestamp” vs “last feed timestamp”; verify feed requires counter increments (transactions completed / heartbeat ack).

Fix: implement communication watchdog (feed condition = progress), use window watchdog to prevent blind feeding, and log feed reason + reset cause.

Pass criteria: stall detection time ≤ X ms; watchdog cause correctly classified ≥ Y%; false-feed events = 0.

An RMA unit cannot reproduce the issue — which missing production/field log field is most fatal?

Likely cause: evidence gap: reset cause, line snapshot, recovery step/duration, environment bucket, or firmware build ID missing.

Quick check: audit schema completeness and verify a ring-buffer snapshot is captured at “SUSPECT” moment (before reset clears evidence).

Fix: freeze a minimal schema (reset cause + controller state + line snapshot + recovery step + duration + temp/Vrail + fw hash) and enforce upload/retention rules.

Pass criteria: evidence coverage ≥ Z%; time-to-classify ≤ X minutes; “unknown bucket” rate ≤ Y%.

Serial Bus Reliability: Lock-Up Recovery, Watchdogs, Redundancy

Serial Bus Reliability: Lock-Up Recovery, Watchdogs, Redundancy

H2-1 · Definition & Scope Guard (Reliability for Serial Peripheral Buses)

H2-2 · Failure Mode Taxonomy (What “Lock-up” really means)

H2-3 · Observability & Metrics (Make reliability measurable)

H2-4 · Detection Hooks (Timeouts, line monitors, heartbeat)

H2-5 · Recovery Playbook (Unwedge → reset → re-enumerate)

H2-6 · Watchdog Architecture (System WDT + Communication WDT)

H2-7 · Timeout & Retry Budgeting (Don’t self-trigger failures)

H2-8 · Redundancy Patterns (Dual bus / fallback mode / safe degrade)

H2-9 · Bypass / Isolation Channels (Mux, switches, isolators, power gating)

H2-10 · Reliability Traps (Brown-out, hot-plug, ghost-powering)

Engineering Checklist (Design → Bring-up → Production)

Design — build-in detection, isolation, and clean reset

Bring-up — prove recovery works under fault injection

Production — make reliability testable and diagnosable at scale

Applications & IC Selection Logic (Reliability-first)

Bucket A — long cable / noisy environment / ground potential risk

Bucket B — hot-plug / brown-out / ghost-powering suspected

Bucket C — mixed-voltage domains / need controlled enable and isolation

Bucket D — many peripherals / address conflicts / need fast isolation of a bad branch

Bucket E — UART service/debug links with rugged cabling

Request a Quote

Accepted Formats

Attachment

FAQs (Lock-up, Recovery, Watchdog, Bypass, Redundancy, Production)

Explore

Categories

Get in Touch

Serial Bus Reliability: Lock-Up Recovery, Watchdogs, Redundancy

Serial Bus Reliability: Lock-Up Recovery, Watchdogs, Redundancy

H2-1 · Definition & Scope Guard (Reliability for Serial Peripheral Buses)

H2-2 · Failure Mode Taxonomy (What “Lock-up” really means)

H2-3 · Observability & Metrics (Make reliability measurable)

H2-4 · Detection Hooks (Timeouts, line monitors, heartbeat)

H2-5 · Recovery Playbook (Unwedge → reset → re-enumerate)

H2-6 · Watchdog Architecture (System WDT + Communication WDT)

H2-7 · Timeout & Retry Budgeting (Don’t self-trigger failures)

H2-8 · Redundancy Patterns (Dual bus / fallback mode / safe degrade)

H2-9 · Bypass / Isolation Channels (Mux, switches, isolators, power gating)

H2-10 · Reliability Traps (Brown-out, hot-plug, ghost-powering)

Design — build-in detection, isolation, and clean reset

Bring-up — prove recovery works under fault injection

Production — make reliability testable and diagnosable at scale

Bucket A — long cable / noisy environment / ground potential risk

Bucket B — hot-plug / brown-out / ghost-powering suspected

Bucket C — mixed-voltage domains / need controlled enable and isolation

Bucket D — many peripherals / address conflicts / need fast isolation of a bad branch

Bucket E — UART service/debug links with rugged cabling

Recommended topics you might also need

Request a Quote

Accepted Formats

Attachment

Explore

Categories

Get in Touch