123 Main Street, New York, NY 10001

Serial Bus Reliability: Lock-Up Recovery, Watchdogs, Redundancy

← Back to: I²C / SPI / UART — Serial Peripheral Buses

Reliability means two things: the bus should not lock up, and if it does, it must self-recover fast with measurable evidence.

This page turns “random lock-ups” into a closed-loop system: detect → isolate → recover → verify → monitor, with timeouts, watchdogs, bypass/redundancy, and production-ready metrics.

H2-1 · Definition & Scope Guard (Reliability for Serial Peripheral Buses)

Intent

Reliability is constrained to four outcomes: no lock-ups, self-recovery when lock-ups occur, verifiable recovery, and production/field observability.

Protocol details (I²C rise-time math, SPI mode sampling windows, UART baud derivations) are intentionally kept out of this chapter to prevent sibling-page overlap.

A) Engineering definition (operational)
  • Lock-up: progress stops (no forward movement in bus or controller state) beyond a defined timeout window.
  • Self-recovery: automated, bounded-time actions restore a known-good state without manual intervention.
  • Verifiable: recovery is accepted only after a post-recovery health check passes (not just “traffic resumes”).
  • Observable: every lock-up and recovery leaves a minimal, consistent evidence trail for production and field analytics.
B) Four SLOs (define success in numbers)
1) Lock-up rate
  • Definition: lock-up events / denominator
  • Denominator options: per 1k transactions · per hour · per boot
  • Window: rolling W minutes (e.g., 10/60/1440) + aggregation by build version
  • Pass criteria: < X / 1k (placeholder)
2) Recovery success rate
  • Definition: accepted recoveries / recovery attempts
  • Accepted recovery: traffic resumes and post-check passes (health probe / sanity read / loopback)
  • Bucket by: recovery level (soft-unwedge → block reset → domain reset → isolate/bypass)
  • Pass criteria: > Y% (placeholder)
3) MTTR (Mean Time To Recover)
  • Start timestamp: first detection trigger (timeout / stuck-line / watchdog pre-timeout)
  • Stop timestamp: recovery verification pass (not merely “bus active”)
  • Report: median + p95 (tail behavior matters for field stability)
  • Pass criteria: < Z ms (placeholder)
4) Data safety (write consistency)
  • Goal: no partial commit across lock-up/recovery boundaries
  • Hook: idempotent transactions / journaled writes / write-protect on brown-out
  • Evidence: commit markers + recovery replay logs + error injection results
  • Pass criteria: “no partial writes” under fault injection (placeholder)
C) Typical failure domains (where lock-ups originate)
Host controller

State-machine stalls, FIFO/DMA starvation, missed IRQs, clock/reset domain races, or bus logic stuck in BUSY/WAIT.

PHY / bridges / extenders

Enable/UVLO behaviors, direction-control mistakes (especially with open-drain), delay skew, or failure-domain coupling across segments.

Cabling / connectors

Intermittent contact, common-mode rise with long returns, hot-plug transients, and ESD/surge aftermath that shifts thresholds.

Peer devices

Busy cycles, incomplete resets, brown-out “half-alive” states, or internal write/commit phases that stop responding.

Power / grounding / hot-plug

Brown-out and ghost-powering create “logic anomalies”: lines driven unexpectedly, pins clamped, or reset not fully asserted.

D) Scope Guard (prevent overlap across sibling pages)
In-scope (deep)
  • SLO definitions and measurement windows (what “good” means).
  • Lock-up taxonomy + minimal evidence points (fast classification).
  • Detection hooks, recovery escalation logic, watchdog layering, and redundancy/bypass concepts.
  • Verification acceptance rules and production/field logging requirements.
Out-of-scope (link only)
  • I²C rise-time/pull-up math and mode-specific timing tables.
  • SPI CPOL/CPHA mode-by-mode sampling window derivations.
  • UART baud-error budgeting derivations and clock-selection deep dive.
  • Full EMC/edge-control theory and topology planning deep detail.
Minimal deliverables (this chapter sets the contract for the rest of the page)
  • Unified definitions for “lock-up”, “recovery success”, and “MTTR”.
  • Event evidence fields (snapshot) required at detection time.
  • Escalation ladder concept (soft → hard → isolate/bypass), with placeholder thresholds.
Diagram · Reliability closed-loop map (Failure → Detect → Recover → Monitor)
Reliability Loop: Sources → Detect → Isolate → Recover → Verify → Monitor Failure sources Closed-loop pipeline Outputs Electrical / SI / thresholds Timing / protocol edges Firmware / state machine Power / reset / hot-plug Environment / cable / ESD Detect timeouts · line monitor · counters Isolate freeze traffic · segment · bypass Recover soft-unwedge → reset → reinit Verify health probe · loopback · commit check Monitor SLO dashboard · logs · alerts feedback SLO lock-up · success · MTTR Logs evidence snapshot Field update hooks

H2-2 · Failure Mode Taxonomy (What “Lock-up” really means)

Intent

A lock-up must become a classifiable event. Classification is driven by the fastest observable evidence (pin/register/counter), not by guesswork or symptom narratives.

Each class below includes: symptoms, minimal 1-second evidence, a first safe action (non-destructive), and an escalation trigger (placeholder thresholds).

How to use this taxonomy (fast path)
  1. Freeze traffic: prevent additional damage and preserve evidence.
  2. Capture minimal evidence: line levels + controller state + error counters + reset cause.
  3. Classify: map evidence to a class (line/controller/peer/power/software).
  4. Escalate safely: start with non-destructive actions; upgrade only when triggers are met.
Class A · Line stuck / electrical wedge
PIN-FIRST
  • Typical symptoms: sustained low level, missing edges, repeated start attempts without progress, “bus busy” never clears.
  • Minimal evidence (1 second): line level held low > T (placeholder) · edge count stalls · glitch density spikes.
  • First safe action: freeze traffic + snapshot line levels + identify which segment/device shares the wedge domain.
  • Escalate when: wedge persists after N non-destructive attempts (placeholder) → move to structured recovery ladder.
Class B · Controller hung / state-machine stall
REG-FIRST
  • Typical symptoms: controller BUSY stuck, FIFO never drains/fills, DMA completion missing, IRQ pending with no handler progress.
  • Minimal evidence (1 second): BUSY flag + FIFO level + underrun/overrun counters + DMA status snapshot.
  • First safe action: stop the transfer engine (DMA/interrupt-driven loop) and preserve a controller snapshot before any reset.
  • Escalate when: transfer engine restarts still show “no forward movement” → reset only the affected controller block before system-wide resets.
Class C · Peer hung / device half-alive
BUCKET
  • Typical symptoms: failures concentrate on one peripheral/segment; failures correlate with write/commit phases; short recovery after power cycle.
  • Minimal evidence (1 second): error counters bucketed by address/device/segment · “busy” readback (if available) · repeated timeouts to the same endpoint.
  • First safe action: isolate the suspected segment/device (bypass/mux or traffic gating) to keep the rest of the system alive.
  • Escalate when: data safety risk appears (partial writes possible) → switch to fail-safe write policy and stronger recovery level.
Class D · Power glitch / brown-out / hot-plug
CAUSE
  • Typical symptoms: temperature/load/hot-plug correlation; intermittent “logic anomalies” after ESD; devices appear powered but behave inconsistently.
  • Minimal evidence (1 second): reset-cause flags (WDT/BOR/POR) + brown-out indicator + rail dip record (if present).
  • First safe action: record cause and avoid partial reinitialization; use a clean reset boundary for the affected domain.
  • Escalate when: repeated causal flags → add supervisor/sequence hooks and tighten post-recovery verification gates.
Class E · Software deadlock / scheduling artifact
CORRELATE
  • Typical symptoms: failures appear only at high throughput; errors look like SI/EMI but track CPU load, queue depth, or ISR latency.
  • Minimal evidence (1 second): underrun/overrun counters + task/ISR latency + ring-buffer drops + throughput discontinuity.
  • First safe action: correlate events to scheduler metrics before hardware changes; verify whether “no forward progress” is a software gate.
  • Escalate when: correlation is strong → treat as budgeting/priority issue (timeouts/retries/queueing) rather than electrical integrity first.
Diagram · Lock-up taxonomy tree + minimal evidence tags
Lock-up taxonomy: classify by fastest evidence (pin / register / counter) Lock-up event A · Line stuck B · Controller hung C · Peer hung D · Power glitch E · SW deadlock SDA low edges stop BUSY stuck DMA no-done addr bucket busy read BOR flag reset cause underrun ISR latency Rule: evidence-first classification → apply recovery ladder with bounded escalation thresholds (X/Y/Z placeholders)

H2-3 · Observability & Metrics (Make reliability measurable)

Intent

Reliability cannot be judged by “feel”. It must be measurable, regression-testable, and comparable across builds, environments, and endpoints.

This chapter defines event semantics, a minimal evidence schema, slicing dimensions, and a health score that points to the worst buckets quickly.

A) Event semantics (what gets counted)
Lock-up event
  • Trigger: no forward progress beyond a timeout window.
  • Boundary: from first detection to entering the recovery state machine.
  • De-dup rule: repeated triggers inside one continuous stall are merged into a single event (same bus + same class + same window).
Recovery attempt
  • Unit: one execution of a recovery level (soft → block reset → domain reset → isolate/bypass).
  • Must record: level, step-id, and per-step duration.
  • Purpose: measures escalation pressure and pinpoints the first effective step.
Recovery acceptance
  • Definition: traffic resumes and a post-recovery health check passes.
  • MTTR stop: the verification pass timestamp, not “first activity”.
  • Reason: avoids counting fragile recoveries as “success”.
B) Denominator & window (make comparisons fair)
Denominator selection
  • Per 1k transactions: best when load varies widely (throughput changes).
  • Per hour: best for steady-duty systems (trend and alarms).
  • Per boot: best when boot/enum phases dominate risk.
Windowing
  • Rolling window: W minutes for real-time detection and regression (e.g., 10/60).
  • Aggregate window: 24h/7d for build-to-build comparisons.
  • Rule: all dashboards must show the denominator and window explicitly.
C) Minimal log schema (evidence snapshot)

A minimal schema prioritizes stable summaries over raw waveforms. It enables fast triage, production analytics, and consistent cross-version comparisons.

Identity

bus-id · segment-id · role · endpoint (addr/device/port) · speed/mode

Snapshot

line snapshot (level/hold/edge) · controller state (BUSY/FIFO/DMA/IRQ) · error code

Cause hints

reset cause (WDT/BOR/POR) · power flags · last-success token

Recovery trace

recovery step/level · result · duration (step + total) · verification outcome

Must-have slicing dimensions (buckets)

endpoint (addr/device) · speed · temperature band · power state · cable length class · firmware version

D) Health score (find the worst bucket fast)
Core signals
  • Recovery success rate by bucket (endpoint/speed/temp/version).
  • MTTR distribution (median + p95 tail).
  • Average recovery level (soft vs reset vs isolate share).
  • Retry anomaly ratio (excess retries over threshold X).
  • Recurrence rate within rolling window (same bucket repeats).
Outcome

The health score is a triage tool: it highlights the worst buckets to prioritize fixes, verification, and production gating.

Diagram · Metrics pipeline (event → schema → buffer → storage → dashboard)
Metrics pipeline: sources → normalize → ring buffer → storage → dashboard Event sources driver · WDT · monitor Normalizer schema · semantics Ring buffer rate-limit · persist Storage / uplink flash · telemetry · file Dashboard SLO · trends · alerts device speed temp fw ver drop cnt evidence snapshot bus · line · state

H2-4 · Detection Hooks (Timeouts, line monitors, heartbeat)

Intent

Early detection enables fast recovery. Detection must be layered and evidence-driven to avoid self-inflicted resets.

The detection stack runs from electrical-level certainty (line/power) up to application-level semantics (watchdog). Recovery triggers should require evidence fusion.

A) Detection stack (fast → slow)
  • Line & power: stuck-low, missing edges, brown-out flags (fastest, highest certainty).
  • Controller progress: BUSY/FIFO/DMA/IRQ snapshots (catches stalls before protocol timeouts cascade).
  • Transaction timeouts: peer response, bus idle, DMA completion (policy-driven).
  • App watchdog: service-level progress (slowest, highest semantic value).
B) Timeout layering (define “no progress” precisely)
Transaction timeout

One command/transfer fails to complete. Best default detector, but must tolerate legitimate slow endpoints.

Peer response timeout

Wait-for-response exceeds a bound. Useful when endpoints have known response envelopes and busy phases.

Bus idle / busy timeout

Detects “busy forever”. Requires progress evidence to avoid false alarms under heavy traffic.

DMA completion timeout

High-throughput detector: distinguishes electrical errors from starvation artifacts using underrun/latency counters.

Rule

Timeout thresholds must be produced by a budgeting chapter (later). Short timeouts without evidence fusion can self-trigger recoveries and reduce stability.

C) Line monitor (open-drain vs push-pull)
Open-drain (wired-AND)
  • Detect: held-low duration, release failures, edge disappearance.
  • Evidence: SDA/SCL low > T, edge counter stalled.
  • Guard: slow edges are not lock-up if progress is present.
Push-pull
  • Detect: activity disappearance, burst error patterns, framing error runs.
  • Evidence: missing toggles, repeated framing/parity bursts, persistent abnormal idle.
  • Guard: avoid turning EMI noise into false “stalls” without progress checks.
D) Heartbeat probe (isolated from traffic)
  • Non-invasive: probe must not change device state (status/ping class reads).
  • Isolated queue: prevents starvation under bulk traffic (heartbeat must still run when most needed).
  • Evidence fusion: heartbeat miss alone does not trigger hard recovery; it contributes to a fused decision.
  • Rate control: frequency is selected by desired MTTR, bounded to avoid bus load spikes.
False-positive guardrails (avoid self-inflicted recovery)
  • Progress check: if counters advance, treat as slow-not-dead.
  • Load correlation: if failures only appear at high throughput, check starvation artifacts (DMA/latency) before electrical assumptions.
  • Power correlation: if reset-cause and brown-out flags align, classify as power-driven stall first.
Diagram · Detection stack (fast → slow) with evidence fusion and guardrails
Detection stack: fast certainty at bottom, semantic value at top Line + Power stuck-low · missing edges · BOR Controller progress BUSY · FIFO · DMA · IRQ Transaction timeouts txn · peer · idle · DMA App watchdog service progress Evidence fusion line + ctrl + txn + heartbeat → trigger level Guardrails progress check load correlation power correlation Trigger → Recovery ladder

H2-5 · Recovery Playbook (Unwedge → reset → re-enumerate)

Intent

Provide a repeatable recovery ladder that turns lock-up handling into an auditable state machine: isolate traffic, capture evidence, attempt minimal unwedge, escalate resets, then re-initialize and verify before resuming.

Protocol bit-level details remain in protocol-specific subpages; this chapter focuses on cross-bus recovery structure, escalation gates, and verification criteria.

A) The 5-step recovery ladder (fixed sequence)
Freeze Snapshot Soft unwedge Reset Re-init + Verify

Each step must emit a log record: step-id, result, and duration to support MTTR and recovery success measurements.

B) Step definitions (Goal · Evidence · Pass/Escalate)
1
Freeze traffic
  • Goal: stop new transactions to prevent damage amplification and queue churn.
  • Evidence: queue depth, last-success token, active endpoint identifiers.
  • Pass: isolation mode entered; critical writes are blocked during recovery.
2
Snapshot evidence
  • Goal: capture minimal evidence to classify the stall and compare across versions.
  • Evidence: line snapshot, controller state, counters, and reset-cause flags (if present).
  • Pass: schema-complete snapshot persisted to ring buffer or stable storage.
3
Attempt soft-unwedge
  • Goal: recover with minimal intrusion and without losing broader system context.
  • Evidence: progress counters resume, error bursts stop, bus transitions return.
  • Pass: health check passes within T; else escalate.
4
Escalate reset
  • Goal: clear stuck state machines and “half-alive” peripherals deterministically.
  • Evidence: record reset level (controller / peripheral / domain), and reset-cause attribution.
  • Pass: reset completes cleanly; proceed to re-init. Repeated resets indicate the need for isolation/bypass.
5
Re-init + verify + resume
  • Goal: restore stable operation, not just “first activity”.
  • Evidence: re-enumeration (if applicable), baseline transfer, and verification checks.
  • Pass: VERIFY gate meets thresholds (error rate < X, no re-stall for Y).
Escalation gates (soft → hard)
  • Count gate: soft-unwedge fails N times consecutively.
  • Time gate: SUSPECT → VERIFY exceeds T total duration.
  • Risk gate: data safety risk is detected (critical write in-flight, power anomaly); escalate immediately.
Data safety strategy (policy-level hooks)

Recovery must not turn transient stalls into persistent corruption. The following hooks provide cross-bus data safety without requiring protocol timing detail.

Idempotent writes

Repeat execution must not amplify damage. Design writes so retries are safe or detectable.

Rollback / two-phase

Either commit fully or remain at the previous state. Partial commits must be prevented or recoverable.

Write protection in SUSPECT

When the state machine enters SUSPECT/RECOVER, block critical writes until VERIFY gates pass.

Diagram · Recovery state machine (RUN → VERIFY → RUN) with escalation triggers
Recovery state machine: isolate → snapshot → unwedge → reset → re-init → verify RUN ok SUSPECT gate SNAPSHOT log snap SOFT_RECOVER unw chk HARD_RESET ctrl peri REINIT enum VERIFY gate hold timeout stuck-line retry overflow Escalation ladder soft reset isolate

H2-6 · Watchdog Architecture (System WDT + Communication WDT)

Intent

Convert watchdogs from a single last-resort reset into a layered safety system: system health, communication progress, and clean reset gating.

Feeding conditions must be based on forward progress signals, not periodic timers, to avoid “runaway code still feeding the dog”.

A) Why layered watchdogs are required
  • Runaway still feeding: code is broken but continues to pet a standard WDT; window WDT blocks this failure mode.
  • Bus dead, CPU alive: main loop runs but the bus makes no progress; a communication WDT detects stalls via progress counters.
  • Half reset: brown-out creates partial resets and stuck peripherals; power supervisors enforce clean reset timing and sequencing.
B) System WDT (feed = main loop health)
Window WDT option

Window watchdog prevents “feed loops” by requiring feeds to occur within a valid timing window.

Feed conditions (examples)
  • critical task ticks advanced (scheduler progress)
  • event queues not saturated (no persistent backlog)
  • latency budget not exceeded for T window
Policy

When system WDT conditions fail, the preferred action is to enter the recovery ladder rather than immediate full reboot—unless risk gates demand escalation.

C) Communication WDT (feed = bus forward progress)
Progress signals
  • transaction completion counter increments
  • heartbeat probe passes (isolated queue)
  • error burst stays under threshold X
Guard

If progress exists, treat the system as slow-not-dead. Avoid hard resets driven only by slow endpoints or bursty traffic.

D) Reset-cause logging & clean reset gating
  • Reset cause: distinguish WDT reset, brown-out, and external reset; write cause to the event log for closed-loop analysis.
  • External supervisor: enforce reset width, voltage thresholds, and sequencing to prevent partial resets and stuck peripherals.
Diagram · Layered watchdogs and feeding conditions (System WDT vs Comm WDT + supervisor)
Watchdog architecture: progress-based feeding + clean reset gating System WDT feed gate task tick · queue ok latency < T Comm WDT feed gate txn done · hb ok counter++ Power supervisor UVLO reset width sequencing · clean reset gating reset cause → log trigger → ladder ctrl reset peri reset

H2-7 · Timeout & Retry Budgeting (Don’t self-trigger failures)

Intent

Convert timeouts from guesswork into a budget row that prevents false triggers, and define retries as a controlled limiter rather than an error amplifier.

Thresholds are derived from worst-case queueing, transfer, peer delays, verification time, and safety margin—then validated with measurable pass criteria.

A) Timeout budget structure (fixed decomposition)
Budget row

Timeout = T_queue(worst) + T_transfer(worst) + T_peer(worst) + T_verify(worst) + T_margin

What each term captures
  • T_queue: RTOS scheduling, DMA service latency, lock contention, queue backlog.
  • T_transfer: worst-case payload and framing overhead at the selected bus speed.
  • T_peer: endpoint busy time and response delay (device internal work), plus jitter.
  • T_verify: health checks required to accept completion and resume safely.
  • T_margin: safety headroom for drift, temperature, and bursty contention.
B) Tiered timeouts (T1/T2/T3) and escalation mapping
T1
Transaction timeout

Use for first-line detection. Response is soft retry with rate limiting.

T2
No-progress / peer response

When progress counters stop, upgrade to re-init or soft-unwedge.

T3
Systemic stall / stuck-line

Reserved for hard failures. Response is recovery ladder escalation (reset or isolate).

Rule

Escalation must be gated by evidence consistency (no progress + aligned signals), not by time alone.

C) Retry strategy (limiter, not amplifier)
Backoff

Prefer exponential or capped backoff to prevent retry storms during bursts, noise, or shared-bus contention.

Caps
  • max retries: N
  • max retry time: T_total
  • rate limit: R retries per window
Tiered actions

Escalate from soft retryre-initrecovery ladder when caps are exceeded or risk gates are raised.

D) DMA / RTOS coupling (avoid “SI-lookalike” timeouts)
  • Symptom: CRC errors and timeouts rise only at high throughput.
  • Evidence: underrun/overrun counters correlate with CPU load, ISR latency, or queue depth.
  • Control: budget T_queue explicitly; gate escalation on forward progress and evidence alignment.
Pass criteria template
  • false timeout trigger rate < X / 1k
  • recovery success rate > Y%
  • MTTR (median / p95) < Z ms
Diagram · Budget row timeline (queue → transfer → peer → verify → margin) with T1/T2/T3 set points
Timeout budget row: timeouts are set-points on a worst-case timeline queue dma sched transfer len rate peer busy delay verify health gate margin guard T1 soft retry T2 re-init T3 ladder Set-points must align with evidence: no progress + counters + line signals (avoid time-only escalation)

H2-8 · Redundancy Patterns (Dual bus / fallback mode / safe degrade)

Intent

Keep the system usable under partial failures by applying redundancy patterns: A/B paths, controller failover, safe degrade modes, and alternate routes—driven by health monitoring and hysteresis.

Switching must be gated, rate-limited, and transaction-safe to avoid thrashing and duplicate writes.

A) Pattern library (what it protects)
Dual bus (A/B)

Protects against wiring, connector, transceiver, and segment faults by maintaining an alternate physical path.

Dual controller (hot/cold standby)

Protects against controller hang and software deadlocks by transferring ownership under controlled gating.

Safe degrade mode

Keeps core service available by reducing speed, write rate, or optional features when health degrades.

Alternate route (bridge / bypass)

Routes around bad endpoints or segments using a selector, mux, or bridge for containment and continued operation.

B) Switching policy (evidence-gated, anti-thrash)
  • Trigger: N consecutive failures or no-progress time > T, aligned with health score drop.
  • Hold: minimum dwell time T_hold after switching.
  • Return: require a stable pass window (not a single success) to switch back.
  • Gray verify: return via limited traffic or non-critical operations before full resume.
C) Transaction consistency (avoid duplicate writes)
Transaction token

Use sequence IDs or tokens to de-duplicate retries and switchovers across A/B paths.

Two-phase / rollback hooks

Ensure writes either commit fully or remain unchanged; switching during critical writes must be gated.

Write gate during switching

Block critical writes until the selected path passes VERIFY gates and the hold window is satisfied.

Diagram · A/B paths with health monitor, selector gating, and return hysteresis
Redundancy: A/B paths + selector gating + hysteresis return window Health monitor progress errors power Selector MUX Hysteresis hold return stable pass window A path bus / segment devices A score B path bus / segment devices B score service

H2-9 · Bypass / Isolation Channels (Mux, switches, isolators, power gating)

Intent

Translate recovery actions into hardware-controlled fault domains: isolate bad segments/endpoints, preserve service on healthy domains, and keep a reachable debug path.

The goal is containment and repeatability: every isolation/bypass action is controllable, observable, and safe under partial power states.

A) Fault-domain partition (design boundary before device choice)
Host domain

Controller/software faults must not take down all endpoints; recovery control must remain alive.

Bus fabric domain

Switch/mux elements must support safe disconnect/reconnect and report their selected state.

Segment domain

Long traces/cables and connectors are frequent fault sources; segment isolation prevents “one bad branch kills all”.

Endpoint domain

Misbehaving peripherals must be quarantinable; critical resets and straps must be reachable for forced recovery.

Rule

Every domain boundary must support two properties: disconnect control and debug observability.

B) Bypass & quarantine patterns (disconnect the bad actor)
  • Segment isolate: cut a noisy/shorted branch while keeping the trunk running; verify error-rate drops after isolation.
  • Endpoint quarantine: detach one peripheral when its health bucket degrades; keep service for the remaining endpoints.
  • Bypass route: switch around a suspect node via an alternate path; apply hysteresis and VERIFY gates before resuming writes.
C) Isolation and extenders (reliability-only requirements)
Containment

Prevent external faults (cable, ground shift, ESD aftermath) from propagating into the host domain.

Delay budgeting

Isolation/extension latency must be included in timeout budgets and verified under worst-case conditions.

Observability

Enable status and fault flags must be loggable to reconstruct isolation state during recovery.

D) Forced recovery access (pins, straps, power gating)
Control points
  • controller reset
  • endpoint reset / boot strap
  • mux/switch enable (disconnect/reconnect)
  • domain power gate (off → on)
Production reachability

Each control point must be reachable via test pads, GPIO, or debug interfaces, and must produce an action receipt in logs.

Reliability traps to block (bypass must not create new lock-up modes)
  • Unpowered-side clamping: disconnected or unpowered domains must not clamp line levels.
  • Unknown selector state on reset: define safe default path and deterministic reset behavior.
  • Direction control mismatch: preserve open-drain semantics and prevent accidental push-pull contention.
  • Lost debug after isolation: keep a read-only debug/log path even when endpoints are quarantined.
Diagram · Fault-domain isolation map (host → switch/mux → segments → devices), with control plane and debug survivability
Fault-domain isolation map: containment + controllability + debug survivability Host domain Bus fabric Segments Endpoint domains (quarantine-able) host ctrl log switch / mux EN SEL STAT segment A segment B EN PG devices (A) RST devices (B) EN recovery ctrl GPIO debug path

H2-10 · Reliability Traps (Brown-out, hot-plug, ghost-powering)

Intent

Explain “hard-to-reproduce” lock-ups using a causal chain and attach hard hooks: power anomaly evidence, line-state checks, controlled re-initialization, and post-event self-tests.

Traps are treated as system faults: captured, gated, and recovered using deterministic sequences.

A) Brown-out (half-reset behaviors)
  • Mechanism: voltage dip can reset only part of logic, leaving drivers and state machines inconsistent.
  • Symptom: lines are held in unexpected states and progress counters stop even when software is alive.
  • Hook: capture reset-cause and power-good state; route brown-out into a dedicated recovery branch.
B) Ghost-powering (backfeed through IO/ESD paths)
  • Mechanism: an unpowered domain is partially energized through IO paths and clamps.
  • Symptom: “not powered, but still pulling the line” (unexpected line clamp).
  • Hook: require isolation/power-gating rules: lines must release when a domain is off.
C) Hot-plug (transients + state loss)
  • Mechanism: insertion/removal causes transient IO conditions and breaks configuration state.
  • Hard rule: recovery must include re-enumeration / re-initialization and VERIFY gates before writes resume.
  • Hook: record a hot-plug event marker and run post-event self-test (loopback/BIST-lite).
Minimal countermeasure kit (attachable to any bus)
Sequencing gate

Gate bus enable using power-good/reset-cause; avoid half-reset domains joining the bus.

Pre-power line release

Check that lines are not clamped before enabling traffic; isolate suspicious domains early.

Post-power self-test

Run lightweight loopback/BIST and health probes; only then lift write gates.

Hot-plug re-init path

Use re-enumeration + re-initialization + VERIFY gates as a mandatory post hot-plug recovery sequence.

Diagram · Power anomaly → lock-up causal chain with minimal hooks
Power anomaly → lock-up: convert “mystery” failures into a causal chain with hooks brown-out reset cause hot-plug event IO backfeed clamp gate line abnormal stuck miss controller stuck no progress recovery re-init verify self-test Hook evidence early: power flags + line release checks + event markers

Engineering Checklist (Design → Bring-up → Production)

Intent: make reliability executable Output: checklist + acceptance gates + production hooks

This chapter compresses the entire reliability stack into a phase-by-phase checklist that can be audited, tested, and carried into production. The checklist is written to avoid “hero debugging”: every recovery action must be triggerable, observable, and measurable.

Design — build-in detection, isolation, and clean reset

  • Detection points: expose BUSY/state bits, retry counters, timeout causes, and line snapshots (SCL/SDA/UART RX/TX/SPI SCLK/MISO/MOSI) in a single “evidence record.”
  • Isolation & segmentation: ensure each fault domain can be disconnected (mux/switch) without taking down the whole bus.
  • Clean reset path: guarantee a deterministic reset cause and reset timing (power supervisor + reset pin reachability).
  • Power gating option: reserve a controlled power-cut path to hard-reset a misbehaving peripheral segment.
  • Data safety hooks: write operations must be idempotent or guarded (write-protect, journaling, commit markers, or rollback flags).
Example building blocks (material numbers; verify package/suffix/value/availability)
  • I²C mux / fault-domain split: TI TCA9548A, NXP PCA9548A
  • Hot-swap / stuck-bus containment: TI TCA4311A, NXP PCA9511A
  • Bidirectional I²C level shifting: TI PCA9306 (or NXP PCA9306 variants)
  • Long cable / noisy reach (differential or buffered): NXP PCA9615, NXP P82B96
  • I²C isolation (fault-domain separation): TI ISO1540, ADI ADuM1250
  • Bypass / selector (multi-signal): TI TMUX1574 (SPDT multi-channel switch)
  • Supervisor / deterministic reset: TI TPS3808
  • Watchdog timer: TI TPS3431
  • Load switch for hard power-cycle: TI TPS22918
  • Low-C ESD arrays (port robustness): TI TPD4E05U06

Bring-up — prove recovery works under fault injection

  • Fault injection matrix: cable unplug/replug, short-to-GND, slow peripheral response, brown-out pulse, hot-plug spike, and “half-reset” scenarios.
  • Recovery state machine: run Freeze → Snapshot → Soft-unwedge → Hard-reset → Re-init → Verify → Resume, and log each transition.
  • Quantify MTTR: record recovery duration distribution (P50/P95/P99), not just averages.
  • False-trigger control: verify timeouts/retries do not self-trigger lock-ups during high throughput / heavy RTOS load.
  • Acceptance gates: Recovery success rate ≥ X%; MTTR ≤ Y ms; false recovery triggers ≤ Z / 1k (placeholders).
Debug access & bridges (examples; pick per lab/production constraints)
  • I²C/SPI ↔ UART bridge for remote console: NXP SC16IS750
  • USB ↔ UART (host debug): FTDI FT232R, Silicon Labs CP2102N
  • USB ↔ I²C/UART (mixed debug): Microchip MCP2221A
  • UART PHY layering examples: TI TRS3232E (RS-232), TI THVD1550 (RS-485)

Production — make reliability testable and diagnosable at scale

  • Fixture-triggerable recovery: test fixtures must be able to force reset, power-cycle a segment, and verify the bus is released afterward.
  • Reset cause pipeline: store WDT vs supervisor vs brown-out cause, plus a short evidence snapshot.
  • Log schema stability: freeze a minimal schema so field logs remain comparable across firmware revisions.
  • RMA bucketing rules: bucket by failure class (line stuck / controller hung / peer hung / power anomaly / software deadlock) and by MTTR tier.
  • Yield gates: lock-up rate ≤ X / 1k transactions; recovery success ≥ Y%; evidence coverage ≥ Z% (placeholders).
SVG 11 · Checklist pipeline (Design → Bring-up → Production)
Reliability checklist as a pipeline Reserve hooks → inject faults → validate recovery → ship with logs Design Bring-up Production Detect points Isolation / mux Clean reset Log schema Fault injection Recovery SM MTTR stats False triggers Fixture hooks Reset cause Field logs RMA buckets Gate with pass criteria placeholders (X/Y/Z) to keep the flow measurable

Applications & IC Selection Logic (Reliability-first)

No product recommendation — only selection rules Includes example material numbers as anchors

Reliability-driven selection starts from failure cost and maintainability, then chooses isolation/segmentation, clean reset, and observability. Material numbers below are reference anchors to help structure a BOM; final choice must match voltage, timing, temperature, EMC, and qualification needs.

Bucket A — long cable / noisy environment / ground potential risk

  • Primary goal: keep line anomalies from becoming controller lock-ups.
  • Pattern: segment the bus + use an extender or differential physical layer + add robust port protection.
  • Reference parts: NXP PCA9615 (differential I²C), NXP P82B96 (buffer/extender), TI TPD4E05U06 (ESD array).
Pass criteria placeholders: lock-up rate ≤ X / 1k, recovery success ≥ Y%, MTTR ≤ Z ms.

Bucket B — hot-plug / brown-out / ghost-powering suspected

  • Primary goal: guarantee clean reset and controllable hard-recovery.
  • Pattern: hot-swap I²C buffering + supervisor reset gating + optional segment power-cut.
  • Reference parts: TI TCA4311A / NXP PCA9511A (hot-swap buffers), TI TPS3808 (supervisor), TI TPS22918 (load switch for power-cycle).

Bucket C — mixed-voltage domains / need controlled enable and isolation

  • Primary goal: avoid “powered-off side clamps the bus” and prevent lock-up by domain leakage.
  • Pattern: level translator with enable + isolation barrier when fault-domain separation is required.
  • Reference parts: TI PCA9306 (level shifting with EN), TI ISO1540 or ADI ADuM1250 (I²C isolation).

Bucket D — many peripherals / address conflicts / need fast isolation of a bad branch

  • Primary goal: isolate one bad branch without stopping the entire system.
  • Pattern: mux per branch + health monitor-driven disable + optional bypass selector.
  • Reference parts: TI TCA9548A / NXP PCA9548A (I²C mux), TI TMUX1574 (selector/bypass for multi-signal paths).

Bucket E — UART service/debug links with rugged cabling

  • Primary goal: keep debug connectivity reliable while avoiding lock-up cascades into the main system.
  • Pattern: PHY with robust ESD + flow control + watchdog on the debug task + log reset causes.
  • Reference parts: TI TRS3232E (RS-232), TI THVD1550 (RS-485), TI TPS3431 (watchdog), FTDI FT232R / Silicon Labs CP2102N (USB-UART).
Reliability-first selection steps (portable rule set)
  1. Set targets: lock-up rate, recovery success rate, MTTR, evidence coverage (placeholders X/Y/Z).
  2. Pick fault domains: decide what can fail without stopping the system (branch, peripheral, cable segment).
  3. Choose controls: mux/switch/isolation + supervisor/reset + watchdog + optional power-cut.
  4. Define evidence: fixed log fields and counters so failures are classifiable within seconds.
  5. Validate by injection: prove recovery state machine under realistic stress and temperature.
SVG 12 · Selection flow (reliability-driven)
Inputs Cable / noise Failure cost Maintainability Decisions Isolate? Bypass? Outputs Segmentation Reset & WDT Evidence logs Example anchor blocks (material numbers) Mux / hot-swap TCA9548A TCA4311A Isolation / reach ISO1540 PCA9615 Reset / power TPS3808 TPS3431

Request a Quote

Accepted Formats

pdf, csv, xls, xlsx, zip

Attachment

Drag & drop files here or use the button below.

FAQs (Lock-up, Recovery, Watchdog, Bypass, Redundancy, Production)

Fixed 4-line answers Data-driven pass criteria (X/Y/Z placeholders)

These FAQs close long-tail debugging strictly within this page boundary: lock-up definition, evidence capture, timeout/retry budgeting, recovery ladder, watchdog layering, bypass/isolation, redundancy hysteresis, and production/field monitoring.

System occasionally locks up but the scope looks “normal” — what state evidence should be checked first?

Likely cause: controller state machine “no-progress” (BUSY/IRQ/DMA) or a peer that stalls without a clear line-level symptom.

Quick check: capture one snapshot record: bus-id/role/speed + controller BUSY/IRQ pending + DMA progress counter + “age of last successful transaction” + line snapshot (stuck-low detector state).

Fix: trigger Snapshot-on-Suspect, add progress-based watchdog (counter increments required), and classify lock-ups by the first missing progress signal.

Pass criteria: evidence coverage ≥ Z%; lock-up rate ≤ X / 1k transactions; MTTR p95 ≤ Y ms.

Making the timeout shorter makes the system less stable — how to tell “self-trigger” from a real fault?

Likely cause: timeout budget below worst-case queue/peer delay, causing false positives and retry storms that amplify load.

Quick check: compare timeout to measured latency percentiles (p95/p99) and correlate timeouts with CPU load/queue depth/ISR latency.

Fix: compute Timeout = T_queue + T_transfer + T_peer + T_verify + margin; add exponential backoff + rate limiting + tiered escalation (soft retry → re-init → ladder).

Pass criteria: false-trigger rate ≤ X / 1k; recovery success ≥ Y%; MTTR p95 ≤ Z ms.

The watchdog resets the system but the lock-up returns — first check reset cause or power sequencing?

Likely cause: “half reset” (brown-out / weak reset gating) or a peripheral/segment not covered by the reset tree.

Quick check: log reset-cause + power-good history, and verify reset width/sequence across all relevant rails and bus segments.

Fix: enforce supervisor-gated reset, ensure every segment/peripheral has a controllable reset or power-cycle path, and block traffic until line-release + verify pass.

Pass criteria: bus released within X ms after reset; repeated-reset loop rate ≤ Y / 24h; recovery success ≥ Z%.

Recovery succeeds but data is occasionally wrong — how to locate an idempotency / write-protect gap?

Likely cause: write transactions retried without de-duplication or partial-commit during a recover window.

Quick check: search logs for duplicate transaction tokens / sequence IDs, missing “commit marker,” or writes executed while state = SUSPECT/RECOVER.

Fix: add transaction token + commit marker (or verify-after-write), and gate critical writes until VERIFY passes.

Pass criteria: data mismatch ≤ X / 10k writes; zero duplicate commits; recovery success ≥ Y%.

Lock-up only appears at high throughput — how to disprove FIFO underrun / scheduling jitter?

Likely cause: DMA underrun/overrun or ISR latency spikes masquerading as protocol/line errors.

Quick check: correlate lock-ups with underrun counters, queue depth, CPU load, and “progress counter freeze” timestamps.

Fix: increase buffering, enforce backpressure, prioritize comm ISR/DMA, and use progress-based timeouts (not pure wall-clock).

Pass criteria: underrun rate ≤ X / 1M bytes; lock-up rate ≤ Y / 1k; MTTR p95 ≤ Z ms.

A new cable reduced lock-up rate but MTTR got longer — did the monitoring definition change?

Likely cause: MTTR start/stop markers changed, denominators/windows changed, or recovery escalates to a heavier step more often.

Quick check: verify metric definitions (per hour / per 1k txn / per boot) and compare recovery-step distribution (SOFT vs HARD vs REINIT).

Fix: freeze schema + MTTR markers, log step transitions with timestamps, and standardize dashboards by the same denominators.

Pass criteria: metric definition stable across releases; MTTR p95 ≤ X ms; recovery-step mix variation ≤ Y%.

Adding bypass/isolation made failures worse — what is the most common power-domain/direction mistake?

Likely cause: unpowered-side clamping (ghost-power) or direction/enable defaults that violate open-drain semantics.

Quick check: measure line levels with the segment powered off, verify EN/SEL reset defaults, and confirm “line release” before enabling traffic.

Fix: add enable sequencing + safe default route, enforce domain-aware gating (PG required), and log isolation state in every event record.

Pass criteria: zero stuck-low while any segment is off; isolation-enabled error rate ≤ X / 1k; recovery success ≥ Y%.

Redundancy switching flaps back and forth — how to set hysteresis to avoid mis-switching?

Likely cause: no minimum hold time, return condition too loose, or a noisy health score.

Quick check: plot health score vs time and count path toggles per hour; check whether hold timer and “consecutive good windows” exist.

Fix: enforce Thold (min residence), require N consecutive pass windows before return, and canary traffic before full cutover.

Pass criteria: switch rate ≤ X / day; availability ≥ Y%; no oscillation bursts (> Z toggles in 10 min).

Lock-up increases at cold/hot temperatures — which fields should be recorded first for fastest diagnosis?

Likely cause: margin shrink (timing, thresholds, leakage) or power droop that changes state-machine behavior.

Quick check: slice metrics by temperature + Vrail + clock source + cable/segment ID + speed + retry counters; compare p95 latency and lock-up rate per bucket.

Fix: apply temperature-aware margins (budget + hysteresis), tighten reset/power gating, and verify recovery ladder under thermal sweep.

Pass criteria: lock-up rate ≤ X / 1k across temperature range; recovery success ≥ Y%; MTTR p95 ≤ Z ms.

Hot-plug always causes one lock-up event — which recovery step is most commonly missing?

Likely cause: missing re-enumeration/re-initialization or missing VERIFY gate before resuming traffic.

Quick check: confirm hot-plug event triggers Freeze → Snapshot → Re-init → Verify, and that traffic remains blocked until line-release check passes.

Fix: enforce a dedicated hot-plug path: isolate segment → power-stabilize → reinit/re-enumerate → verify → resume; gate critical writes during the window.

Pass criteria: resume time ≤ X ms; first-transaction failure rate ≤ Y / 1k; no repeated lock-up bursts after hot-plug.

Field reports only “watchdog reset” — how to distinguish true deadlock vs a watchdog-fed logic bug?

Likely cause: watchdog feed is not tied to real progress; the main loop feeds WDT while comm progress is stalled.

Quick check: compare “last progress timestamp” vs “last feed timestamp”; verify feed requires counter increments (transactions completed / heartbeat ack).

Fix: implement communication watchdog (feed condition = progress), use window watchdog to prevent blind feeding, and log feed reason + reset cause.

Pass criteria: stall detection time ≤ X ms; watchdog cause correctly classified ≥ Y%; false-feed events = 0.

An RMA unit cannot reproduce the issue — which missing production/field log field is most fatal?

Likely cause: evidence gap: reset cause, line snapshot, recovery step/duration, environment bucket, or firmware build ID missing.

Quick check: audit schema completeness and verify a ring-buffer snapshot is captured at “SUSPECT” moment (before reset clears evidence).

Fix: freeze a minimal schema (reset cause + controller state + line snapshot + recovery step + duration + temp/Vrail + fw hash) and enforce upload/retention rules.

Pass criteria: evidence coverage ≥ Z%; time-to-classify ≤ X minutes; “unknown bucket” rate ≤ Y%.