Error Handling & Recovery for I²C, SPI, and UART
← Back to: I²C / SPI / UART — Serial Peripheral Buses
Reliable recovery is not “reset and hope”—it is a measurable ladder: detect no-progress, resync safely, escalate with bounded timeouts, and verify configuration consistency after recovery.
This page turns I²C/SPI/UART failures into an auditable playbook with telemetry, pass/fail criteria, and soft-to-hard controls that prevent duplicate writes and false resets.
H2-1 · Definition & Recovery Goals
Intent
Establish a strict vocabulary for transient, desync, and hung/wedged states, then define what “recovery success” means in measurable, production-ready terms.
Scope guardrail (to avoid overlap)
- Covers: recovery definitions, “no forward progress” criteria, measurable KPIs (TTR/loss/consistency/false-reset).
- Does not cover: topology/capacitance budgeting, SI/termination, or protocol deep-dives (those belong to sibling pages).
Three failure classes (strict definitions)
1) Transient errors
Errors occur, but forward progress continues: retries succeed, counters increment, transactions complete intermittently.
- Signal: error counters rise, but completion events still appear.
- Risk: throughput/latency degradation; occasional data loss.
2) Protocol desync
The bus is active, but endpoints disagree on frame boundary / phase / session state, producing repeated invalid frames until re-aligned.
- Signal: persistent CRC/header/parity/framing failures with toggling activity.
- Risk: repeated writes/reads may target wrong state unless fenced.
3) Bus wedged / hung
No forward progress beyond a defined timeout window: a line may be held, a controller may be stuck, or a queue may never complete.
- Signal: “time since last success” exceeds threshold; completion interrupts stop.
- Risk: system-level cascading failures; recovery must be staged and safe.
Recovery success = three parallel pass criteria
- Data consistency: no unintended duplicate writes; configuration remains coherent (read-back match where applicable).
- Device availability: critical transactions complete repeatedly after recovery (not a one-off “lucky” success).
- System safety state: recovery does not trigger unsafe actions (avoid over-resetting, avoid “half-applied” control changes).
Outputs to report (KPIs with threshold placeholders)
Time to recover (TTR)
Define TTR_p95 and TTR_max per application. Example placeholder: TTR_p95 < X ms.
Loss / retry burden
Track loss and retry density (burstiness matters): loss_per_hour < Y, retry_burst_p99 < N.
Duplicate-write risk
For non-idempotent writes, require a fence (read-back, sequence number, or safe “commit” step). Placeholder: dup_write_rate < Z.
False resets
Stage recovery to minimize unnecessary resets. Placeholder: false_reset_per_day < W.
Reading tip: “Hung” is defined by no forward progress (not just “many errors”). Escalation should be staged to protect consistency and minimize false resets.
H2-2 · Failure Taxonomy Across I²C / SPI / UART
Intent
Replace guessing with a symptom → cause class → first check map. Each symptom includes one discriminator to quickly separate line-level wedges from state-machine wedges.
Two primary trunks (fast classification)
A) Physical-level wedge
A line is held (stuck-low, forced drive, or electrical contention). Progress stops because the signal cannot return to a valid idle or edge sequence.
Discriminator: pin level remains invalid even when the controller is idle.
B) State-machine wedge
The physical signals may toggle, but a controller/driver/device state machine (or DMA/queue) never completes. Progress stops because the system is “stuck waiting”.
Discriminator: pin levels look plausible, but completion events and “last-good” timestamps freeze.
Symptom cards (no large tables; mobile-safe)
I²C — common “hung” signatures
Symptom: SDA stuck low
Likely cause class: physical wedge (slave holds SDA, contention, or pin latch).
First check: read SDA level with controller disabled; log “last-good” timestamp freeze.
Discriminator: SDA remains low even after STOP attempt or peripheral disable.
Symptom: SCL stuck low
Likely cause class: physical wedge or clock source/driver latch.
First check: verify SCL can be driven high in GPIO mode (safe window); confirm pull-up rail present.
Discriminator: SCL low persists across controller reset.
Symptom: clock stretching never releases
Likely cause class: state-machine wedge (slow/slave stuck, timeout policy missing).
First check: measure stretch duration vs master timeout; confirm “no-progress” criteria trips.
Discriminator: pins toggle up to a point, then wait forever without completion.
Symptom: arbitration loss → no recovery
Likely cause class: state-machine wedge (controller state not cleared, retry storm).
First check: confirm driver clears arbitration flag; track retry burst density and backoff presence.
Discriminator: line is idle/high, but transactions never complete.
SPI — common “hung / desync” signatures
Symptom: CS stuck asserted
Likely cause class: state-machine wedge (session boundary not fenced; driver waiting).
First check: confirm CS deassert occurs on timeout; verify “end-of-transaction” interrupt path.
Discriminator: SCLK toggles may stop, but CS remains low beyond the budget window.
Symptom: DMA never completes
Likely cause class: state-machine wedge (interrupt masked, queue stuck, descriptor issue).
First check: verify DMA progress counter; check if last descriptor completed timestamp stops updating.
Discriminator: pins may be quiet, but software shows “busy” indefinitely.
Symptom: MISO driven when it should be Hi-Z
Likely cause class: physical wedge (contention, miswired CS, faulty device).
First check: sample MISO level with CS high; look for forced-level behavior across resets.
Discriminator: line level stays strong even when the bus is idle.
Symptom: header/CRC wrong after one glitch
Likely cause class: protocol desync (bit-slip or phase/session mismatch).
First check: verify a resync primitive exists (CS fence + known header probe); count consecutive invalid frames.
Discriminator: activity continues, but validity never returns without a resync step.
UART — common “noise lock / stuck receive” signatures
Symptom: framing/parity errors in bursts
Likely cause class: protocol desync or noise-driven false start bits.
First check: correlate FE/PE bursts with idle gaps; verify “wait-for-idle” resync window exists.
Discriminator: errors cluster; stable periods exist between bursts.
Symptom: FIFO overrun repeats
Likely cause class: state-machine wedge (ISR starvation, flow control missing, buffer sizing).
First check: log service latency vs FIFO depth; confirm error flags are cleared and counters advance.
Discriminator: overrun rate tracks CPU load/ISR latency, not cable events.
Symptom: autobaud drifts over time
Likely cause class: protocol desync (clock mismatch accumulation; wrong reacquire policy).
First check: compare measured bit time vs expected; validate reacquire trigger and guard window.
Discriminator: errors increase gradually, not abruptly, and improve after reacquire.
Usage rule: each symptom card intentionally stops at the first check. Deeper electrical design and protocol timing details belong to sibling pages to avoid content overlap.
H2-3 · Observability & Telemetry Hooks
Intent
Without observability, recovery becomes blind resets. Use a minimal, low-overhead telemetry set to prove forward progress, tune timeouts, and control escalation.
Minimal telemetry set (must-have)
Counters
- Error counters: NAK / CRC / FE / PE (map by bus type; keep a unified field name).
- Timeout counters: byte/step, frame, transaction timeouts (do not merge).
- Retry counters: retry_total, retry_burst_max, backoff_applied_count.
Time signals
- last_good_ts: last completed successful transaction.
- last_progress_ts: last observable forward progress (DMA advance, FIFO drain, state step).
- Duration distribution: p50/p95/p99 per operation type (not just averages).
Recovery action counts
- Soft: retry/backoff, resync, flush.
- Hard: driver re-init, device reset.
- Power: segment power-cycle (should be rare; track false resets).
Rule: “hung” should be triggered by no forward progress (last_progress_ts frozen), not by “many errors” alone.
Event log schema (minimal fields)
Identity
- bus_id (controller instance / channel)
- device_addr (address/CS index/port id)
Operation
- op_type (read/write/config/stream)
- duration_us (end-to-end elapsed)
Outcome
- error_code (NAK/CRC/FE/PE/timeout)
- phase (byte/step, frame, txn)
Recovery
- recovery_action (retry/resync/reinit/reset/power)
- attempt_id (groups repeated attempts for the same request)
Alert tiers (bind alerts to action)
Warning
Errors rise but progress continues. Action: increase logging granularity; track burstiness; avoid escalation.
Degraded
p95/p99 latency or error bursts exceed budget. Action: apply backoff, reduce load, and prepare resync primitives.
Fail-safe
No forward progress meets the criteria. Action: enter staged recovery; quarantine the device; protect consistency.
Pass criteria (placeholders)
- Telemetry coverage: 100% of recovery actions logged with schema fields populated.
- Progress detection: hung classification uses last_progress_ts (not error count only).
- False reset: false_reset_per_day < W (target placeholder).
Design note: keep ISR work minimal (increment counters only). Use the health task for classification and escalation to reduce false resets.
H2-4 · Timeout Budgeting (Transaction / Byte / Idle)
Intent
Timeouts should be budgeted, not guessed. Use layered timeouts to detect “stuck steps” early while avoiding false resets and uncontrolled TTR.
Timeout layers (strict meanings)
Byte / step timeout (T_byte)
The smallest wait budget for a step to advance (byte, phase, descriptor progress). Triggers early when “progress stalls” inside a transaction.
Frame timeout (T_frame)
A complete frame/burst window (UART frame, SPI burst block, I²C phase group). Useful for “resync boundary” decisions.
Transaction timeout (T_txn)
End-to-end budget for a full transaction. Limits total waiting and bounds TTR when combined with retries/backoff.
Idle / no-progress timeout (T_idle)
A “hung” classifier based on last_progress_ts. Not the same as T_txn; it detects zero progress even if signals toggle.
How to set thresholds (distribution-based)
- Use p95/p99 from telemetry (H2-3), not averages.
- Set: T_byte = p99(step) + margin, T_frame = p99(frame) + margin, T_txn = p99(txn) + margin.
- Set: T_idle using last_progress_ts stalls (progress-free window), not “transaction time”.
- Define placeholders: T_byte, T_frame, T_txn, N_retry, Backoff (tuned by burst statistics).
Key knobs by bus type (timeout vocabulary)
I²C
- T_stretch_max: maximum tolerated clock-stretching window.
- Master timeout policy: define when to stop waiting and classify no-progress.
SPI
- DMA block timeout: bound waiting for descriptor completion.
- CS assert + frame gap: fence sessions; detect stalls between bursts.
UART
- Inter-byte timeout: detect stalled streams and rebuild boundaries.
- Frame timeout + idle window: define resync windows after bursts.
Integrate retries/backoff into the TTR budget
- Total waiting should be bounded: TTR ≈ (T_txn × N_retry) + Σ(backoff) (placeholder form).
- Use burst telemetry to pick Backoff (avoid retry storms).
- Escalation should require timeout + no-progress before hard resets.
Pass criteria (placeholders)
- False timeouts: false_timeout_rate < X (target placeholder).
- Recovery speed: TTR_p95 < Y ms (target placeholder).
- Safety: no duplicate-write incidents under fault injection (policy verified).
Implementation note: classify “hung” only when timeouts coincide with no-progress (last_progress_ts frozen). This reduces false resets while bounding recovery time.
H2-5 · Recovery Ladder (Soft → Hard) & When to Escalate
Intent
Make recovery auditable with a staged ladder. Escalation should be driven by timeouts + no-forward-progress, not by a single error spike.
Recovery ladder (6 levels, unified vocabulary)
Level 1 · Retry same op (with backoff)
- Trigger: transient errors while progress continues.
- Action: retry + exponential backoff + jitter (avoid retry storms).
- Exit: one success + stability window (N_ok placeholder).
Level 2 · Re-sync (flush / idle wait / boundary rebuild)
- Trigger: suspected protocol desync (boundary mismatch).
- Action: flush queues, wait idle window, rebuild session boundary.
- Exit: read-only health probe passes.
Level 3 · Re-init peripheral block (driver reset)
- Trigger: progress stalls inside controller/driver (DMA/ISR/state).
- Action: controller reset + context rebuild + config replay/verify.
- Exit: controller self-check + health probe passes.
Level 4 · Reset target device (GPIO reset / command reset)
- Trigger: failures localize to one device; safe to isolate.
- Action: device reset, then read-only probe and configuration reconciliation.
- Exit: device online + probe stable for N_ok.
Level 5 · Bus-level clear (I²C bus-clear / SPI dummy clocks)
- Trigger: bus wedge indicators (timeouts + no-progress + line/session stuck).
- Action: bus clear primitive after freezing traffic and acquiring bus lock.
- Exit: bus returns to idle + read-only probe passes.
Level 6 · Power-cycle segment / system reset (last resort)
- Trigger: fail-safe requirements or repeated ladder failure.
- Action: segment power-cycle first; full system reset only when required.
- Exit: boot self-test + probes confirm stable operation.
Escalation policy (combine gates, avoid false resets)
- Gate A — Timeout budget hit: T_byte / T_frame / T_txn exceeded (placeholders from H2-4).
- Gate B — No forward progress: now − last_progress_ts > T_idle (placeholder).
- Gate C — Persistence: fail_streak ≥ K or err_rate ≥ E within a window W (placeholders).
- Gate D — Operation class: non-idempotent writes require a read-only probe before retrying; critical ops may skip to a safer level.
- Gate E — Safety: enter safe-state before hard actions (device reset / bus clear / power-cycle).
Degraded operation (contain impact while recovering)
Read-only mode
Allow health probes and status reads only; block side-effect writes until stability is proven.
Quarantine device
Isolate one address/CS while keeping the bus usable for other devices; log quarantine entry/exit.
Bypass path
Switch to a redundant channel when available; keep telemetry continuity to compare both paths.
Auditability & pass criteria (placeholders)
- Action record: level, trigger gates, duration, result, attempt_id always logged.
- Recovery speed: TTR_p95 < Y ms (placeholder).
- False resets: false_reset_per_day < W (placeholder).
- Consistency: zero duplicate-write incidents in fault injection for non-idempotent ops.
Rule: prefer the lowest level that restores forward progress. Hard actions require the combined gates and an audit trail.
H2-6 · I²C Recovery Playbook (Hung Bus)
Intent
Most I²C hangs are either lines held (SDA/SCL stuck low) or a stalled state machine. Recovery should be repeatable, verifiable, and safe for non-idempotent writes.
Detect & classify (line-level vs state-level)
Hung classifier
- No-progress: now − last_progress_ts > T_idle (placeholder).
- Timeout evidence: repeated T_step/T_txn hits (from H2-4).
Line state
- SDA stuck low: common “bus wedged” signature.
- SCL stuck low: treat cautiously; may require reinit or segment power control.
State-level wedge
- Stretch timeout: stretch_duration > T_stretch_max (placeholder).
- Arbitration loss stall: after arbitration, progress does not resume.
Safe preconditions (protect consistency)
- Freeze traffic: stop new I²C transactions; acquire a bus lock/mutex.
- Block side-effect writes: during recovery, allow read-only probes only.
- Snapshot: record SDA/SCL levels, controller status, and counters before action.
Recovery actions (repeatable sequence)
1) Bus clear (SCL pulses)
- Use when: SDA stuck low (or strong suspicion).
- Action: drive N_pulses clocks (placeholder) to release SDA.
- Observe: SDA returns high; bus returns to idle.
2) Generate STOP (if supported)
- Use when: SDA/SCL are high and a clean boundary is needed.
- Goal: return bus to a known idle state before probes.
3) Re-init controller (driver/peripheral)
- Use when: both lines high but no-progress persists.
- Action: reset controller, rebuild context, replay config, verify status.
4) Optional: re-enumerate targets
- Use when: devices may have reset/power-cycled.
- Action: presence checks via read-only probes (avoid side effects).
Risk control & health probe (read-only, verifiable)
- Probe: read a non-side-effect status/ID register to confirm ACK + STOP behavior.
- Consistency: after recovery, config writes must be reconciled by read-back (policy).
- Pass criteria: probe passes N_ok times; error counters remain stable in window W (placeholders).
Safety note: the recovery path should freeze traffic and use read-only probes to prevent duplicate writes during uncertainty.
H2-5 · Recovery Ladder (Soft → Hard) & When to Escalate
Intent
Make recovery auditable with a staged ladder. Escalation should be driven by timeouts + no-forward-progress, not by a single error spike.
Recovery ladder (6 levels, unified vocabulary)
Level 1 · Retry same op (with backoff)
- Trigger: transient errors while progress continues.
- Action: retry + exponential backoff + jitter (avoid retry storms).
- Exit: one success + stability window (N_ok placeholder).
Level 2 · Re-sync (flush / idle wait / boundary rebuild)
- Trigger: suspected protocol desync (boundary mismatch).
- Action: flush queues, wait idle window, rebuild session boundary.
- Exit: read-only health probe passes.
Level 3 · Re-init peripheral block (driver reset)
- Trigger: progress stalls inside controller/driver (DMA/ISR/state).
- Action: controller reset + context rebuild + config replay/verify.
- Exit: controller self-check + health probe passes.
Level 4 · Reset target device (GPIO reset / command reset)
- Trigger: failures localize to one device; safe to isolate.
- Action: device reset, then read-only probe and configuration reconciliation.
- Exit: device online + probe stable for N_ok.
Level 5 · Bus-level clear (I²C bus-clear / SPI dummy clocks)
- Trigger: bus wedge indicators (timeouts + no-progress + line/session stuck).
- Action: bus clear primitive after freezing traffic and acquiring bus lock.
- Exit: bus returns to idle + read-only probe passes.
Level 6 · Power-cycle segment / system reset (last resort)
- Trigger: fail-safe requirements or repeated ladder failure.
- Action: segment power-cycle first; full system reset only when required.
- Exit: boot self-test + probes confirm stable operation.
Escalation policy (combine gates, avoid false resets)
- Gate A — Timeout budget hit: T_byte / T_frame / T_txn exceeded (placeholders from H2-4).
- Gate B — No forward progress: now − last_progress_ts > T_idle (placeholder).
- Gate C — Persistence: fail_streak ≥ K or err_rate ≥ E within a window W (placeholders).
- Gate D — Operation class: non-idempotent writes require a read-only probe before retrying; critical ops may skip to a safer level.
- Gate E — Safety: enter safe-state before hard actions (device reset / bus clear / power-cycle).
Degraded operation (contain impact while recovering)
Read-only mode
Allow health probes and status reads only; block side-effect writes until stability is proven.
Quarantine device
Isolate one address/CS while keeping the bus usable for other devices; log quarantine entry/exit.
Bypass path
Switch to a redundant channel when available; keep telemetry continuity to compare both paths.
Auditability & pass criteria (placeholders)
- Action record: level, trigger gates, duration, result, attempt_id always logged.
- Recovery speed: TTR_p95 < Y ms (placeholder).
- False resets: false_reset_per_day < W (placeholder).
- Consistency: zero duplicate-write incidents in fault injection for non-idempotent ops.
Rule: prefer the lowest level that restores forward progress. Hard actions require the combined gates and an audit trail.
H2-6 · I²C Recovery Playbook (Hung Bus)
Intent
Most I²C hangs are either lines held (SDA/SCL stuck low) or a stalled state machine. Recovery should be repeatable, verifiable, and safe for non-idempotent writes.
Detect & classify (line-level vs state-level)
Hung classifier
- No-progress: now − last_progress_ts > T_idle (placeholder).
- Timeout evidence: repeated T_step/T_txn hits (from H2-4).
Line state
- SDA stuck low: common “bus wedged” signature.
- SCL stuck low: treat cautiously; may require reinit or segment power control.
State-level wedge
- Stretch timeout: stretch_duration > T_stretch_max (placeholder).
- Arbitration loss stall: after arbitration, progress does not resume.
Safe preconditions (protect consistency)
- Freeze traffic: stop new I²C transactions; acquire a bus lock/mutex.
- Block side-effect writes: during recovery, allow read-only probes only.
- Snapshot: record SDA/SCL levels, controller status, and counters before action.
Recovery actions (repeatable sequence)
1) Bus clear (SCL pulses)
- Use when: SDA stuck low (or strong suspicion).
- Action: drive N_pulses clocks (placeholder) to release SDA.
- Observe: SDA returns high; bus returns to idle.
2) Generate STOP (if supported)
- Use when: SDA/SCL are high and a clean boundary is needed.
- Goal: return bus to a known idle state before probes.
3) Re-init controller (driver/peripheral)
- Use when: both lines high but no-progress persists.
- Action: reset controller, rebuild context, replay config, verify status.
4) Optional: re-enumerate targets
- Use when: devices may have reset/power-cycled.
- Action: presence checks via read-only probes (avoid side effects).
Risk control & health probe (read-only, verifiable)
- Probe: read a non-side-effect status/ID register to confirm ACK + STOP behavior.
- Consistency: after recovery, config writes must be reconciled by read-back (policy).
- Pass criteria: probe passes N_ok times; error counters remain stable in window W (placeholders).
Safety note: the recovery path should freeze traffic and use read-only probes to prevent duplicate writes during uncertainty.
H2-7 · SPI Recovery Playbook (Desync / Stuck Session)
Intent
Most SPI “hangs” are session-boundary failures (CS) or flow-control stalls (DMA/queues). Recovery should enforce session fencing, rebuild framing, and restore forward progress with minimal side effects.
Detection (symptoms → first check)
CS asserted too long
- Likely causes: missing end-of-transfer, device busy, CS control stuck.
- First check: CS level vs transfer_done flag and guard time policy.
MISO strong-driven unexpectedly
- Likely causes: multi-slave contention, wrong CS decode, device stuck in data phase.
- First check: only one CS active; confirm idle Hi-Z expectation.
DMA never completes / queue stalls
- Likely causes: lost interrupt, descriptor stuck, driver state mismatch.
- First check: last_progress_ts updates; DMA state and IRQ pending flags.
CRC / header continuously wrong
- Likely causes: desync (bit-slip), partial frame, CS boundary violation.
- First check: enforce CS fence; then header probe after re-sync.
Session fencing (do not start a new transaction until the previous one is closed)
- Fence conditions (placeholders): transfer_done = true, CS deasserted, guard time ≥ T_guard.
- Queue discipline: flush/abort must fully terminate the active descriptor before re-arming DMA.
- Audit fields: cs_low_time, dma_state, attempt_id, last_progress_ts always captured.
Recovery actions (repeatable sequence)
1) CS deassert + guard + reassert
- Goal: close the prior session boundary.
- Exit: header probe becomes stable after the next step.
2) Dummy clocks (re-align shift boundary)
- Use when: header/CRC persistently wrong (suspected bit-slip).
- Parameter: N_dummy_bits placeholder; ensure MOSI fill is side-effect safe.
3) Sync header probe (read-only)
- Goal: confirm known header/ID framing before resuming normal traffic.
- Exit: probe passes N_ok times (placeholder).
4) Flush DMA / driver queues
- Use when: descriptors do not advance, or queued frames are polluted.
- Exit: DMA state returns to idle and last_progress_ts updates.
5) Device reset (only if required)
- Use when: re-sync cannot restore a stable header.
- Exit: probe stable; error density below threshold in window W.
Pass criteria (placeholders)
- Framing: sync header probe passes N_ok times.
- Stability: error rate < X within window W.
- Speed: TTR_p95 < Y ms; avoid false hard resets.
Emphasis: close the previous session and re-validate framing before normal traffic resumes. Queue flush must fully terminate the active DMA state.
H2-8 · UART Recovery Playbook (Noise Burst / Framing Lock)
Intent
UART recovery focuses on rebuilding frame boundaries and clearing error states, especially under noise bursts. The goal is a verifiable return to valid frames, not blind resets.
Detection (signals that indicate boundary loss)
- FE/PE burst: continuous framing/parity errors; burst_max exceeds baseline.
- Idle missing: no reliable idle detect window within W (placeholder).
- FIFO overrun: overrun_count rises; RX backlog does not drain.
- Autobaud mismatch (optional): autobaud lock/estimate deviates from expected range.
Recovery actions (boundary rebuild)
1) Flush RX FIFO + clear error flags
- Goal: remove polluted bytes and exit sticky error state.
- Audit: dropped_bytes_count recorded for traceability.
2) Wait-for-idle window (rebuild boundary)
- Rule: accept a new frame only after idle ≥ T_idle_uart (placeholder).
- Benefit: filters noise-induced false start bits.
3) Break / idle resync (if allowed)
- Use when: protocol supports a resync marker (break/idle sequence).
- Exit: first valid frame passes checks (parity/length/CRC per stack).
4) Autobaud reacquire (optional)
- Use when: autobaud lock is suspected to be wrong.
- Exit: estimated baud within tolerance window (placeholder).
Pass criteria (placeholders)
- N-frame rule: within next N frames, error frames < X.
- Error density: FE/PE rate < R in window W.
- Recovery time: TTR_p95 < Y ms; avoid unnecessary hard resets.
Rule: flush and wait for a reliable idle window before accepting the first valid frame; then verify stability over an N-frame window.
H2-9 · Firmware Architecture Patterns (State Machine, Idempotency, Fencing)
Intent
Recovery must be a structure, not a pile of actions. Firmware architecture should prevent duplicate writes, half-transactions, and infinite recovery loops while remaining auditable.
Recovery state machine (core contract)
- States: INIT → IDLE → TXN → ERROR → RECOVER → DEGRADED.
- Progress definition: last_progress_ts advances (DMA descriptors move, FIFO drains, or a transaction completes).
- Exit criteria: every state has a pass condition; RECOVER cannot loop without an escalation gate.
Idempotency rules (safe retry vs protected write)
Retry-safe operations
- Reads and status probes.
- Write-same-value (explicitly bounded) with readback optional.
- Versioned writes (seq/version token; compare-and-set semantics).
Non-idempotent writes (protected)
- Increment / append semantics (duplicate write is harmful).
- Trigger writes (start/arm/commit operations).
- Multi-step writes (page/program sequences). Policy: no blind retry after timeout; require readback/confirmation gates.
Fencing & context cleanup (invariants)
- Transaction identity: txn_id and attempt_id must be unique and logged for every retry.
- Completion token: do not start a new TXN until the previous one provides a completion/ack token.
- Timeout cleanup: queue_depth → 0, dma_state → IDLE, session boundary closed (CS deasserted / idle window observed), error flags cleared.
- No infinite loops: RECOVER has a bounded iteration counter; escalation occurs after threshold K (placeholder).
Backoff + error code layering (actionable telemetry)
- Backoff: exponential + jitter; bound by overall TTR budget (placeholders: base, max, jitter).
- Driver-level codes: dma_stuck, irq_missing, timeout_phase (byte/frame/txn).
- Device-level codes: NAK/CRC/header_mismatch, FE/PE burst, busy_stuck.
- System-level codes: no_progress, degraded_entered, safety_state_entered.
Key properties: bounded recovery loops, explicit fencing invariants, and separation of retry-safe operations from protected writes.
H2-10 · Hardware Assist: Watchdogs, Reset Lines, Bus Guardians
Intent
Hardware assists turn recovery into a controlled soft+hard collaboration: layered watchdogs detect no-progress and enforce bounded resets; reset lines and power segmentation isolate faults before system-wide reset.
Watchdog layering (who monitors what)
- Task watchdog: health-task heartbeat; catches soft deadlocks early.
- Comm watchdog: forward progress (last_progress_ts) and error density; triggers peripheral re-init or bus clear.
- System watchdog: last resort; only after fail-safe conditions and audit logging.
Reset & power resources (prefer segment isolation)
- Target reset pin: isolate a single device without disturbing the bus segment.
- Segment load switch: power-cycle only the affected bus segment; keeps unrelated domains alive.
- Reset supervisor: enforces reset pulse width and ordering; exports reset_cause for audit.
Bus guardian concept + minimum anti-ghost-powering rule
- Bus guardian: latches a fault and isolates/disconnects the dragging node or segment when no-progress persists.
- Firmware contract: guardian trip forces DEGRADED mode and probe-only traffic until stability returns.
- Anti-ghost-powering (minimum): disable/Hi-Z bus drivers before removing segment power; re-enable I/O only after rail is stable (ordering placeholders).
Principle: isolate and recover at the smallest scope first (device → segment → system), with watchdog decisions always logged and rate-limited.
H2-11 · Engineering Checklist (Design → Bring-up → Production)
Intent
Convert recovery ideas into an auditable, executable checklist and acceptance criteria across design, bring-up, and production.
Design gate (define budgets + rules before code)
- Timeout budgeting: placeholders fixed for
T_byte,T_frame,T_txn,T_idle; retry/backoff bounded by overallTTR. - Recovery ladder: retry → resync → reinit → device reset → bus-clear/dummy clocks → segment power-cycle/system reset; escalation gates defined (
Ktries, error density windowW). - Idempotency rules: retry-safe ops listed; non-idempotent writes require “probe + readback/confirm” gates (no blind retry after timeout).
- Minimum event log schema:
bus_id,device,op_type,duration,error_code,action_level,txn_id,attempt_id.
Bring-up gate (fault injection validates the ladder)
Inject & observe (minimum set)
- I²C: force SDA low / SCL low; stretch timeout; arbitration-loss → no-progress.
- SPI: CS stuck asserted; persistent header/CRC mismatch (desync); DMA never completes; MISO contention symptom.
- UART: FE/PE burst; FIFO overrun; idle-detect missing; optional autobaud mismatch.
Bring-up pass checks (close the loop)
- Detect: no-progress triggers within
T_idleand classification is stable. - Recover: ladder steps execute in order; escalation only after gates (
K, density inW). - Verify: post-recovery probe passes (
N_oktimes) and key configuration remains consistent (readback/compare).
Production gate (BIST + counters + field log collection)
- BIST / loopback: acceptance routines prove the RX/TX path and basic framing without relying on a “perfect” environment.
- Counter policy: define reset/rollover rules for error counters; preserve key “lifetime” counters for root-cause trend.
- Reset rate limiting: cap device reset / segment power-cycle attempts; enter DEGRADED mode if exceeded.
- Field log readiness: minimal schema is always available and exportable (bus/device/op/error/action/TTR).
Pass criteria (placeholders)
TTR_p95 < X msandrecovery_success_rate > Y%.false_reset_rate < Z(per hour/day; define windowW).- After
Ninjections, recovery remains repeatable and configuration consistency is preserved (readback/compare OK).
H2-12 · Applications & IC Selection Notes (placed before FAQ)
Intent
Map recovery requirements to application buckets and selection checkpoints without becoming a product page. Example material numbers are provided for design reference—always verify package, suffix, and availability.
Application buckets (recovery-driven requirements)
Industrial chassis / long harness
- Common failures: no-progress, stuck session, burst errors.
- Recovery focus: bounded escalation + segment isolation + probe-only degraded mode.
- Verification: repeated fault injection across temperature and cable changes; audit TTR and false resets.
High-noise environment
- Common failures: CRC/header mismatch (SPI), FE/PE bursts (UART), NAK storms (I²C).
- Recovery focus: resync + backoff+jitter; avoid retry storms that keep the bus unstable.
- Verification: error density windows and rate limiting demonstrate stability recovery.
Hot-plug boundary
- Common failures: session boundary corruption, ghost-power symptoms, partial power domains.
- Recovery focus: fencing + controlled reset/power sequencing; probe before enabling writes.
- Verification: repeated plug cycles; ensure no duplicate write and no false system reset.
Low-power wake / intermittent duty
- Common failures: first-frame failure after wake, idle-detect ambiguity, timeout mismatches.
- Recovery focus: idle windows, safe resync, and predictable watchdog behavior.
- Verification: wake loops with telemetry; ensure TTR stays within budget without power-cycling storms.
Mass production & field service
- Common failures: rare corner cases that require attribution, not guesswork.
- Recovery focus: consistent error codes + event logs + bounded reset counters.
- Verification: pass/fail gates and field log exportability are mandatory.
Selection notes + example material numbers (verify package/suffix/availability)
Bus helpers (timeouts, buffering, isolation, stuck-bus handling)
- I²C hot-swap / stuck-bus assist:
TCA4307(TI),PCA9511A(NXP). - I²C mux / hub (channel isolation):
TCA9548A(TI),PCA9548A(NXP). - I²C buffer / rise-time accelerator:
TCA4311A(TI),PCA9515A(NXP). - I²C isolator:
ISO1540/ISO1541(TI),ADuM1250/ADuM1251(Analog Devices). - isoSPI / long-chain SPI-style link:
LTC6820(Analog Devices).
Watchdogs, supervisors, reset lines (hard recovery stack)
- Reset supervisor:
TPS3808(TI),ADM809/ADM810(Analog Devices),MCP130(Microchip). - External watchdog timer:
TPS3430(TI),MAX6369(Analog Devices/Maxim legacy),MCP1316(Microchip). - Reset/power sequencing aid (platform-dependent): combine supervisor + load switch to isolate segments (see below).
Segment power & isolation primitives (keep resets local)
- Load switch (segment power-cycle):
TPS22910A,TPS22965(TI),FPF2123(onsemi/Fairchild legacy family). - High-side switch (higher current, platform-dependent):
TPS1H100(TI) example class; verify current/diagnostics. - Digital isolator for SPI/UART-class signals:
ISO7741(TI),ADuM140xfamily (Analog Devices) as example classes; verify channel count/direction.
Bridges & service ports (field observability and controlled access)
- USB ↔ UART bridge:
CP2102N(Silicon Labs),FT232R(FTDI),CH340(WCH) as common classes; verify driver strategy. - RS-485 transceiver (UART layering):
SN65HVD72(TI),MAX3485(Analog Devices/Maxim) as common classes. - I²C expander for recovery GPIOs:
PCA9555(NXP),TCA9555(TI) as common classes.
Capability matrix (presented as card list, not a table)
- Controller block: byte/txn/idle timeouts + error flags + abort/flush path + “no-progress” telemetry.
- Target device: reset pin and safe-state behavior + read-only probe register + write-readback capability.
- Bridge/expander: stuck-bus timeout and counters + channel isolation/disable + predictable reset behavior.
Example ICs above are reference points. Selection should start from recovery requirements and verification plan, then back-propagate to hardware hooks and firmware structure.
Recommended topics you might also need
Request a Quote
H2-13 · FAQs (Error Handling & Recovery)
Intent
Close long-tail troubleshooting without expanding the main body. Each answer uses a fixed 4-line structure with measurable pass criteria placeholders.
I²C bus stuck low after a brown-out — first check line vs state-machine?
Likely cause: SDA/SCL is physically held low by a slave or clamp, or the controller/slave FSM is wedged mid-transaction.
Quick check: Sample SDA/SCL as GPIOs (inputs) and compare to controller status (BUSY/START seen/STOP seen). If lines are low even with controller disabled → line-held; if lines float high but BUSY persists → FSM wedge.
Fix: Soft: disable I²C block → GPIO bus-clear (SCL pulses) → generate STOP if possible → re-init controller; then run a read-only health probe before any writes.
Pass criteria: SDA/SCL high within T_idle; health probe passes N_ok times; no duplicate write events after recovery (0 occurrences in window W).
Clock stretching happens occasionally — where to set master timeout without breaking slow devices?
Likely cause: A legitimate slow slave stretches occasionally, or a fault makes SCL “never release” (hung stretch).
Quick check: Log stretch duration distribution (max/p95) per device address; compare to “normal” device behavior. If duration is unbounded or always hits the cap → treat as hung stretch.
Fix: Use a two-tier policy: T_byte cap for individual steps + T_txn cap for whole transaction; allow longer caps only for specific known-slow addresses, and escalate to bus-clear after K consecutive cap hits.
Pass criteria: No transaction exceeds T_txn; stretch p95 stays below T_stretch_p95; false timeouts on known-slow devices < X per W.
SPI CRC bursts after one glitch — how to resync without resetting the whole system?
Likely cause: A single edge glitch caused bit-slip so host/device frame boundaries no longer align (desync), even though the bus still toggles.
Quick check: After forcing CS deassert, send a short “sync header probe” (read-only pattern). If header stays wrong across attempts → boundary is shifted.
Fix: Apply a session fence: CS ↑ guard time → CS ↓ → transmit N_dummy_bits of dummy clocks with safe MOSI fill → read-only header probe → flush driver/DMA queue if descriptors are polluted; reset only the target device if probe never locks.
Pass criteria: Header probe locks within M attempts; CRC error rate < R_crc in window W; TTR_p95 < X ms without system reset.
CS toggling doesn’t recover — what’s the quickest “dummy clocks” sanity check?
Likely cause: CS toggling ends the session, but the target remains inside a partial command/stream state and requires boundary re-alignment.
Quick check: Use a minimal sequence: CS ↑ (guard) → CS ↓ → dummy clocks for exactly one frame-length (or one byte multiple) → issue a read-only “known header” read. If the header moves closer to expected values with different dummy lengths → desync confirmed.
Fix: Standardize N_dummy_bits (safe MOSI fill) and require a header probe gate before resuming traffic; if multiple slaves share MISO, add a “Hi-Z check” phase (verify no contention) before reasserting CS.
Pass criteria: Header probe passes N_ok consecutive reads; no MISO contention events in W; no-progress counter remains 0 during steady state.
UART shows framing errors only in bursts — noise coupling or baud drift?
Likely cause: Burst-coupled noise corrupts edges (FE/PE spikes), or baud mismatch accumulates until sampling slips (drift).
Quick check: Compare FE/PE burst timing to system events (motors, relays, radio TX) and to temperature/clock changes. Noise bursts correlate to events; drift correlates to sustained operating conditions and persists until re-sync/autobaud.
Fix: For bursts: flush RX FIFO + clear flags → wait for an idle window T_idle_uart → accept first valid frame gate. For drift: re-acquire baud (or switch to a calibrated divisor) and re-validate with an idle window gate.
Pass criteria: After recovery, FE/PE rate < R_fe over window W; first-valid-frame locks within TTR; no FIFO overruns in N frames.
Recovery works on bench but fails in chassis — what telemetry field is usually missing?
Likely cause: The failure is not just “an error,” but “no forward progress” under real coupling; missing telemetry hides the escalation triggers and wrong action is chosen.
Quick check: Confirm logs include last_progress_ts, error density (errors per W), burst length, action level, and reset cause. If only raw error codes exist, the ladder cannot be audited.
Fix: Add “progress” counters (transaction complete increments) + ring-buffer events (bus/device/op/duration/error/action/attempt) and classify warning/degraded/fail-safe. Use these to drive escalation deterministically.
Pass criteria: ≥ P% of field failures have an attributable root bucket; escalation steps are reproducible; TTR_p95 and false reset rate are measurable from logs.
Retries make it worse — how to detect retry storms and apply backoff?
Likely cause: Fast retries amplify error density and prevent the bus/device from returning to a stable state (self-sustaining storm).
Quick check: Track retries per second and consecutive failures per device. If retry rate rises while progress stays flat (no completions), a storm is present.
Fix: Enforce exponential backoff with jitter and a hard cap; rate-limit per device and globally; if density exceeds threshold, enter DEGRADED (probe-only) until stability returns.
Pass criteria: Retry rate bounded below R_retry; progress resumes within T_recover; error density falls below D_ok in window W.
After recovery, device config is wrong — how to design idempotent writes?
Likely cause: A timeout occurred during a non-idempotent write (increment/append/trigger), and an automatic retry re-applied a side effect or wrote partial state.
Quick check: Identify which writes are non-idempotent and whether a timeout can happen mid-write (page write, multi-register sequence). Check if the code retries these without a “confirm/readback” gate.
Fix: Make writes idempotent via version/sequence tokens or “write-then-readback compare”; after timeout, block blind retry and switch to read-only probe + reconcile state before re-applying changes.
Pass criteria: Duplicate side effects = 0 in window W; readback/compare passes after recovery; config hash matches expected within T_reconcile.
Watchdog resets too often — how to separate comm watchdog vs system watchdog?
Likely cause: A single watchdog is covering both “temporary comm faults” and “system deadlock,” causing over-resets and masking root cause.
Quick check: Log reset cause and the last comm progress timestamp. If resets happen while other tasks are healthy and only comm stalls, the system watchdog is triggered too aggressively.
Fix: Split into task watchdog (task heartbeat), comm watchdog (no-progress/escalation), and system watchdog (fail-safe only). Require comm ladder attempts before system reset, with rate limits.
Pass criteria: System resets per day < Z; comm watchdog resolves stalls in < TTR for > Y% cases; reset causes are attributable in ≥ P% logs.
Need fast TTR but no false resets — what escalation thresholds are most effective?
Likely cause: Escalation gates are based on raw error count rather than progress and error density, so transient bursts trigger hard resets.
Quick check: Evaluate triggers: do they use “no-progress for T_idle” and “density in W” or just “N errors”? Raw-N-only is prone to false resets.
Fix: Use layered gates: (1) retry/backoff until K1 failures, (2) resync/reinit if no-progress persists, (3) device reset only if probe fails, (4) segment power-cycle only after K2 repeats with rate limit.
Pass criteria: TTR_p95 < X ms while false_reset_rate < Z; escalation level distribution stable across runs; hard resets occur only after gate conditions are met.
One bad device wedges the bus — how to quarantine without losing the whole segment?
Likely cause: A single target holds a shared resource (line, MISO, or power domain), causing repeated no-progress and forcing global recovery.
Quick check: Identify the “dominant offender” by per-device no-progress/timeout counters and which address/CS correlates with stalls; confirm that the rest of the segment is healthy via probe-only traffic.
Fix: Quarantine by disabling that channel/device (mux/expander channel disable, CS inhibit) or power-cycling only that sub-branch; keep the system in DEGRADED mode with read-only probes for the remaining devices.
Pass criteria: Non-offender devices maintain progress (no-progress=0) for window W; quarantined branch stays isolated; system-wide resets drop below Z per day.
Production wants a pass/fail — what’s a minimal fault-injection test set?
Likely cause: Production tests validate normal traffic but do not exercise recovery paths, so latent “hung” corner cases escape.
Quick check: Confirm whether the test includes at least one injected no-progress condition and one desync/burst condition, and whether it records TTR + action levels.
Fix: Minimal set: (1) I²C SDA low (bus-clear path), (2) SPI forced desync (dummy clocks + header probe), (3) UART FE burst (flush + idle window), plus reset-rate limit verification; require logs for each step.
Pass criteria: All injections recover within TTR; success rate > Y%; false reset rate < Z; post-recovery probes pass N_ok times.