123 Main Street, New York, NY 10001

Error Handling & Recovery for I²C, SPI, and UART

← Back to: I²C / SPI / UART — Serial Peripheral Buses

Reliable recovery is not “reset and hope”—it is a measurable ladder: detect no-progress, resync safely, escalate with bounded timeouts, and verify configuration consistency after recovery.

This page turns I²C/SPI/UART failures into an auditable playbook with telemetry, pass/fail criteria, and soft-to-hard controls that prevent duplicate writes and false resets.

H2-1 · Definition & Recovery Goals

Intent

Establish a strict vocabulary for transient, desync, and hung/wedged states, then define what “recovery success” means in measurable, production-ready terms.

Scope guardrail (to avoid overlap)

  • Covers: recovery definitions, “no forward progress” criteria, measurable KPIs (TTR/loss/consistency/false-reset).
  • Does not cover: topology/capacitance budgeting, SI/termination, or protocol deep-dives (those belong to sibling pages).

Three failure classes (strict definitions)

1) Transient errors

Errors occur, but forward progress continues: retries succeed, counters increment, transactions complete intermittently.

  • Signal: error counters rise, but completion events still appear.
  • Risk: throughput/latency degradation; occasional data loss.

2) Protocol desync

The bus is active, but endpoints disagree on frame boundary / phase / session state, producing repeated invalid frames until re-aligned.

  • Signal: persistent CRC/header/parity/framing failures with toggling activity.
  • Risk: repeated writes/reads may target wrong state unless fenced.

3) Bus wedged / hung

No forward progress beyond a defined timeout window: a line may be held, a controller may be stuck, or a queue may never complete.

  • Signal: “time since last success” exceeds threshold; completion interrupts stop.
  • Risk: system-level cascading failures; recovery must be staged and safe.

Recovery success = three parallel pass criteria

  1. Data consistency: no unintended duplicate writes; configuration remains coherent (read-back match where applicable).
  2. Device availability: critical transactions complete repeatedly after recovery (not a one-off “lucky” success).
  3. System safety state: recovery does not trigger unsafe actions (avoid over-resetting, avoid “half-applied” control changes).

Outputs to report (KPIs with threshold placeholders)

Time to recover (TTR)

Define TTR_p95 and TTR_max per application. Example placeholder: TTR_p95 < X ms.

Loss / retry burden

Track loss and retry density (burstiness matters): loss_per_hour < Y, retry_burst_p99 < N.

Duplicate-write risk

For non-idempotent writes, require a fence (read-back, sequence number, or safe “commit” step). Placeholder: dup_write_rate < Z.

False resets

Stage recovery to minimize unnecessary resets. Placeholder: false_reset_per_day < W.

Diagram · Fault-state funnel and action escalation
Fault-state funnel with escalation Four stages show how transient errors can escalate to desync, then hung, then system-level reset, each with detect, cost, and action cues. Errors Desync Hung Reset Detect Cost Action level Counters rise Boundary mismatch No forward progress Watchdog trip low med high Retry / Backoff Resync / Re-init Bus-clear / Reset

Reading tip: “Hung” is defined by no forward progress (not just “many errors”). Escalation should be staged to protect consistency and minimize false resets.

H2-2 · Failure Taxonomy Across I²C / SPI / UART

Intent

Replace guessing with a symptom → cause class → first check map. Each symptom includes one discriminator to quickly separate line-level wedges from state-machine wedges.

Two primary trunks (fast classification)

A) Physical-level wedge

A line is held (stuck-low, forced drive, or electrical contention). Progress stops because the signal cannot return to a valid idle or edge sequence.

Discriminator: pin level remains invalid even when the controller is idle.

B) State-machine wedge

The physical signals may toggle, but a controller/driver/device state machine (or DMA/queue) never completes. Progress stops because the system is “stuck waiting”.

Discriminator: pin levels look plausible, but completion events and “last-good” timestamps freeze.

Symptom cards (no large tables; mobile-safe)

I²C — common “hung” signatures

Symptom: SDA stuck low

Likely cause class: physical wedge (slave holds SDA, contention, or pin latch).

First check: read SDA level with controller disabled; log “last-good” timestamp freeze.

Discriminator: SDA remains low even after STOP attempt or peripheral disable.

Symptom: SCL stuck low

Likely cause class: physical wedge or clock source/driver latch.

First check: verify SCL can be driven high in GPIO mode (safe window); confirm pull-up rail present.

Discriminator: SCL low persists across controller reset.

Symptom: clock stretching never releases

Likely cause class: state-machine wedge (slow/slave stuck, timeout policy missing).

First check: measure stretch duration vs master timeout; confirm “no-progress” criteria trips.

Discriminator: pins toggle up to a point, then wait forever without completion.

Symptom: arbitration loss → no recovery

Likely cause class: state-machine wedge (controller state not cleared, retry storm).

First check: confirm driver clears arbitration flag; track retry burst density and backoff presence.

Discriminator: line is idle/high, but transactions never complete.

SPI — common “hung / desync” signatures

Symptom: CS stuck asserted

Likely cause class: state-machine wedge (session boundary not fenced; driver waiting).

First check: confirm CS deassert occurs on timeout; verify “end-of-transaction” interrupt path.

Discriminator: SCLK toggles may stop, but CS remains low beyond the budget window.

Symptom: DMA never completes

Likely cause class: state-machine wedge (interrupt masked, queue stuck, descriptor issue).

First check: verify DMA progress counter; check if last descriptor completed timestamp stops updating.

Discriminator: pins may be quiet, but software shows “busy” indefinitely.

Symptom: MISO driven when it should be Hi-Z

Likely cause class: physical wedge (contention, miswired CS, faulty device).

First check: sample MISO level with CS high; look for forced-level behavior across resets.

Discriminator: line level stays strong even when the bus is idle.

Symptom: header/CRC wrong after one glitch

Likely cause class: protocol desync (bit-slip or phase/session mismatch).

First check: verify a resync primitive exists (CS fence + known header probe); count consecutive invalid frames.

Discriminator: activity continues, but validity never returns without a resync step.

UART — common “noise lock / stuck receive” signatures

Symptom: framing/parity errors in bursts

Likely cause class: protocol desync or noise-driven false start bits.

First check: correlate FE/PE bursts with idle gaps; verify “wait-for-idle” resync window exists.

Discriminator: errors cluster; stable periods exist between bursts.

Symptom: FIFO overrun repeats

Likely cause class: state-machine wedge (ISR starvation, flow control missing, buffer sizing).

First check: log service latency vs FIFO depth; confirm error flags are cleared and counters advance.

Discriminator: overrun rate tracks CPU load/ISR latency, not cable events.

Symptom: autobaud drifts over time

Likely cause class: protocol desync (clock mismatch accumulation; wrong reacquire policy).

First check: compare measured bit time vs expected; validate reacquire trigger and guard window.

Discriminator: errors increase gradually, not abruptly, and improve after reacquire.

Diagram · 3-bus symptom tree (fast triage map)
Three-bus symptom tree Root hung state splits into physical-level wedges and state-machine wedges, then maps to I2C, SPI, and UART symptom leaves. HUNG / NO PROGRESS PHYSICAL WEDGE STATE-MACHINE WEDGE I²C SPI UART SDA stuck low SCL stuck low Stretch forever CS stuck low DMA stuck Header/CRC wrong FE/PE bursts FIFO overrun Autobaud drift

Usage rule: each symptom card intentionally stops at the first check. Deeper electrical design and protocol timing details belong to sibling pages to avoid content overlap.

H2-3 · Observability & Telemetry Hooks

Intent

Without observability, recovery becomes blind resets. Use a minimal, low-overhead telemetry set to prove forward progress, tune timeouts, and control escalation.

Minimal telemetry set (must-have)

Counters

  • Error counters: NAK / CRC / FE / PE (map by bus type; keep a unified field name).
  • Timeout counters: byte/step, frame, transaction timeouts (do not merge).
  • Retry counters: retry_total, retry_burst_max, backoff_applied_count.

Time signals

  • last_good_ts: last completed successful transaction.
  • last_progress_ts: last observable forward progress (DMA advance, FIFO drain, state step).
  • Duration distribution: p50/p95/p99 per operation type (not just averages).

Recovery action counts

  • Soft: retry/backoff, resync, flush.
  • Hard: driver re-init, device reset.
  • Power: segment power-cycle (should be rare; track false resets).

Rule: “hung” should be triggered by no forward progress (last_progress_ts frozen), not by “many errors” alone.

Event log schema (minimal fields)

Identity

  • bus_id (controller instance / channel)
  • device_addr (address/CS index/port id)

Operation

  • op_type (read/write/config/stream)
  • duration_us (end-to-end elapsed)

Outcome

  • error_code (NAK/CRC/FE/PE/timeout)
  • phase (byte/step, frame, txn)

Recovery

  • recovery_action (retry/resync/reinit/reset/power)
  • attempt_id (groups repeated attempts for the same request)

Alert tiers (bind alerts to action)

Warning

Errors rise but progress continues. Action: increase logging granularity; track burstiness; avoid escalation.

Degraded

p95/p99 latency or error bursts exceed budget. Action: apply backoff, reduce load, and prepare resync primitives.

Fail-safe

No forward progress meets the criteria. Action: enter staged recovery; quarantine the device; protect consistency.

Pass criteria (placeholders)

  • Telemetry coverage: 100% of recovery actions logged with schema fields populated.
  • Progress detection: hung classification uses last_progress_ts (not error count only).
  • False reset: false_reset_per_day < W (target placeholder).
Diagram · Telemetry pipeline (low overhead → actionable decision)
Telemetry pipeline Telemetry flows from ISR and driver counters into a ring buffer, then health evaluation, metrics and event logs, alert tiers, and a watchdog decision for staged recovery. ISR counters Ring buffer Health task Metrics p95/p99 Event logs Alert tiers warning degraded fail-safe Watchdog decision staged action drop policy

Design note: keep ISR work minimal (increment counters only). Use the health task for classification and escalation to reduce false resets.

H2-4 · Timeout Budgeting (Transaction / Byte / Idle)

Intent

Timeouts should be budgeted, not guessed. Use layered timeouts to detect “stuck steps” early while avoiding false resets and uncontrolled TTR.

Timeout layers (strict meanings)

Byte / step timeout (T_byte)

The smallest wait budget for a step to advance (byte, phase, descriptor progress). Triggers early when “progress stalls” inside a transaction.

Frame timeout (T_frame)

A complete frame/burst window (UART frame, SPI burst block, I²C phase group). Useful for “resync boundary” decisions.

Transaction timeout (T_txn)

End-to-end budget for a full transaction. Limits total waiting and bounds TTR when combined with retries/backoff.

Idle / no-progress timeout (T_idle)

A “hung” classifier based on last_progress_ts. Not the same as T_txn; it detects zero progress even if signals toggle.

How to set thresholds (distribution-based)

  • Use p95/p99 from telemetry (H2-3), not averages.
  • Set: T_byte = p99(step) + margin, T_frame = p99(frame) + margin, T_txn = p99(txn) + margin.
  • Set: T_idle using last_progress_ts stalls (progress-free window), not “transaction time”.
  • Define placeholders: T_byte, T_frame, T_txn, N_retry, Backoff (tuned by burst statistics).

Key knobs by bus type (timeout vocabulary)

I²C

  • T_stretch_max: maximum tolerated clock-stretching window.
  • Master timeout policy: define when to stop waiting and classify no-progress.

SPI

  • DMA block timeout: bound waiting for descriptor completion.
  • CS assert + frame gap: fence sessions; detect stalls between bursts.

UART

  • Inter-byte timeout: detect stalled streams and rebuild boundaries.
  • Frame timeout + idle window: define resync windows after bursts.

Integrate retries/backoff into the TTR budget

  • Total waiting should be bounded: TTR ≈ (T_txn × N_retry) + Σ(backoff) (placeholder form).
  • Use burst telemetry to pick Backoff (avoid retry storms).
  • Escalation should require timeout + no-progress before hard resets.

Pass criteria (placeholders)

  • False timeouts: false_timeout_rate < X (target placeholder).
  • Recovery speed: TTR_p95 < Y ms (target placeholder).
  • Safety: no duplicate-write incidents under fault injection (policy verified).
Diagram · Timeout budgeting timelines (I²C / SPI / UART)
Timeout budgeting timelines Three parallel timelines show I2C, SPI, and UART transaction segmentation with step, frame, transaction, and idle/no-progress timeout windows. I²C SPI UART START ADDR DATA STOP T_step T_txn T_stretch_max CS↓ CLK bursts CS↑ T_block T_cs_assert_max START bytes STOP T_interbyte T_frame T_idle no progress

Implementation note: classify “hung” only when timeouts coincide with no-progress (last_progress_ts frozen). This reduces false resets while bounding recovery time.

H2-5 · Recovery Ladder (Soft → Hard) & When to Escalate

Intent

Make recovery auditable with a staged ladder. Escalation should be driven by timeouts + no-forward-progress, not by a single error spike.

Recovery ladder (6 levels, unified vocabulary)

Level 1 · Retry same op (with backoff)

  • Trigger: transient errors while progress continues.
  • Action: retry + exponential backoff + jitter (avoid retry storms).
  • Exit: one success + stability window (N_ok placeholder).

Level 2 · Re-sync (flush / idle wait / boundary rebuild)

  • Trigger: suspected protocol desync (boundary mismatch).
  • Action: flush queues, wait idle window, rebuild session boundary.
  • Exit: read-only health probe passes.

Level 3 · Re-init peripheral block (driver reset)

  • Trigger: progress stalls inside controller/driver (DMA/ISR/state).
  • Action: controller reset + context rebuild + config replay/verify.
  • Exit: controller self-check + health probe passes.

Level 4 · Reset target device (GPIO reset / command reset)

  • Trigger: failures localize to one device; safe to isolate.
  • Action: device reset, then read-only probe and configuration reconciliation.
  • Exit: device online + probe stable for N_ok.

Level 5 · Bus-level clear (I²C bus-clear / SPI dummy clocks)

  • Trigger: bus wedge indicators (timeouts + no-progress + line/session stuck).
  • Action: bus clear primitive after freezing traffic and acquiring bus lock.
  • Exit: bus returns to idle + read-only probe passes.

Level 6 · Power-cycle segment / system reset (last resort)

  • Trigger: fail-safe requirements or repeated ladder failure.
  • Action: segment power-cycle first; full system reset only when required.
  • Exit: boot self-test + probes confirm stable operation.

Escalation policy (combine gates, avoid false resets)

  • Gate A — Timeout budget hit: T_byte / T_frame / T_txn exceeded (placeholders from H2-4).
  • Gate B — No forward progress: now − last_progress_ts > T_idle (placeholder).
  • Gate C — Persistence: fail_streak ≥ K or err_rate ≥ E within a window W (placeholders).
  • Gate D — Operation class: non-idempotent writes require a read-only probe before retrying; critical ops may skip to a safer level.
  • Gate E — Safety: enter safe-state before hard actions (device reset / bus clear / power-cycle).

Degraded operation (contain impact while recovering)

Read-only mode

Allow health probes and status reads only; block side-effect writes until stability is proven.

Quarantine device

Isolate one address/CS while keeping the bus usable for other devices; log quarantine entry/exit.

Bypass path

Switch to a redundant channel when available; keep telemetry continuity to compare both paths.

Auditability & pass criteria (placeholders)

  • Action record: level, trigger gates, duration, result, attempt_id always logged.
  • Recovery speed: TTR_p95 < Y ms (placeholder).
  • False resets: false_reset_per_day < W (placeholder).
  • Consistency: zero duplicate-write incidents in fault injection for non-idempotent ops.
Diagram · Staged recovery ladder (cost/time + escalation gates)
Recovery ladder Six-level recovery ladder from retry to power-cycle, annotated with cost and time markers and escalation gate tags. escalation gates timeout hit no progress fail streak K markers cost time L1 retry + backoff L2 resync (flush/idle) L3 reinit controller L4 device reset L5 bus clear L6 power-cycle escalate if timeout + idle

Rule: prefer the lowest level that restores forward progress. Hard actions require the combined gates and an audit trail.

H2-6 · I²C Recovery Playbook (Hung Bus)

Intent

Most I²C hangs are either lines held (SDA/SCL stuck low) or a stalled state machine. Recovery should be repeatable, verifiable, and safe for non-idempotent writes.

Detect & classify (line-level vs state-level)

Hung classifier

  • No-progress: now − last_progress_ts > T_idle (placeholder).
  • Timeout evidence: repeated T_step/T_txn hits (from H2-4).

Line state

  • SDA stuck low: common “bus wedged” signature.
  • SCL stuck low: treat cautiously; may require reinit or segment power control.

State-level wedge

  • Stretch timeout: stretch_duration > T_stretch_max (placeholder).
  • Arbitration loss stall: after arbitration, progress does not resume.

Safe preconditions (protect consistency)

  • Freeze traffic: stop new I²C transactions; acquire a bus lock/mutex.
  • Block side-effect writes: during recovery, allow read-only probes only.
  • Snapshot: record SDA/SCL levels, controller status, and counters before action.

Recovery actions (repeatable sequence)

1) Bus clear (SCL pulses)

  • Use when: SDA stuck low (or strong suspicion).
  • Action: drive N_pulses clocks (placeholder) to release SDA.
  • Observe: SDA returns high; bus returns to idle.

2) Generate STOP (if supported)

  • Use when: SDA/SCL are high and a clean boundary is needed.
  • Goal: return bus to a known idle state before probes.

3) Re-init controller (driver/peripheral)

  • Use when: both lines high but no-progress persists.
  • Action: reset controller, rebuild context, replay config, verify status.

4) Optional: re-enumerate targets

  • Use when: devices may have reset/power-cycled.
  • Action: presence checks via read-only probes (avoid side effects).

Risk control & health probe (read-only, verifiable)

  • Probe: read a non-side-effect status/ID register to confirm ACK + STOP behavior.
  • Consistency: after recovery, config writes must be reconciled by read-back (policy).
  • Pass criteria: probe passes N_ok times; error counters remain stable in window W (placeholders).
Diagram · I²C hung recovery (bus-clear → STOP → probe → escalate)
I2C hung recovery flow Flow starts from detection (timeout + no-progress) and branches by line state: SDA low uses SCL pulses to release SDA; SCL low escalates carefully; both high uses STOP or controller reinit. All paths end with a read-only health probe. detect: timeout + idle read line state SDA / SCL SDA low bus clear SCL low cautious path both high STOP / reinit SCL pulses N_pulses reinit ctrl or segment pwr generate STOP or reinit read-only health probe ACK + STOP pass → resume fail → escalate SDA SCL

Safety note: the recovery path should freeze traffic and use read-only probes to prevent duplicate writes during uncertainty.

H2-5 · Recovery Ladder (Soft → Hard) & When to Escalate

Intent

Make recovery auditable with a staged ladder. Escalation should be driven by timeouts + no-forward-progress, not by a single error spike.

Recovery ladder (6 levels, unified vocabulary)

Level 1 · Retry same op (with backoff)

  • Trigger: transient errors while progress continues.
  • Action: retry + exponential backoff + jitter (avoid retry storms).
  • Exit: one success + stability window (N_ok placeholder).

Level 2 · Re-sync (flush / idle wait / boundary rebuild)

  • Trigger: suspected protocol desync (boundary mismatch).
  • Action: flush queues, wait idle window, rebuild session boundary.
  • Exit: read-only health probe passes.

Level 3 · Re-init peripheral block (driver reset)

  • Trigger: progress stalls inside controller/driver (DMA/ISR/state).
  • Action: controller reset + context rebuild + config replay/verify.
  • Exit: controller self-check + health probe passes.

Level 4 · Reset target device (GPIO reset / command reset)

  • Trigger: failures localize to one device; safe to isolate.
  • Action: device reset, then read-only probe and configuration reconciliation.
  • Exit: device online + probe stable for N_ok.

Level 5 · Bus-level clear (I²C bus-clear / SPI dummy clocks)

  • Trigger: bus wedge indicators (timeouts + no-progress + line/session stuck).
  • Action: bus clear primitive after freezing traffic and acquiring bus lock.
  • Exit: bus returns to idle + read-only probe passes.

Level 6 · Power-cycle segment / system reset (last resort)

  • Trigger: fail-safe requirements or repeated ladder failure.
  • Action: segment power-cycle first; full system reset only when required.
  • Exit: boot self-test + probes confirm stable operation.

Escalation policy (combine gates, avoid false resets)

  • Gate A — Timeout budget hit: T_byte / T_frame / T_txn exceeded (placeholders from H2-4).
  • Gate B — No forward progress: now − last_progress_ts > T_idle (placeholder).
  • Gate C — Persistence: fail_streak ≥ K or err_rate ≥ E within a window W (placeholders).
  • Gate D — Operation class: non-idempotent writes require a read-only probe before retrying; critical ops may skip to a safer level.
  • Gate E — Safety: enter safe-state before hard actions (device reset / bus clear / power-cycle).

Degraded operation (contain impact while recovering)

Read-only mode

Allow health probes and status reads only; block side-effect writes until stability is proven.

Quarantine device

Isolate one address/CS while keeping the bus usable for other devices; log quarantine entry/exit.

Bypass path

Switch to a redundant channel when available; keep telemetry continuity to compare both paths.

Auditability & pass criteria (placeholders)

  • Action record: level, trigger gates, duration, result, attempt_id always logged.
  • Recovery speed: TTR_p95 < Y ms (placeholder).
  • False resets: false_reset_per_day < W (placeholder).
  • Consistency: zero duplicate-write incidents in fault injection for non-idempotent ops.
Diagram · Staged recovery ladder (cost/time + escalation gates)
Recovery ladder Six-level recovery ladder from retry to power-cycle, annotated with cost and time markers and escalation gate tags. escalation gates timeout hit no progress fail streak K markers cost time L1 retry + backoff L2 resync (flush/idle) L3 reinit controller L4 device reset L5 bus clear L6 power-cycle escalate if timeout + idle

Rule: prefer the lowest level that restores forward progress. Hard actions require the combined gates and an audit trail.

H2-6 · I²C Recovery Playbook (Hung Bus)

Intent

Most I²C hangs are either lines held (SDA/SCL stuck low) or a stalled state machine. Recovery should be repeatable, verifiable, and safe for non-idempotent writes.

Detect & classify (line-level vs state-level)

Hung classifier

  • No-progress: now − last_progress_ts > T_idle (placeholder).
  • Timeout evidence: repeated T_step/T_txn hits (from H2-4).

Line state

  • SDA stuck low: common “bus wedged” signature.
  • SCL stuck low: treat cautiously; may require reinit or segment power control.

State-level wedge

  • Stretch timeout: stretch_duration > T_stretch_max (placeholder).
  • Arbitration loss stall: after arbitration, progress does not resume.

Safe preconditions (protect consistency)

  • Freeze traffic: stop new I²C transactions; acquire a bus lock/mutex.
  • Block side-effect writes: during recovery, allow read-only probes only.
  • Snapshot: record SDA/SCL levels, controller status, and counters before action.

Recovery actions (repeatable sequence)

1) Bus clear (SCL pulses)

  • Use when: SDA stuck low (or strong suspicion).
  • Action: drive N_pulses clocks (placeholder) to release SDA.
  • Observe: SDA returns high; bus returns to idle.

2) Generate STOP (if supported)

  • Use when: SDA/SCL are high and a clean boundary is needed.
  • Goal: return bus to a known idle state before probes.

3) Re-init controller (driver/peripheral)

  • Use when: both lines high but no-progress persists.
  • Action: reset controller, rebuild context, replay config, verify status.

4) Optional: re-enumerate targets

  • Use when: devices may have reset/power-cycled.
  • Action: presence checks via read-only probes (avoid side effects).

Risk control & health probe (read-only, verifiable)

  • Probe: read a non-side-effect status/ID register to confirm ACK + STOP behavior.
  • Consistency: after recovery, config writes must be reconciled by read-back (policy).
  • Pass criteria: probe passes N_ok times; error counters remain stable in window W (placeholders).
Diagram · I²C hung recovery (bus-clear → STOP → probe → escalate)
I2C hung recovery flow Flow starts from detection (timeout + no-progress) and branches by line state: SDA low uses SCL pulses to release SDA; SCL low escalates carefully; both high uses STOP or controller reinit. All paths end with a read-only health probe. detect: timeout + idle read line state SDA / SCL SDA low bus clear SCL low cautious path both high STOP / reinit SCL pulses N_pulses reinit ctrl or segment pwr generate STOP or reinit read-only health probe ACK + STOP pass → resume fail → escalate SDA SCL

Safety note: the recovery path should freeze traffic and use read-only probes to prevent duplicate writes during uncertainty.

H2-7 · SPI Recovery Playbook (Desync / Stuck Session)

Intent

Most SPI “hangs” are session-boundary failures (CS) or flow-control stalls (DMA/queues). Recovery should enforce session fencing, rebuild framing, and restore forward progress with minimal side effects.

Detection (symptoms → first check)

CS asserted too long

  • Likely causes: missing end-of-transfer, device busy, CS control stuck.
  • First check: CS level vs transfer_done flag and guard time policy.

MISO strong-driven unexpectedly

  • Likely causes: multi-slave contention, wrong CS decode, device stuck in data phase.
  • First check: only one CS active; confirm idle Hi-Z expectation.

DMA never completes / queue stalls

  • Likely causes: lost interrupt, descriptor stuck, driver state mismatch.
  • First check: last_progress_ts updates; DMA state and IRQ pending flags.

CRC / header continuously wrong

  • Likely causes: desync (bit-slip), partial frame, CS boundary violation.
  • First check: enforce CS fence; then header probe after re-sync.

Session fencing (do not start a new transaction until the previous one is closed)

  • Fence conditions (placeholders): transfer_done = true, CS deasserted, guard time ≥ T_guard.
  • Queue discipline: flush/abort must fully terminate the active descriptor before re-arming DMA.
  • Audit fields: cs_low_time, dma_state, attempt_id, last_progress_ts always captured.

Recovery actions (repeatable sequence)

1) CS deassert + guard + reassert

  • Goal: close the prior session boundary.
  • Exit: header probe becomes stable after the next step.

2) Dummy clocks (re-align shift boundary)

  • Use when: header/CRC persistently wrong (suspected bit-slip).
  • Parameter: N_dummy_bits placeholder; ensure MOSI fill is side-effect safe.

3) Sync header probe (read-only)

  • Goal: confirm known header/ID framing before resuming normal traffic.
  • Exit: probe passes N_ok times (placeholder).

4) Flush DMA / driver queues

  • Use when: descriptors do not advance, or queued frames are polluted.
  • Exit: DMA state returns to idle and last_progress_ts updates.

5) Device reset (only if required)

  • Use when: re-sync cannot restore a stable header.
  • Exit: probe stable; error density below threshold in window W.

Pass criteria (placeholders)

  • Framing: sync header probe passes N_ok times.
  • Stability: error rate < X within window W.
  • Speed: TTR_p95 < Y ms; avoid false hard resets.
Diagram · SPI session fence (CS toggle → dummy clocks → header probe → resume)
SPI session fence flow Flow starts from detection, then enforces CS fencing, applies dummy clocks to realign framing, runs a sync header probe, and resumes if stable. If probe fails, flush queues and optionally reset device. detect CS fence + guard dummy clocks header probe stable? flush queues device reset resume traffic audit: cs_low_time / dma_state / attempt_id yes no CS SCLK

Emphasis: close the previous session and re-validate framing before normal traffic resumes. Queue flush must fully terminate the active DMA state.

H2-8 · UART Recovery Playbook (Noise Burst / Framing Lock)

Intent

UART recovery focuses on rebuilding frame boundaries and clearing error states, especially under noise bursts. The goal is a verifiable return to valid frames, not blind resets.

Detection (signals that indicate boundary loss)

  • FE/PE burst: continuous framing/parity errors; burst_max exceeds baseline.
  • Idle missing: no reliable idle detect window within W (placeholder).
  • FIFO overrun: overrun_count rises; RX backlog does not drain.
  • Autobaud mismatch (optional): autobaud lock/estimate deviates from expected range.

Recovery actions (boundary rebuild)

1) Flush RX FIFO + clear error flags

  • Goal: remove polluted bytes and exit sticky error state.
  • Audit: dropped_bytes_count recorded for traceability.

2) Wait-for-idle window (rebuild boundary)

  • Rule: accept a new frame only after idle ≥ T_idle_uart (placeholder).
  • Benefit: filters noise-induced false start bits.

3) Break / idle resync (if allowed)

  • Use when: protocol supports a resync marker (break/idle sequence).
  • Exit: first valid frame passes checks (parity/length/CRC per stack).

4) Autobaud reacquire (optional)

  • Use when: autobaud lock is suspected to be wrong.
  • Exit: estimated baud within tolerance window (placeholder).

Pass criteria (placeholders)

  • N-frame rule: within next N frames, error frames < X.
  • Error density: FE/PE rate < R in window W.
  • Recovery time: TTR_p95 < Y ms; avoid unnecessary hard resets.
Diagram · UART resync timeline (noise → flush → idle window → first valid frame → N-frame check)
UART resync timeline Timeline illustrates a noise burst producing FE/PE, then RX FIFO flush and error clear, then an idle window to rebuild boundary, then acceptance of the first valid frame, and finally an N-frame pass criteria window. noise burst FE/PE flush clear flags idle window T_idle_uart first valid frame N frames err < X time →

Rule: flush and wait for a reliable idle window before accepting the first valid frame; then verify stability over an N-frame window.

H2-9 · Firmware Architecture Patterns (State Machine, Idempotency, Fencing)

Intent

Recovery must be a structure, not a pile of actions. Firmware architecture should prevent duplicate writes, half-transactions, and infinite recovery loops while remaining auditable.

Recovery state machine (core contract)

  • States: INIT → IDLE → TXN → ERROR → RECOVER → DEGRADED.
  • Progress definition: last_progress_ts advances (DMA descriptors move, FIFO drains, or a transaction completes).
  • Exit criteria: every state has a pass condition; RECOVER cannot loop without an escalation gate.

Idempotency rules (safe retry vs protected write)

Retry-safe operations

  • Reads and status probes.
  • Write-same-value (explicitly bounded) with readback optional.
  • Versioned writes (seq/version token; compare-and-set semantics).

Non-idempotent writes (protected)

  • Increment / append semantics (duplicate write is harmful).
  • Trigger writes (start/arm/commit operations).
  • Multi-step writes (page/program sequences). Policy: no blind retry after timeout; require readback/confirmation gates.

Fencing & context cleanup (invariants)

  • Transaction identity: txn_id and attempt_id must be unique and logged for every retry.
  • Completion token: do not start a new TXN until the previous one provides a completion/ack token.
  • Timeout cleanup: queue_depth → 0, dma_state → IDLE, session boundary closed (CS deasserted / idle window observed), error flags cleared.
  • No infinite loops: RECOVER has a bounded iteration counter; escalation occurs after threshold K (placeholder).

Backoff + error code layering (actionable telemetry)

  • Backoff: exponential + jitter; bound by overall TTR budget (placeholders: base, max, jitter).
  • Driver-level codes: dma_stuck, irq_missing, timeout_phase (byte/frame/txn).
  • Device-level codes: NAK/CRC/header_mismatch, FE/PE burst, busy_stuck.
  • System-level codes: no_progress, degraded_entered, safety_state_entered.
Diagram · Recovery state machine (bounded escalation, no infinite loops)
Recovery state machine State machine shows normal flow INIT to IDLE to TXN and back to IDLE. Failures go to ERROR then RECOVER with a bounded retry counter and escalation to DEGRADED if recovery does not pass probes. INIT IDLE TXN ERROR RECOVER soft→hard DEGRADED probe-only gate K tries invariants txn_id / fence pass classify probe pass escalate

Key properties: bounded recovery loops, explicit fencing invariants, and separation of retry-safe operations from protected writes.

H2-10 · Hardware Assist: Watchdogs, Reset Lines, Bus Guardians

Intent

Hardware assists turn recovery into a controlled soft+hard collaboration: layered watchdogs detect no-progress and enforce bounded resets; reset lines and power segmentation isolate faults before system-wide reset.

Watchdog layering (who monitors what)

  • Task watchdog: health-task heartbeat; catches soft deadlocks early.
  • Comm watchdog: forward progress (last_progress_ts) and error density; triggers peripheral re-init or bus clear.
  • System watchdog: last resort; only after fail-safe conditions and audit logging.

Reset & power resources (prefer segment isolation)

  • Target reset pin: isolate a single device without disturbing the bus segment.
  • Segment load switch: power-cycle only the affected bus segment; keeps unrelated domains alive.
  • Reset supervisor: enforces reset pulse width and ordering; exports reset_cause for audit.

Bus guardian concept + minimum anti-ghost-powering rule

  • Bus guardian: latches a fault and isolates/disconnects the dragging node or segment when no-progress persists.
  • Firmware contract: guardian trip forces DEGRADED mode and probe-only traffic until stability returns.
  • Anti-ghost-powering (minimum): disable/Hi-Z bus drivers before removing segment power; re-enable I/O only after rail is stable (ordering placeholders).
Diagram · Soft + hard recovery stack (watchdog → supervisor → load switch → bus segment)
Soft + hard recovery stack Stack shows firmware generating heartbeat and progress signals into layered watchdogs; watchdogs drive reset supervisor and segment load switch; bus guardian isolates the bus segment on persistent faults; audit signals capture reset cause and counts. MCU / firmware task WDT comm WDT reset supervisor load switch (segment power) bus segment I²C / SPI / UART bus guardian isolate / latch audit reset_cause count / interval heartbeat progress reset / pulse power-cycle

Principle: isolate and recover at the smallest scope first (device → segment → system), with watchdog decisions always logged and rate-limited.

H2-11 · Engineering Checklist (Design → Bring-up → Production)

Intent

Convert recovery ideas into an auditable, executable checklist and acceptance criteria across design, bring-up, and production.

Design gate (define budgets + rules before code)

  • Timeout budgeting: placeholders fixed for T_byte, T_frame, T_txn, T_idle; retry/backoff bounded by overall TTR.
  • Recovery ladder: retry → resync → reinit → device reset → bus-clear/dummy clocks → segment power-cycle/system reset; escalation gates defined (K tries, error density window W).
  • Idempotency rules: retry-safe ops listed; non-idempotent writes require “probe + readback/confirm” gates (no blind retry after timeout).
  • Minimum event log schema: bus_id, device, op_type, duration, error_code, action_level, txn_id, attempt_id.

Bring-up gate (fault injection validates the ladder)

Inject & observe (minimum set)

  • I²C: force SDA low / SCL low; stretch timeout; arbitration-loss → no-progress.
  • SPI: CS stuck asserted; persistent header/CRC mismatch (desync); DMA never completes; MISO contention symptom.
  • UART: FE/PE burst; FIFO overrun; idle-detect missing; optional autobaud mismatch.

Bring-up pass checks (close the loop)

  • Detect: no-progress triggers within T_idle and classification is stable.
  • Recover: ladder steps execute in order; escalation only after gates (K, density in W).
  • Verify: post-recovery probe passes (N_ok times) and key configuration remains consistent (readback/compare).

Production gate (BIST + counters + field log collection)

  • BIST / loopback: acceptance routines prove the RX/TX path and basic framing without relying on a “perfect” environment.
  • Counter policy: define reset/rollover rules for error counters; preserve key “lifetime” counters for root-cause trend.
  • Reset rate limiting: cap device reset / segment power-cycle attempts; enter DEGRADED mode if exceeded.
  • Field log readiness: minimal schema is always available and exportable (bus/device/op/error/action/TTR).

Pass criteria (placeholders)

  • TTR_p95 < X ms and recovery_success_rate > Y%.
  • false_reset_rate < Z (per hour/day; define window W).
  • After N injections, recovery remains repeatable and configuration consistency is preserved (readback/compare OK).
Diagram · 3-gate flow (Design gate → Bring-up gate → Production gate)
3-gate flow Three gates show checkpoints for design, bring-up fault injection, and production validation, ending in release readiness. Design gate Bring-up gate Production gate timeouts ladder + gates idempotency logs schema fault inject observe probe pass consistency BIST counters rate limit field logs TTR / safety repeatable ship-ready release

H2-12 · Applications & IC Selection Notes (placed before FAQ)

Intent

Map recovery requirements to application buckets and selection checkpoints without becoming a product page. Example material numbers are provided for design reference—always verify package, suffix, and availability.

Application buckets (recovery-driven requirements)

Industrial chassis / long harness

  • Common failures: no-progress, stuck session, burst errors.
  • Recovery focus: bounded escalation + segment isolation + probe-only degraded mode.
  • Verification: repeated fault injection across temperature and cable changes; audit TTR and false resets.

High-noise environment

  • Common failures: CRC/header mismatch (SPI), FE/PE bursts (UART), NAK storms (I²C).
  • Recovery focus: resync + backoff+jitter; avoid retry storms that keep the bus unstable.
  • Verification: error density windows and rate limiting demonstrate stability recovery.

Hot-plug boundary

  • Common failures: session boundary corruption, ghost-power symptoms, partial power domains.
  • Recovery focus: fencing + controlled reset/power sequencing; probe before enabling writes.
  • Verification: repeated plug cycles; ensure no duplicate write and no false system reset.

Low-power wake / intermittent duty

  • Common failures: first-frame failure after wake, idle-detect ambiguity, timeout mismatches.
  • Recovery focus: idle windows, safe resync, and predictable watchdog behavior.
  • Verification: wake loops with telemetry; ensure TTR stays within budget without power-cycling storms.

Mass production & field service

  • Common failures: rare corner cases that require attribution, not guesswork.
  • Recovery focus: consistent error codes + event logs + bounded reset counters.
  • Verification: pass/fail gates and field log exportability are mandatory.

Selection notes + example material numbers (verify package/suffix/availability)

Bus helpers (timeouts, buffering, isolation, stuck-bus handling)

  • I²C hot-swap / stuck-bus assist: TCA4307 (TI), PCA9511A (NXP).
  • I²C mux / hub (channel isolation): TCA9548A (TI), PCA9548A (NXP).
  • I²C buffer / rise-time accelerator: TCA4311A (TI), PCA9515A (NXP).
  • I²C isolator: ISO1540/ISO1541 (TI), ADuM1250/ADuM1251 (Analog Devices).
  • isoSPI / long-chain SPI-style link: LTC6820 (Analog Devices).

Watchdogs, supervisors, reset lines (hard recovery stack)

  • Reset supervisor: TPS3808 (TI), ADM809/ADM810 (Analog Devices), MCP130 (Microchip).
  • External watchdog timer: TPS3430 (TI), MAX6369 (Analog Devices/Maxim legacy), MCP1316 (Microchip).
  • Reset/power sequencing aid (platform-dependent): combine supervisor + load switch to isolate segments (see below).

Segment power & isolation primitives (keep resets local)

  • Load switch (segment power-cycle): TPS22910A, TPS22965 (TI), FPF2123 (onsemi/Fairchild legacy family).
  • High-side switch (higher current, platform-dependent): TPS1H100 (TI) example class; verify current/diagnostics.
  • Digital isolator for SPI/UART-class signals: ISO7741 (TI), ADuM140x family (Analog Devices) as example classes; verify channel count/direction.

Bridges & service ports (field observability and controlled access)

  • USB ↔ UART bridge: CP2102N (Silicon Labs), FT232R (FTDI), CH340 (WCH) as common classes; verify driver strategy.
  • RS-485 transceiver (UART layering): SN65HVD72 (TI), MAX3485 (Analog Devices/Maxim) as common classes.
  • I²C expander for recovery GPIOs: PCA9555 (NXP), TCA9555 (TI) as common classes.

Capability matrix (presented as card list, not a table)

  • Controller block: byte/txn/idle timeouts + error flags + abort/flush path + “no-progress” telemetry.
  • Target device: reset pin and safe-state behavior + read-only probe register + write-readback capability.
  • Bridge/expander: stuck-bus timeout and counters + channel isolation/disable + predictable reset behavior.
Diagram · Selection flow (requirements → hardware hooks → firmware structure → verification plan) + capability stacks
Selection flow + capability stacks Left side shows a four-step selection flow; right side shows three capability stacks as card lists to avoid wide tables. Selection flow requirements TTR / success / safety hardware hooks reset / WDT / segment firmware structure FSM / fencing / backoff verification plan fault inject / probe Capability stacks controller timeouts / flags / flush target device reset / safe-state / probe bridge / expander counters / timeout / isolate maps to

Example ICs above are reference points. Selection should start from recovery requirements and verification plan, then back-propagate to hardware hooks and firmware structure.

Request a Quote

Accepted Formats

pdf, csv, xls, xlsx, zip

Attachment

Drag & drop files here or use the button below.

H2-13 · FAQs (Error Handling & Recovery)

Intent

Close long-tail troubleshooting without expanding the main body. Each answer uses a fixed 4-line structure with measurable pass criteria placeholders.

I²C bus stuck low after a brown-out — first check line vs state-machine?

Likely cause: SDA/SCL is physically held low by a slave or clamp, or the controller/slave FSM is wedged mid-transaction.

Quick check: Sample SDA/SCL as GPIOs (inputs) and compare to controller status (BUSY/START seen/STOP seen). If lines are low even with controller disabled → line-held; if lines float high but BUSY persists → FSM wedge.

Fix: Soft: disable I²C block → GPIO bus-clear (SCL pulses) → generate STOP if possible → re-init controller; then run a read-only health probe before any writes.

Pass criteria: SDA/SCL high within T_idle; health probe passes N_ok times; no duplicate write events after recovery (0 occurrences in window W).

Clock stretching happens occasionally — where to set master timeout without breaking slow devices?

Likely cause: A legitimate slow slave stretches occasionally, or a fault makes SCL “never release” (hung stretch).

Quick check: Log stretch duration distribution (max/p95) per device address; compare to “normal” device behavior. If duration is unbounded or always hits the cap → treat as hung stretch.

Fix: Use a two-tier policy: T_byte cap for individual steps + T_txn cap for whole transaction; allow longer caps only for specific known-slow addresses, and escalate to bus-clear after K consecutive cap hits.

Pass criteria: No transaction exceeds T_txn; stretch p95 stays below T_stretch_p95; false timeouts on known-slow devices < X per W.

SPI CRC bursts after one glitch — how to resync without resetting the whole system?

Likely cause: A single edge glitch caused bit-slip so host/device frame boundaries no longer align (desync), even though the bus still toggles.

Quick check: After forcing CS deassert, send a short “sync header probe” (read-only pattern). If header stays wrong across attempts → boundary is shifted.

Fix: Apply a session fence: CS ↑ guard time → CS ↓ → transmit N_dummy_bits of dummy clocks with safe MOSI fill → read-only header probe → flush driver/DMA queue if descriptors are polluted; reset only the target device if probe never locks.

Pass criteria: Header probe locks within M attempts; CRC error rate < R_crc in window W; TTR_p95 < X ms without system reset.

CS toggling doesn’t recover — what’s the quickest “dummy clocks” sanity check?

Likely cause: CS toggling ends the session, but the target remains inside a partial command/stream state and requires boundary re-alignment.

Quick check: Use a minimal sequence: CS ↑ (guard) → CS ↓ → dummy clocks for exactly one frame-length (or one byte multiple) → issue a read-only “known header” read. If the header moves closer to expected values with different dummy lengths → desync confirmed.

Fix: Standardize N_dummy_bits (safe MOSI fill) and require a header probe gate before resuming traffic; if multiple slaves share MISO, add a “Hi-Z check” phase (verify no contention) before reasserting CS.

Pass criteria: Header probe passes N_ok consecutive reads; no MISO contention events in W; no-progress counter remains 0 during steady state.

UART shows framing errors only in bursts — noise coupling or baud drift?

Likely cause: Burst-coupled noise corrupts edges (FE/PE spikes), or baud mismatch accumulates until sampling slips (drift).

Quick check: Compare FE/PE burst timing to system events (motors, relays, radio TX) and to temperature/clock changes. Noise bursts correlate to events; drift correlates to sustained operating conditions and persists until re-sync/autobaud.

Fix: For bursts: flush RX FIFO + clear flags → wait for an idle window T_idle_uart → accept first valid frame gate. For drift: re-acquire baud (or switch to a calibrated divisor) and re-validate with an idle window gate.

Pass criteria: After recovery, FE/PE rate < R_fe over window W; first-valid-frame locks within TTR; no FIFO overruns in N frames.

Recovery works on bench but fails in chassis — what telemetry field is usually missing?

Likely cause: The failure is not just “an error,” but “no forward progress” under real coupling; missing telemetry hides the escalation triggers and wrong action is chosen.

Quick check: Confirm logs include last_progress_ts, error density (errors per W), burst length, action level, and reset cause. If only raw error codes exist, the ladder cannot be audited.

Fix: Add “progress” counters (transaction complete increments) + ring-buffer events (bus/device/op/duration/error/action/attempt) and classify warning/degraded/fail-safe. Use these to drive escalation deterministically.

Pass criteria:P% of field failures have an attributable root bucket; escalation steps are reproducible; TTR_p95 and false reset rate are measurable from logs.

Retries make it worse — how to detect retry storms and apply backoff?

Likely cause: Fast retries amplify error density and prevent the bus/device from returning to a stable state (self-sustaining storm).

Quick check: Track retries per second and consecutive failures per device. If retry rate rises while progress stays flat (no completions), a storm is present.

Fix: Enforce exponential backoff with jitter and a hard cap; rate-limit per device and globally; if density exceeds threshold, enter DEGRADED (probe-only) until stability returns.

Pass criteria: Retry rate bounded below R_retry; progress resumes within T_recover; error density falls below D_ok in window W.

After recovery, device config is wrong — how to design idempotent writes?

Likely cause: A timeout occurred during a non-idempotent write (increment/append/trigger), and an automatic retry re-applied a side effect or wrote partial state.

Quick check: Identify which writes are non-idempotent and whether a timeout can happen mid-write (page write, multi-register sequence). Check if the code retries these without a “confirm/readback” gate.

Fix: Make writes idempotent via version/sequence tokens or “write-then-readback compare”; after timeout, block blind retry and switch to read-only probe + reconcile state before re-applying changes.

Pass criteria: Duplicate side effects = 0 in window W; readback/compare passes after recovery; config hash matches expected within T_reconcile.

Watchdog resets too often — how to separate comm watchdog vs system watchdog?

Likely cause: A single watchdog is covering both “temporary comm faults” and “system deadlock,” causing over-resets and masking root cause.

Quick check: Log reset cause and the last comm progress timestamp. If resets happen while other tasks are healthy and only comm stalls, the system watchdog is triggered too aggressively.

Fix: Split into task watchdog (task heartbeat), comm watchdog (no-progress/escalation), and system watchdog (fail-safe only). Require comm ladder attempts before system reset, with rate limits.

Pass criteria: System resets per day < Z; comm watchdog resolves stalls in < TTR for > Y% cases; reset causes are attributable in ≥ P% logs.

Need fast TTR but no false resets — what escalation thresholds are most effective?

Likely cause: Escalation gates are based on raw error count rather than progress and error density, so transient bursts trigger hard resets.

Quick check: Evaluate triggers: do they use “no-progress for T_idle” and “density in W” or just “N errors”? Raw-N-only is prone to false resets.

Fix: Use layered gates: (1) retry/backoff until K1 failures, (2) resync/reinit if no-progress persists, (3) device reset only if probe fails, (4) segment power-cycle only after K2 repeats with rate limit.

Pass criteria: TTR_p95 < X ms while false_reset_rate < Z; escalation level distribution stable across runs; hard resets occur only after gate conditions are met.

One bad device wedges the bus — how to quarantine without losing the whole segment?

Likely cause: A single target holds a shared resource (line, MISO, or power domain), causing repeated no-progress and forcing global recovery.

Quick check: Identify the “dominant offender” by per-device no-progress/timeout counters and which address/CS correlates with stalls; confirm that the rest of the segment is healthy via probe-only traffic.

Fix: Quarantine by disabling that channel/device (mux/expander channel disable, CS inhibit) or power-cycling only that sub-branch; keep the system in DEGRADED mode with read-only probes for the remaining devices.

Pass criteria: Non-offender devices maintain progress (no-progress=0) for window W; quarantined branch stays isolated; system-wide resets drop below Z per day.

Production wants a pass/fail — what’s a minimal fault-injection test set?

Likely cause: Production tests validate normal traffic but do not exercise recovery paths, so latent “hung” corner cases escape.

Quick check: Confirm whether the test includes at least one injected no-progress condition and one desync/burst condition, and whether it records TTR + action levels.

Fix: Minimal set: (1) I²C SDA low (bus-clear path), (2) SPI forced desync (dummy clocks + header probe), (3) UART FE burst (flush + idle window), plus reset-rate limit verification; require logs for each step.

Pass criteria: All injections recover within TTR; success rate > Y%; false reset rate < Z; post-recovery probes pass N_ok times.