Firmware Robustness: Timeouts, Retries, CRC & Recovery
← Back to: I²C / SPI / UART — Serial Peripheral Buses
Firmware robustness means making serial-bus failures predictable, measurable, and self-healing.
This page delivers practical rules for timeouts, retries, CRC/framing, power-fail-safe writes, and recovery state machines—each with clear pass/fail criteria for bring-up and production.
Definition & Scope Guard (Firmware Robustness for Serial Buses)
Robustness is not “zero errors.” It is predictable failures, deterministic self-healing, and measurable production acceptance.
- Predictable failures: consistent counters + consistent denominators + consistent time windows.
- Deterministic recovery: bounded steps, bounded time, and explicit escalation rules (no infinite loops).
- Production acceptance: pass/fail thresholds defined per metric (e.g., errors per 1k, p99 latency).
- Strategy: staged timeouts, retry ladder with backoff, integrity framing.
- Templates: counters/log fields, recovery state machine skeleton, atomic-write skeleton.
- Checklist: bring-up → production verification steps with fault injection.
- Pass criteria: measurable thresholds and acceptance gates (placeholders for X/Y).
- In-scope: timeout/retry/CRC policies, power-fail write safety, recovery state machines, observability, and acceptance definitions.
- Out-of-scope: electrical/timing/SI/layout tutorials for I²C/SPI/UART (handled in protocol subpages). Only minimal firmware-relevant conclusions appear here.
Failure Model & Symptom Taxonomy (What actually goes wrong)
Robustness begins with classification: the same symptom can come from different root causes. Each bucket below must be tied to observable evidence (counters, timing phase, correlations), not assumptions.
Each symptom must map to: root-cause bucket → first check → action type. This prevents random timeout/retry tuning.
First check: split by address + op + time window; compare NAK per 1k across endpoints.
Action type: retry with backoff; escalate to reinit if repeated at same phase.
First check: phase tagging in logs (queue/transfer/peer); verify the stall repeats at one phase.
Action type: reinit bus controller → bus-clear / peer reset → reset domain (bounded ladder).
First check: correlate CRC fails with queue depth, under-run counters, and CPU load spikes.
Action type: throttle bursts / add backpressure; only then tune retries.
First check: p95/p99 latency during error windows + retry_count; confirm if latency spikes follow retry bursts.
Action type: add backoff + global rate limiting + circuit-breaker isolation.
First check: integrity markers (version/CRC/commit flag) and last-write timestamps on boot.
Action type: fallback slot + atomic commit; fail-safe mode if integrity fails.
- Errors are scattered and succeed after bounded retries.
- Failure phase varies; no single “stuck point” dominates.
- Integrity markers remain valid; state can return to a known baseline.
- Failures repeat at the same phase with no forward progress (stall pattern).
- Integrity checks fail (version/CRC/commit flag mismatch) indicating corrupted state.
- Recovery exceeds a bounded time budget or repeats beyond a bounded step count.
A consistent taxonomy that routes any symptom into the next chapters: Timeout design, Retry/backoff, CRC/framing, and Recovery state machine escalation.
Observability & Metrics (Counters, logs, and pass/fail definitions)
Robustness requires consistent measurement. Each counter must define: trigger point, denominator, and grouping dimensions. Without this contract, dashboards disagree and tuning becomes random.
- timeout_count (prefer phase tags: queue/transfer/peer)
- retry_count (include attempt index and escalation level)
- crc_fail, framing_err
- bus_reset (and optional: reinit/bus-clear/peer-reset)
- recovery_success, recovery_fail
- Error rate (/1k): (errors ÷ transactions) × 1000
- MTBF: mean time between severe events (recovery_fail / fail-safe)
- Latency p95/p99: sampled at a fixed start/end point (define once)
- Recovery duration: detect → back-to-normal (not just reset issued)
- Recovery effectiveness: success ÷ (success + fail)
- Event-driven: log only state transitions, timeouts, escalations, integrity failures, and fail-safe entry.
- Rate-limited: keep one representative record per window and aggregate the rest into counters.
- Minimal context: addr/op/len/phase/state/power_state + attempt_index + escalation_level.
- timeout_rate < X per 1k transactions
- crc_fail_rate < X per 1k transactions
- recovery_fail = 0 (or < X per hour, if allowed)
- p99_latency < Y ms (steady-state)
- Over N transactions or T minutes (define once)
- Include worst-case modes: peak throughput, cold/hot starts, power transitions (if applicable)
- Per port/bus, per peer/address, per operation type (read/write), and per power state
- Report includes firmware version/build id and configuration profile
Timeout Design (Budgeting time, not guessing)
A single “giant timeout” hides where time is spent and makes recovery unpredictable. Use layered timeouts to keep progress measurable and escalation deterministic.
Segment budgets produce two outcomes: (1) phase-level diagnosis and (2) phase-specific actions. Fill X/Y/Z and keep the structure.
- T_queue_max = X ms
- T_transfer_max = Y ms
- T_peer_max = Z ms
- T_retry_window = policy-based (backoff + jitter)
- T_deadline = X + Y + Z + margin (tail latency margin is explicit)
- Use when progress exists but the phase is slow (peer busy, batch DMA).
- Action: log + throttle + extend within the total deadline (no infinite extension).
- Escalate only if repeated slowdowns violate p99 latency targets.
- Use when no forward progress is observed (stall pattern, repeated phase failure).
- Action: abort → retry ladder (retry → reinit → bus-clear → reset domain → isolate).
- Stop conditions: bounded step count and bounded recovery time budget.
- Queue timeout: reduce contention, apply global rate limiting, verify scheduling and backpressure.
- Transfer timeout: abort safely, retry with backoff, then reinit controller if repeated.
- Peer timeout: classify busy vs stall; soft timeout for busy, hard timeout for stall; escalate to peer reset only when bounded.
- Deadline violation: enter fail-safe/isolation path and record a production-grade incident entry.
Retry Strategy & Backoff (Make retries help, not harm)
Good retries reduce transient failures without creating congestion collapse. The policy must be bounded, phase-aware, and escalating. Use backoff and jitter to prevent synchronized retry storms and keep recovery deterministic.
Not all commands are safe to retry. Classify operations by side effect risk and apply explicit guards (sequence, transaction id, readback verify, and commit rules).
- Read / query operations
- Status polling with bounded rate
- Requests with sequence number and stateless response
- Register writes with readback verify
- Writes with expected-version / compare-before-write
- Updates that are safe only with transaction id + commit flag
- EEPROM / flash / page write without atomic commit
- One-shot actions (erase, trigger, arm)
- Stateful operations without sequence or replay protection
- attempt_index + escalation_level for observability and policy enforcement
- seq / transaction_id to detect replay or duplicate processing
- readback verify to confirm the intended value, not just a completed transfer
- commit rule (two-phase where needed) to prevent partial writes from becoming permanent
- Single success unblocks a pipeline but does not prove health.
- N consecutive successes are required to exit degrade mode (prevents oscillation).
- Observation window must confirm error/1k and p99 latency remain within thresholds.
Data Integrity: CRC, Framing, and Sequence Control
CRC detects corruption, but robust systems also need framing and sequence control. Use the smallest layer that can provide reliable detection and actionable recovery.
To handle I²C/SPI/UART uniformly, wrap the bus transfer with a higher-layer frame: header (type/len/seq/flags), payload, and CRC. Sequence control prevents replay and duplicate processing, not just corruption.
- type: parsing safety and operation intent
- len: boundary control (prevents truncation/overrun and UART “glue” errors)
- seq: replay/duplicate detection and de-bounce for retry ladders
- flags: optional fields, capability bits, and downgrade mode
- CRC: integrity verification with clear “where to check” and “what to log”
- Single sparse CRC fail: immediate retry (bounded).
- Clustered CRC fails: backoff + jitter; consider throughput throttling.
- CRC fails correlated with load: verify buffer underrun/overrun counters and scheduling.
- Persistent CRC fails: escalate (reinit → bus-clear → reset domain → isolate).
- Consecutive CRC fails ≥ N → escalation +1
- CRC_fail_rate ≥ X / 1k → enable storm controls
- Recovery effectiveness drops below Y → quarantine peer
- Use flags for optional CRC/seq fields and downgrade mode.
- Attempt CRC-enabled mode first; downgrade only after capability mismatch is detected.
- In no-CRC mode, keep length/type strict and tighten timeouts and retry limits.
Power-Fail Safety (Write-protection, brown-out, and atomicity)
Power interruptions are not “rare corner cases” in production. Robust firmware treats configuration writes as transactions that remain recoverable across brown-out, dropout, and bounce (power cycling near thresholds).
The goal is not “never lose power,” but to ensure storage always contains at least one valid configuration and the newest valid one can be selected unambiguously at boot.
- Write new data into staging region.
- Verify CRC before writing commit flag.
- Commit is written only when data is complete and verifiable.
- CRC validates content integrity.
- Version/seq prevents replay and selects the newest valid slot.
- Boot selection uses commit + CRC + newest version.
- Dual-slot (A/B) update rotation.
- Always keep one last-known-good slot.
- Rollback if newest slot fails validation.
- Inhibit writes when power state is not stable (brown-out or bouncing).
- Prefer compare-before-write to avoid unnecessary wear.
- Require readback verify for writes with side effects.
- Retry only when the failure type is transient and side effects are guarded.
- Keep retry count lower than read operations and log attempt index + power state.
- After repeated failures, stop writes and enter a safe mode rather than grinding storage.
- Scan slot A/B and validate commit + CRC + version.
- Select newest valid slot; rollback if newest is invalid.
- If commit exists but CRC fails, enter fail-safe and preserve diagnostics.
- Power cut injection: X events across write phases; system must boot with valid config every time.
- Config integrity: CRC mismatches must remain < Y per N updates.
- Recovery: rollback must complete within T ms and produce a stable “active slot” decision.
Recovery State Machines (Deterministic self-healing)
Replace ad-hoc if-else recovery with a deterministic state machine. Each transition must be observable, bounded, and driven by consistent triggers (timeouts, integrity failures, and “no-progress” detection).
- Re-entrant: repeated triggers do not corrupt state.
- Interruptible: power events and watchdog constraints remain respected.
- Observable: log state, phase, attempt index, escalation level.
- Finite: maximum steps and maximum recovery time budget.
- Escalating: local → link → system, based on thresholds and effectiveness.
- clear FIFO
- flush DMA
- restart transaction
- reinit driver
- bus-clear / bus reset
- re-scan devices
- re-sync configuration
- toggle peer reset GPIO
- reset domain
- enter safe-mode
- limit writes and throughput
- preserve diagnostics
- no progress in a hard-timeout window
- consecutive failures ≥ N
- recovery effectiveness below Y
- integrity failures persist across reinit/bus-clear
- max attempts per peer/address
- max recovery time budget
- circuit breaker cooldown + half-open probe
- enter fail-safe when thresholds are exceeded
Bus-Specific Firmware Hooks (I²C vs SPI vs UART)
Electrical and timing details are intentionally out of scope. This section focuses on firmware levers that turn “random glitches” into measurable, bounded behaviors: detection signals, recovery actions, prevention policies, and logging fields that keep teams aligned.
- max attempts per peer / address
- max recovery time budget
- circuit breaker cooldown + half-open probe
- fail-safe for repeated integrity failures
- bus_id, peer_id (addr / cs / port), op, len
- phase/state, attempt_index, escalation_level
- power_state, timestamp, latency bucket
- result_code (timeout / integrity / overflow)
- hung-bus: SCL-low vs SDA-low + duration window
- no-progress: phase not advancing / counters not changing
- arbitration loss: treat as recoverable event
- stretch seen: classify into peer-process budget
- bus-clear: SCL pulses → STOP → controller reinit
- success rule: bus returns to idle + START is accepted
- controller reset if internal state is stuck
- escalate to quarantine if repeated stalls persist
- soft vs hard timeouts for stretching vs no-progress
- backoff + jitter after arbitration loss
- bounded retries; avoid infinite reinit loops
- write operations use stricter retry constraints
- addr, op, phase, attempt, escalation_level
- stretch_seen, arbitration_lost, bus_clear_used
- hung_type (SCL/SDA), stall_duration
- result_code + recovery_time
- CS framing mismatch: expected_len vs actual_len
- mode mismatch: pattern test / known header readback
- DMA underrun/overrun masquerading as data/CRC errors
- sync loss: header/type/seq inconsistent
- flush FIFO/DMA → reinit controller → sync pattern
- success rule: N consecutive headers/seq valid
- escalate: bus-domain reset if no-progress persists
- quarantine noisy peer to protect the system
- transaction framing: type/len/seq in the payload
- rate limit retries to avoid DMA queue avalanches
- guard writes with readback verify + idempotency
- explicit resync entry point after reset
- cs_id, mode_id, expected_len, actual_len
- dma_underrun, dma_overrun, fifo_level_peak
- sync_pattern_fail, header_fail, seq_gap
- recovery_action + recovery_time
- framing vs parity classification (separate counters)
- break_seen as a state-machine event (wake/resync)
- overrun / watermark peaks (buffer avalanche signals)
- flow_drop: RTS/CTS missing or mis-handled
- flush RX → wait idle → resync header/preamble
- success rule: N frames with valid length/type/seq
- backoff on clustered errors; avoid tight loops
- fail-safe when integrity remains unstable
- deglitch/filters only with latency budget control
- watermark-based backpressure + drop policy
- flow-control health checks (periodic sanity)
- bounded retries to prevent buffer snowballing
- baud, frame_cfg, framing_err, parity_err
- overrun, rx_watermark_peak, flow_drop
- break_seen, resync_count, seq_gap
- recovery_action + recovery_time
Watchdog, Reset Domains, and “Don’t brick the system”
Watchdogs are not timers; they are progress proofs. Feeding should be tied to state-machine milestones, not to periodic loops that might keep running while the system is stuck.
- guards bus tasks and recovery loops
- fires on no-progress in local budget
- triggers L1/L2 recovery escalation
- guards scheduler deadlocks and global collapse
- fires when recovery time budget is exceeded
- enters fail-safe or system reset as last resort
- phase advances, bytes/frames committed, queue drains
- recovery step counter increments (bounded)
- health probe passes N consecutive transactions
- after any reset action: N consecutive probes pass (N = X)
- p99 recovery time stays below T ms
- no reset storms: resets per window below Y
- re-enumerate devices and rebuild the live topology
- re-apply configuration using atomic write rules
- flush stale queues, reset seq, and align state
Engineering Checklist (Bring-up → Production) + Engineering Pack
- Required counters: timeout_count, retry_count, crc_fail, framing_err, bus_reset, recovery_success, recovery_fail
- Derived metrics: error_rate (per X/1k transactions), MTBF, p95/p99 latency, recovery_time (Detect→Normal)
- Minimum log fields: bus_id, peer_id (addr/cs/port), op, len, phase/state, attempt_index, escalation_level, power_state, result_code
- Bucket rules: always slice by peer_id + op + window (no mixed denominators)
- Power events: brownout flag simulation, write-inhibit toggling, commit-interrupt cases
- Peer behavior: busy response, forced stalls, dropped ACK/frames, delayed service windows
- Integrity: CRC/framing error injection at protocol layer, seq gaps, length mismatch
- Topology: peer disappear/reappear, re-enumeration, quarantine + half-open probe
- State coverage: Normal→Detect, Detect→Recover_L1, Recover_L1→Recover_L2, Recover_L2→FailSafe, Quarantine→Half-open→Normal
- Evidence: edge_id hit_count, last_seen_time, avg_recovery_ms, p99_recovery_ms
- Stop rules: max attempts per peer, max recovery budget, circuit breaker cooldown
- Error rate: ≤ X / 1k transactions (per peer/op bucket)
- Latency: p99 ≤ Y ms (define window and load level)
- Recovery: avg ≤ A ms, p99 ≤ B ms
- Reset storm guard: resets/window ≤ C
- Known-pattern probe: header/type/len/seq checks before enabling full traffic
- Version consistency: FW/config/protocol capability alignment (no mixed versions)
- Rollback verification: dual-slot selection rule (commit + CRC + newest version)
- Write safety: write-inhibit must engage during power-unstable / fail-safe states
- Fail-safe frequency: must be below D per hour (or per test run)
- Quarantine: bad peers isolated; system remains operational with health probes
- Evidence pack: logs + counters + coverage report exported per unit
- Name: ______________________________
- Setup: bus_id / peer_id / load profile / window length
- Injection: power event / busy stall / CRC cluster / disconnect
- Observation fields: counters + logs (phase/state, attempt, escalation, power_state)
- Pass criteria: error_rate ≤ X/1k, p99 latency ≤ Y ms, recovery p99 ≤ B ms, resets/window ≤ C
Examples only. Verify package, temperature grade, suffix, and availability for the exact BOM.
- TI TPS3430 (external watchdog timer)
- TI TPS3890 / TPS3891 (voltage supervisor)
- Analog Devices / Maxim MAX809 / MAX810 (reset supervisor family)
- Microchip MCP1316 / MCP130 (reset supervisor family)
- TI TCA9548A / TCA9546A (I²C mux for address conflicts & isolation)
- NXP PCA9548A (I²C mux)
- NXP PCA9615 (differential I²C-bus extender)
- Analog Devices LTC4332 (I²C/SMBus extender / conditioner family)
- TI ISO1540 / ISO1541 (I²C isolator)
- Analog Devices ADuM1250 / ADuM1251 (I²C isolator)
- NXP SC16IS740 (UART over I²C/SPI, FIFO)
- NXP SC16IS750 (UART over I²C/SPI, FIFO)
- NXP SC16IS752 (dual UART over I²C/SPI, FIFO)
- NXP SC18IS602B (I²C-to-SPI bridge)
- Silicon Labs CP2102N (USB-to-UART bridge)
- FTDI FT232R (USB-to-UART bridge)
- Microchip 24LC256 (I²C EEPROM, write control pin variants)
- ST M24C64 / M24C128 / M24C256 (I²C EEPROM family)
- Infineon FM24C256 (I²C FRAM family; fast writes, high endurance)
- Fujitsu MB85RC256V (I²C FRAM)
- Winbond W25Q64JV (SPI NOR flash family)
- Macronix MX25L series (SPI NOR flash family)
Applications & Selection Notes (when protocol/bridge features are required)
- Feature set: CRC+seq framing, backoff+jitter, circuit breaker, quarantine + half-open probe
- Common enabling parts: NXP PCA9615 (diff I²C), TI ISO1540 (I²C isolation), ADuM1250 (I²C isolation)
- Feature set: per-node buckets, escalation ladder, isolation of bad nodes, bounded retries
- Common enabling parts: TI TCA9548A (I²C mux isolation), NXP PCA9548A (I²C mux), NXP SC16IS752 (dual UART bridge with FIFO)
- Feature set: atomic config writes (dual-slot), safe-mode + diagnostics channel, write-inhibit under unstable power
- Common enabling parts: Infineon FM24C256 (FRAM), Fujitsu MB85RC256V (FRAM), TI TPS3430 (watchdog), TI TPS3890 (supervisor)
- Feature set: compare-before-write, readback verify, commit flag, CRC+version, rollback slot
- Common enabling parts: Microchip 24LC256 (I²C EEPROM), ST M24C256 (EEPROM), Winbond W25Q64JV (SPI NOR)
Selection rule: choose parts that reduce “retry storms” and make failures observable (FIFO depth, isolation, segmentation, and stable nonvolatile behavior).
- Is the environment long/noisy (clustered errors, latency spikes)?
- Is the topology multi-node (one bad node can stall the system)?
- Is field update required (remote recovery must succeed)?
- Is power-fail write risk present (brownout, bounce, sudden drop)?
- Is isolation used (added delay affects timeout segmentation)?
- Is buffering needed (deep FIFO or rate shaping to avoid avalanches)?
- CRC + seq framing: enforce type/len/seq/crc at protocol layer
- Timeout segmentation: queue / transfer / peer_process / retry_window / deadline
- Retry policy: bounded + backoff + jitter + circuit breaker
- Recovery SM: L1→L2→FailSafe with quarantine + half-open probe
- Atomic writes: dual-slot + commit flag + version + rollback
- Diagnostics path: minimum channel + counters export
- TI TCA9548A / TCA9546A
- NXP PCA9548A
- TI TCA9535 (I/O expander used as controlled reset/enable lines)
- NXP PCA9555 (I/O expander used as controlled reset/enable lines)
- NXP PCA9615 (differential I²C extender)
- Analog Devices LTC4332 (I²C/SMBus extender/conditioner family)
- TI ISO1540 / ISO1541 (I²C isolation)
- Analog Devices ADuM1250 / ADuM1251 (I²C isolation)
- Analog Devices ADuM3151 / ADuM4151 (SPI isolation families)
- NXP SC16IS740 / SC16IS750 (UART bridge with FIFO)
- NXP SC16IS752 (dual UART bridge with FIFO)
- Silicon Labs CP2102N (USB-to-UART)
- FTDI FT232R (USB-to-UART)
- Infineon FM24C256 (FRAM)
- Fujitsu MB85RC256V (FRAM)
- Microchip 24LC256 (EEPROM)
- ST M24C256 (EEPROM)
- Winbond W25Q64JV (SPI NOR flash)
- Macronix MX25L series (SPI NOR flash)
- TI TPS3430 (watchdog timer)
- TI TPS3890 / TPS3891 (supervisor)
- Microchip MCP1316 / MCP130 (supervisor)
- Analog Devices / Maxim MAX809 / MAX810 (supervisor)
Recommended topics you might also need
Request a Quote
FAQs (Firmware Robustness)
Scope: firmware-only robustness (timeouts, retries, CRC/framing, power-fail safety, recovery state machines). Each answer is fixed to 4 lines to avoid expanding the main text. Thresholds use placeholders (X/Y/A/B/C) and must keep the same denominator/window definitions across the team.
Retries increased, but stability got worse — retry storm or idempotency issue?
Likely cause: synchronized retries (no backoff/jitter) amplify congestion, or non-idempotent commands create repeated side effects (writes/commits).
Quick check: bucket retry_count by peer_id + op; compare read vs write ops; check if failures cluster right after identical retry intervals.
Fix: add bounded retries + exponential backoff + jitter + circuit breaker; mark write/commit ops as “single-shot” or require safe idempotency token/seq.
Pass criteria: retry storms eliminated (no simultaneous spikes across peers), and error_rate ≤ X/1k transactions per peer/op over Y minutes.
CRC fails occasionally, but the scope looks “fine” — buffer/DMA underrun or the link?
Likely cause: buffer boundary bugs, DMA underrun/overrun, or framing/length mismatch masquerading as link corruption.
Quick check: correlate crc_fail with DMA underrun/overrun counters, FIFO watermark peaks, and (type,len,seq) parse errors in logs.
Fix: enforce frame header (type/len/seq) + CRC; reject partial frames; add “no-progress” detection; throttle bursts when watermark stays high.
Pass criteria: crc_fail ≤ X/1k and DMA underrun/overrun = 0 under defined load for Y minutes (per peer bucket).
Timeout too large causes latency explosion; too small causes false kills — what is the first budgeting step?
Likely cause: a single “flat timeout” mixes queueing, transfer, peer processing, and retry windows; decisions become guesswork.
Quick check: log segmented timestamps: queue_start, tx_start, peer_wait_start, retry_start, deadline_hit; verify forward progress (bytes/frames) per segment.
Fix: set soft timeout for peer_busy windows + hard timeout for no-progress; allocate a bounded retry window; keep an absolute deadline for the whole transaction/session.
Pass criteria: p99 latency ≤ Y ms with error_rate ≤ X/1k, and false-kill rate ≤ A/hour in the same workload window.
After power loss, configuration is occasionally corrupted — check commit flag or version/CRC first?
Likely cause: incomplete write accepted as valid, or boot selects the wrong slot due to weak commit/CRC/version rules.
Quick check: inspect boot decision logs: slot_id, commit_flag, stored_crc, computed_crc, version; confirm boot always rejects “commit=0” and CRC mismatch.
Fix: implement atomic two-phase commit (staging→verify→set_commit→verify) + dual-slot rollback; enable write-inhibit under unstable power states.
Pass criteria: across N power-cut injections, boot selection is 100% correct (commit+CRC+newest version), and corrupted-config incidence = 0.
Recovery logic enters a loop — how to add a circuit breaker and escalation ladder?
Likely cause: missing stop rules (max steps/time), or the state machine lacks a “fail-safe” terminal state and keeps retrying the same action.
Quick check: log (state, edge_id, attempt_index, elapsed_ms, escalation_level); verify whether transitions repeat without forward progress or cooldown.
Fix: add bounded attempts + max time budget; escalate L1→L2→FailSafe; quarantine the peer and probe with half-open checks after cooldown.
Pass criteria: infinite loops eliminated (max recovery time ≤ B ms), and recovery_fail ≤ C/hour with resets/window ≤ D.
Production passes, but field units are unstable — what counters/log context are usually missing?
Likely cause: metrics lack consistent denominators/buckets, or logs miss critical context (phase/state/power state), hiding the true failure mode.
Quick check: verify per-peer/per-op bucketing, fixed windows, and presence of (phase/state, attempt, escalation_level, power_state, result_code) in logs.
Fix: standardize counter definitions; add minimal event logs with rate limits; export an evidence pack (counters + key events + recovery edges hit) per unit/build.
Pass criteria: all KPIs computed from identical bucket/window rules, and field faults are classifiable (≥ P% mapped to a known root-cause bucket).
Peer “busy” causes many timeouts — wrong policy, or a non-reentrant state machine?
Likely cause: busy windows treated as hard timeouts, or recovery actions interrupt in-flight transactions without reentrancy protection.
Quick check: check phase/state transitions during busy periods; confirm whether “soft timeout” path exists and whether locks/guards prevent reentering recover while tx is active.
Fix: define soft timeout for busy/stretched phases + bounded backoff; make recovery SM reentrant-safe (idempotent actions, guarded critical sections, cancel/rollback semantics).
Pass criteria: timeout_count during peer_busy drops by ≥ Q%, and recovery_success rate ≥ R% without increased p99 latency beyond Y ms.
SPI high-throughput CRC spikes — DMA underrun first, or sampling window first?
Likely cause: DMA underrun/overrun or transaction framing mismatch is the most common firmware-side cause under burst load.
Quick check: correlate crc_fail with DMA underrun/overrun, FIFO watermark, and (expected_len vs actual_len) per transaction; run a known-header/pattern probe.
Fix: enforce strict framing (type/len/seq) + CRC, throttle bursts when watermark is high, and treat CRC clusters as “degrade” → backoff → resync path.
Pass criteria: DMA underrun/overrun = 0 and crc_fail ≤ X/1k at target throughput for Y minutes (per cs_id bucket).
UART occasional framing errors — baud error budget first, or noise deglitch first?
Likely cause: combined clock mismatch (baud budget) or bursty noise causing edge glitches; both appear as framing_err without proper context.
Quick check: compare error patterns: discrete vs clustered; log measured baud/clock trim, framing_err vs parity_err ratio, and whether errors drop when enabling deglitch/oversampling.
Fix: keep combined baud error within typical budget (target ≤ ±2% unless proven otherwise); add optional deglitch/oversampling with bounded latency impact and clear enable conditions.
Pass criteria: framing_err ≤ X/1k frames with p99 latency ≤ Y ms, and errors do not form clusters above K events in T seconds.
After recovery, things look “OK” but performance drops — what post-recovery health metrics should be logged?
Likely cause: the system stays in a degraded mode (extra retries/backoff, quarantine, reduced throughput) without explicit visibility.
Quick check: log a “recovery stamp” and compare pre/post metrics: p99 latency, retry rate, error_rate, quarantine_count, throughput, and resets/window over the same bucket/window.
Fix: enforce an observation window after recovery; only exit degrade/quarantine after N consecutive clean transactions; reset policy knobs to nominal in a controlled, logged step.
Pass criteria: post-recovery metrics return to baseline within W seconds (p99 ≤ Y ms; error_rate ≤ X/1k), and degrade time ratio ≤ S% per hour.