I2C Arbitration & Clock Stretching: Timeouts and Recovery
← Back to: I²C / SPI / UART — Serial Peripheral Buses
This page turns I²C arbitration and clock stretching into an executable reliability plan: phase-based timeouts, evidence-driven debugging, and a recovery state machine that prevents stalls and multi-master storms.
It helps long-cable and slow-slave systems stay stable by separating legit stretching from true hangs, then enforcing bounded retries, backoff fairness, and measurable pass criteria.
Center idea: make arbitration & stretching predictable, not mysterious
Arbitration loss is a normal multi-master path; clock stretching is a normal slow-slave path. The failures happen when timeouts and recovery are defined with the wrong measurement points.
This page provides a practical toolkit: phase-aware timeout templates, a recovery state machine (bus-clear & re-init), a telemetry schema (counters + histograms), and bring-up/production acceptance gates—so long cables and slow devices remain reliable.
Common field symptoms (measurable, not vague)
1) Intermittent bus stall / “hung bus”
Fast evidence: SCL low > X ms or SDA low > X ms; missing STOP; bus-clear counter increments.
First action: separate legit stretch from stuck-low using SCL-low duration distribution.
2) NAK / arbitration-loss rate spikes after system changes
Fast evidence: arb-lost per 1k transactions rises; retries cluster in bursts; error rate varies by address/op.
First action: enforce “release + randomized backoff” and verify fairness with per-master counters.
3) Throughput oscillation / periodic latency cliffs
Fast evidence: average throughput looks fine but P95/P99 latency jumps; stretch histogram grows a long tail.
First action: switch to phase-aware timeouts (bit/byte/transaction/bus-stuck) and log duration by phase.
Deliverables shipped by this page
- Timeout policy template: phase-aware timeouts (bit/byte/transaction/bus-stuck), with start/stop measurement points.
- Recovery sequence: retry → bus clear → peripheral re-init → controller reset, each with exit criteria.
- Telemetry schema: counters + histograms (stretch duration, arb-lost rate, bus-clear events) keyed by addr/op.
- Validation gates: bring-up and production checks that force stuck-low and long-tail stretch scenarios.
Mechanism primer: enough to design timeouts and avoid false “hangs”
Mechanism cheat sheet (engineering meaning, not protocol trivia)
- Arbitration: each master compares intended SDA vs actual SDA during SCL-high sampling. A master that tries to send ‘1’ but observes ‘0’ loses arbitration and must stop driving immediately.
- Why ‘0’ wins: open-drain wired-AND makes ‘0’ dominant; this is what enables safe multi-master operation.
- Clock synchronization: the effective clock is the shared bus clock. When any participant holds SCL low longer, the bus period stretches—so “ideal SCL frequency” is not a valid timeout reference.
- Engineering consequence: arbitration loss is a normal path; stability depends on release + randomized backoff, plus phase-aware timeouts that distinguish legit stretching from stuck-low faults.
Practical rule: measure timeouts by phase (ACK wait, byte transfer, transaction completion, bus-stuck) rather than by nominal SCL rate—especially with multi-master or slow devices.
Timing snippet: two masters start together, a single bit decides arbitration
The “loss point” is detected at the sampling window: when one master intends SDA=1 but the line is SDA=0. At that instant, the losing master must release the bus and reattempt only after bus-idle plus a randomized backoff.
Engineering meaning: arbitration loss should be counted and handled deterministically (release + backoff + retry), while timeouts must remain phase-aware so legit stretching is not misclassified as a hang.
Clock stretching semantics: separate “legit wait” from “stuck bus”
Stretching is not the enemy. Misclassification is.
A robust design must distinguish a bounded, recoverable wait from an unbounded stuck-low condition. If “legit stretch” is treated as a hang, timeouts will abort valid transactions and create false instability. If “hang” is treated as stretch, the system will wait forever and appear dead.
Legit stretch (normal wait): bounded and statistically stable
- Typical sources: internal processing delay, slow peripherals, clock-domain crossings.
- Signature: stretch durations form a repeatable distribution with a practical upper bound (by slave/address/phase).
- Required handling: phase-aware timeouts + telemetry (histogram by phase) so long-tail growth is visible before it becomes a hang.
Key metric: stretch duration distribution, not a single average number. A stable distribution suggests “wait”; an expanding tail suggests “risk”.
False stretch / hang: unbounded, needs intervention
- Typical sources: brown-out leaves a device holding SCL/SDA low; IO latch-up after ESD; cable common-mode events that trap a port.
- Signature: SCL low (or SDA low) persists beyond the system’s “bus-stuck” threshold; missing STOP; repeated retries do not progress.
- Required handling: abort + bus-clear sequence + re-init; escalate to reset if bus-clear does not restore idle.
Fast discriminator: does the line return to idle before T_bus_clear? If not, treat it as a stuck-bus path and trigger recovery.
The thresholds above are placeholders. The engineering requirement is the structure: a warning threshold to increase logging, an abort threshold to bound transaction time, and a bus-clear threshold to avoid indefinite waits. Later chapters define how to set these thresholds per phase and per slave.
Failure taxonomy: symptom → mechanism → evidence → first check
This section is a field index. Each item follows a fixed structure to keep debugging deterministic: Symptom (what is observed) → Likely mechanism (why it happens) → What to log (fast evidence) → First check (10-minute action).
Arbitration loss storm (multi-master)
Symptom: arb-lost spikes in bursts; throughput drops despite bus activity.
Likely mechanism: backoff is missing or deterministic, causing synchronized retries.
What to log: arb-lost per master, retry interval distribution, bus utilization vs effective bytes.
First check: enforce release + randomized backoff; verify per-master counters converge.
Stretch tail grows with temperature
Symptom: P95/P99 latency rises; occasional long waits appear.
Likely mechanism: slow-slave processing or drift expands stretch distribution.
What to log: stretch histogram by phase and by address, plus temperature timestamp.
First check: apply phase-aware timeouts and verify the tail is bounded per slave.
SDA stuck low after reset
Symptom: missing STOP; SDA remains low; first transaction never completes.
Likely mechanism: a slave never releases SDA after reset/brown-out.
What to log: bus-idle check result, bus-clear attempts, offending address candidate list.
First check: run bus-clear; if not recovered, isolate which device holds SDA.
SCL stuck low (device or other master)
Symptom: all transactions fail; SCL low dominates; timeouts cluster in the same phase.
Likely mechanism: a participant holds SCL low indefinitely (fault or lockup).
What to log: SCL-low duration, timeout_by_phase, bus-clear success/fail.
First check: confirm whether bus-clear restores idle; if not, isolate the holder by segmentation.
“Looks OK on scope” but NAK spikes
Symptom: waveform seems clean, yet error rate jumps in logs.
Likely mechanism: wrong denominator/window; mixing endpoints, addresses, or R/W directions.
What to log: NAK by address + R/W + phase, with a fixed time window.
First check: standardize metric definition; split counters by endpoint and operation type.
Throughput periodic drops
Symptom: drops appear with a stable period; bursts of retries/long waits align to a cadence.
Likely mechanism: stretching couples to scheduling/DMA bursts or polling loops.
What to log: stretch duration + a scheduler marker (tick, load, or burst ID).
First check: test jittered scheduling/backoff; confirm correlation weakens and tail shrinks.
Long-cable ringing → false edges
Symptom: errors depend on cable/connector changes; near-end and far-end waveforms differ.
Likely mechanism: edge artifacts masquerade as timing/protocol events.
What to log: capture near vs far taps; annotate with cable variant and event time.
First check: validate edge control and immunity paths (handled in the EMC/edge-control page).
Brown-out ghost powering
Symptom: faults cluster around plug/unplug or supply dips; reset order changes behavior.
Likely mechanism: IO back-powering or partial reset traps the bus.
What to log: rail timing + fault timestamp + bus-idle state.
First check: align reset/power sequencing (detailed handling belongs to the Hot-plug/Brown-out page).
The decision tree is intentionally minimal: it focuses on the fastest discriminators (bounded vs unbounded stretch, phase counters, metric sanity, near/far taps). Later chapters define the detailed timeout templates and recovery state machine referenced here.
Timeout policy design: what to time, where to start the clock
A timeout is not a number. It is a definition.
Robust systems define what is being waited for, when the clock starts, when it stops, and how decisions escalate from progress protection to transaction bounding to bus availability. Without a clear timing contract, valid clock stretching will be misclassified as a hang, and real hangs will be allowed to persist.
Layered policy map (three rings)
- Short (progress): detect “no forward progress” at bit/byte/ACK phases.
- Long (transaction): bound end-to-end time from START to STOP + bus-idle confirmation.
- Fatal (bus availability): detect stuck-low or persistent non-idle and trigger recovery.
Required telemetry: phase_id, addr, R/W, duration, reason, and window.
T_SCL_LOW_MAX — SCL low hold upper bound
Wait for: SCL returning high after being held low.
Start clock: first detection of SCL low (level holds beyond a debounce window).
Stop clock: SCL high confirmed (stable high for ≥ X).
Failure shape: bounded distribution indicates legit stretch; unbounded indicates stuck-low risk.
Tune knobs: bucket by phase and by slave; watch tail growth (P95/P99).
Log: phase_id, addr (if known), low_duration, repeats, bus_state.
T_ACK_WAIT — ACK/NACK completion window
Wait for: ACK bit sampling to complete for address/data bytes.
Start clock: entry into ACK phase (after 8 data bits).
Stop clock: ACK sampled and the next phase begins (or STOP/restart is scheduled).
Failure shape: timeouts here often indicate phase definition errors or a stretched SCL low.
Tune knobs: separate by address + R/W + byte index; do not mix endpoints.
Log: addr, R/W, byte_index, ack_value, duration, reason.
T_BYTE_PROGRESS — byte-level forward progress
Wait for: completion of one byte + its ACK phase.
Start clock: first bit of the byte is launched (or first SCL edge of that byte).
Stop clock: ACK phase completes and next byte/phase begins.
Failure shape: detects mid-transaction stalls earlier than transaction timeouts.
Tune knobs: bucket by phase + byte_index; treat reads and writes separately.
Log: phase_id, byte_index, addr, duration, scl_low_hits, retry_count.
T_TRANSACTION_MAX — end-to-end transaction bound
Wait for: a full transaction from START to STOP completion.
Start clock: START observed (or bus ownership acquired in multi-master).
Stop clock: STOP completed and bus returns to idle.
Failure shape: protects upper layers from being blocked by long stalls or repeated retries.
Tune knobs: set per operation type (register read vs bulk write); cap retries within this bound.
Log: addr, op_type, total_duration, retries, abort_reason, final_state.
T_STOP_COMPLETE / T_BUS_IDLE_CONFIRM — bus idle confirmation
Wait for: the bus to become truly idle after STOP.
Start clock: STOP is issued.
Stop clock: SCL=H and SDA=H stable for ≥ X (debounced).
Failure shape: prevents false “STOP done” when a line is still being held.
Tune knobs: keep small but explicit; use as a gate before reattempting traffic.
Log: idle_confirm_time, line_states, stuck_line (SCL/SDA), followup_action.
T_BUS_IDLE_WAIT — wait-for-idle gate (multi-master safe)
Wait for: bus idle before issuing a new transaction or a recovery step.
Start clock: a transaction is requested but bus is not idle (or after arb-lost).
Stop clock: idle confirmed (same criteria as idle confirmation).
Failure shape: prevents “recovery injection” from interrupting another master’s active transaction.
Tune knobs: add randomized backoff windows to avoid synchronized reattempts.
Log: idle_wait_time, arb_lost_seen, backoff_applied, bus_owner_hint (if available).
T_BUS_STUCK_FATAL — bus availability guardrail
Wait for: persistent non-idle (SCL low or SDA low) beyond a fatal threshold.
Start clock: detection of persistent line hold (debounced).
Stop clock: bus returns to idle, or recovery is triggered.
Failure shape: distinguishes “bounded wait” from “unbounded hang” and forces recovery.
Tune knobs: keep separate from short/long timeouts; do not let it silently inherit them.
Log: stuck_line, stuck_duration, last_phase, bus_clear_attempts, escalation_level.
The diagram shows why timing points matter: ACK/byte progress windows start and stop at phase boundaries, while transaction bounds cover START to STOP plus idle confirmation. Fatal stuck-bus detection remains separate to avoid misclassifying bounded stretching as a hang.
Recovery state machine: bus clear, re-init, and backoff
Recovery must be deterministic. A state machine avoids “folk remedies” by defining triggers, actions, and exit criteria. This prevents interrupting another master in multi-master systems and prevents synchronized retry storms through jittered backoff.
STATE: SOFT_RETRY
Trigger: short timeout (ACK/byte progress) while bus still becomes idle.
Action: abort current attempt, retry the same transaction with a bounded retry count.
Exit criteria: success OR retry budget exhausted → escalate to BACKOFF_WAIT or BUS_CLEAR_GATE.
STATE: BACKOFF_WAIT (jitter)
Trigger: arbitration loss bursts, repeated retries, or post-recovery reattempts.
Action: wait a randomized backoff window (jittered) before re-checking bus idle.
Exit criteria: backoff complete AND bus idle confirmed → reattempt; otherwise continue waiting up to T_BUS_IDLE_WAIT.
STATE: BUS_CLEAR_GATE (multi-master safe)
Trigger: fatal stuck-bus detection (persistent non-idle) OR transaction max exceeded with no progress.
Action: only proceed if the bus is confirmed stuck (no progress) and there is no active ownership by another master.
Exit criteria: gate satisfied → BUS_CLEAR; otherwise wait for idle or re-check after backoff.
STATE: BUS_CLEAR (9 pulses + STOP/idle)
Trigger: BUS_CLEAR_GATE satisfied.
Action: toggle SCL (typical 9 pulses) to release stuck slave state; then attempt STOP and confirm bus idle.
Exit criteria: bus idle confirmed → RE_INIT_CTRL; if still not idle → RESET_PERIPH or FULL_RESET (escalation).
STATE: RE_INIT_CTRL
Trigger: bus idle recovered or controller state is suspected inconsistent.
Action: re-initialize controller peripheral; re-apply timing config; clear error flags.
Exit criteria: PROBE_OK passes within a stable window → EXIT_OK; otherwise BACKOFF_WAIT or escalate.
STATE: RESET_PERIPH (targeted)
Trigger: bus-clear fails to restore idle OR a specific device is strongly correlated with holds/timeouts.
Action: reset or power-cycle the suspected peripheral (targeted), then re-check bus idle.
Exit criteria: bus idle confirmed + PROBE_OK → EXIT_OK; otherwise FULL_RESET.
STATE: FULL_RESET (last resort)
Trigger: repeated recovery failure or controller lockup is suspected.
Action: full controller reset and system-level mitigation; then enforce BACKOFF_WAIT before resuming traffic.
Exit criteria: bus idle confirmed + stable window without stuck-bus events → EXIT_OK; otherwise hold in safe mode.
Multi-master safety is enforced by the BUS_CLEAR_GATE: destructive actions are only allowed when the bus is confirmed stuck with no progress. The BACKOFF_WAIT state is mandatory after arbitration loss or recovery to avoid synchronized retry storms.
Multi-master backoff & fairness: prevent arbitration storms
Problem: “smart” masters can fail together
Arbitration loss is expected in multi-master systems. The failure mode is lockstep retry: both masters time out, release, and re-attempt on the same schedule, creating bursty arbitration storms. The bus looks busy, but effective transaction completion collapses due to repeated collisions.
Storm signatures (quick recognition)
- Burstiness: arb-lost events appear in clusters, not as uniform scatter.
- Lockstep: re-attempt intervals are quantized to the same tick or fixed delay.
- Efficiency drop: retry/abort ratio grows while bus utilization stays high.
- Phase concentration: collisions concentrate at START/address boundaries.
Strategy: backoff + slicing + idle gates (three layers)
Layer A — Jittered backoff after arbitration loss
Goal: break time correlation between masters.
Rule: backoff must include random jitter; fixed delays re-align on system ticks.
Implementation cues: seed sources must differ per master; resolution must exceed OS tick quantization.
Outcome: collision probability decreases over time instead of repeating indefinitely.
Layer B — Transaction slicing (reduce continuous occupancy)
Goal: reduce “hold time” so competing masters can interleave.
Rule: split large transfers into bounded fragments with explicit completion points.
Safety: fragment boundaries must be recoverable (avoid unsafe partial writes).
Outcome: fairness improves and worst-case latency tightens under contention.
Layer C — Idle detect and claim threshold (avoid aggressive re-claim)
Goal: avoid injecting attempts into a bus that is not truly idle.
Rule: require debounced idle confirmation before re-claim; apply a fairness gate after repeated losses.
Outcome: fewer collisions at START boundaries and fewer “storms after recovery”.
Validation: quantify convergence, not just “feels better”
Metrics to report (two normalizations)
- arb-lost rate per 1k transactions: quality normalization across workloads.
- arb-lost rate per minute: stability and burst detection over time.
- burstiness index: peak/mean in a fixed window (storm indicator).
- time-to-converge: time from first collision burst to stable regime.
- efficiency: successful transactions / total attempts (retries included).
Minimum log fields (to make metrics unambiguous)
master_id, event_time, event_type (arb_lost/timeout/success), addr, R/W, op_type, retry_index, backoff_applied, idle_confirm_time.
The critical design lever is decorrelation: jittered backoff must be truly random per master and must not be quantized to the same scheduling tick. Slicing reduces continuous occupancy, while idle gates reduce collisions at START boundaries.
Long cable & slow slave compatibility: what changes
Three things that change (and what must be adjusted)
Change #1 — Stretch distribution widens (tail risk increases)
Why: slow internal processing and system scheduling amplify tail latency.
Risk: a single timeout number either kills legit stretches or tolerates real hangs.
Do: keep layered timeouts (short/long/fatal) and tune per-phase and per-slave.
Pass: tail metrics (P95/P99) remain bounded; fatal stuck-bus triggers become rare.
Change #2 — Retry cost rises (fail-fast + spaced retries)
Why: longer transactions and longer recovery windows increase the cost of each failure.
Risk: repeated retries can monopolize the bus and trigger contention storms.
Do: fail fast on short timeouts, cap retries under the transaction bound, and apply jittered spacing.
Pass: average failure occupancy decreases; throughput becomes smoother under load.
Change #3 — Observability becomes mandatory (per-slave buckets)
Why: symptoms look similar on the bus, but root causes concentrate by address, operation, and phase.
Risk: tuning becomes guesswork without per-slave and per-operation visibility.
Do: log and bucket by addr, R/W, op_type, and phase_id; correlate with stretch tails and timeouts.
Pass: the worst offender can be named (which slave, which phase, which operation).
The purpose of tap points is not only waveform visibility; it is metric correctness. Without per-slave and per-operation buckets, timeouts and backoff policies cannot be tuned safely for long-reach systems.
Multi-master backoff & fairness: prevent arbitration storms
Problem: “smart” masters can fail together
Arbitration loss is expected in multi-master systems. The failure mode is lockstep retry: both masters time out, release, and re-attempt on the same schedule, creating bursty arbitration storms. The bus looks busy, but effective transaction completion collapses due to repeated collisions.
Storm signatures (quick recognition)
- Burstiness: arb-lost events appear in clusters, not as uniform scatter.
- Lockstep: re-attempt intervals are quantized to the same tick or fixed delay.
- Efficiency drop: retry/abort ratio grows while bus utilization stays high.
- Phase concentration: collisions concentrate at START/address boundaries.
Strategy: backoff + slicing + idle gates (three layers)
Layer A — Jittered backoff after arbitration loss
Goal: break time correlation between masters.
Rule: backoff must include random jitter; fixed delays re-align on system ticks.
Implementation cues: seed sources must differ per master; resolution must exceed OS tick quantization.
Outcome: collision probability decreases over time instead of repeating indefinitely.
Layer B — Transaction slicing (reduce continuous occupancy)
Goal: reduce “hold time” so competing masters can interleave.
Rule: split large transfers into bounded fragments with explicit completion points.
Safety: fragment boundaries must be recoverable (avoid unsafe partial writes).
Outcome: fairness improves and worst-case latency tightens under contention.
Layer C — Idle detect and claim threshold (avoid aggressive re-claim)
Goal: avoid injecting attempts into a bus that is not truly idle.
Rule: require debounced idle confirmation before re-claim; apply a fairness gate after repeated losses.
Outcome: fewer collisions at START boundaries and fewer “storms after recovery”.
Validation: quantify convergence, not just “feels better”
Metrics to report (two normalizations)
- arb-lost rate per 1k transactions: quality normalization across workloads.
- arb-lost rate per minute: stability and burst detection over time.
- burstiness index: peak/mean in a fixed window (storm indicator).
- time-to-converge: time from first collision burst to stable regime.
- efficiency: successful transactions / total attempts (retries included).
Minimum log fields (to make metrics unambiguous)
master_id, event_time, event_type (arb_lost/timeout/success), addr, R/W, op_type, retry_index, backoff_applied, idle_confirm_time.
The critical design lever is decorrelation: jittered backoff must be truly random per master and must not be quantized to the same scheduling tick. Slicing reduces continuous occupancy, while idle gates reduce collisions at START boundaries.
Long cable & slow slave compatibility: what changes
Three things that change (and what must be adjusted)
Change #1 — Stretch distribution widens (tail risk increases)
Why: slow internal processing and system scheduling amplify tail latency.
Risk: a single timeout number either kills legit stretches or tolerates real hangs.
Do: keep layered timeouts (short/long/fatal) and tune per-phase and per-slave.
Pass: tail metrics (P95/P99) remain bounded; fatal stuck-bus triggers become rare.
Change #2 — Retry cost rises (fail-fast + spaced retries)
Why: longer transactions and longer recovery windows increase the cost of each failure.
Risk: repeated retries can monopolize the bus and trigger contention storms.
Do: fail fast on short timeouts, cap retries under the transaction bound, and apply jittered spacing.
Pass: average failure occupancy decreases; throughput becomes smoother under load.
Change #3 — Observability becomes mandatory (per-slave buckets)
Why: symptoms look similar on the bus, but root causes concentrate by address, operation, and phase.
Risk: tuning becomes guesswork without per-slave and per-operation visibility.
Do: log and bucket by addr, R/W, op_type, and phase_id; correlate with stretch tails and timeouts.
Pass: the worst offender can be named (which slave, which phase, which operation).
The purpose of tap points is not only waveform visibility; it is metric correctness. Without per-slave and per-operation buckets, timeouts and backoff policies cannot be tuned safely for long-reach systems.
Engineering checklist: bring-up → production
Use this chapter as an executable SOP. Every item is written as Action → Evidence → Pass criteria. Material numbers below are concrete examples to speed procurement; verify package/suffix/voltage/temperature grade and availability.
Bring-up checklist (lab)
☐ Define mode / speed / multi-master enable strategy (allowed set + safe defaults + disallowed toggles)
Action: publish an allowed operating set (speed, stretching policy, multi-master on/off) and lock down unsafe hot-switches.
Evidence: configuration maps to firmware build flags / register writes; mode changes are auditable.
Pass criteria: each allowed mode has a bound timeout profile (short/long/fatal) and recovery profile; no “silent” mode changes.
Material examples (for segmentation / contention control)
- I²C mux / hub: TI TCA9548A, NXP PCA9548A (resolve address conflicts; isolate channels).
- Hot-swap / stuck-bus recovery helper: TI TCA4311A, Analog Devices LTC4311 (bus stuck prevention / idle conditioning).
- I²C buffer / repeater: NXP PCA9515A, NXP PCA9617A (capacitance segmentation; level isolation between segments).
- Isolation (bidirectional I²C): Analog Devices ADuM1250, TI ISO1540 (functional/safety isolation with delay budget check).
☐ Install initial timeout gates + log schema (phase, slave addr, duration)
Action: implement layered timeouts (short/long/fatal) and bind decisions to phase_id and per-slave buckets.
Evidence: logs contain: phase_id, addr, R/W, op_type, duration, result, retry_index, backoff_applied, bus_clear_count.
Pass criteria: timeouts are unambiguous (start/stop points defined); “scope looks OK but metrics are bad” can be resolved using buckets.
Material examples (for timestamping / persistent logs)
- Non-volatile ring buffer (high write endurance): Fujitsu MB85RC256V (I²C FRAM), Cypress/Infineon FM24C256 (I²C FRAM).
- SPI NOR for snapshots: Winbond W25Q64JV, Macronix MX25L6406E (store last-N transaction bundles).
- RTC (optional, stable timestamps): Maxim/Analog Devices DS3231.
☐ Validate bus clear (force stuck scenarios; verify exit criteria)
Action: deliberately pull SDA/SCL low to create stuck-bus; run recovery state machine (soft retry → bus clear → re-init).
Evidence: capture aligned evidence (waveform/protocol/counters) around the trigger; record held_line (SCL/SDA) and bus_clear outcome.
Pass criteria: bus idle confirmed within X (placeholder) and recovery does not create arbitration storms in multi-master systems.
Material examples (for controlled “stuck” injection)
- Open-drain pull-down gate: TI SN74LVC1G07, Nexperia 74LVC1G07 (force line low under MCU control).
- N-MOSFET clamp: Nexperia BSS138, Diodes Inc. 2N7002 (simple controlled clamp fixture).
- Analog switch (route clamp to bus): TI TS5A23157, Analog Devices/Maxim MAX4619 (inject/remove faults without rework).
☐ Simulate slow slaves (controlled stretching delays; profile tail behavior)
Action: inject deterministic and randomized response delays in a controllable slave; collect stretch_histogram per phase and per addr.
Evidence: bounded vs heavy-tail vs unbounded behavior is visible in histograms; timeouts_by_phase correlates with offenders.
Pass criteria: short timeouts do not kill legit stretching; fatal timeout reliably detects unbounded hangs; retry spacing prevents storm.
Material examples (slow-slave / programmable emulation)
- Programmable slave MCU (inject stretch): ST STM32G031K8, Microchip ATSAMD21G18 (build a controllable I²C slave to stretch and fault).
- Bus extender (stress long-reach behavior without cabling physics deep-dive): NXP PCA9615 (differential I²C over twisted pair) as a controllable reach extender.
- Address conflict / fan-out stress: TI TCA9546A (4-ch mux) to create multi-branch scenarios for bucketed metrics.
☐ Prepare evidence-capture tools (trigger-first; align waveform/protocol/counters)
Action: set analyzer triggers (arb-lost, SCL low > T, missing STOP, repeated START) and verify counter alignment (timeouts_by_phase, stretch_histogram).
Evidence: a single incident can be reconstructed across waveform/protocol/firmware layers.
Pass criteria: “scope looks OK but metrics are bad” can be resolved via denominator/window and per-slave buckets.
Tool material examples (concrete SKUs)
- Protocol analyzer: Total Phase Beagle I2C/SPI Protocol Analyzer (e.g., “Beagle I2C/SPI v2”).
- Logic analyzer: Saleae Logic Pro 8 / Logic Pro 16 (trigger + decode workflows).
- Active probe (scope): Tektronix TAP1500 (example SKU) for clean edge visibility when needed.
Production checklist (field-ready)
☐ Define monitoring metrics + alert thresholds (placeholders X/Y)
Action: publish dashboards and alerts using consistent normalization (per 1k transactions and per minute).
Evidence: metrics are bucketed by addr/R-W/op/phase; alert rules point to a next-step action.
Pass criteria: alerts separate “contention storm” from “slow-slave tail growth” and from “stuck bus”.
Material examples (telemetry storage and export)
- FRAM (high-endurance counters/log): Fujitsu MB85RC256V, Infineon/Cypress FM24C256.
- Secure MCU/SoC NVM option: Winbond W25Q32JV (compact snapshot storage for alerts).
☐ Deploy bounded field recovery policy (max retries + degrade mode + safe isolation hooks)
Action: cap retries and bound worst-case occupancy; apply jittered spacing; enter a controlled degraded mode (e.g., lower speed) after repeated failures.
Evidence: recovery actions and outcomes are logged; backoff is applied after arb-lost clusters.
Pass criteria: no infinite retry loops; contention does not escalate into storms; bus returns to idle confirmed before resuming traffic.
Material examples (redundancy / isolate-and-fallback building blocks)
- Channel isolation / fallback selection: TI TCA9548A (route around a bad branch), NXP PCA9547 (8-ch mux).
- Isolation boundary (if required): TI ISO1540, Analog Devices ADuM1250 (verify delay budget vs timeout policy).
- Hot-swap conditioning: TI TCA4311A, Analog Devices LTC4311 (reduce stuck-bus events after disturbances).
☐ Implement failure sample capture (last N transactions + freeze on trigger)
Action: store a ring buffer of the last N transactions (addr/RW/op/phase/duration/result/retry/arb-lost); freeze snapshot on fatal timeout or bus clear.
Evidence: each field incident includes a compact “evidence bundle” exportable for lab replay.
Pass criteria: incidents can be reconstructed without blind waveform hunting; offenders can be named by bucket keys.
Material examples (snapshot persistence)
- I²C FRAM (write-heavy safe): Fujitsu MB85RC256V.
- SPI NOR (bulk snapshot): Winbond W25Q64JV.
- MicroSD socket (if acceptable): TE Connectivity 2041021-1 (example socket PN; verify footprint).
Gate failures should produce a reusable evidence bundle: last-N transactions + aligned triggers + counters by phase and by slave. This keeps debugging deterministic across bring-up and field returns.
Recommended topics you might also need
Request a Quote
FAQs: arbitration, stretching, timeouts, recovery
Scope: only long-tail troubleshooting within this page boundary (arbitration, stretching semantics, phase-based timeouts, recovery state machine, backoff/fairness, and evidence alignment). Format per FAQ: Likely cause / Quick check / Fix / Pass criteria (threshold placeholders).
Arbitration loss suddenly spikes after adding a second master — first fairness/backoff check?
Likely cause: retry attempts are time-aligned (no jitter), or both masters share the same backoff quantization/seed, creating lockstep collisions.
Quick check: plot arb-lost timestamps per master; look for constant inter-retry spacing and near-zero phase offset between masters; verify per-master RNG seeding differs.
Fix: enable jittered backoff (base + rand[0..J]); increase backoff resolution above OS tick; enforce a fairness gate after consecutive arb-lost bursts.
Pass criteria: arb-lost rate < X/1k over Y minutes and burstiness (peak/mean) < Z.
Bus “stalls” but SCL isn’t stuck low — what timeout phase definition is wrong?
Likely cause: the timeout clock starts/ends on the wrong phase boundary (e.g., timing “transaction” from START but never stopping at STOP/idle-confirm), so progress is misclassified as a stall.
Quick check: inspect timeouts_by_phase buckets; confirm each timeout has a precise start point and exit point (STOP issued vs bus idle confirmed); check for timeouts firing while bytes are still advancing.
Fix: split timeouts by intent (ACK-wait, byte-progress, transaction-max, fatal-stuck); define phase transitions explicitly and stop the clock on a verifiable exit (idle-confirm window).
Pass criteria: false “stall” timeouts < X/day and timeouts correlate to a single phase_id with consistent offenders (addr/op buckets).
Clock stretching works on bench but fails in system — master timeout too tight or false hang?
Likely cause: system workload widens stretch distribution (tail growth) and crosses a short timeout gate, or “hang” classification triggers on missing idle-confirm rather than unbounded SCL low.
Quick check: compare stretch_histogram (bench vs system) by phase_id and by addr; verify whether failures occur at P95/P99 tail or at unbounded events; confirm idle-confirm window definition.
Fix: use layered timeouts (short/long/fatal) and per-slave/per-phase profiles; treat “unbounded” as fatal-stuck, not as normal tail; ensure hang detection requires no-progress evidence.
Pass criteria: P99 stretch < X ms with < Y false fatal events per Z hours.
Only one slave causes long stretch tails — how to prove slow-slave vs topology artifact?
Likely cause: a single device has variable internal service time (processing/clock-domain) that dominates tail behavior; apparent “system” issues are actually per-address/per-operation.
Quick check: bucket stretch_histogram by addr + R/W + op_type; compare tails across slaves under the same master load; verify the tail follows that slave across time windows.
Fix: apply per-slave timeout profiles and retry spacing; reduce single-transaction occupancy (slicing) for that slave’s heavy operations; add offender-specific logging and alerting.
Pass criteria: offending slave’s P99 stretch decreases or is tolerated without fatal timeouts; system-level throughput variance < X% over Y minutes.
After brown-out, SDA stays low — what is the fastest bus-clear sequence to try?
Likely cause: a slave is holding SDA waiting for clocking to complete an internal state, or the bus never returns to a confirmed idle state after reset.
Quick check: confirm held_line = SDA via GPIO readback or capture; verify whether SCL can toggle; check bus_clear_count/outcome and whether STOP/idle-confirm is performed.
Fix: perform controlled bus clear: drive SCL pulses (e.g., 9 clocks), then issue STOP, then require an idle-confirm window (SCL=H,SDA=H stable); escalate to peripheral reset only after no-progress is proven.
Pass criteria: bus returns to idle confirmed within X ms and stuck recurrence < Y per Z hours.
Arbitration loss happens only during certain traffic bursts — what transaction slicing helps first?
Likely cause: long contiguous transactions monopolize bus time, increasing collision probability at START/address boundaries when both masters attempt access during bursts.
Quick check: correlate arb-lost bursts with transaction length distribution (bytes/transaction) and burst schedule; identify the top offenders by op_type and addr buckets.
Fix: split large operations into smaller bounded chunks (slicing); insert jittered idle gaps between chunks; keep each chunk idempotent and abortable under timeout gates.
Pass criteria: mean transaction length decreases while completion rate increases; arb-lost clusters drop by > X% at the same workload.
Logic analyzer shows NAK, but scope looks fine — what denominator/window mismatch is common?
Likely cause: NAK rate is computed with a drifting denominator (mixed endpoints, mixed op types, or inconsistent time window), making “spikes” that are statistics artifacts.
Quick check: split metrics by addr + R/W + op_type; compare “per 1k transactions” vs “per minute”; verify window length and exclude warm-up/transients consistently.
Fix: standardize metric definitions and rollups; log both numerator and denominator with bucket keys; add a guardrail alert for window/denominator changes.
Pass criteria: NAK rate is stable within X/1k over Y minutes for each top addr bucket.
Recovery loops make the system worse — what backoff/jitter is missing?
Likely cause: recovery actions re-trigger immediately and synchronously (no jitter), so both masters re-enter recovery/traffic at the same time, creating repeated collisions and stalls.
Quick check: inspect recovery event timestamps; detect fixed periodicity; confirm backoff is applied after arb-lost or after recovery exit; verify retry spacing is not quantized to the same tick.
Fix: add jittered spacing to recovery exit (and to retries); require no-progress proof before destructive recovery; implement escalation with explicit exit criteria to prevent infinite loops.
Pass criteria: recovery loops < X/day and post-recovery arb-lost does not cluster (burstiness < Y).
Long cable works at 100 kHz but fails at 400 kHz — what should be relaxed besides bitrate?
Likely cause: higher rate increases sensitivity to stretch tail and retry cost; a single “one-size” timeout and tight retry spacing may break under workload variance even if the bus still toggles.
Quick check: compare stretch_histogram and timeouts_by_phase between 100 kHz and 400 kHz; check whether failures are tail-driven (P99) or unbounded hangs; validate per-slave buckets.
Fix: use phase-based layered timeouts (not a single global number), adopt fail-fast on short gates, and add jittered retry spacing; increase evidence granularity (addr/op/phase buckets) for offender isolation.
Pass criteria: at 400 kHz, fatal-stuck events < X per Y hours and P99 transaction latency < Z ms.
Stretch time increases with temperature — what to log to separate clock drift vs processing delay?
Likely cause: the offender’s internal service time or scheduling jitter grows with temperature; apparent drift may be phase-specific (e.g., ACK wait vs data phase) rather than global.
Quick check: log stretch duration histogram by phase_id and by addr across temperature bins; correlate with op_type and workload; confirm whether only one phase or one slave shows tail expansion.
Fix: apply per-phase/per-slave timeout profiles; for temperature-sensitive offenders, add guarded retries with jitter and fail-fast on short gates; tighten offender-specific monitoring and alerts.
Pass criteria: temperature-binned P99 stretch increase < X% and timeout rate stays < Y/1k across Z temperature points.
Bus clear “works” but devices still misbehave — what reset ordering is usually wrong?
Likely cause: the bus returns to idle electrically, but one or more devices remain in a partial internal state (half-transaction) because reset/re-init sequencing skipped a sanity probe or exit criteria.
Quick check: after bus clear, run a minimal “probe transaction” per critical slave (read a known-safe register/status) and log outcomes by addr; verify recovery state machine includes a re-init step before resuming traffic.
Fix: enforce a post-recovery gate: idle-confirm → re-init controller → per-slave sanity probes → resume; use bounded retries and offender bucketing for any probe failures.
Pass criteria: post-clear probes succeed for N/N critical slaves within X attempts; recurrence < Y per Z days.
Multi-master + stretching causes rare deadlocks — what “bus ownership” sanity rule prevents it?
Likely cause: destructive recovery or claim actions run without proving no-progress/ownership, so one master interrupts another mid-transaction while stretching hides progress, creating a deadlock pattern.
Quick check: verify a “no-progress gate” exists (phase timeout + lack of line transitions/counters) before bus clear; check whether both masters can issue recovery concurrently; inspect deadlock incidents for overlapping recovery windows.
Fix: enforce an ownership sanity rule: only one master may execute destructive recovery at a time, and only after no-progress is proven; apply jittered backoff on recovery exit to avoid simultaneous re-claims.
Pass criteria: deadlock rate < X per Y device-hours and concurrent-recovery events = 0 in logs over Z days.