Clock stretching is only “real” when a slave actively holds SCL low; this page shows how to separate true stretching from electrical/bridge artifacts, then enforce bounded timeouts and reliable recovery.
The goal is production-grade behavior: signature-based diagnosis, transparent topology checks, and measurable counters/pass criteria (X placeholders) from bring-up through mass production.
H2-1 · Scope & Quick Triage: Is it really clock stretching?
Goal: classify “SCL looks low for too long” into the correct bucket in under a minute.
This prevents chasing clock stretching when the root cause is actually edge-rate,
intermediate clamping, or master pacing.
Scope guard (to avoid cross-page overlap)
This page focuses on clock-stretch behavior, timeout policy, robust error handling.
Detailed pull-up sizing math and full RC budgeting belongs to the
Open-Drain & Pull-Up Network subpage.
Deep multi-master arbitration and repeated-start edge cases belong to their dedicated subpages.
Here they are mentioned only as false-positive discriminators.
Required signature (measurable, repeatable)
Treat “true clock stretching” as a measured extension of SCL-low time:
tLOW_EXT = observed SCL-low − expected SCL-low
Actor: SCL is held low by an external open-drain holder
(typically the slave on that segment), not by the master’s own duty-cycle shaping.
Context: the extension tends to align with protocol boundaries
(ACK/byte boundary or specific command response), not purely random jitter.
Consistency: the phenomenon is reproducible under the same command
and can be confirmed at two probe points (near master vs near slave).
Three common false positives (fast discrimination)
A) Slow edge / RC distortion (looks like “long low”)
Symptom: SCL “low” width changes with probe location; rising edge is a long ramp.
Quick check: compare near-master vs near-slave SCL. If “low width” shrinks near master,
it is likely threshold-crossing delay rather than a real hold-low.
Conclusion bucket: electrical edge-rate problem (RC/EMI). Resolve in pull-up/network design pages.
B) Level shifter / isolator clamp (direction-control trap)
Symptom: the bus becomes “held” only when a specific segment/device is connected.
Quick check: temporarily bypass/short the intermediate device or move the slave to the same voltage domain.
If the issue disappears instantly, the intermediate path is not transparent.
C) Master pacing (controller inserts waits; not a slave)
Symptom: similar “long low” appears across many slaves; depends on master configuration/load.
Quick check: switch driver mode (polling ↔ IRQ ↔ DMA) or change I²C peripheral timing mode.
If the pattern tracks master settings, it is pacing rather than stretching.
3-step triage checklist (produce a conclusion label)
Step 1 — Identify who holds SCL low
Use two probe points (near master vs near slave or across the intermediate device). The holder is the segment where SCL remains low
even when the upstream side attempts to release.
Repeat the same transaction across a second slave (or with the intermediate device bypassed).
If the symptom follows a specific slave/segment, it is not a generic master pacing artifact.
Output label:
CONSISTENCY = same across slaves / only one slave / only after shifter
Fast conclusion mapping
Likely true stretching:HOLDER=slave and
PHASE=ACK/data boundary.
Likely edge-rate artifact:PHASE=random and the “low width” changes by probe point.
Likely intermediate clamp:CONSISTENCY=only after shifter.
Likely master pacing:
symptom reproduces across multiple slaves and tracks controller/driver configuration.
Diagram: a practical triage tree—start from the symptom, probe at strategic points, then classify into a cause bucket.
H2-2 · What Clock Stretching Is (and is not)
This section establishes a single engineering definition used throughout the page: measurable,
repeatable, and phase-aware.
Without this, timeout policy and recovery logic become inconsistent across teams and tooling.
Engineering definition (portable across tools and teams)
Signal: stretching occurs on SCL (not SDA). SCL is held low longer than the
master’s nominal clocking pattern.
Mechanism: the holder is an open-drain device on the bus segment (commonly the slave).
Observable metric:
tLOW_EXT = measured SCL-low extension (distribution, not a single point)
Protocol context: typical stretching is command-dependent and aligns with
byte/ACK boundaries more often than it appears as random timing noise.
Typical legitimate reasons (kept high-level to avoid overlap)
Internal latency
Conversion, compute, CDC synchronization, or NVM operations that must complete before a safe response.
Power-state transitions
Wake-up latency or safety gates (UVLO/thermal) that delay readiness.
Rate mismatch protection
A slave throttles the bus to avoid buffer overruns or to guarantee coherent register reads.
“Is NOT stretching” exclusion clauses (each with a fastest check)
Not #1 — Master intentionally slows SCL
Fast check: change master timing (rate/duty mode). If “long low” scales deterministically with settings across multiple slaves, it is master pacing.
Not #2 — Duty-cycle shaping by controller
Fast check: the low-time ratio remains fixed (patterned) regardless of command content; no command-dependent “latency signature” exists.
Not #3 — RC/threshold distortion (false low width)
Fast check: compare near-master vs near-slave and inspect the SCL rising edge. A long ramp indicates threshold-crossing delay rather than a stable hold-low plateau.
Diagram: normal SCL vs stretched SCL. Clock stretching is defined by a stable hold-low segment and a measurable extension (tLOW_EXT).
H2-3 · Timing Anatomy: Where stretching can appear inside a transaction
Clock stretching becomes actionable only after it is mapped to a transaction segment. This section defines a
phase-aware signature so debug and timeout policy can target the right bucket
(internal latency, intermediate clamp, or master pacing) without guessing.
Three anchors to label every event (produce tags, not opinions)
Anchor 1 · Alignment
Determine whether tLOW_EXT aligns to a byte/ACK boundary or appears random.
Check whether stretching happens only on read, only on write, or both.
This often separates “response preparation” from “commit operations”.
Output tag:
DIR = read_only / write_only / both
Anchor 3 · Trigger coupling
Identify whether stretching is tied to a specific register/command. Strong coupling implies a functional latency bucket
(conversion/NVM/security/wake) rather than generic bus integrity.
After locating the phase signature (H2-3), the next step is to assign the event to an internal latency bucket.
Each bucket below provides: expected tLOW_EXT range (placeholder),
typical trigger, and the fastest check.
The intent is to shorten root-cause time, not to expand into unrelated protocol chapters.
Six practical latency buckets (each includes a fastest discriminator)
Use the buckets above to define a system-specific timeout contract:
T_STRETCH_MAX = X msT_TXN_MAX = X msN_RETRY = X
The purpose is to cap worst-case latency while keeping legitimate device behavior functional.
Diagram: a high-level internal pipeline. Waiting at the gate can extend SCL-low until the response buffer is ready.
H2-5 · Master-side policy: timeouts, retries, and bus release rules
Clock stretching must be treated as bounded waiting. The master should enforce layered time limits,
classify failures, and convert every timeout into a recoverable, observable event instead of a deadlock.
Policy guardrails (layered limits)
T_STRETCH_MAX (single low-hold bound)
Maximum allowed tLOW_EXT for a single stretching event. Prevents one hold from stalling the system.
Placeholder:
T_STRETCH_MAX = X ms
T_TRANSACTION_MAX (transaction wall clock)
Maximum wall-clock time for a full transaction (START→end condition). Captures multi-stretch accumulation and controller stalls.
Placeholder:
T_TRANSACTION_MAX = X ms
N_RETRY + backoff (avoid retry storms)
Retry count and delay policy to absorb transient busy windows while preventing feedback loops under fault conditions.
Placeholders:
N_RETRY = X
BACKOFF = X ms (± jitter)
Failure classification (controls retry vs escalation)
Class R · Retryable
Boundary-aligned, command-coupled busy windows (conversion/NVM busy/security gate). Distribution stays bounded.
Action: retry with backoff, keep counters and snapshots.
Class S · Suspect
Statistics drift (p99 grows), partial reproducibility, or mixed signatures. Often correlates with environment or intermediary behavior.
Action: limited retries then escalate recovery steps; consider feature downgrade.
Class H · Hard / Non-retryable
Bus stuck low (SCL/SDA held), controller error state, or persistent BUSY across transactions. Retry risks a storm.
Action: immediate strong recovery (bus clear / controller reset / isolate / power-cycle) and raise an error event.
Timeout: keep tight T_STRETCH_MAX=X ms; enforce T_TRANSACTION_MAX=X ms
Retry: small N_RETRY=X; backoff with jitter to avoid correlated failures
Fail action: recover step ladder; if recurrence rate exceeds X%, enter degrade mode and alert
Log fields: p95/p99 drift, temp/voltage snapshot, cross-device consistency flag
Recommendation: store both “event snapshots” and “rolling statistics” so policies can be tuned using field data.
Diagram: master policy as a bounded state machine with escalation steps and mandatory observability.
H2-6 · Robust firmware driver design: non-blocking state machine
A robust I²C driver must never block indefinitely. Waiting is expressed as state,
not as a busy loop. The driver must be watchdog-friendly, observable, and idempotent under retries so that higher layers can
take actionable decisions.
Design pillars (must-haves)
Watchdog-friendly
Every wait point must yield control and resume safely. Recovery paths cannot depend on a full system reset.
Observable
Every timeout emits counters and a snapshot (addr, phase signature, tLOW_EXT, recovery step, result code).
Idempotent under retry
Retries must not amplify device state. Write-like operations require conservative retry rules and clear error reporting.
Actionable error codes (driver → upper layer)
BUS_BUSY / BUS_STUCK
Indicates the bus is not idle or lines are held. Upper layers should not queue blindly; trigger recovery or degrade mode.
STRETCH_TIMEOUT / TXN_TIMEOUT
Bounded waiting exceeded. Upper layers may retry with backoff (Class R) or escalate recovery (Class S/H).
NACK
Address/data not acknowledged. Upper layers should differentiate “device absent” vs “busy window” using phase signatures and counters.
ARB_LOST / BERR
Controller-level events. Driver should reset or re-init the controller before returning an actionable status upstream.
addr, rw, phase signature (ALIGN/DIR/TRIGGER if known)
tLOW_EXT, txn_time, retry_count, recovery_step
result_code, line_state (SCL/SDA), timestamp
Contract goal: upper layers can decide “retry vs escalate” using structured evidence, not opaque failures.
Diagram: layered responsibilities. Timeouts/watchdog yields/logging belong in the driver layer for non-blocking operation.
H2-7 · Hardware compatibility traps: controllers/bridges that break stretching
Clock stretching only works when the entire link is transparent to SCL low-hold.
Any node that regenerates, queues, re-times, or latches direction can turn stretching into bus errors, false timeouts,
or “invisible” timing changes.
Stretch transparency model (what the link does to SCL)
Transparent
Preserves the same SCL behavior end-to-end. A slave low-hold remains a low-hold across the node.
Conditionally transparent
Transparent only in specific modes/speeds/direction states. Configuration drift can silently break stretching.
Non-transparent
Regenerates timing (queue/re-time/protocol conversion). SCL after the node is no longer the same “wire behavior”.
Field symptoms that indicate a node is breaking stretching
Direct connection works, but inserting a node causes bus error / immediate timeout.
Stretch signatures lose byte-boundary alignment after insertion (queue/re-time artifacts).
Different controller modes treat the same low-hold as BERR/timeout rather than wait.
Near-side and far-side SCL disagree on whether the line is being held low (non-transparent behavior).
Measure SCL both before and after the inserted node on the same transaction.
A real low-hold must propagate through a transparent link.
Hold-through test
Trigger a known “busy window” transaction (device-specific command) and verify that SCL low-hold remains low
across the chain.
Boundary preservation check
Confirm whether low-hold aligns to ACK / byte boundaries. Loss of alignment after insertion
suggests queue/re-time or internal state-machine translation.
Mode-switch check (controller)
Repeat the same test under different controller modes (filters, hardware state machine options). A true-compatibility path
does not change semantics from “wait” to “bus error”.
Transparency: source of low-hold; must produce stable low platform
Break mechanism: undefined behavior under brown-out or internal fault state
Must-test: repeatability under the same command; cross-device comparison on same bus
Scope boundary: this section focuses on transparency and validation, not device selection or pull-up design.
Diagram: topology and insertion points that may break clock-stretch transparency (bridges and direction-latched nodes are highest risk).
H2-8 · Stretching vs electrical effects: how to avoid false positives
A true stretching event is a stable low-hold driven by a device on SCL.
Electrical effects can imitate “long low time” through delayed threshold crossing, noise spikes, or reference shifts.
The goal is to separate wire-held-low from measurement/edge artifacts.
Minimal mechanism set (only what matters for stretching vs false positives)
Delayed threshold crossing (slow rise)
The line is rising, but the logic threshold is reached later. Tools may report a longer “low” even without a device holding SCL down.
Noise / crosstalk spikes
Short negative spikes or ringing can be decoded as extra low segments. These are usually narrow and not boundary-aligned.
Ground bounce / common-mode shift
Reference movement changes comparator thresholds. The “low duration” can vary with probe reference method and load switching.
Deliverable: false-positive checklist (3 waveform points + 2 control experiments)
Waveform point 1 · low platform stability
True stretching shows a clean, stable low platform. Electrical artifacts often show ringing/spikes during “low”.
Waveform point 2 · edge shape (rise-time)
Slow rise produces a long slope through the threshold region. A long slope can mimic “low extended” in decoded timing.
Waveform point 3 · boundary alignment
True stretching is commonly correlated with ACK/byte boundaries. Random alignment suggests noise or non-transparent links.
Control experiment A · near vs far (same transaction)
Probe SCL near the controller and near the slave. A real low-hold should appear as a low-hold at both locations.
A rise-time artifact usually changes with distance.
Control experiment B · bypass / simplify the chain
Temporarily remove the suspected node or shorten the link. True stretching keeps a consistent signature; electrical artifacts often shrink or disappear.
Scope boundary: this section explains false-positive discrimination, not pull-up sizing or full SI/EMC design.
Diagram: “true stretching” produces a stable low platform; “slow rise-time” shifts threshold timing and can mimic extended low time.
H2-9 · Debug & instrumentation: what to capture and how to trigger
Debugging clock stretching is less about “seeing SCL low” and more about capturing the first failing transaction
and aligning it with command context and timeout semantics. The workflow is:
Trigger → Tag → Correlate → Decide bucket.
Deliverable: Minimum debug bundle (6 fields that classify the failure)
ts
addr (R/W)
reg/cmd
len + dir
timeout bucket
attempt#
These six fields tie together transaction context (address/command/length),
policy outcome (timeout bucket), and repeatability (attempt#),
enabling fast mapping to “true stretch vs compatibility break vs electrical false positive”.
Logic analyzer capture recipe (focus: the first failing transaction)
Trigger
SCL low > X (threshold placeholder for “unexpected hold”).
Pre-trigger buffer enabled to preserve the lead-in bytes.
Stop after first hit to avoid later retries contaminating context.
Filter
Filter by addr first (target device).
Then filter by reg/cmd (the command that triggers the slow path).
Capture both read and write variants for comparison.
Include a monotonically increasing txn_id to match logs with analyzer captures.
Record the maximum observed SCL low-hold duration if available (optional).
Record whether bus was idle before/after (optional exit criteria hook).
Signals to capture (stretching-focused minimum set)
Must-capture
SCLSDA
Strongly correlated (when present)
VDD (slave)RESETVDD (bridge)
Anti-false-positive hook
Capture SCL near the controller and near the slave on the same transaction to detect non-transparent nodes and rise-time artifacts.
Diagram: use a single pipeline to align analyzer triggers and firmware logs, then map the failure into a minimal set of actionable buckets.
H2-10 · Recovery playbook: from stretch timeout to hung-bus recovery
Recovery should be an ordered escalation ladder, not an unlimited wait or blind retry loop.
Each step defines a goal, a risk, an escalation condition, and a pass criterion to exit cleanly.
This section turns repeated-start + page-write requirements into a practical, tick-box checklist across the full lifecycle.
Each item should have evidence (capture/log) and a bounded failure action (timeout + cleanup).
Use at least: one full-page write, one multi-page (chunked) write, and one random-read verification pass.
Required counters (minimum set)
poll_count_avg / poll_count_max (distribution matters more than a single average)
poll_wait_ms_avg / poll_wait_ms_max
nack_count_total (optionally classify: busy-NACK vs unexpected-NACK)
timeout_count
readback_error_count (compare/CRC mismatch)
Pass criteria (placeholders; tune per product)
poll_max < X ms (derived from tWR(max) + margin)
timeout_rate < X ppm (per lot / per shift)
readback_error = 0 (or < X ppm if rework policy allows)
Production flow: STOP triggers tWR, polling confirms completion, readback validates data, and counters capture reliability trends.
Applications & IC Selection Notes (Strictly relevant to Sr + Page Write)
Only use-cases that depend on combined transactions (write pointer → Sr → read) and EEPROM page-write behavior are listed here.
Selection notes focus on the minimum required device/controller capabilities for stable, bounded operation.
Applications (Sr + page write strongly relevant)
1) EEPROM parameter storage (config / serial / calibration)
Templates:
Random read: S → AddrW → WordAddr → Sr → AddrR → Data… → NACK(last) → P ·
Page write: S → AddrW → WordAddr → Payload… → P → Poll(NACK…ACK)
Material examples: Microchip 24LC256 / 24AA256 · ST M24C64 / M24M02 · onsemi CAT24C256 · ROHM BR24G256 (verify suffix/package).
Template:
S → AddrW → Reg → Sr → AddrR → Data… → NACK(last) → P
(Sr keeps the combined transaction semantics and avoids relying on vendor-specific “pointer retention after STOP”.)
Material examples: Bosch BME280 · TI TMP117 · Microchip MCP9808 · TI INA219 (verify address map + auto-increment behavior).
Repeated START support: must generate Sr reliably (capture must match the template).
Last-byte NACK control: hardware/driver must guarantee NACK(last) at the end of burst reads.
Timeout + stuck detection: bounded waits and forced cleanup (no infinite busy loops).
Error visibility: interrupts/counters for NACK/timeout/bus errors are strongly preferred for production telemetry.
Concrete MCU examples (controller capability must be confirmed by TRM + capture):
ST STM32G031K8 · ST STM32L052C8 · NXP LPC55S16 · Microchip ATSAMD21G18A · TI MSP430FR2355.
Prefer platforms that can log NACK bursts, timeout counts, and controller reset counts.
Treat capture evidence as truth: verify actual Sr/STOP behavior on the wire, not only via driver API assumptions.
Mini-flow: select EEPROM based on page-write behavior (page size, tWR(max), polling), and select controller based on Sr/NACK(last)/timeout behavior verified by capture.
These FAQs only close long-tail troubleshooting. Each answer is intentionally short and actionable with measurable pass criteria (X placeholders).
SCL low holds for ~X ms, but only on one register read—why?
Signature-locked stretching usually maps to a specific internal “slow-path” bucket.
Likely cause
That specific register read triggers a slow internal path (e.g., sensor conversion, FIFO refill, NVM access, CDC sync), so the slave holds SCL low until data is ready.
Quick check
Repeat the same read X times and confirm the stretch aligns to a consistent byte boundary (address/ACK vs data/ACK). Then run a “nearby” register read: if only one register stretches, it is command-triggered.
Fix
Redesign access as “kick + later read” (two short transactions), or use INT/DRDY when available. If stretching must remain, bound it with T_STRETCH_MAX and log per-command stretch histogram.
Pass criteria
For that register: stretch_low_hold ≤ X ms (p99), timeout_rate ≤ X per 10k reads, and retries do not increase stretch_count beyond X%.
Works on bench, fails with a bridge/extender—first transparency check?
Many bridges/extenders are not timing-transparent; prove it with probe A/B.
Likely cause
The inserted bridge/extender buffers or re-times SCL/SDA so slave-held SCL low is not propagated (or is reshaped), causing the master to see bus error/timeout.
Quick check
Use a known “stretching command” and probe SCL on both sides of the node (A: master-side, B: slave-side). If slave-side shows a flat low-hold but master-side does not, transparency is broken.
Fix
Replace or reconfigure the node to a mode documented as clock-stretch transparent, or redesign to avoid stretching (INT/DRDY or kick+later-read). Keep T_TXN_MAX bounded regardless.
Pass criteria
Under the same command: A/B probes match (stretch visible on both sides), timeout_rate ≤ X per 10k, and recovery_count ≤ X per 1k transactions.
Timeout triggers, but retry makes it worse—what state-machine bug is common?
The most common failure is “retry without cleanup,” which compounds bus/device state.
Likely cause
Retry is not idempotent: controller state (BUSY/flags/FIFO/DMA) is not reset, STOP is not issued, or the “slow command” is re-fired while the slave is still busy.
Quick check
Log a “retry trace” of attempt# with bus-state snapshots (BUSY flag, error flags, FIFO level, STOP sent). If attempt#2 fails faster or with a new error bucket, cleanup is missing.
Fix
Make retry idempotent: abort DMA, clear controller flags, flush FIFOs, issue STOP (or bus reset sequence), enforce backoff, and only re-issue the command after a verified bus-idle + optional health probe.
Pass criteria
After any timeout: cleanup completes in ≤ X ms, bus-idle is true for ≥ X µs, and retry success_rate ≥ X% without increasing recovery_count.
Looks like stretching, but SCL waveform is sloped—what’s the fastest discrimination?
True stretching shows a flat low-hold; RC/threshold artifacts show a ramp.
Likely cause
Slow rise-time (RC) or threshold-crossing delay is extending “apparent low time,” not a slave holding SCL low.
Quick check
Compare SCL at the controller pin vs the far node on the same transaction. True stretching has a stable low plateau; RC artifact shows a ramp with delayed threshold crossing and often varies with probe point.
Fix
Improve edge margins (reduce effective bus C, adjust pull-up, avoid over-slow edge shaping). Then re-run the same command signature to verify timing stability.
Pass criteria
Measured tR ≤ X (per mode budget) and “low-hold” duration variance ≤ X% across X repeats; timeout_rate ≤ X per 10k.
After timeout, bus stays busy forever—what’s the correct recovery ladder?
Recovery must escalate from soft abort to forced bus release with measurable exit criteria.
Likely cause
The transaction aborted without releasing the bus (missing STOP), or a slave remains in a partial state holding SDA/SCL low (hung-bus).
Quick check
Immediately after timeout, sample SCL and SDA levels. If either is low for ≥ X µs, the bus is not idle and a forced release ladder is required.
Fix
Ladder: (1) abort + STOP + backoff; (2) switch to GPIO and clock out 9 pulses, then STOP; (3) isolate segment / reset node / power-cycle the affected branch (if available). Always finish with a health probe read/write.
Pass criteria
Bus-idle holds (SCL=H and SDA=H) for ≥ X µs, recovery completes within ≤ X attempts, and the health probe succeeds with error_rate ≤ X.
Two slaves behave differently—how to classify latency bucket quickly?
Use a signature table: position + direction + command correlation.
Likely cause
Different slaves map to different internal latency buckets (conversion vs NVM vs wake-up vs protection), even on the same bus and same master policy.
Quick check
For each slave, capture: (a) where stretching occurs (addr/ACK vs data/ACK), (b) read vs write correlation, (c) command correlation (specific reg/cmd). Build a quick histogram of low-hold durations.
Fix
Apply bucket-appropriate mitigation: replace “read-until-ready” with INT/DRDY, split slow ops (kick+read), or increase per-command budget while keeping global T_TXN_MAX bounded.
Pass criteria
Per slave: p99 stretch_low_hold ≤ X ms (or documented per-command X), and timeout_rate ≤ X per 10k with stable histogram shape across X runs.
Stretch count rises at cold/hot—what to log first?
Temperature sensitivity is often indirect; log the minimum fields to avoid guesswork.
Likely cause
Cold/hot changes internal latency (conversion time, NVM write timing, wake latency) and can also reduce edge margin (rise-time, thresholds), increasing apparent stretching and timeouts.
Quick check
For each timeout/stress window, log: temperature, bus speed, command signature (addr/reg/dir/len), stretch_low_hold (duration), timeout bucket, and recovery result. Compare p95/p99 across cold vs hot.
Fix
Separate “real latency” vs “edge artifact”: confirm waveform plateau vs slope. If real latency increases, adjust per-command budget while keeping T_TXN_MAX bounded; if edge margin is the limiter, fix rise-time and noise susceptibility.
Pass criteria
Across temperature corners: timeout_rate ≤ X per 10k, p99 stretch_low_hold ≤ X ms (or bucket-specific X), and recovery_success ≥ X% within ≤ X attempts.
DMA I²C occasionally locks—where to put watchdog hooks?
DMA needs progress-based watchdoging, not just a global timer.
Likely cause
DMA completion or I²C controller IRQ is missed, leaving the driver in a waiting state; without a non-blocking state machine, a stretch or bus glitch can freeze the pipeline.
Quick check
Add “progress markers” to log: state transitions, DMA submit, DMA done, STOP sent, and bus-idle reached. If progress stops for ≥ X ms without a state change, the hook point is missing.
Fix
Implement a non-blocking driver: watchdog each wait-state (WAIT_SCL / WAIT_DMA / WAIT_STOP), abort-and-cleanup on timeout, and always end with a verified bus-idle + optional health probe.
Pass criteria
No stuck state for ≥ X hours of stress: watchdog recovers within ≤ X ms, and recovery_count ≤ X per 10k transactions with health probe success ≥ X%.
Oscilloscope shows SCL released but analyzer still decodes “clock held”—why?
Tool thresholds, sampling, or probe point mismatch can create a decode illusion.
Likely cause
The analyzer’s threshold/sampler sees SCL below its logic-high threshold (slow rise-time/noise), or it is attached at a different node than the scope (far node still low).
Quick check
Put scope and analyzer on the same node; set analyzer threshold to match measured VIH; increase analyzer sample rate; compare the time SCL crosses the threshold vs the scope waveform.
Fix
Align measurement conditions (node, threshold, sampling). If the root is edge margin, fix rise-time/noise; if the root is topology, re-probe near/far to find the node that remains low.
Pass criteria
Analyzer decode matches scope at the same node; measured tR/tF meet budgets (≤ X), and “clock held” false positives drop to ≤ X per 10k.
If you must cap stretching, what’s a safe timeout policy?
Use layered limits: per-hold, per-transaction, and bounded retry with cleanup.
Likely cause
An unbounded wait treats stretching as “infinite busy” and causes deadlocks; an overly tight timeout causes false failures on legitimate slow-path commands.
Quick check
Measure p95/p99 stretch_low_hold for the worst-latency command at temperature corners, then set T_STRETCH_MAX with margin; separately bound total transaction time with T_TXN_MAX.
Fix
Policy template: T_STRETCH_MAX (per-hold), T_TXN_MAX (end-to-end), N_RETRY with backoff; on timeout, always cleanup (STOP/abort/flush) then optional health probe before retry.
Pass criteria
Under worst-case command: p99 stretch_low_hold ≤ T_STRETCH_MAX, p99 txn_time ≤ T_TXN_MAX, timeout_rate ≤ X per 10k, and no deadlock for ≥ X hours stress.
How to prove production readiness for stretching behavior?
Production readiness is counters + thresholds + reproducible worst-case loop.
Likely cause
Without standardized counters and a worst-case stimulus, stretching regressions appear as “random” field failures and cannot be screened in production.
Quick check
Run a production BIST loop that triggers worst-latency commands for X iterations and record stretch_count / timeout_count / recovery_count plus per-bucket classification.
Fix
Standardize: policy parameters (T_STRETCH_MAX/T_TXN_MAX/N_RETRY), a golden waveform reference for bring-up, and production thresholds with automated logging/reporting.
Pass criteria
Over X-loop BIST: timeout_rate ≤ X, recovery_count ≤ X, and post-recovery health probe success ≥ X%; transparency matrix is signed for all inserted nodes.
What’s the minimum debug bundle to send to a vendor?
A small, consistent bundle shortens vendor triage cycles and avoids back-and-forth.
Likely cause
Vendor support stalls when key context is missing (signature, bucket, topology, and recovery results), even if a waveform exists.
Quick check
Confirm the bundle contains exactly the minimum fields below, plus one capture of the “first failing transaction” (not only a late-stage failure).
Fix
Send: (1) bus speed/mode, (2) topology nodes list (buffer/mux/isolator/bridge), (3) command signature (addr+reg/cmd+dir+len),
(4) stretch_low_hold duration + where it occurs (phase/byte boundary), (5) timeout bucket + attempt#, (6) recovery action + outcome; attach one scope/LA capture with node location noted.
Pass criteria
Vendor can reproduce/triage within X iterations: the first-failing signature is repeatable (≥ X%), and bucket classification remains consistent across X runs.