123 Main Street, New York, NY 10001

BIST, POST & Fault Injection for Server Platforms

← Back to: Data Center & Servers

BIST/POST and fault injection turn “it failed in the field” into a repeatable, provable workflow: test the right fault models, inject controlled errors, correlate observability, and preserve black-box evidence. With clear pass/fail (including gray-zone rules), the system can fail fast, degrade safely, and continuously close coverage gaps to reduce escapes.

H2-1 · What this page covers: BIST/POST vs Fault Injection

This chapter sets the boundary and the operational goal: turn “it booted once” into “it is provably reliable,” with repeatable self-tests, controlled fault injection, and forensic-quality evidence.

Three terms, one verification chain

  • POST (Power-On Self-Test): staged boot checkpoints that validate the minimum safe bring-up path. Output: checkpoint status, reset cause, last-known-good stage.
  • BIST (Built-In Self-Test): repeatable subsystem tests that generate objective evidence. Output: signatures, BER/margins, latched fault bits, and counters.
  • Fault Injection: deliberate error creation to verify detection → isolation → recovery → logging. Output: the monitoring and policy path is proven, not assumed.
POST vs BIST difference fault injection in servers why black box logs bring-up

Why data center platforms need this (practical drivers)

  • Scale turns rare into frequent: intermittent issues become fleet incidents.
  • Field failures are hard to reproduce: controlled injection provides repeatability and bounded risk.
  • MTTR pressure: evidence must survive resets and power cycles to avoid “guess-and-replace.”
  • Forensics matter: pre/post-trigger logs convert one-time events into actionable sequences.

Scope boundary (what belongs here)

  • In scope: test mechanisms, injection methods, observability (counters/trace/timestamps), black-box logging, pass/fail & gray-zone policy.
  • Out of scope: PSU topologies, cooling control, and protocol-stack deep dives (only test-relevant observables are referenced).

Expected outcome after finishing this page: a platform can be validated with staged POST checkpoints, targeted BIST suites, safe injection guardrails, and logs that preserve causality across resets.

Figure F1 — Verification flow: POST → BIST → Injection → Evidence → Verdict
Figure F1 — POST, BIST, Fault Injection, Observability, Black-Box Logs, Verdict A single-line flow with boxes and arrows: Power-on, POST checkpoints, BIST suite, fault injection campaign, observability, black-box logs, and final actions. Provable Reliability Pipeline (Platform Self-Test) Power-on rails/clk POST checkpoints fast-fail BIST Suite signature / BER counters LBIST MBIST PRBS Fault Inject error / glitch guardrails MARK Observe counters trace/time CNT TS TR Evidence that survives resets Black-box Logs ring buffer + pre/post trigger TSCHKPTEVENT TSINJCNT Δ TSRECOVVERDICT Verdict PASS DERATE LOCKOUT Note: protocol/PSU/cooling details are intentionally out-of-scope; only test evidence, guardrails, and verdict policy are covered.

H2-2 · Test taxonomy: logic/memory/SerDes/IO BIST

This chapter turns “which self-test should run” into a selection method based on target domain, evidence type, trigger mode, and intrusiveness. The focus is on what each test proves and what it cannot prove.

Four common BIST families (what they prove)

  • LBIST (logic): detects certain structural logic failures; best used with stable signatures and clear version binding.
  • MBIST (memory): targets array/address/data-path faults; pairs naturally with ECC error counters for in-field trend checks.
  • SerDes/PHY BIST: PRBS/BERT and training margin checks for link integrity and stability under constraints.
  • I/O loopback: validates end-to-end path continuity at a chosen loop point; passing does not guarantee real traffic robustness.

Evidence outputs (how results should look)

  • Signature: compressed “golden comparison” (fast screening, strong for regression when version/config are pinned).
  • BER / margin: statistical quality of a link; requires adequate observation time to avoid false confidence.
  • Counters: trendable health indicators (must avoid pitfalls: clear policy, overflow handling, and timestamp correlation).
  • Latched faults / event bits: proof that a condition happened, useful for post-mortem and black-box correlation.
LBIST MBIST server board PRBS BERT loopback self test ECC counters self test

Trigger modes (boot vs periodic vs on-demand)

  • Boot (POST-time): tight time budget; prioritize high information density and deterministic verdicts.
  • Periodic (in-field): low intrusion; split tests into small chunks and require safe exit conditions.
  • On-demand (after anomaly): targeted reproduction; pair with black-box markers and correlation timestamps.

Practical rule: the deeper the test, the stronger the guardrails and logging requirements.

Figure F2 — BIST selection matrix: domain × evidence × trigger
Figure F2 — Logic/Memory/Link/I-O BIST matrix A grid showing LBIST, MBIST, SerDes PRBS/BERT, and I/O loopback across boot, periodic, and on-demand triggers, with evidence tags: signature, BER, counters, and logs. BIST Taxonomy Matrix Choose by domain, trigger, and evidence output (Sig / BER / Cnt / Log) SIG BER CNT LOG Boot time-tight Periodic low intrusion On-demand targeted Domain Logic LBIST Memory MBIST Link PRBS/BERT I/O Loopback SIG LOG fast screen CNT LOG SIG segmented LOG SIG after anomaly SIG LOG CNT deterministic CNT LOG trend health LOG CNT repro focus BER LOG CNT baseline CNT BER LOG low impact BER LOG deep repro LOG CNT continuity LOG CNT safe checks LOG after swap Interpretation: choose the loop point and evidence tags to match the failure model and intrusion budget.

H2-3 · POST staging: checkpoints, gating, and fast-fail philosophy

POST becomes operationally useful only when it is staged into checkpoints that (1) enforce stable prerequisites, (2) emit minimal forensic evidence, and (3) choose a failure path that prevents “booting sick” into production.

Stage map (what each stage must prove)

  • Minimum power: rails reach window and remain stable for a hold time.
  • Clock stable: reference present, PLLs lock, and lock stays stable long enough to trust downstream timing.
  • Reset release: reset is released only after power+clock are stable (avoid races).
  • Basic interconnect: minimum “link-ready” signals are consistent (no protocol deep dive).
  • Critical peripherals: key fault/status bits are readable and latched faults are captured.
  • Extended tests: short, high-information tests are optional and gated by time/impact budget.

Gating rules (turn random boot issues into deterministic outcomes)

  • Window gating: declare “OK” only inside voltage/clock windows and after a stability window.
  • Sequencing gating: each checkpoint starts only after the previous checkpoint is stable, not merely asserted.
  • State gating: if critical latched faults or abnormal counters appear, route to a controlled failure path.

Key principle: gating protects evidence. Without stable prerequisites, later failures become ambiguous.

Fast-fail vs safe-mode vs retry (policy, not preference)

  • Fail-fast (HALT): choose when prerequisites are not met or evidence cannot be preserved.
  • Safe mode (DERATE): allow controlled startup only when monitoring+logging remain trustworthy.
  • Retry with backoff: allow limited retries when a transient is plausible; every retry must log cause and count.

Verdict is not binary: the correct choice minimizes fleet risk and shortens MTTR.

Common pitfalls (and how to neutralize them)

  • One pass ≠ reliable: boundary issues depend on temperature, ramp, and timing distributions.
  • Race windows: lock/reset/enable timing differs across devices; record durations, not just “OK.”
  • Cold vs warm restart: residual state changes initial conditions; tag the boot type and carry it into logs.

Minimum evidence set per checkpoint (keep it small but decisive)

  • Checkpoint ID + timestamp (monotonic when possible).
  • Reset cause (current + previous) and a “last-known-good checkpoint.”
  • Rail/clock summary (window pass + stability window duration).
  • Counter snapshot (with a consistent clear policy).
  • Log marker that survives retries/backoff.
POST checklist server bring-up boot intermittent failure debug
Figure F3 — POST timeline with gating and failure routing
Figure F3 — POST staging: checkpoints, gating, and fail paths A horizontal timeline with staged boxes and arrows. Each stage has a gating label and an evidence label. Failures route to HALT, SAFE MODE, or RETRY+BACKOFF. POST Staging Timeline Stable prerequisites → checkpoint evidence → controlled failure routing RAIL OK window + hold EVID: ID+TS CLK LOCK stable lock EVID: LOCK T RESET REL after stable EVID: CAUSE CHKPT 1 link-ready EVID: CNT Δ CHKPT 2 fault bits EVID: LOG MK Failure routing (policy) Any stage failure selects one controlled path — each path must preserve evidence. HALT fail-fast preserve cause SAFE MODE derate monitor+log OK RETRY backoff log each try FAIL → ROUTE + LOG

H2-4 · Loopback engineering: where to loop and what it proves

Loopback is a fault-domain tool. The loop point defines what portion of the path is being validated and which failure causes can be ruled in or ruled out. Correct loop selection prevents false confidence.

Loopback levels (fault-domain boundaries)

  • Digital loop (MAC/logic): validates digital datapath continuity; does not validate analog link behavior.
  • SerDes loop (PCS/PMA): validates training/CTLE/DFE behavior and lane logic within the transceiver boundary.
  • Physical loop (connector/cable/fixture): validates the real channel segment and exposes connector/fixture issues.

What a pass means (and what it does not mean)

  • Pass at an inner loop: narrows faults to segments outside the loop point (channel/connector/fixture).
  • Fail at an inner loop: points to transceiver-side issues (training stability, lane logic, or margin).
  • All loops pass but instability remains: often indicates boundary conditions (temperature/jitter), too-short BER windows, or incomplete correlation.

Interpretation must be aligned with the chosen evidence metrics (LOCK / BER / MARGIN / CNT).

Evidence metrics to record (minimal but decisive)

  • LOCK: stable lock duration and re-lock events (not only “locked once”).
  • BER: adequate observation window for confidence; short windows can hide intermittent issues.
  • MARGIN: use as a trend indicator across temperature/voltage conditions (not a single absolute number).
  • CNT: counter snapshots must carry timestamps for causality and comparison.

Practical uses (kept within page scope)

  • Baseline link health during POST and post-anomaly triage.
  • Channel consistency across lanes; validate lane swap and polarity mapping.
  • Board-level screening for routing/connector sensitivity before deeper system-level tests.
loopback test pass but link unstable SerDes internal loopback vs external
Figure F4 — Loopback hierarchy: digital vs SerDes vs physical
Figure F4 — Loopback hierarchy and metrics A block diagram showing a transmit/receive path with three loopback arcs: digital loop, SerDes loop, and physical loop via connector/cable/fixture. Metrics tags annotate where LOCK, BER, MARGIN, and CNT are observed. Loopback Levels = Fault Domains Loop point defines what the test proves (and what it cannot) SoC / Logic MAC / datapath METRIC: CNT SerDes / PHY PCS / PMA METRIC: LOCK Retimer EQ / re-lock METRIC: CNT Connector pad/plug MARGIN Cable fixture BER DIG LOOP SERDES LOOP PHYSICAL LOOP Quick interpretation Inner loop PASS Fault likely outside the loop point CHECK: BER/MARGIN Inner loop FAIL Fault likely inside transceiver boundary CHECK: LOCK/CNT All loops PASS Suspect boundary conditions or short windows ADD: timestamp

H2-5 · Signature analysis: CRC/MISR, golden storage, and drift control

Signatures compress long responses into a fast verdict. Reliability comes from controlling the full chain: stimulus + initial conditions + observation window + golden governance + compare policy.

Why signatures are used (compression with intent)

  • Bandwidth and storage: full response traces are too large for boot-time and fleet-scale logging.
  • Speed: a short signature enables fast screening and deterministic routing (pass / isolate / retest).
  • Portable evidence: signatures can be stored and compared across runs while keeping logs small.

Compression increases ambiguity; governance and policy restore interpretability.

Signature sources (three common compressors)

  • CRC: lightweight detection for data streams and short sequences.
  • MISR / LFSR: multi-input compaction of long response streams into a fixed-size signature.
  • Trace summary: selective sampling and aggregation (e.g., event counts and short windows) when full trace is not feasible.

Consistency triplet (what must match before comparing)

  • Stimulus identity: test vectors and test suite ID must be identical.
  • Initial conditions: seed, reset state, and initialization must be deterministic.
  • Observation window: start/stop points and loop counts must be consistent.

Most “mismatch” events are explained by a broken consistency triplet, not a failing device.

Golden signature governance (the real control plane)

  • Version binding: tie the golden to firmware/build ID and a configuration profile ID (not just a name).
  • Platform identity: classify by stepping / board revision / allowed BOM variants.
  • Environment tags: record temperature band and clock mode to separate drift from defects.
  • Test identity: record suite ID, vector ID, and compressor parameters (seed/polynomial ID).

Compare policy (strict vs window vs mask)

  • Strict: strongest detection, best for stable domains and manufacturing screens.
  • Window: allow controlled variation by selecting a permitted set (multi-golden per tag) and bounded retries.
  • Mask: ignore known-unstable segments only under explicit change control (highest leak risk).

Policy must be explainable: every relaxation should be audited and reversible.

False positives vs false negatives (root causes to isolate)

  • False positives: seed/init mismatch, vector mismatch, window mismatch, unclassified platform variance.
  • False negatives: insufficient vector coverage, rare boundary faults, over-aggressive masking.
  • Drift drivers: jitter sensitivity, metastable edges, temperature/voltage margin erosion.
MISR signature mismatch debug golden signature management manufacturing
Figure F5 — Signature generation chain and compare policy
Figure F5 — Signature chain: generate, store, compare, and govern A left-to-right block diagram. Stimulus and initial conditions feed a response stream. Compressors produce a signature. A golden database with metadata drives a compare policy selector (strict/window/mask) that routes to verdicts. Signature Analysis Stimulus + init + window → compress → govern → compare → verdict Test Vectors Suite ID VECTOR ID Init State Seed + reset SEED ID Response Stream / events WINDOW ID Compress CRC MISR TRACE SUM Signature SIG VALUE + SIG ID Golden database (governed) FW/CFG PLATFORM TEST ID ENV TAG Family golden platform class Config golden profile binding Change control: approve / audit / rollback Compare policy STRICT WINDOW MASK Verdict PASS / ROUTE / RETEST

H2-6 · Programmable power gating for test: domains, isolation, and safe sequencing

Power gating for test is a controlled experiment: it narrows fault domains, enables brownout-style injections, and validates recovery paths — without expanding damage and without breaking the evidence chain.

Why tests use power gating (practical goals)

  • Fault-domain localization: gate a domain and observe whether symptoms persist.
  • Recovery validation: prove controlled restart and post-gate re-initialization.
  • Brownout-style injection: emulate supply disturbance to validate detect/isolate/restore.
  • De-coupling: prevent one failing domain from polluting observability in others.

Domain model for testing (separate “target” from “evidence”)

  • Target domains (A/B/C): independently gateable test subjects.
  • Isolation boundary: prevents back-power and unpredictable boundary states.
  • Always-on evidence island: logging + timestamps + minimal control must remain powered.

Evidence-first rule: logging must survive both the injection and the recovery.

Isolation and retention (test-view constraints)

  • Isolation gates: enforce deterministic boundary signaling during and after gating.
  • Retention policy: keep only the state required for post-gate comparisons and recovery proofs.
  • Recoverability: recovery must attribute resets to injection markers, not to “unknown causes.”

Safe sequencing (a minimal injection procedure)

  • Mark: write an injection marker into the evidence island.
  • Snapshot: capture a small counter/status snapshot (pre-injection).
  • Gate/Glitch: apply controlled gating or disturbance within a defined impact scope.
  • Stabilize: wait for stability windows to avoid transient misclassification.
  • Snapshot + verdict: capture post-injection evidence and route (halt/safe/retry).

Two red lines (non-negotiable test rules)

  • Do not expand the fault: gating must not create new uncontrolled side effects.
  • Do not lose evidence: never gate the evidence island before markers and snapshots are recorded.
power domain isolation for BIST fault injection power glitch safe design
Figure F6 — Power domains with an always-on evidence island (evidence first)
Figure F6 — Evidence-first power gating for test A block diagram with three target domains and an always-on evidence island. Each target domain has a power switch icon and an isolation gate at its boundary. A sequencing ribbon shows marker, snapshot, gate/glitch, stabilize, snapshot, verdict. Programmable Power Gating for Test Evidence island stays on: marker → snapshot → injection → recovery Always-On Evidence Island logging + timestamps Log Store markers Timestamp correlation EVIDENCE FIRST Domain A test target ISO PWR SW Domain B test target ISO PWR SW Domain C test target ISO PWR SW KEEP LOG ON Minimal injection sequence MARK SNAPSHOT GATE/GLITCH STABILIZE SNAPSHOT VERDICT

H2-7 · Fault injection methods: error-insert vs physical perturbation

Fault injection is only useful when the fault model, the injection mechanism, and the observation points are aligned. Two families dominate: deterministic error insertion and boundary-driven physical perturbation.

Two families, two verification targets

  • Error insertion: injects modeled errors (e.g., ECC bit-flip, parity fault, link error, timeout) to validate detect → isolate → recover paths.
  • Physical perturbation: injects boundary stress (power droop/glitch, clock glitch, reset bounce, temperature edge) to validate margin and real-world susceptibility.

Error insertion proves the safety logic; perturbation probes the physical boundary where intermittent failures live.

Error insertion: mechanisms and what must be proven

  • ECC / parity: injected flips must trigger the correct fault flag and counter delta, then route to the expected containment action.
  • Link error inject: injected error bursts must increment lane/link counters and create an event record tied to the injection marker.
  • Timeout inject: injected timeouts must exercise retry/backoff rules and produce a clean, explainable recovery outcome.

The pass condition is not “an error happened”, but “the evidence chain explains it end-to-end.”

Physical perturbation: boundary stress without turning it into randomness

  • Power droop/glitch: apply within a defined window and scope, with bounded amplitude and count.
  • Clock glitch: keep glitches time-gated and traceable; avoid uncontrolled spread across multiple domains.
  • Reset bounce: use controlled patterns to validate reset-cause capture and deterministic re-entry paths.
  • Temperature edge: treat as a boundary trigger; correlate transitions to counters and traces on the same timeline.

Perturbation should confirm a hypothesis about margin, not create a new failure mode.

Decision criteria (choose by repeatability, model coverage, risk, diagnostic value)

  • Repeatability: insertion is high; perturbation depends on window control and boundary stability.
  • Fault model: insertion covers modeled detection paths; perturbation covers timing/electrical margin faults.
  • Risk: insertion is usually lower; perturbation requires hard guardrails to avoid cascading effects.
  • Diagnostic value: insertion is best for verifying response logic; perturbation is best for revealing true intermittents.

Safety guardrails (non-negotiable)

  • Injection window: enable only in approved checkpoints/states; reject injection outside the window.
  • Amplitude & count limits: cap intensity and total attempts; enforce cool-down/backoff.
  • Rollback and exit: define stop conditions (unexpected symptoms, missing evidence, reset loops) and force safe halt.
  • Evidence first: marker + pre-snapshot must be recorded before any injection; evidence storage must remain available.
fault injection vs error injection difference clock glitch test safe guardrails
Figure F7 — Injection method comparison matrix (repeatability / risk / model / observability / stage)
Figure F7 — Fault injection method comparison A two-column comparison matrix: Error Insertion vs Physical Perturbation. Rows show repeatability, risk, fault model, observability, best stage, and guardrails. Minimal labels, large text, many block elements. Fault Injection Methods Pick by repeatability, risk, fault model, observability, and stage Error Insertion deterministic, modeled Physical Perturbation boundary-driven stress Repeatability Risk Fault model Observability Best stage Guardrails window / limit / exit HIGH LOW–MED MODELED ERRORS COUNTERS + EVENTS BRING-UP PRODUCTION WINDOW COUNT CAP EXIT AUDIT MED MED–HIGH MARGIN / TIMING TIMELINE + TRACE VALIDATION FIELD* WINDOW AMPL CAP COUNT CAP EXIT * Field perturbation requires strict permissions and stop rules.

H2-8 · Observability design: counters, traces, timestamps, and correlation

Observability must be correlation-first: injection markers, state snapshots, counter deltas, trace snippets, and black-box records should land on one timeline and one identity (boot ID and sequence ID).

The four-piece set (what each component proves)

  • Counters: quantify accumulation and deltas (errors, retries, resets, recoveries).
  • Events: discrete flags and reason codes (what happened, where, why).
  • Trace: short ordered snippets around the injection window (what led to what).
  • Timestamps: the anchor that makes everything correlatable across domains.

Minimal evidence bundle (a repeatable record format)

  • Injection marker (type ID + sequence ID).
  • Pre-snapshot (key counters + domain states).
  • Post-snapshot (same fields, after stabilization).
  • Trace snippet (optional but recommended for ordering).
  • Black-box record (boot ID + checkpoint + verdict) on the same time base.

Correlation fails when any of these is missing or not aligned to the same marker.

Counter design rules (avoid “looks fine but is wrong”)

  • Domain binding: counters must be scoped (domain/lane/subsystem), not only global totals.
  • Clear policy: define boot-clear vs periodic-clear vs never-clear and record the policy.
  • Overflow handling: detect wrap/rollover or provide sufficient width; mark wrap in records.
  • Snapshot-first: compute deltas from pre/post snapshots, not from asynchronous polling.

Trace and event design (keep it small, make it indexable)

  • Event IDs: include domain ID, reason code, severity, and sequence ID.
  • Trace windows: capture around injection and reset edges; do not attempt full-time tracing.
  • Indexing: every trace snippet must be searchable by marker sequence ID and timestamp.

Common correlation failures and the fix

  • Unsynchronized sampling: marker and snapshots are triggered by different clocks → unify triggering.
  • Bad reset/clear rules: counters reset unexpectedly → log clear policy and boot ID.
  • Overflow/wrap: deltas look negative or jump → add wrap flags and wider counters.
  • Event loss: marker exists but event is missing → use bounded buffers with backpressure rules.
error counter overflow debug timestamp correlation fault injection
Figure F8 — Timeline correlation (marker → counter Δ → trace → black-box record)
Figure F8 — Correlated evidence on one timeline A four-lane timeline. A vertical marker line aligns injection marker, pre/post counter snapshots with delta arrow, a trace snippet block, and a final black-box record block. Labels are short and large. Correlation-First Observability Same time base + sequence ID aligns evidence across lanes t0 tN Marker Counters Trace Black-box INJECTION MARKER SEQ ID PRE POST Δ TRACE SNIPPET windowed BLACK-BOX RECORD BOOT ID CHECKPOINT VERDICT Correlation failure modes to guard against UNSYNC WRAP BAD CLEAR EVENT LOSS

H2-9 · Black-box logging: pre/post-trigger, ring buffer, and forensic quality

A black-box log exists to preserve the “before and after” of an intermittent failure when the field cannot reproduce it. The goal is correlation-ready evidence that survives resets and prevents silent loss or overwrite.

What the black-box must solve

  • Non-reproducible faults: capture the last-known-good context and the transition into failure.
  • Post-reset amnesia: preserve evidence across reboot or partial loss of volatile state.
  • Broken causal chains: keep marker → snapshots → deltas → trace → verdict on one identity (boot ID / sequence ID).

A long log is not a good log. A good log is provable and correlatable.

Ring buffer is for windows, not for capacity

  • RAM ring continuously covers the most recent time span or the last N events.
  • Freeze on trigger to protect the pre-trigger window from overwrite.
  • Post-trigger capture records the system response chain (retry → degrade → isolate → reset → recover).

Trigger policy (threshold / pattern / composite)

  • Threshold: error-rate spikes, counter deltas, signature mismatches, repeated training failures.
  • Pattern: recurring reset causes within a short interval, oscillating “almost failing” states.
  • Composite: marker + state transition + counter delta required to trigger (reduces false positives).

Trigger sensitivity should be paired with a cool-down rule to avoid log spam.

Pre/Post trigger windows and the minimal evidence set

  • Pre-window: must include the last-good checkpoint and the drift leading into the trigger.
  • Post-window: must include the containment and recovery actions until stabilization or exit.
  • Minimal set: domain states, reset cause, last-good checkpoint, signature delta, key counter deltas, and marker ID.

Forensic quality (provable, non-silent loss)

  • Anti-overwrite: trigger records should move into a frozen/append-only segment.
  • Anti-truncation: each record carries length + sequence + CRC so partial writes are detectable.
  • Anti-silent-clear: clear operations must be explicit and auditable, never silent.

Forensics is about verifying completeness and ordering, not about storing everything forever.

Flush policy (writing order under power loss)

  • Evidence-first: write header + minimal evidence bundle before any extended payload.
  • On trigger: flush immediately; on exit conditions, flush final verdict and stop reason.
  • Atomic mindset: if a record cannot be fully written, it must be marked invalid, not “half-valid”.
black box log ring buffer best practices pre post trigger logs bring-up
Figure F9 — Black-box structure: RAM ring → freeze & window → flush policy → nonvolatile store
Figure F9 — Black-box logging pipeline Block diagram with many elements: RAM ring slots, trigger and freeze, pre/post window labels, flush policy with evidence-first ordering, integrity fields, and a nonvolatile store. Minimal text with large labels. Black-box Logging Ring buffer + pre/post window + evidence-first flush RAM RING BUFFER rolling coverage event state delta trace event state delta trace WP OVERWRITE IF NOT FROZEN TRIGGER & FREEZE protect the window THRESHOLD PATTERN COMPOSITE PRE WINDOW last-good → trigger POST WINDOW contain → recover FREEZE SEGMENT FLUSH POLICY evidence-first HEADER MIN EVIDENCE EXTENDED NONVOLATILE STORE APPEND / FREEZE LEN CRC SEQ / BOOT ID NO SILENT CLEAR

H2-10 · Manufacturing vs in-field: test time budget and coverage strategy

Manufacturing tests are constrained by strict time budgets, while in-field self-tests are constrained by service continuity. A practical strategy uses tiered test suites and prioritizes fault models by frequency, impact, and observability.

Two environments, two primary constraints

  • Manufacturing: time equals cost; results must be fast, repeatable, and easy to bin (pass/rework/fail).
  • In-field: tests must not disrupt workloads; they must be shardable, pausable, and safe under partial failure.

Manufacturing: short BIST + high-information signature

  • Short suite: keep runtime bounded; focus on high-yield screens.
  • Information density: prefer signatures and counter deltas that separate fault classes with minimal vectors.
  • Fast verdict: output a binning result that is explainable and consistent across identical units.

Manufacturing aims to screen and route units efficiently, not to exhaust every edge case.

In-field: sharded self-tests with safe post-failure actions

  • Sharding: test one domain or one slice at a time; avoid long uninterrupted runs.
  • Low-priority windows: schedule in allowed maintenance windows or idle time.
  • Failure actions: degrade/derate/lockout/maintenance-flag with a correlatable evidence record.

Coverage strategy: prioritize by frequency, impact, observability

  • Frequency: start with fault models that occur often in the target population.
  • Impact: prioritize models that drive SLA/MTTR and large blast radius.
  • Observability: choose models that can be proven with counters, signatures, and black-box evidence.

Expand from core detection paths to boundary scenarios only after evidence quality is stable.

Tiered suites (Tier 0/1/2) to balance time vs coverage

  • Tier 0: shortest, highest-yield screen; suitable for both production and field.
  • Tier 1: targeted add-ons; used when time allows or when a symptom appears.
  • Tier 2: boundary deep-dive; primarily for validation/diagnostics with strict guardrails.
production test time budget BIST online self test strategy servers
Figure F10 — Time budget vs fault coverage (tiered suites for production and field)
Figure F10 — Test time vs coverage with tiered suites A simplified plot with axes and a step-like curve. Labels mark Tier 0, Tier 1, Tier 2. Two vertical budget markers show production budget and field window. Large text and many blocks. Time Budget vs Fault Coverage Tiered suites make the trade-off explicit SHORT LONG COVERAGE LOW → HIGH TIER 0 TIER 1 TIER 2 quick screen targeted add-ons boundary deep-dive PRODUCTION BUDGET FIELD WINDOW Prioritize models by FREQ IMPACT OBSERVABILITY

H2-11 · Coverage closure: fault models, pass/fail policy, and escaping bugs

Coverage closure turns “a field failure” into a durable asset: a classified fault model, a reproducible injection recipe, a correlatable evidence record, and an updated test/metric/policy that reduces future escapes.

Fault models drive test design (permanent vs intermittent vs marginal)

  • Permanent: repeatable failures → short, high-yield screens; domain-level binning; clear signatures/counter deltas.
  • Intermittent: probability + window dependence → pre/post windows, composite triggers, scripted reproduction via injection.
  • Marginal (corner): temperature/voltage/clock-margin sensitivity → trend-based metrics (near-threshold), tiered deep-dive with strict guardrails.

A “passed test list” is not coverage. Coverage means fault-model coverage with proof-quality observability.

Coverage means: model + injection + observability + policy (within a budget)

  • Model: which faults are targeted (and which are explicitly out-of-scope).
  • Injection: at least one aligned method to reproduce a representative failure mechanism.
  • Observability: marker + counters/events + trace + black-box evidence share identity and time base.
  • Policy: pass/fail + gray-zone actions mapped to manufacturing and in-field constraints.

Pass/Fail is not binary — use a gray-zone policy

Gray-zone decisions reduce escapes by trading small extra test cost for much higher evidence quality and better classification.

GRAY-ZONE ENTRY NEAR-THRESHOLD METRIC INCONCLUSIVE SIGNATURE
  • Retest: rerun with an altered order/window; require consistent outcome before binning.
  • Expand window: increase pre/post evidence capture to preserve causal context.
  • Change gear: swap from “strict” to “window/mask/segmented thresholds” when drift is legitimate and bounded.
  • Isolate & re-test: revalidate a single domain to reduce cross-domain interference and false attribution.
  • Action on fail: manufacturing binning (rework/fail) or in-field action (derate/lockout/maintenance flag) with a complete evidence record.

Why tests pass but failures still happen (escape analysis)

COVERAGE GAP OBSERVABILITY GAP MODEL MISMATCH EVIDENCE LOSS
  • Coverage gap: the real-world sequence is not represented in the suite → add a targeted test slice or tier mapping.
  • Observability gap: failure is detected but not explainable → add counters/events/markers and unify correlation identities.
  • Injection mismatch: injected faults do not match real faults → refine the fault model and align injection stage and observables.
  • Evidence loss: logs are overwritten/truncated/cleared after reset → enforce black-box integrity and evidence-first flush.

Versioned closure assets (what must be updated)

  • Fault-model catalog: model taxonomy, scope, and expected observables.
  • Tier mapping: what runs in manufacturing vs in-field, with explicit time budgets.
  • Policy: strict vs window vs gray-zone rules, and post-failure actions.
  • Baselines: golden signatures and metric thresholds bound to platform revision and configuration.

Every closure update should be traceable to a concrete escape and verified by regression.

why BIST passes but failures in field fault model intermittent bug validation

MPN examples (for coverage-closure plumbing)

The items below are example parts commonly used to implement evidence retention, integrity checks, time correlation, and programmable power gating for test/injection. Final selection depends on voltage domain, endurance, and bus architecture.

Nonvolatile evidence store (fast writes, high endurance)

  • SPI MRAM: Everspin MR25H40
  • SPI FRAM: Fujitsu MB85RS64V
  • SPI FRAM: Infineon/Cypress FM25V10

Baseline / firmware / policy storage (common serial flash)

  • SPI NOR Flash: Winbond W25Q64JV
  • SPI NOR Flash: Macronix MX25U6435F
  • SPI NOR Flash: Micron MT25QL128ABA

Time base & event correlation (boot identity / timestamp anchor)

  • RTC (I²C): NXP PCF85063A
  • RTC (I²C): Microchip MCP7940N
  • RTC (high stability): Analog Devices/Maxim DS3231M

Supervisors / reset cause capture (clean, auditable resets)

  • Window supervisor: TI TPS3850
  • Supervisor + reset: TI TPS3890
  • Supervisor family: Analog Devices/Maxim MAX16052

Programmable domain gating for test/injection (policy-driven cut/isolate)

  • eFuse / power limiter: TI TPS25982
  • eFuse / hot-swap style: TI TPS25940
  • Hot-swap controller: Analog Devices LTC4215

Sideband control expansion (more triggers, more markers, more isolation control)

  • I²C I/O expander: TI TCA9535
  • I²C I/O expander: NXP PCA9555
  • Low-cost CPLD/FPGA glue: Lattice MachXO3LF-2100E

Practical pairing: (MRAM/FRAM) for ring/critical records + (SPI NOR) for baselines/policy + (supervisor/RTC) for reset/time anchoring + (eFuse/hot-swap) for safe domain gating.

Figure F11 — Closure loop: field failure → reproduce via injection → identify gaps → update tests/metrics/policy → regression → deploy
Figure F11 — Coverage closure loop Eight-stage loop with many blocks and arrows. Gap analysis is split into three sub-blocks. Large labels (>=18px) and minimal text. Coverage Closure Loop fault model → evidence → policy → regression FIELD FAILURE escape / RMA EVIDENCE black-box record REPRODUCE via injection GAP ANALYSIS COVERAGE MODEL GAP OBSERVABILITY EVIDENCE GAP INJECTION MISMATCH PATCH add test / metric update policy UPDATE golden / baseline tier mapping REGRESSION verify closure DEPLOY prod + field tiers

Request a Quote

Accepted Formats

pdf, csv, xls, xlsx, zip

Attachment

Drag & drop files here or use the button below.

H2-12 · FAQs (BIST/POST, Fault Injection, Observability, Logging, Policy)

These FAQs focus on self-test, fault injection, observability/correlation, black-box logging, pass/fail policy, and coverage closure. They intentionally exclude PSU topology, cooling, and protocol-stack deep dives.

1

POST passes, but random errors appear under load or temperature—suspect test coverage first or observability gaps?

Maps to: H2-11 (closure) · H2-8 (observability)

Quick take: Start with observability if the failure cannot be correlated to a repeatable marker; suspect coverage when correlation is strong but the suite does not represent the marginal condition.

What it proves / doesn’t: A “clean POST” proves only that boot checkpoints passed at that moment. It does not prove margin under sustained stress or a corner fault model.

  • Check one time base: injection/marker time → counter deltas → trace snippet → black-box record.
  • Look for near-threshold indicators (retries, margin trends), not only pass/fail bits.
  • If evidence is present but no test covers the condition, classify as a coverage gap and add a tiered stress slice.
2

Loopback passes, yet intermittent link jitter appears in production—wrong loopback level or wrong criteria?

Maps to: H2-4 (loopback) · H2-11 (policy/closure)

Quick take: Wrong loopback level is most common; wrong criteria is the second. A pass at an internal loopback may bypass the real failure path.

What it proves / doesn’t: A loopback pass proves only the tested segment (MAC/PCS/PMA/connector/fixture). It does not prove end-to-end channel behavior outside that segment.

  • Re-run at a deeper level (external/physical) if production failures involve cabling/connector paths.
  • Upgrade criteria from “link up” to: BER bursts, retrain count, margin trend, and counter deltas.
  • If pass/fail flaps, apply gray-zone retest and expand evidence windows before binning.
3

Occasional signature mismatch that disappears on retest—real marginal fault or noise/initialization variance?

Maps to: H2-5 (signatures) · H2-11 (closure)

Quick take: Treat it as gray-zone until evidence distinguishes deterministic drift from non-deterministic variance (seed/config/order effects).

What it proves / doesn’t: A mismatch proves only that the compressed response differs. It does not identify whether the difference is legitimate drift, marginal timing, or test non-determinism.

  • Record and compare: test vector version, seed/init state, and platform configuration identity.
  • Correlate mismatch with counter deltas or trace anomalies; “signature-only” failures are often non-diagnostic.
  • Use a window/mask policy for legitimate bounded drift, but require repeatability before declaring a real fault.
4

Should golden signatures be bucketed by BIOS/config version? What goes wrong if bucketing is too fine?

Maps to: H2-5 (golden mgmt) · H2-10 (manufacturing vs field)

Quick take: Yes—bucket by meaningful change points. Over-bucketing increases maintenance cost, weakens statistics, and turns normal drift into false alarms.

What it proves / doesn’t: A golden only proves comparability within its defined identity. Without version binding, mismatches become ambiguous and non-actionable.

  • Bucket keys should be coarse but causal: platform revision + BIOS major + feature toggles that change test paths.
  • Too-fine buckets fragment baselines, reduce confidence, and inflate retest time in production.
  • Prefer: fewer buckets + a bounded drift window + an audit trail for baseline updates.
5

ECC error insertion reproduces different symptoms than real field failures—why, and what injection should be added?

Maps to: H2-7 (injection methods) · H2-11 (escape analysis)

Quick take: Error insertion validates the detection/handling path, but many field failures are margin/timing/state problems. Add injections that match the fault model.

What it proves / doesn’t: Bit-flips prove ECC/parity machinery and policy. They do not prove behavior under brownout, clock margin, or reset-sequence disturbances.

  • Classify the field escape: permanent vs intermittent vs marginal; pick injection aligned to that model.
  • Complement error insertion with controlled perturbations (power/clock/reset windows) under strict guardrails.
  • Require correlation evidence: marker → deltas → trace → black-box before concluding “same bug”.
6

Brownout/glitch injection can amplify damage—what safety guardrails prevent “making it worse”?

Maps to: H2-6 (power gating) · H2-7 (guardrails)

Quick take: Guardrails must control when injection is allowed, how strong/how often it can be, and how to exit safely while preserving evidence.

What it proves / doesn’t: Perturbation proves recovery and isolation behavior only if the system returns to a known baseline and evidence survives the event.

  • Gate injection to explicit checkpoints; forbid during uncontrolled transitions.
  • Clamp amplitude/count; require cool-down and a hard exit condition (stop, isolate, or maintenance flag).
  • Preserve an always-on evidence island: freeze the pre-window before applying aggressive perturbation.
7

Counters look “normal” but black-box logs show a complete failure chain—how should time correlation be done?

Maps to: H2-8 (correlation) · H2-9 (black-box)

Quick take: “Normal counters” often mean sampling/clearing/overflow hid the burst. Trust the chain only when all records share the same identity and time base.

What it proves / doesn’t: A black-box chain proves ordering and context if it is integrity-checked; counters alone may miss short spikes or reset on read/boot.

  • Verify counter semantics: clear-on-read, wraparound, sampling cadence, and reset boundaries.
  • Anchor correlation on one marker ID: injection marker / checkpoint ID / boot-sequence ID.
  • Require alignment: marker time → counter delta time → trace snippet time → black-box record time.
8

How large should the ring buffer be, and how should pre/post-trigger windows be chosen?

Maps to: H2-9 (ring + pre/post + flush)

Quick take: Size the ring for the time span or event count needed to capture “last-good → trigger → recovery,” not for raw bytes.

What it proves / doesn’t: A large buffer without freeze/flush rules still loses evidence. Windows matter more than capacity.

  • Pre-window should include the last-known-good checkpoint and the drift leading to trigger.
  • Post-window should include containment and stabilization (retry/degrade/isolate/reset/recover).
  • Use evidence-first flush: header + minimal bundle first; extended payload only if budget allows.
9

During POST, when should the system fail-fast vs allow degraded boot? How is “acceptable degradation” defined?

Maps to: H2-3 (POST staging) · H2-10 (field policy)

Quick take: Fail-fast for checkpoints that threaten safety, evidence integrity, or core functionality. Allow degradation only when the degraded mode is provable, bounded, and auditable.

What it proves / doesn’t: Continuing after a warning proves only that boot can proceed; it does not prove SLA viability unless policy and evidence are explicit.

  • Define hard-stop checkpoints (core self-test, corrupted evidence path, repeated reset loops).
  • Define degraded actions (derate/maintenance flag/limited features) with explicit evidence capture.
  • Use a gray-zone approach: retest once, then commit to a deterministic policy outcome.
10

Manufacturing time is tight—what BIST should be kept first: signatures, loopbacks, or counter scans?

Maps to: H2-2 (taxonomy) · H2-10 (time budget)

Quick take: Keep the highest information-per-second set: short signatures for broad screens, targeted loopbacks for link paths, and minimal counter deltas for near-threshold trend flags.

What it proves / doesn’t: No single method covers all fault models. Priorities depend on dominant escapes and what is observable in the line budget.

  • Use Tier 0: shortest, highest-yield screens (signature + a small set of loopbacks).
  • Keep counter scans as “delta snapshots” rather than long observation runs.
  • Reserve deeper margin diagnostics for Tier 1/2 when a symptom or bin requires it.
11

How can injection tests avoid “dirtying the system”: residual state, uncleared counters, or overwritten logs?

Maps to: H2-7 (injection) · H2-8 (correlation) · H2-9 (black-box)

Quick take: Treat injection as a controlled experiment: define baseline identity, freeze evidence windows, and guarantee cleanup rules before the next run.

What it proves / doesn’t: A reproduced failure proves little if state and evidence cannot be compared run-to-run. Contamination turns a test into noise.

  • Before injection: record baseline snapshots (boot/sequence ID + key counters + checkpoint).
  • During: mark injection time; freeze the pre-window to prevent overwrite.
  • After: enforce cleanup (explicit counter policy, state reset, cooldown) and append-only evidence records with CRC/sequence.
12

How can in-field self-tests be low-impact yet provably effective—how should sharding and criteria be defined?

Maps to: H2-10 (in-field strategy) · H2-11 (closure)

Quick take: Shard by domain/time and verify effectiveness with correlation-ready evidence and gray-zone criteria—not long disruptive runs.

What it proves / doesn’t: A low-impact shard proves effectiveness only if it can detect a defined fault model and produce evidence that supports an action.

  • Define shard granularity (one domain / one slice / one window) with explicit time budgets.
  • Criteria should include trends and deltas (near-threshold), plus deterministic gray-zone actions.
  • Feed every in-field escape back into the closure loop: update model → test/metric → policy → regression → redeploy tiers.