Fault & Black-Box: Event Logging, Latch & Clear
← Back to: Digital Isolators & Isolated Power
H2-01 · What is “Fault & Black-Box” in an Isolated System
Black-box logging in isolation is not generic logging—it is an audit-ready evidence chain for fault events: who asserted it, when it happened, how long it lasted, what it impacted, and how recovery was cleared and verified.
Definition: Black-Box vs Debug Log vs Telemetry
Black-box is event-centric and evidence-grade: structured records, strict field requirements, power-loss survivability, and an auditable clear policy.
Debug logs are developer-centric: often free-form, frequently disabled, and rarely reliable under brownouts, resets, or event storms.
Telemetry is trend-centric: periodic sampling that may miss short-lived trips (UV/SC/DESAT) and cannot prove “first fault” or “who cleared it”.
Pass/Fail statement: A record is “black-box evidence” only if it can answer: what happened, when, duration, impact, latch state, clear authority, and post-clear verification.
Why Isolation Systems Need a Black-Box
- Fast protection, faster recovery: UVLO/OT/SC/DESAT can trip and recover before a scope is attached; evidence must survive resets and power-cycles.
- Cross-domain symptoms: primary and secondary sides can show different symptoms; correlation requires domain tags and a shared “evidence contract”.
- Common-mode stress: high dv/dt can create false triggers or masked droops; diagnostics must separate cause vs symptom.
- Field behavior is uncontrolled: repeated rebooting can erase fragile logs; black-box must preserve “last critical fault” as non-erasable evidence.
Design intent: Convert intermittent, cross-domain faults into a consistent and testable evidence chain, enabling reliable triage and minimizing “no fault found” returns.
Page Deliverables (What This Topic Produces)
- Event dictionary: stable Event IDs, severity semantics, latch classes, required fields, debounce rules, and clear conditions.
- Latch/Clear state machine: auditable transitions, clear authority, cooldown/escalation, and verification gates after clear.
- Record format: schema versioning, required/core/optional fields, and integrity checks (CRC).
- NVM storage policy: ring buffer, atomic commit, brownout-safe write, and event-storm controls.
- Export interface: deterministic readout and report fields (protocol-agnostic).
Acceptance criterion: The system can always answer “last fault + clear history + verification result” in the field without requiring reproduction.
H2-02 · Fault Taxonomy & Event Dictionary (UV/OT/SC…)
The event dictionary is the “data constitution” of the black-box: stable Event IDs, consistent severity semantics, and unambiguous latch/clear requirements—so field diagnosis never depends on free-form logs.
Fault Taxonomy (Category + Role)
Use two orthogonal tags to prevent mis-triage: Category describes where the fault lives; Role describes whether it is a cause or a symptom.
- Device fault: internal flags (UVLO asserted, OT trip, DESAT/SC trip).
- Power fault: rail droop, brownout window, start-up collapse, no-load anomalies.
- Barrier-related symptom: unexpected toggles, domain resets, abnormal CM coupling signatures.
- Interface fault (symptom): CRC spike, timeout, framing error counters (protocol-agnostic).
Rule: Every dictionary entry must declare Role = Cause / Symptom / Correlated to stop “symptom-as-root-cause” mistakes.
Severity Semantics (Action-Oriented, Not Color-Oriented)
Severity must map to system action, otherwise latch/clear policy drifts across teams and versions.
- S0 Info: non-impacting evidence (retained for correlation).
- S1 Degraded: performance-limited mode (derating, rate-limit, power-limit).
- S2 Protective Trip: protection action occurred, may auto-recover with verification gate.
- S3 Safety Latch: requires explicit clear authority (service / controlled power-cycle).
- S4 Evidence-Critical: must survive reset/power-cycle and remain non-erasable until readout.
Compatibility rule: After an Event ID is published, Severity + Latch class + Required fields may only evolve in a backward-compatible way.
Minimal Event Dictionary Schema (Field Contract)
A compact schema ensures the black-box remains readable across firmware versions and service tools.
| Field | Meaning (audit-focused) |
|---|---|
| Event ID | Stable identifier used for triage, reporting, and version compatibility. |
| Domain | Primary / Secondary / Both, so cross-barrier correlation is deterministic. |
| Category + Role | Where the fault lives + whether it is a cause, symptom, or correlated context. |
| Severity (S0–S4) | Action semantics: info/degrade/trip/latch/evidence-critical. |
| Latch class | Self-clear / SW-clear / Power-cycle-clear / Manual-service-clear. |
| Clear condition | Verification gate after clear (X/Y/N placeholders, testable in bring-up). |
| Debounce / blanking | Reject glitches and dv/dt injection (N µs / N samples placeholders). |
| Required fields | Minimum context that must be present for the event to be “evidence-grade”. |
Threshold placeholders: UV trip = X V, OT trip = Y °C, SC detect blanking = N µs (placeholders reserved for product-specific values).
H2-03 · Detection Inputs: What Signals You Actually Trust
Field diagnosis depends on signal credibility. In isolation systems, dv/dt injection, ground bounce, and sampling windows can turn “measurements” into guesses. A strict trust ladder prevents symptom-as-root-cause mistakes.
Inputs Map (Evidence vs Clues)
Inputs must be grouped by how directly they prove a protection condition. “Symptoms” can guide triage, but should not be treated as evidence without correlation.
| Input class | Examples (isolation context) | Typical use |
|---|---|---|
| Hard internal flags | UVLO asserted, OT trip, OC/SC trip, DESAT trip, internal fault latch | Primary evidence for “protection fired” and “latch state” |
| Hardware pins & rails | PG/FAULT pin, nRESET, rail monitors, secondary feedback (isolated power) | Cross-check evidence; identify brownout windows and recovery gating |
| Firmware computed | ADC-based rail/temperature, state-machine timeouts, counter thresholds | Context + trend; must declare sampling window and validity |
| External symptoms | CRC spikes, timeout bursts, link flap, “sporadic reset” reports | Clues only; require correlation to a stronger cause signal |
Rule: Symptom-class inputs must not directly escalate latch class without corroboration (hard flag, pin-level evidence, or a verified correlation tag).
Trust Ladder (Credibility Ranking)
A ranking is required because not all inputs survive common-mode stress equally. Higher rungs prove “what happened”; lower rungs indicate “what looked wrong”.
- Rung 1 — Hard-latched internal flag: strongest evidence; closest to protection action.
- Rung 2 — Comparator/pin latched: strong but sensitive to layout and ground bounce.
- Rung 3 — Firmware computed: explainable but sampling-window dependent.
- Rung 4 — External symptom: triage clue; never a standalone root cause.
Minimum evidence entry (MEE): S2/S3/S4 events require an Event ID (fast path), monotonic time/sequence, and at least one high-credibility field (rung 1–2).
Anti-False-Trigger Toolbox (Strategy, Not Code)
False triggers usually originate from transient injection or sampling misalignment. Controls must be applied at the right layer and must leave audit traces.
- Blanking window: ignore detection for X µs after switching edges, start-up, or driver enable (prevents dv/dt injection spikes).
- Deglitch filter: require N samples / N µs persistence before confirming an event (prevents narrow glitches).
- Majority vote: confirm only if M-of-N samples agree inside a bounded time window (improves noise immunity).
- Correlation tags: record “dv/dt window”, “start-up phase”, or “brownout window” tags to explain false-looking events.
Audit rule: If a filter drops an event, increment a “suppressed counter” or store a summarized coalesced entry (evidence of event pressure).
H2-04 · Event Capture Pipeline (Fast Path vs Slow Path)
Split capture into two lanes so evidence survives worst-case conditions. The fast path records the fact of the event in bounded time; the slow path enriches context when the system is stable enough to do so.
Two-Lane Capture Concept (Bounded-Time Evidence)
The fast lane must succeed even during brownouts and reset storms. The slow lane improves diagnosis quality but must never block the fast lane.
- Fast path (ISR / hardware latch): Event ID + monotonic time/sequence + domain tag + minimal snapshot.
- Slow path (background): rail/temperature snapshots, mode/state ID, counters, and last reset cause (when available).
Fast-path constraint: capture is constant-time with an upper bound of X µs (placeholder) and must not perform heavy formatting or NVM writes.
Fast Path: Minimum Evidence Record (MEE)
The minimum evidence record must be sufficient to answer “what happened first” even if context enrichment fails.
- Required: Event ID, monotonic counter/time, domain (P/S), latch snapshot.
- Minimal snapshot: rail class tag (OK/LOW), reset window tag (YES/NO), and a compact “source” tag (flag/pin/symptom).
- Storm safety: bounded queue push + drop accounting when the queue is full.
Priority rule: retain first-occurrence and last-occurrence records for S3/S4 classes; drop low-severity repeats first during storms.
Slow Path: Context Enrichment (When Stable)
Context fields explain why a trip occurred and whether recovery is safe. Missing context is allowed, but the reason must be recorded.
- Context set: primary/secondary rails, temperature, operating mode/state ID, key counters (fault/retry/CRC), last reset cause.
- Correlation tags: dv/dt window tag, start-up phase tag, brownout window tag.
- Missing-context rule: record “context_unavailable = reset/brownout/overrun” rather than leaving silent gaps.
Schema rule: slow-path data is appended as optional TLVs (or equivalent) so the core record remains backward-compatible.
Loss Strategy (Overrun, Rate-Limit, Coalescing)
When event pressure exceeds capacity, the pipeline must preserve diagnostic value and leave auditable evidence of suppression.
- Buffer overrun handling: keep-last-critical (S3/S4), drop-low-severity first, and store dropped_count.
- Rate limit: cap repeated events to X per Y seconds while retaining first and last occurrences.
- Event coalescing: merge repeats into one entry with count + first_time + last_time (ideal for CRC/timeout storms).
Audit rule: suppression must be visible (dropped_count / coalesced_count), otherwise field triage cannot distinguish “quiet system” from “lost evidence”.
H2-05 · Latch & Clear Policy (State Machine You Can Audit)
Most field disputes come from inconsistent clear behavior. A black-box policy must define: when to latch, who can clear, how to prevent thrash, and which verification gates must pass before returning to service.
Latch Classes (Semantic Freeze)
Each Event ID must map to a latch class with stable meaning. Latch class changes across versions break field comparability and invalidate audit trails.
| Latch class | Meaning (auditable) | Clear trigger |
|---|---|---|
| Self-clear | Condition disappears; state can return automatically, but the record must retain first/last occurrence, duration, and repeat count. | Automatic exit + verify gate (optional) |
| SW-clear | Condition may disappear but recovery is not assumed. Explicit clear is required; must log clear authority, reason, and verification result. | Authorized clear request |
| Power-cycle-clear | Return to Normal requires a controlled power-cycle/reset action. Prevents silent recovery after serious protection trips. | Power-cycle + verify gate |
| Manual-service-clear | Highest strictness. Only service/factory authority can clear. Clear must be fully auditable and blocked if verification fails. | Service/factory clear + verification |
Freeze rule: Once published, an Event ID must not silently relax latch class. Only backward-compatible tightening is allowed (with explicit version notes).
Clear Authority (Who Can Clear What)
Clear actions must be treated as audit events. Authority defines permission scope without depending on any specific protocol or transport.
- Local HMI: operator-level actions for low-risk events (S0/S1, select S2). Must log an operator token (placeholder).
- Service tool: engineering/service authority for S2/S3 clear. Must log identity (placeholder), reason code, and verification result.
- Factory mode: controlled manufacturing authority. Must log factory-mode flag and time window tag.
Clear is a record: clear_time, clear_authority, clear_reason, clear_target_event, verify_gate_result (Pass/Fail) are mandatory for audited systems.
Anti-Thrash Controls (Cooldown, Retry Budget, Escalation)
Repeated recover-fail loops destroy evidence quality and can create unsafe oscillation. Controls must prevent rapid cycling and must be observable in logs.
- Cooldown: after clear, block repeated clears for X seconds/minutes (placeholder) to avoid immediate re-trigger loops.
- Retry budget: limit recovery attempts to Z times within Y minutes (placeholders) for each Event ID/class.
- Escalation: repeated trips (e.g., 3 times) upgrade latch class to hard latch. Must log escalation_count and escalation_reason.
Audit rule: any blocked clear or escalation must write an explicit audit entry; silent blocking is not acceptable for field diagnosis.
Pass Criteria After Clear (Verification Gates: X / Y / N)
“Cleared” is not “recovered”. Returning to Normal requires gates that are measurable, repeatable, and suitable for acceptance testing.
- Margin gate: UV margin ≥ X; temperature ≤ Y; rails stable for Y seconds (placeholders).
- Stability gate: no repeat of the same event for N minutes; symptom counters below X per window Y (placeholders).
- Functional gate: system returns to expected mode ID = X; key outputs/driver status = expected (placeholders).
State rule: only Verify-Pass transitions to Normal. Verify-Fail stays in RecoveryPending (or Degraded) and must be logged.
H2-06 · Record Format: What to Store (Data Schema)
Standardized records enable cross-batch and cross-site analysis while controlling write cost. A schema must be versioned, backward-compatible, and integrity-checked so old records remain readable after OTA updates.
Core Schema (Minimum Evidence Fields)
Core fields must be sufficient to reconstruct “what happened, when, and under which latch/clear semantics”, even if optional context is missing.
- Header: schema_version, record_length, record_type (event/clear/escalation).
- Core evidence: Event ID, monotonic counter/sequence, timestamp (policy placeholder), domain tag (P/S/Both).
- Policy fields: latch class, power state tag, mode ID.
- Clear audit (for clear records): clear_authority, clear_reason, verify_result.
Fast-path alignment: core fields must be writable via the fast path without heavy formatting; optional context must never block core commit.
Optional Context (TLV Groups)
Optional fields provide diagnostic depth while keeping the base record compact and stable. Unknown groups must be skippable by older tools.
- Rails: primary/secondary snapshots, brownout window tag.
- Thermal: temperature snapshot and thermal state tag.
- Driver status: gate-driver flags (placeholder), UV/OT sub-flags.
- Link counters: CRC/timeout counters (placeholder), error bursts.
- Reset cause: last reset reason (if available).
- Correlation tags: dv/dt window, start-up phase.
Missing-context rule: if optional context cannot be captured, store context_unavailable_reason = reset/brownout/overrun (placeholder enumeration).
Versioning & Backward Compatibility
Versioning rules prevent “new firmware writes unreadable records”. Compatibility must be designed into the schema layout.
- Core freeze: core field order and semantics remain stable across versions.
- Append-only growth: new information is added as optional TLVs, not by reshaping the core.
- Graceful decode: unknown TLVs are skipped; unknown record_type is preserved as “unknown” rather than breaking parsing.
Compatibility gate: every OTA release must prove “old tool reads new records” and “new tool reads old records” (bidirectional decode).
Integrity & Auditability (CRC, Commit Markers, Counters)
Records must be verifiable in the presence of brownouts and partial writes. Integrity signals enable detection of corruption and missing evidence.
- CRC: computed over Header + Core + Optional; invalid CRC means “record not trusted”.
- Commit marker: explicit valid flag or equivalent to detect partial commits (placeholder).
- Monotonic counter: detects gaps (lost records) and supports ordering across resets.
- Pressure evidence: dropped_count and coalesced_count preserve auditability under event storms.
H2-07 · Storage & Power-Loss Survivability (NVM Ring Buffer)
The most damaging black-box failure is losing the last critical fault during power loss or reset. A robust design uses fixed-layout NVM slots, atomic commit, integrity checks, and storm-safe write controls.
Failure Modes & Design Goals
Survivability requires explicitly handling partial writes, brownout reset loops, event storms, and endurance limits. Storage must expose evidence quality (valid vs partial) rather than hiding uncertainty.
- Last-entry loss: partial record without commit; must be detectable (commit missing / CRC fail).
- Brownout reboot loop: repeated interrupted writes; must remain parseable and non-destructive.
- Event storms: repeated low-value events cannot overwrite first/last critical evidence; suppression must be auditable.
- Endurance exhaustion: write budget must be modeled; rate-limit and coalesce reduce wear.
Acceptance targets (placeholders): retain last S3/S4 record after power loss; distinguish empty/partial/valid; preserve first/last critical under storms.
Ring Buffer Layout (Fixed Slots, Pointers, Wrap)
A fixed-layout ring avoids filesystem complexity and ensures predictable recovery scanning. Slots spread wear naturally and keep the decoder simple.
- Slots: slot[0..K-1], each holds one record (header + payload) with fixed maximum size.
- write_ptr: advances monotonically and wraps; a generation counter distinguishes “new wrap” vs “old data”.
- read_ptr (optional): for export streaming, but recovery prefers scanning for valid commits.
- Metadata robustness: pointer metadata should be mirrored or versioned to recover the latest state.
Robustness rule: data scanning must succeed even if metadata is corrupted; metadata only accelerates boot-time indexing.
Atomicity & Integrity (Prepare/Commit, CRC, Magic)
Atomic commit ensures a record is either fully valid or clearly partial. Commit and CRC must be designed so that power loss never produces ambiguous “maybe valid” entries.
- Magic word: fast slot recognition to filter noise/erased patterns.
- Length + schema_version: bounded parsing, supports forward/backward decode.
- CRC: validates header + core + optional payload.
- Commit marker: last-step flag; only commit=1 + CRC pass is valid.
Recovery rule: boot scan treats commit=0 or CRC fail as partial; partial_count can be tracked to reveal interrupted write pressure.
Write Budget & Storm Behavior (Wear, Rate-Limit, Coalesce)
Write endurance is protected by controlling amplification and avoiding redundant writes. Storm behavior must preserve critical evidence and leave auditable suppression traces.
- Write amplification: WA ≈ ceil(record_bytes / page_bytes) (placeholders).
- Endurance model: total_records ≈ (slots × erase_cycles) / WA (placeholders).
- Storm controls: rate limit (X per Y), coalesce (count + first/last), and priority retention (keep first/last S3/S4).
- Audit evidence: dropped_count / coalesced_count recorded to show pressure, not silence.
H2-08 · Timebase & Correlation (When You Have Two Grounds)
Isolation systems often have separate time domains on primary and secondary sides. Correlation must work without absolute time: preserve per-domain ordering with monotonic counters, and add minimal cross-domain markers when available.
Time Sources (Taxonomy)
Time policy must be declared so field analysis can interpret timestamps correctly. Each source has different reliability and correlation capability.
- MCU tick / monotonic counter: strongest for ordering; requires boot_id/epoch across resets (placeholders).
- RTC / wall clock: enables absolute timelines; validity must be logged (time_valid yes/no).
- External sync: improves cross-domain alignment; only source presence is declared here (no protocol details).
Minimum fields: time_policy, time_valid, boot_id, and per-domain t_local or seq_local (placeholders).
Dual-Domain Tagging (P/S)
Every record must identify which domain it belongs to. Cross-domain conclusions must not be made unless correlation quality is explicitly stated.
- domain_tag: P / S (and optional Both).
- t_local or seq_local: per-domain ordering that stays valid even without RTC.
- boot_id / epoch: differentiates ordering across power cycles.
- quality tags: brownout_window / startup_phase tags explain time anomalies (placeholders).
Ordering rule: within each domain, records must be totally ordered by (boot_id, seq_local) or equivalent.
Minimal Correlation Methods (No Protocol Dependence)
Cross-domain alignment can be achieved with lightweight markers. When markers are absent, correlation must fall back to a bounded uncertainty window.
- Shared marker event: record marker_id with tP and tS; compute offset_estimate = tS − tP (placeholders).
- Ping-pong correlation: record corr_seq and send/recv local times to estimate offset and delay window (placeholders).
- Correlation window: if exact alignment is not possible, store corr_window = W and constrain conclusions to ±W.
Field rule: if time_valid=no or no marker exists, cross-domain statements must include an uncertainty window (±W).
Field Reconstruction Workflow (From Order to Causality)
A consistent workflow turns two domain logs into a usable timeline without overstating certainty.
- Step 1: sort per-domain by (boot_id, seq_local).
- Step 2: locate markers (marker_id or corr_seq).
- Step 3: estimate offset_estimate and correlation quality (good/weak).
- Step 4: merge views; annotate uncertainty (corr_window) when alignment is weak.
H2-09 · Integrity & Anti-Tamper (Make Logs Trustable)
A black-box log must remain trustworthy under field pressure. Trustability comes from explicit integrity checks, auditable access/clear control, and privacy-minimized data that avoids sensitive user content.
Integrity Primitives (CRC → Hash Chain → Signature)
Integrity should be layered so the system can start with strong basics and optionally add stronger anti-tamper evidence. Each layer must produce visible evidence quality in the exported report.
- Record CRC (mandatory): detects corruption and partial writes; CRC failure must be reported as invalid evidence, not silently dropped.
- Hash chain (optional): detects insertion/deletion/replacement by linking records; a chain break must be explicit in the report.
- Signature (touch only): proves the record set is device-originated; only presence and verification result are logged here (no PKI details).
Evidence-quality fields (placeholders): record_trust(valid/invalid), crc_status, chain_break(yes/no), sig_present, sig_ver_result(pass/fail).
Access Control Semantics (Read / Clear / Factory Gate)
Trustability is not only about data integrity. It also requires clear authority boundaries and audit trails for readout and clearing actions.
- Read permission: readout is allowed, but every export must be auditable (who, when, which session).
- Clear permission: clear actions must be gated by authority and recorded with reason and verification outcome.
- Factory-mode gate: factory-only actions must be detectable; operations outside the factory window should be flagged as policy violations.
Audit fields (placeholders): readout_time, readout_authority, session_id, clear_authority, clear_reason, clear_target, verify_result, factory_mode(on/off), factory_window_id.
Tamper Evidence in Practice (Leave Traces)
Field disputes often arise when suspicious behavior leaves no trace. The black-box should record attempts, denials, and mode transitions so the report can explain anomalies without guesswork.
- Denied attempts: denied reads/clears should increment counters with reasons (not just “no”).
- Mode transitions: entering/exiting service or factory modes must be recorded to preserve the evidence chain.
- Pressure artifacts: partial_count, dropped_count, and coalesced_count should be reported to distinguish system pressure from tampering.
Report rule: the exported report must display evidence-quality summary (CRC/chain/signature + pressure artifacts) before any fault conclusions.
Privacy Minimization (Only Engineering Evidence)
A trustworthy black-box should also be safe to export. Records should contain only engineering fields needed for diagnosis and compliance, and exclude sensitive user data.
- Allowed: event IDs, rail/temperature snapshots, counters, mode IDs, clear history, and verification outcomes.
- Disallowed: personal identifiers, content payloads, location data, or anything that can identify a user.
- Exception policy (placeholder): if identifiers are unavoidable, store minimized tokens and mark privacy_mode accordingly.
Compliance rule: schema and report templates should be reviewable for “no sensitive data fields” before production release.
H2-10 · Field Workflow: Diagnose → Decide → Recover (Closed Loop)
A black-box is useful only when it drives a repeatable field SOP. A closed-loop workflow preserves evidence, classifies severity, executes an action (continue/derate/service), and verifies stability before closing the case.
Step 1 — Readout (Preserve Evidence First)
Evidence must be preserved before any clearing or reboot actions. Readout itself should be auditable so the evidence chain remains intact.
- Read-before-clear rule: clear actions require prior readout, except safety emergencies (placeholder exception).
- Readout trace: record who exported logs and when (session_id, authority).
- Analysis window: freeze the case context for consistent triage (placeholder policy).
Minimal readout payload: event_summary, evidence_quality, last_fault_context, clear_history pointer (placeholders).
Step 2 — Triage (Classify with Rules)
Classification should be rule-driven to avoid disputes caused by subjective symptoms. Decisions should reference verification gates and evidence quality.
- Continue: low severity and stability gates pass (no repeat for N minutes, placeholder).
- Derate: trending risk without hard latch; outputs a derate_level and recheck time (placeholders).
- Return/Service: hard latch or escalation, or evidence_quality is insufficient (CRC/chain issues).
Triage rule: “looks fine” is not a criterion; decisions must cite verify gates (X/Y/N placeholders) and evidence quality.
Step 3 — Action (Continue / Derate / Service)
Actions must be traceable and consistent. Each action writes a disposition record so future analysis can distinguish operational policy from device behavior.
- Continue: requires verify pass + recurrence_check=0 within N minutes (placeholders).
- Derate: applies a bounded operating limit and schedules a recheck (placeholders).
- Service: triggers a service_required_code and prevents repeated unsafe clears (placeholders).
Disposition fields (placeholders): action_code, derate_level, service_required_code, operator_ack.
Verification & Close (Make Disputes Converge)
A case is closed only after verification and stability checks. This prevents recurring disputes about “cleared but not fixed” and makes reset storms diagnosable.
- Clear history: every clear must include authority, reason, and verification result.
- Stability gate: recurrence_check must remain 0 for N minutes (placeholder) after recovery.
- Reset storm evidence: boot_id + reset_cause + partial_count distinguishes power issues from missing logs.
- Symptom mismatch rule: event ID + latch class + correlation window (±W) overrides symptom-only narratives.
H2-11 · Engineering Checklist (Design → Bring-up → Production)
Design Gate (Definition Freeze)
- Freeze the event dictionary: stable Event ID, severity, latch class, clear condition, debounce, timestamp policy.
- Freeze the latch/clear state model: who can clear, when clear is allowed, what must be verified after clear.
- Write budget & survivability: size the ring buffer for worst-case burst + rate-limit policy for event storms.
- Authority model: separate read vs clear permissions; factory/service modes must be explicit and logged.
- Schema versioning: schema_version + backward-compatible decoding rules to survive OTA updates.
NVM for ring buffer: MB85RS64V (SPI FRAM), FM25V10-G (SPI F-RAM), MR25H256 (SPI MRAM), W25Q64JV (SPI NOR Flash).
Secure element (integrity / key storage): ATECC608B, SE050 family (e.g. SE050C2HQ1), OPTIGA Trust M SLS32AIA family, STSAFE-A110.
Timebase / supervisor: DS3231 (RTC), MCP79410 (RTC), PCF85063A (RTC), TPS3890 (voltage supervisor).
Bring-up Gate (Proof by Injection)
- Event injection: force UV/OT/SC/OC/DESAT paths and confirm “fast path” record appears within the expected response time.
- False-trigger hardening: verify deglitch/blanking/majority-vote assumptions under dv/dt and noisy return conditions.
- Power-loss test: kill power at worst timing and verify the last record is committed (two-phase commit / CRC / magic word).
- Cross-version decode: confirm old records decode correctly after a firmware/schema change.
After clear, the system must satisfy UV margin ≥ X, Temp ≤ Y, and No repeat events for N minutes.
Power-loss survivability: “last-fault record present” in 100% of K random cut tests.
Production Gate (Tooling & Traceability)
- Fixture readout: manufacturing station can read the log and export a compact report.
- Identity binding: logs must be bound to a unit identity (serial/UID) so swaps are detectable.
- Clear policy enforcement: only approved modes can clear; every clear action generates a record (clear history is evidence).
- RMA workflow: define “do not clear” conditions for returns; evidence must survive handling.
Secure element UID / certificate anchor: ATECC608B, SE050, SLS32AIA family, STSAFE-A110.
NVM with unique ID option (when applicable): choose FRAM variants that provide serial-number capability (vendor-specific options).
H2-12 · Applications & Quick Pairings (Examples with Part Numbers)
Scenario A — Motor / Inverter (Fast protection, slow evidence)
Key faults to capture: DESAT/SC, gate UVLO, driver OT, repeated retry storms.
Latch class: hard latch for SC/DESAT; service-clear only after verification.
Must-store fields: Event ID, latch class, clear reason, driver status snapshot, power state, monotonic counter.
Isolated gate driver: UCC21750 or ADuM4135.
Isolated current sense / modulator: AMC1306 (ΔΣ modulator).
Ring-buffer NVM: MB85RS64V / FM25V10-G (FRAM) or MR25H256 (MRAM).
Supervisor/reset: TPS3890 (reset on brownout).
Integrity anchor (optional for evidence): ATECC608B.
Scenario B — BMS / HV Systems (Communication + droop + coalescing)
Key faults to capture: isolated comm drops, rail droop, repeated bus-off-like patterns, watchdog resets.
Latch class: SW-clear with cooldown; escalate to hard latch after repeated bursts (e.g., 3 bursts → hard latch).
Must-store fields: Event coalescing counters, cooldown state, last reset cause, domain tag (primary/secondary).
Isolated CAN-FD transceiver: ISO1042.
isoSPI / daisy-chain link: LTC6820 (isoSPI transceiver).
Ring-buffer NVM: FM25V10-G (high-endurance F-RAM) or MR25H256 (MRAM).
Integrity / identity: SE050 family or STSAFE-A110.
Scenario C — High-Precision Sampling (Clock/data integrity as evidence)
Key faults to capture: loss of clock integrity, sync slips, repeated re-locks, CRC/timeouts on isolated links.
Latch class: self-clear for transient slips; SW-clear for repeated slips with verification window.
Must-store fields: slip counters, last-good timestamp, record schema version, correlation window data.
Isolated LVDS (low jitter) for clock/data lanes: ADN4654 (up to ~1.1 Gbps) or ADN4650 (up to ~600 Mbps).
Ring-buffer NVM: MB85RS64V / FM25V10-G (fast writes, high endurance).
RTC/timebase option: DS3231 or PCF85063A.
Scenario D — Medical / Service Port (Evidence retention + clear authority)
Key faults to capture: port attach/detach anomalies, overcurrent events, abnormal resets during service.
Latch class: manual-service-clear for safety-related faults; clear must be logged and reproducible.
Must-store fields: clear authority, service session ID, last 3 clears, verification results after clear.
Integrity/identity: ATECC608B / STSAFE-A110 / SE050 family.
Ring-buffer NVM: FM25V10-G (endurance) or W25Q64JV (capacity, with careful commit design).
Supervisor/reset: TPS3890.
H2-13 · FAQs
Field acceptance and dispute closure for fault black-box systems. Each answer uses a fixed 4-line structure with numeric placeholders (X/Y/N).
UV events appear, but measured rail never dips—false trigger or sampling window?
Likely cause: UV flag is from a fast internal comparator or ground-bounce injection, while the external measurement misses a sub-µs dip.
Quick check: Compare UV flag timestamp to the highest-rate rail snapshot and record dip-duration estimate; check deglitch/blanking settings.
Fix: Require a qualified dip window (deglitch + blanking) and store min-rail + dip-duration in the record for every UV event.
Pass criteria: UV is valid only if rail < X V for > Y µs and the dip correlates within ±N ms of the UV flag.
OT flagged, but casing temp is normal—sensor location or internal junction estimate?
Likely cause: OT is based on internal junction estimation or a sensor location with thermal lag; casing temperature is not the trip reference.
Quick check: Log and compare T_die_est vs T_board (or NTC) and check OT persistence time (filter window).
Fix: Store both temperatures in the record and enforce persistence filtering to prevent short spikes from causing OT latches.
Pass criteria: OT triggers only if T_die_est ≥ X °C for ≥ Y ms, and Δ(T_die_est − T_board) is ≤ N °C or explicitly reported.
SC/OC storms only during high dv/dt—real fault or CM injection?
Likely cause: Common-mode injection or sense-threshold disturbance during switching edges creates false SC/OC detections.
Quick check: Correlate event bursts with switching phase markers and check if events cluster within N µs after edges; review blanking config.
Fix: Add edge-blanking + deglitch and require consecutive-sample qualification; keep storm counters (burst size, first/last time).
Pass criteria: At dv/dt ≥ X kV/µs, false SC/OC rate ≤ Y per minute and SC/OC requires ≥ N consecutive qualified samples.
Fault clears itself and returns—should this be escalated to hard latch?
Likely cause: The fault is configured as self-clear or cooldown is too short, causing repetitive recover/relapse cycles.
Quick check: Count repeats per event ID within a fixed window and verify escalation rules are enabled and logged.
Fix: Escalate from self-clear → SW-clear → hard latch after repeated bursts; enforce cooldown and record escalation reason.
Pass criteria: If repeats ≥ N within Y minutes, the system escalates to hard latch; otherwise repeats stay ≤ X per window.
After clear, the system fails again within minutes—what verification gate is missing?
Likely cause: Clear was allowed without verifying recovery conditions (rail margin, temperature margin, stability window).
Quick check: Inspect the clear history and verify_result fields; confirm whether gates were evaluated and recorded.
Fix: Block clear until verify gates pass, then enforce a post-clear stability window with recurrence monitoring.
Pass criteria: Post-clear requires UV_margin ≥ X mV, Temp ≤ Y °C, and no recurrence for N minutes.
Logs lost after power cycle—first suspect NVM commit or brownout window?
Likely cause: Power drops during write, commit flag is not atomic, or brownout reset occurs inside the commit window.
Quick check: Check partial_count/commit flags and reproduce with controlled power-cuts; confirm supervisor threshold vs hold-up.
Fix: Use two-phase commit (prepare/commit) with CRC and a commit marker; treat incomplete writes as partial records (never silent drop).
Pass criteria: Last-fault record survives N random power-cuts; partial records ≤ X% and commit window ≤ Y ms.
Two grounds, two timelines—how to correlate primary vs secondary events fastest?
Likely cause: Missing domain tags or correlation markers causes ambiguous ordering between primary and secondary logs.
Quick check: Confirm both domains record domain_tag, boot_id, and monotonic counters; verify at least one correlation marker exists per session.
Fix: Store domain_tag (P/S), boot_id, seq_local, and correlation window markers; compute and log an offset estimate (placeholder).
Pass criteria: Correlation window is ±X ms, offset_estimate error ≤ Y ms, and ordering is consistent for ≥ N paired events.
Field says “reset fixed it” but no evidence—how to enforce non-erasable last-fault?
Likely cause: Boot flow wipes logs, or last-fault is stored in volatile buffers without a protected persistent slot.
Quick check: Verify whether boot clears the ring or reinitializes pointers; confirm a protected last-fault slot exists and is populated.
Fix: Reserve a non-erasable last-fault record area (append-only) and log every clear/reset action with authority and reason.
Pass criteria: Last-fault persists across N resets and power cycles; unauthorized_wipe count = 0 and last-fault age ≤ X days.
Event rate exceeds buffer—merge policy or rate limit first?
Likely cause: Burst rate exceeds record bandwidth, and the system lacks coalescing or rate limiting.
Quick check: Compare event_rate to max_event_rate and inspect dropped_count/coalesced_count; verify record size and optional fields policy.
Fix: Coalesce first (keep first/last + count) for repeated events, then apply rate limiting with priority rules for severe events.
Pass criteria: At X events/s for Y s, dropped_count ≤ N and first/last records are preserved for each burst.
Factory wants a single pass/fail metric—what should be the acceptance criteria?
Likely cause: Acceptance is undefined, so different operators/labs interpret “good” differently.
Quick check: Produce a one-page report that includes evidence_quality, verify_result, recurrence_check, and top-severity summary.
Fix: Define pass/fail as a deterministic function: evidence_quality_ok AND verify_pass AND recurrence=0 within a fixed window.
Pass criteria: Pass if evidence_quality_ok=1, verify_pass=1, and recurrence=0 for N minutes; metric repeatability ≥ X% over Y runs.
Same unit, different lab behavior—what minimum context fields are usually missing?
Likely cause: Reports lack minimum context (mode, load, temperature, dv/dt class, firmware/schema versions), so results cannot be normalized.
Quick check: Validate that required fields are present in the report and that schema_version/firmware_version are consistent.
Fix: Freeze a “required fields” set and refuse “final reports” when completeness is below threshold; log missing_fields_count.
Pass criteria: Required fields completeness = 100% for N required fields; cross-lab variance ≤ X% under the same declared conditions (±Y).
Audit asks who cleared the latch—how to design clear authority + traceability?
Likely cause: Clear actions are not authenticated or not logged with authority/session identifiers.
Quick check: Confirm every clear action produces a clear-history record containing clear_authority and session_id.
Fix: Enforce clear API gating (authority required) and write an auditable clear record including reason and post-clear verification result.
Pass criteria: clear_authority present in 100% of clear events; unauthorized_clear count = 0; traceability available within X minutes for ≥ N past clears.