Diagnostics & Safety for CAN/LIN/FlexRay Transceivers

Q: Bus-off happens after a short glitch—first check counter window definition or a real fault?

Likely cause: error counter window/denominator mismatch, or a latched fault condition being interpreted as a brief glitch. Quick check: log TEC/REC + bus-off flag with timestamp; verify the sampling window (X ms) and whether counters are reset/rolled over. Fix: standardize the counter window + denominator, and add a “fault-latched vs transient” tag before executing recovery/reset. Pass criteria: bus-off rate ≤ X per Y minutes under the defined window; counters and flags are consistent across versions.

Q: ERR pin toggles but logs show no DTC—what’s the first signal-to-event mapping check?

Likely cause: wrong edge polarity/level mapping, missing interrupt enable, or an event filter that drops the pin pulse. Quick check: correlate ERR pin timestamp with MCU ISR entry + event_id creation; verify mapping table: pin → event_id → DTC code. Fix: lock a “signal-to-event map” spec (pin/SPI/counters), then implement a minimum pulse capture (X µs) and a debounce window (Y ms). Pass criteria: ≥ X% of ERR pulses produce an event within Y ms; DTC mapping is deterministic across firmware builds.

Q: Fail-safe receive keeps output recessive—how to confirm it’s line-open vs transceiver silent mode?

Likely cause: receiver fail-safe default drives recessive on open/floating, or the device is in silent/standby mode with RxD forced recessive. Quick check: read mode status (pin state + SPI status if available) and log “mode_state”; compare RxD behavior across known mode transitions. Fix: explicitly tag silent/standby transitions in logs; add a mode-guard that blocks root-cause classification unless mode_state is “normal”. Pass criteria: for open-line tests, classification accuracy ≥ X% with mode_state recorded; no “silent-mode mislabels” in Y cycles.

Q: Dominant timeout triggers in the field but never on bench—what injection case is missing?

Likely cause: missing injection of TxD stuck-dominant under realistic mode transitions (sleep/wake/reset), or missing “partial latch + recovery” sequence. Quick check: compare lab injection matrix vs field timeline; verify whether TxD forcing was tested across: reset window (X ms), standby exit, and bus-off recovery. Fix: add an injection case set: TxD dominant during reset, during wake, and during recovery; define expected detect channel(s) and evidence fields. Pass criteria: dominant-timeout detection latency ≤ X ms and repeatability ≥ Y% over Z runs for all defined injection states.

Q: False wake vs false fault—how to separate wake-source attribution from bus-fault detection?

Likely cause: wake reason is not tagged (bus/local/timed), so a wake transition is misclassified as a bus fault event. Quick check: ensure logs contain wake_reason + mode_state + timestamp; check if fault events cluster within X ms after wake/standby transitions. Fix: enforce a two-step classification: (1) attribute wake_reason; (2) enable bus-fault detection only after a stabilization window (X ms) in normal mode. Pass criteria: post-wake false fault rate ≤ X per Y wakes; wake_reason is present in ≥ Z% of related events.

Q: Two ECUs disagree on fault root-cause—first alignment check: timestamps, event IDs, or counters?

Likely cause: timebases are not aligned, event IDs are not globally unique, or counters are sampled with different windows. Quick check: align on a single tuple: (timestamp ± X ms, event_id, counter snapshot window); verify both ECUs record the same mode_state at the same time. Fix: standardize event_id schema (include node ID), standardize counter windows, and add a correlation rule for multi-ECU incidents. Pass criteria: disagreement rate ≤ X% over Y injected incidents; correlation success ≥ Z% with the standardized tuple.

Q: Thermal shutdown recovers but CAN stays unstable—first check recovery state vs bus-off policy?

Likely cause: the transceiver recovers to a restricted mode while firmware assumes normal, causing repeated errors and bus-off loops. Quick check: log Tj/thermal flag + mode_state + bus-off state across recovery; verify if firmware re-enables Tx before cooldown (X ms). Fix: add a recovery gate: require thermal-clear + cooldown + explicit normal-mode confirmation before Tx enable and bus-off recovery. Pass criteria: post-thermal recovery achieves stable operation for X minutes with bus-off count ≤ Y and no repeated thermal relatch.

Q: SPI status bits look “stuck”—is it a latch behavior or missing clear sequence?

Likely cause: latched status bits require a read/clear or specific unlock sequence, or the driver never executes the clear path after logging. Quick check: compare raw SPI reads before/after a known clear attempt; check if “clear on read” or “write-1-to-clear” is required (per device manual). Fix: implement the documented clear sequence, and log a “clear_attempted” flag + post-clear readback for auditability. Pass criteria: status bit clears within X ms after the clear sequence and stays clear for Y cycles without masking real faults.

Q: Fault injection passes once but later becomes flaky—what degradation sign should be logged first?

Likely cause: marginal conditions accumulate (mode latches, repeated recovery cycles, or injection fixture variability) causing non-repeatable detection. Quick check: log the first drift indicators: mode_state changes, repeated recoveries, time-to-detect, and event rate vs run count. Fix: add a “pre/post injection” snapshot + cooldown rule; stabilize injection procedure and enforce a fixed run protocol (X cycles, Y spacing). Pass criteria: detection latency distribution stays within X–Y ms across Z runs; flake rate ≤ W%.

Q: ASIL audit asks for detection time—how to derive it from logs without re-testing everything?

Likely cause: logs lack a “fault start marker”, so detection time cannot be computed consistently across incidents. Quick check: identify two timestamps: (A) earliest observable symptom (pin/SPI/counter threshold crossing) and (B) event_id creation; define Δt = B − A. Fix: add a standardized “fault start” marker (threshold crossing) and a “reaction start” marker (safe-state entry) to the event schema. Pass criteria: detection latency Δt ≤ X ms for Y% of incidents; audit computation is reproducible from logs only.

← Back to: Automotive Fieldbuses: CAN / LIN / FlexRay

Core idea

Diagnostics & Safety at the vehicle bus interface means turning “the network feels unstable” into a specific, logged root-cause and a deterministic safe-state reaction. The goal is a provable chain from signal → event → evidence → recovery, so faults are detectable, auditable, and serviceable in the field.

H2-1 · Definition & Scope: Diagnostics vs Safety at the Bus Interface

The bus interface becomes safety-critical when its failures can hide, amplify, or misreport system risk. This section defines diagnostics as measurable observability and localization, and safety as risk control with verifiable evidence.

Scope guard (avoid overlap)

Owns (deep coverage in this page)

Fail-safe receive contracts: default outputs under faults, predictable safe behavior.
Bus fault detection & reporting: fault flags, pins, registers, and evidence mapping.
ASIL interface hooks: safety pins, safe-state control, testability-oriented design hooks.
Fault-injection support: how detection and coverage are proven with repeatable tests.

Mentions only (1–2 sentences + internal link)

Timing / sample-point / loop-delay budgets → link to “Data Rate & Timing”.
EMC / protection components / layout → link to “EMC / Protection & Co-Design”.
Selective wake / partial networking → link to “ISO 11898-6”.
Controller scheduling / gateway / DoIP → link to “CAN Controller / Bridge”.

Diagnostics vs Safety (executable definitions)

Diagnostics answers: “What happened, where, and how often?” Safety answers: “What risk exists, what safe response is enforced, and how is it proven?”

Dimension	Diagnostics (Observability)	Safety (Risk control + Evidence)
Primary goal	Detect + localize faults with measurable signals.	Enforce safe states and prove coverage and reaction time.
Signals	Fault flags, pins, counters, status registers, snapshots.	Safety hooks, safe-state control paths, test modes, evidence logs.
Pass criteria	Stable detection, low false positives, consistent labeling.	Defined safe outputs, bounded detection time, verified coverage.
Typical failure mode	“Fault seen once” with no root-cause evidence package.	Safe state triggered, but coverage/time/evidence cannot be demonstrated.

Note: a “passing waveform” does not guarantee safety; safety requires explicit contracts and evidence-ready verification.

Safety boundary at the bus interface (who owns what)

The bus interface safety boundary is defined by detection location, reporting channel, and safe-state enforcement. Responsibility is split across blocks to prevent single-point blind spots.

Transceiver/PHY: closest to harness; best for physical fault flags, fail-safe receive outputs, and immediate protection reactions.
Controller: protocol-layer counters and bus-off state (referenced only; detailed timing/counters belong to the Controller page).
SBC (if present): power/reset/watchdog hooks that bound unsafe behavior during undervoltage or MCU reset windows.
MCU safety monitor: system decision and evidence packaging (event snapshots, DTC mapping, recovery policies).

Required evidence deliverables (requirements → tests → logs)

Requirements map: fault list coverage + detection time window (X) + safe reaction + safe output contract.
Verification map: fault injection method + expected detection signals + pass criteria (threshold X / time Y).
Evidence log schema: event ID + timestamp + mode + V/T snapshot + pin/reg/counter snapshot + action taken.

H2-2 · Fault Taxonomy: What Can Go Wrong (and What Must Be Detected)

A complete fault taxonomy is the “mother table” for detection, safe reactions, fault injection, and audit-ready evidence. Every later chapter must map back to the same fault IDs and the same evidence fields.

Taxonomy structure (three layers)

Physical faults: wiring shorts/opens, stuck dominant/recessive, ground shift, thermal, under/over-voltage.
Functional/control faults: TxD stuck, dominant-timeout behavior, silent-mode latch, mode-pin faults.
Safety classification: single-point vs latent + detection time window (X) for each fault.

Each fault must be written as a testable contract: (fault condition) → (what detects) → (safe reaction) → (evidence fields).

Physical faults (testable, evidence-ready)

Physical faults are best detected close to the harness. The objective is not only “communication failed,” but a bounded root-cause class (short/open/thermal/UV/OV/ground shift) plus a consistent evidence package.

Short-to-VBAT / short-to-GND: detect via transceiver fault flags/pins; react by enforcing safe output + inhibit where applicable; log fault ID + mode + V/T snapshot.
CANH↔CANL short / line short: detect by differential abnormality + fault report; react by bounded recovery policy (X); log counters + pin/reg snapshot.
Open line / floating bus: detect via fail-safe receive behavior and/or line monitoring; react by safe default state; log fail-safe state entry and duration.
Stuck dominant / stuck recessive: detect via dominant timeout and stuck detection; react by forcing safe mode; log timeout counter and detection time.
Ground shift / common-mode excursion: detect by local monitoring (if available) and symptom correlation; react by safe-state + escalation; log supply/ground context (placeholder).
Thermal / under-voltage / over-voltage: detect via built-in flags; react by entering defined safe state and controlled recovery; log temperature/supply snapshots and recovery reason.

Functional/control faults (bus-interface only)

Functional faults are constrained to the bus-interface control path. Firmware retry policies and gateway scheduling are out of scope.

TxD stuck: detect by mismatch between commanded state and observed bus behavior; react by entering safe mode; log command context + status bits.
Dominant-timeout behavior: detect by timeout flag and repeated triggers; react by controlled inhibit and recovery policy; log timeout count and time-to-detect (X).
Silent-mode latch: detect by mode/status mismatch and lack of transmit effect; react by safe re-initialization sequence; log mode transitions + reset cause.
Mode-pin fault (float/short): detect by pin readback (if available) or inconsistent mode entry; react by safe defaults with hardware biasing; log pin snapshot + boot phase.

Safety classification (single-point vs latent) + detection window

Single-point fault: can directly create unsafe behavior → requires bounded detection time X and an explicit safe-state reaction.
Latent fault: does not immediately cause unsafe behavior, but degrades safety mechanisms → requires periodic test or fault injection proof within interval Y.
Detection window: expressed as time, frames, or cycles; derived from system hazard analysis and architecture assumptions (placeholders X/Y/Z).

Practical rule: classification must live inside the same fault table so verification and logs cannot drift from requirements.

Deliverable: Fault → Detection → Reaction → Evidence (starter template)

Use this table as the single source of truth. Add project-specific thresholds to pass criteria in later chapters.

Fault ID	Fault	Detection (pin/bit/counter)	Reaction (safe-state)	Evidence fields (log)
P-01	Short-to-VBAT / GND	Transceiver fault flag + ERR/INT	Enter safe mode; inhibit output if required	event_id, time, mode, V/T, fault_bits, action
P-02	Open line / floating	Fail-safe receive state + status bit	Enforce safe default output contract	event_id, time, fail_safe_state, duration, V/T
F-01	TxD stuck / command mismatch	Status mismatch + symptom correlation	Safe mode; controlled re-init sequence	event_id, time, cmd_state, mode, reg_snapshot
F-02	Dominant-timeout triggered	Timeout flag + counter	Inhibit transmit; bounded recovery policy (X)	event_id, time, timeout_count, detect_time_X, action

H2-3 · Fail-Safe Receive: Default States, Biasing, and Safe Output Contracts

Fail-safe receive is a predictable output contract under invalid inputs. When the bus input becomes unreliable (floating, open, common-mode out of range), receiver outputs must converge to a defined safe state and remain stable.

Executable definition: invalid input → safe output

A bus input is treated as invalid when it cannot be trusted to represent correct symbols. The receiver must then enforce (1) a default safe RxD state, (2) a reporting behavior (ERR/INT where applicable), and (3) an evidence-ready log snapshot.

Invalid input set (examples)

Floating / open: line disconnected or bias missing.
Stuck dominant/recessive: bus forced to one level by a fault.
Common-mode excursion: outside valid receiver window (placeholder X).
Local supply boundary: undervoltage/overvoltage causing unreliable thresholds.
Mode mismatch: silent/standby asserted unexpectedly during operation.

Scope guard: termination, reflections, and harness SI/EMC are out of scope here; only receiver-side contracts are defined.

Safe output contracts (CAN / LIN / FlexRay)

The contract is expressed as a small set of receiver outputs that upper layers can interpret deterministically. Use conservative defaults that prevent “valid-looking” symbols during invalid input windows.

Bus	Invalid input example	Safe RxD default	Reporter	Notes
CAN	Open / floating	Recessive (or defined safe state)	ERR/INT + status bits	Avoid “valid-looking” toggles; log fail-safe entry.
LIN	Line open / missing bias	Defined idle-safe level	Wake/INT + status	Keep wake attribution separate from fault flags.
FlexRay	Channel fault / invalid receive	Defined safe output (per channel)	Status + fault pins	Ensure deterministic output under dual-channel behavior.

Contract rules: (a) safe default must be explicit, (b) transitions must be debounced, (c) evidence must log entry/exit and snapshots.

Key parameter vocabulary (placeholders)

Fail-safe threshold (X): receiver boundary that classifies input as invalid; expressed as a window or limit (placeholder X).
Filtering / debounce (Y): minimum persistence for entry/exit to prevent chatter; expressed as time/frames (placeholder Y).
Glitch immunity (Z): transient pulse width that must not flip safe-state; expressed as a max glitch duration (placeholder Z).

Pass-criteria templates (fill with project thresholds)

Enter Fail-Safe Output within X after invalid input persists for Y.
Do not flip state for glitches shorter than Z.
Log a single event with snapshot fields; no event storm under boundary chatter.

H2-4 · Bus Fault Detection: What to Detect, Where It Lives, and How It Reports

Fault detection must separate root-cause classes (short/open/thermal/UV/OV/control faults) from “communication failed.” Use layered detectors and a consistent signal-to-evidence map.

Layered detectors: transceiver vs controller

Transceiver (root-cause oriented): local fault flags (short, UV/OV, thermal, dominant-timeout, mode mismatches).
Controller (symptom oriented): protocol counters and bus-off state; useful for confirming severity and persistence.

Root-cause preference: when transceiver fault flags exist, classify by those flags first; controller counters provide context and trend, not root cause.

What to detect (common item list)

Line faults: CANH/CANL short, line-to-battery/ground, open/floating, stuck dominant/recessive.
Power/thermal: undervoltage (UV), overvoltage (OV), thermal warning/shutdown.
Control faults: TxD stuck, RxD stuck, dominant-timeout triggers, silent-mode latch, mode pin faults.
Symptom context: controller error counters, bus-off entry, recovery cycles (referenced only).

How it reports (pins vs SPI vs counters)

ERR/INT pins: fast, simple, suitable for immediate response; limited root-cause detail unless paired with snapshots.
SPI status: rich fault bits and mode context; requires a robust interrupt/poll policy and clear/latched behavior handling.
Controller counters: symptom trending; used to confirm persistence and classify severity; do not treat as a root-cause label.
Diagnostic frames / gateway reporting: mention only; deep details belong to controller/bridge pages.

Evidence snapshot (minimum set)

event_id · timestamp · mode · reporter(pins) · status_bits(SPI) · counter_snapshot · V/T snapshot · action_taken · recovery_reason

Deliverable: Detection signal map (fault → pin/bit/counter)

Map each fault ID to a primary detector and a primary reporter. This prevents inconsistent root-cause labels across ECUs and test stations.

Fault ID	Primary detector	Primary reporter	Backup signal	Log snapshot focus
P-01	Transceiver	ERR/INT + status bits	Controller counters	fault_bits + mode + V/T
P-02	Transceiver	Fail-safe state bit	ERR/INT (optional)	fail_safe_entry + duration
F-02	Transceiver	Timeout flag + counter	Controller bus-off	timeout_count + detect_time_X
C-01	Controller	Bus-off state	Transceiver flags (if any)	counter trend + recovery reason

H2-5 · ASIL Interface Hooks: Safety Pins, Safe-State Control, and Safety Manual Mapping

ASIL hooks are the wiring-level interfaces that enable detection, controlled safe-state transitions, and evidence-ready safety claims. This section maps pins, registers, and watchdog behaviors into an auditable control loop.

Scope guard: no ISO clause reproduction; only engineering mapping templates and integration rules.

Hook inventory: pins, SPI safety regs, and watchdog interfaces

Treat every hook as a contract with three attributes: Direction (in/out), Role (Detect/Control/Report), and Failure concern (how the hook itself can fail and how it is monitored).

Hook	Role	Typical meaning	Failure concern	Integration note
ERR / INT	Report	Fault event notification	Stuck-high/low; missing edge	Pair with SPI snapshot + periodic line check
EN / STB	Control	Mode / enable control	Stuck control line	Cross-check mode readback; define safe default
INH / FS	Safe-state control	Power inhibit / forced safe behavior	Single-point path risk	Provide a redundant safe-state path or monitoring
WAKE	Report/Control	Wake request indication	False wake attribution	Log wake source separately from fault flags
SPI safety regs	Detect/Report	Latched fault bits, mode readback	Read/clear policy errors	Snapshot before clear; include counter/timestamp
Window watchdog	Detect/Control	Timing-checked servicing	Over/under-service	Tie reset cause into safety event logs

Integration tip: classify hooks into Report (pins), Root-cause detail (SPI), and Safe-state actuation (FS/INH/EN/STB) to prevent ambiguous safety claims.

Design rules: single-point fault avoidance and testability

Rule A · Avoid single-point safe-state control paths

Safe-state actuation (FS/INH/EN/STB) must not rely on a single unmonitored line. Use redundant actuation or cross-monitoring (readback + plausibility checks).

Pass criteria (placeholder): detect control-line faults within X, and achieve safe-state within Y.

Rule B · Cross-check report pins with SPI snapshots

ERR/INT pins provide fast notification but are insufficient for root-cause classification. Always capture a pre-clear SPI snapshot to bind the event to fault bits, mode, and counters.

Pass criteria (placeholder): each event produces exactly one snapshot record; no event storm under boundary chatter.

Rule C · Periodic self-test must exercise the detection chain

Self-test should validate report → snapshot → decision → actuation, not only pin toggling. Use maintenance windows for intrusive tests and ensure controlled recovery.

Pass criteria (placeholder): self-test cycle period P; maximum service impact Q.

Safety manual mapping template (engineering-only)

Use this template to align safety claims with integration assumptions and measurable mechanisms. Replace placeholders with project-specific thresholds and coverage targets.

Section	What to write	Evidence fields	Placeholders
Assumptions	Operating boundaries, power sequencing, monitoring task availability	mode, V/T, reset cause	Boundaries = X
Safety mechanisms	Hooks used for detect, control, and reporting	fault bits, IRQ edges, actuation state	Detect time = Y
Diagnostic coverage	Coverage rate, detection windows, false-positive controls	matrix pass/fail records	Coverage = Z

H2-6 · Fault Injection Support: How to Prove Detection and Coverage

Diagnostic capability is only meaningful when it is provable. Prove detection, timing, and coverage by executing a controlled injection matrix tied to a consistent evidence snapshot.

Coverage vocabulary (placeholders) and proof rules

Coverage rate (Z): share of the fault list that produces the expected detect signal under controlled injection.
Detect time (Y): maximum allowed delay between injection start and first valid detection report (placeholder Y).
False-positive rate (X): maximum allowed unexpected detections per time window (placeholder X).

Proof rule: every injected fault must yield (1) an expected reporter signal, (2) an optional safe-state action (when required), and (3) an evidence snapshot containing consistent fields.

Injection methods: built-in test modes vs external emulation (engineering principles)

Use built-in injection when available to isolate internal detection paths; use external emulation to validate the full wiring and reporting chain. Execute injections only in controlled verification environments with defined entry/exit procedures.

Method category	Examples (placeholders)	Best for proving	Evidence focus
Built-in	forced dominant, loopback, test mode, error-flag injection	Internal detect path and reporting	fault bits + latch/clear behavior
External	pin forcing, line short jig, thermal / UV / OV emulation	Full chain: wiring → detect → decision → actuation	report pins + safe-state + snapshot timing

Safety note (scope-level)

Execute injections only under controlled verification conditions with defined current-limits/protection and a documented recovery procedure. Do not treat injection content as in-field troubleshooting instructions.

Deliverable: fault injection matrix (Fault / Method / Expected detect / Pass criteria)

Tie each row to the fault taxonomy (fault IDs). For safety-critical faults, require at least two proof paths (e.g., built-in + external) to prevent single-path blind spots.

Fault ID	Injection method	Expected detect	Expected reaction	Pass criteria	Evidence fields
P-01	External line jig (controlled)	ERR/INT + fault bits	Safe-state when required	Detect ≤ Y; no storm	id, bits, mode, V/T
F-02	Built-in test mode (placeholder)	Timeout flag + counter	TX inhibit / mode change	Detect ≤ Y; latch ok	bits, count, time
PW-01	UV/OV emulation (controlled)	UV/OV flag + ERR/INT	Safe-state + recovery record	Detect ≤ Y; recover ok	V, bits, reset cause
T-01	Thermal emulation (controlled)	Thermal warn/shutdown flag	Safe-state + restart policy	Detect ≤ Y; no loop	T, bits, action

H2-7 · Diagnostics Quality: False Positives, Debounce, and “Serviceable” Events

Diagnostics is only valuable when events are trustworthy and serviceable. Standardize false-positive control, debounce rules, and an evidence packet that can be reproduced and repaired.

Scope guard: event-level quality only (no EMC design details, no timing budgets).

False positive sources: classify before tuning

Treat false positives as category problems, not “random noise.” A stable classification prevents endless threshold chasing and makes field data comparable across programs.

Source class	Typical trigger	Quick check	Fix rule	Pass criteria
Transient boundary	very short excursions	compare event width to debounce window	N-of-M + minimum duration gate	FP ≤ X per window
Mode transition	sleep↔active, silent↔normal	check event clustering at transitions	state-gated detection + warm-up window	no spikes beyond Y
Window definition error	wrong denominator	recompute counts with fixed window	freeze metric spec + version tag	metrics consistent
Snapshot race	clear-before-read, IRQ/poll conflict	compare pin edge vs SPI latched bits	pre-clear snapshot + ordering rule	no empty snapshots

Debounce toolkit: time windows, voting, and multi-source agreement

Debounce is a confirmation rule that defines when an event becomes serviceable. Combine three layers to avoid “delay-only” designs.

Layer 1 · N-of-M voting (time window)

Within a window M, require at least N detections to qualify. Define a minimum event duration to suppress short spikes.

Pass criteria (placeholder): FP ≤ X per T.

Layer 2 · Multi-source agreement (pin + counter + network context)

Require consistency across at least two signals: report pin edge, latched fault bit/counter, and network context (e.g., utilization state).

Pass criteria (placeholder): agreement rate ≥ A; missing-source events flagged as “suspect.”

Layer 3 · State-gated detection (transition windows)

During mode transitions and recovery windows, suppress or downweight detections to prevent systematic false positives. Always tag events with mode to make the behavior reviewable.

Pass criteria (placeholder): transition-related events ≤ B per cycle.

“Serviceable” event definition: evidence packet + DTC mapping

An event is serviceable only when it carries enough context to reproduce, classify, and repair. Standardize an evidence packet and map it to a stable DTC taxonomy.

Field	Why it matters	Notes
Event ID + version	Stable cross-team reference	Freeze once released
Start/End + Duration	Separates spikes from persistent faults	Defines dedupe rules
Mode + Network state	Explains transition-related events	Include utilization summary
Snapshot (pre-clear)	Root-cause signals	Bits + counters + ordering
V/T + Reset cause	Makes field events reproducible	Use compact encoding

H2-8 · Verification Plan: Design → Bring-Up → Production (Evidence-Driven)

Convert verification into executable gates with explicit inputs, outputs, owners, and pass criteria. Evidence is the common currency across design, bring-up, and production.

Gate checklist template (inputs / checks / outputs / pass criteria)

Each gate is a contract: if inputs are incomplete, the gate must fail early. Keep pass criteria measurable and link each output to an evidence artifact (logs, reports, matrices).

Gate	Inputs	Checks	Outputs	Pass criteria	Owner
Design gate	fault list, hook wiring, log schema	completeness + consistency checks	v1 artifacts frozen	no TBD in core rows	Design / FW
Bring-up gate	injection cases, thresholds, FP baseline	matrix pass/fail + snapshot validity	calibrated values + report	Detect ≤ Y; FP ≤ X	FW / Test
Production gate	ATE/ICT items, sampling, regression triggers	measurability + version alignment	production test spec + plan	yield stable; regressions caught	MFG / QE

H2-9 · Engineering Checklist: Wiring, Firmware, and Lab Setup

Convert diagnostics-and-safety requirements into an executable checklist across hardware, firmware, and lab validation.

Scope guard: wiring/contracts/logging/injection readiness only (no EMC design details, no bitrate/timing tuning).

Hardware checklist: safety pins, defaults, and power sequencing

The hardware layer must make safety/diagnostics signals unambiguous at reset and predictable during brown-out. When defaults are wrong, software diagnoses the wrong root cause.

Safety pin strapping: define pull-up/pull-down for EN/STB/WAKE/ERR/INH/FS-class pins; forbid floating inputs unless explicitly supported.
Default-state contract: record expected RxD/INT/ERR states for reset, standby, bus disconnected, and supply undervoltage (placeholders).
Redundancy and cross-monitor: if two channels exist, ensure independent sensing (pin vs SPI vs counter) to avoid single-point reporting failures.
Power-up / power-down order: define sequencing between MCU reset, SBC rails, and transceiver mode pins; prevent transition-window false events.
Reset cause observability: capture reset reason (SBC/MCU) and keep it inside the event snapshot for service triage.

Deliverables (evidence): wiring map, default-state table, and a power-seq checklist with pass criteria (X/Y placeholders).

Firmware checklist: sampling, bus-off policy, reset behavior, and throttling

Firmware defines the diagnostic truth model. Standardize counter sampling periods, event confirmation rules, and recovery policies so field events remain comparable across versions.

Item	What to lock	Quick check	Pass criteria
Counter sampling	period, window, denominator definition	recompute with fixed window	metrics stable within X
Bus-off strategy	entry, recovery, cooldown	verify recovery does not spam events	event rate ≤ Y
Reset policy	who resets whom, when	confirm snapshot before reset	no empty evidence
Event throttling	dedupe rule + rate limiter	look for bursty repeats	burst ≤ B per T

Deliverables (evidence): event schema vX, sampling spec, bus-off & reset policy sheet, and throttling rules with placeholders.

Lab checklist: injection fixtures, isolation, and evidence capture templates

Lab readiness is measured by repeatability and auditability. Every injection must have a defined expected signal, a snapshot requirement, and a pass/fail record template.

Injection fixture safety: define “safe-to-connect” states and interlocks; prevent accidental hard shorts during setup.
Isolation discipline: keep measurement grounds consistent; tag any intentional ground offset tests (event-level only).
Pass/fail recording: for each injection, record method, duration, expected detect channel, and captured evidence packet fields.
Repeatability: run three cycles minimum; if results depend on the operator, the procedure is incomplete.

Deliverables (evidence): injection setup SOP, safety checklist, and a pass/fail record template with placeholders.

H2-10 · IC Selection Logic: Choosing Transceiver/SBC/Controller for Safety Diagnostics

Selection must be driven by diagnostics and safety evidence: default contracts, fault reporting, and proof capability. Keep trade-offs explicit to avoid over-design and avoid blind spots.

Scope guard: diagnostics/safety capability only (no EMC optimization, no bitrate/timing tuning).

Step 1 — Define targets: ASIL goal, safety concept, and fault tolerant time (placeholders)

Start with system-level targets and translate them into interface requirements: diagnostic coverage, detection latency, and safe-state entry conditions (all as placeholders for now).

ASIL target (placeholder): required diagnostic coverage and evidence strength.
Fault tolerant time (placeholder): how fast a detection must surface to prevent unsafe actuation.
Safe-state definition: what outputs/pins/commands constitute a safe fallback at the ECU boundary.

Step 2 — Must-have capabilities: fail-safe receive, reporting channel, test mode, observability

Map each candidate device to a small set of must-have properties that enable detection, reporting, and proof. Missing any of these typically causes “silent failures” or non-serviceable field events.

Capability	Why it matters	Proof hook
Fail-safe receive contract	predictable output under open/floating/common-mode anomalies	documented default table
Fault reporting channel	distinguish short/open/thermal/UV/OV from generic comm loss	pin + SPI bits + counters
Test/injection support	prove detection latency and coverage without destructive tests	test modes / forced flags
Diagnostic observability	evidence packet fields can be collected with stable ordering	pre-clear snapshot rule

Trade-offs: pins vs SPI, standby vs always-on diagnostics, sensitivity vs false positives

Trade-offs must be explicit so the system does not lose serviceability or create false alarms. Prefer decision rules that produce a repeatable device class outcome.

Trade-off	Benefit	Risk	Mitigation
Pins vs SPI	simple latency / easier safety monitor	pin count pressure or missing detail	hybrid: pin for alert + SPI for root-cause bits
Standby vs always-on	lower Iq or stronger observability	blind windows during sleep	state-tagging + wake reason capture
Sensitivity vs FP rate	catch latent faults early	service noise / false DTCs	debounce + evidence packet standard

H2-11 · Applications: Where Diagnostics & Safety Dominates the Architecture

Focus on why certain vehicle domains force heavier diagnostics evidence and safety behavior. This section does not discuss protocol timing, bitrate tuning, or EMC implementation details.

Scope guard: application drivers → architecture implications → minimum evidence packets → common pitfalls. Isolation is referenced only as a diagnostic path boundary, not as a design/EMC topic.

Part numbers below are example BOM references to anchor implementation choices. Always verify the exact grade, suffix, safety documentation, and availability for the target program.

Application heatmap: where diagnostics & safety pressure is highest

Rows are application domains. Columns are the diagnostic/safety drivers that shape architecture and evidence needs.

Interpretation rule: higher intensity means the architecture must prioritize deterministic detection, serviceable evidence, and safe-state behavior above feature breadth.

Powertrain / Chassis: high consequence → heavier coverage and evidence

When a bus fault can trigger torque limiting, steering fallback, or braking degradation, diagnostics must be specific (root-cause) and provable (evidence before recovery/reset).

Architecture implications: multi-source evidence (pin + SPI + controller counters), deterministic safe-state entry, and recovery policies that preserve snapshots.

Minimum evidence packet (field service ready)

event_id, timestamp, reset_cause (before any recovery action)
mode_state (normal / standby / silent / fault-latched)
fault_source (pin alert vs SPI bits vs controller counters)
snapshot placeholders: Vbat/Vio/Tj, error counters, bus-off flag
action_taken (safe-state applied? reset executed? cooldown?)

Example BOM anchors (part numbers)

HS CAN / CAN FD transceiver (examples): TI TCAN1042-Q1, TI TCAN1051-Q1, NXP TJA1044, Microchip MCP2562FD
Selective wake / partial networking (examples): TI TCAN1145-Q1, NXP TJA1145
Controller-side observability (examples): Microchip MCP2517FD (SPI CAN FD controller), TI TCAN4550-Q1 (controller + transceiver class)

Common pitfalls

Reset or bus recovery triggers before snapshot capture → root-cause becomes unserviceable.
Single reporting path (only a pin, or only SPI) → latent reporting failures look like generic comm loss.
Event storms during intermittent faults → safe-state oscillation unless throttling/dedupe is defined.

HV / Isolation contexts: cross-domain observability dominates diagnostics success

Isolation creates a diagnostic boundary. The core risk is not only the bus fault itself, but that evidence cannot cross domains and service teams only see “communication lost”.

Architecture implications: define where detection lives (HV or LV), define how reports cross domains, and tag every event with domain context.

Minimum evidence packet (cross-domain)

domain_tag (HV / LV) + mode_state
cross_domain_link_state (reporting available? cached? degraded?)
fault_source (local pin/SPI/counter) + report_path (gateway/log channel)
time alignment placeholder (how HV/LV timestamps correlate)

Example BOM anchors (part numbers)

Isolated CAN transceiver (examples): TI ISO1042-Q1, Analog Devices ADM3053, Analog Devices ADM3055E
Non-isolated CAN transceiver on domain edge (examples): TI TCAN1042-Q1, NXP TJA1044
Controller-side event anchoring (examples): Microchip MCP2517FD, TI TCAN4550-Q1

Common pitfalls

HV side detects a fault but LV side only records “bus down” → root-cause cannot be isolated.
Domain resets are not sequenced → timestamps and event ordering become inconsistent.
Single reporting channel across isolation → a reporting-path fault mimics a bus fault.

Body / Comfort: high node count → false positives and event storms dominate cost

When many nodes wake/sleep and switch modes frequently, the biggest architecture risk is service noise: false DTCs, repeated alerts, and unbounded log volume.

Architecture implications: debounce rules, dedupe + rate limiters, and explicit mode tags so transient transitions do not look like faults.

Minimum evidence packet (service & triage)

duration and repeats_in_window (storm control)
mode_tag (sleep / wake / standby) + wake_reason placeholder
bus_state snapshot placeholders (utilization, counters)
DTC mapping key fields (consistent identifiers across variants)

Example BOM anchors (part numbers)

LIN transceiver (examples): TI TLIN1029-Q1, TI TLIN2029-Q1, NXP TJA1021, Microchip MCP2003B
CAN transceiver with strong reporting (examples): TI TCAN1042-Q1, NXP TJA1044, Microchip MCP2562FD
SBC-style integration anchor (examples): NXP FS65, NXP UJA1169

Common pitfalls

Debounce implemented only in UI/reporting → storage and uplink still overflow.
Counter window definitions differ across firmware versions → field metrics are not comparable.
Mode transitions are not tagged → wake/sleep transients become “faults”.

Request a Quote

Name

Company

Part Number(s) / BOM

Quantity & Target Lead Time

Alternates Allowed

Temperature Grade

Package / Footprint

Compliance

Budget Window

Lot Size / Qty

Message

Attachment

Accepted Formats

pdf, csv, xls, xlsx, zip

Attachment

Drag & drop files here or use the button below.

H2-12 · FAQs (12): Diagnostics & Safety Boundary Troubleshooting

Fixed 4-line answers only: Likely cause / Quick check / Fix / Pass criteria (threshold placeholders).

Scope guard: diagnostics + safety interfaces only (signal → event → evidence → safe-state). No EMC details, no timing/bitrate tuning, no PN mechanism deep-dive.

1) Bus-off happens after a short glitch — first check counter window definition or a real fault? Counter semantics vs actual fault classification.

Likely cause: error counter window/denominator mismatch, or a latched fault condition being interpreted as a brief glitch.

Quick check: log TEC/REC + bus-off flag with timestamp; verify the sampling window (X ms) and whether counters are reset/rolled over.

Fix: standardize the counter window + denominator, and add a “fault-latched vs transient” tag before executing recovery/reset.

Pass criteria: bus-off rate ≤ X per Y minutes under the defined window; counters and flags are consistent across versions.

2) ERR pin toggles but logs show no DTC — what’s the first signal-to-event mapping check? Pin/SPI/counter → event schema alignment.

Likely cause: wrong edge polarity/level mapping, missing interrupt enable, or an event filter that drops the pin pulse.

Quick check: correlate ERR pin timestamp with MCU ISR entry + event_id creation; verify mapping table: pin → event_id → DTC code.

Fix: lock a “signal-to-event map” spec (pin/SPI/counters), then implement a minimum pulse capture (X µs) and a debounce window (Y ms).

Pass criteria: ≥ X% of ERR pulses produce an event within Y ms; DTC mapping is deterministic across firmware builds.

3) Fail-safe receive keeps output recessive — how to confirm line-open vs transceiver silent mode? Output contract vs mode-latched state.

Likely cause: receiver fail-safe default drives recessive on open/floating, or the device is in silent/standby mode with RxD forced recessive.

Quick check: read mode status (pin state + SPI status if available) and log “mode_state”; compare RxD behavior across known mode transitions.

Fix: explicitly tag silent/standby transitions in logs; add a mode-guard that blocks root-cause classification unless mode_state is “normal”.

Pass criteria: for open-line tests, classification accuracy ≥ X% with mode_state recorded; no “silent-mode mislabels” in Y cycles.

4) Dominant timeout triggers in the field but never on bench — what injection case is missing? Coverage gap in fault injection matrix.

Likely cause: missing injection of TxD stuck-dominant under realistic mode transitions (sleep/wake/reset), or missing “partial latch + recovery” sequence.

Quick check: compare lab injection matrix vs field timeline; verify whether TxD forcing was tested across: reset window (X ms), standby exit, and bus-off recovery.

Fix: add an injection case set: TxD dominant during reset, during wake, and during recovery; define expected detect channel(s) and evidence fields.

Pass criteria: dominant-timeout detection latency ≤ X ms and repeatability ≥ Y% over Z runs for all defined injection states.

5) False wake vs false fault — how to separate wake-source attribution from bus-fault detection? Attribution only (no partial-networking mechanism details).

Likely cause: wake reason is not tagged (bus/local/timed), so a wake transition is misclassified as a bus fault event.

Quick check: ensure logs contain wake_reason + mode_state + timestamp; check if fault events cluster within X ms after wake/standby transitions.

Fix: enforce a two-step classification: (1) attribute wake_reason; (2) enable bus-fault detection only after a stabilization window (X ms) in normal mode.

Pass criteria: post-wake false fault rate ≤ X per Y wakes; wake_reason is present in ≥ Z% of related events.

6) Two ECUs disagree on fault root-cause — first alignment check: timestamps, event IDs, or counters? Correlation hygiene across nodes.

Likely cause: timebases are not aligned, event IDs are not globally unique, or counters are sampled with different windows.

Quick check: align on a single tuple: (timestamp ± X ms, event_id, counter snapshot window); verify both ECUs record the same mode_state at the same time.

Fix: standardize event_id schema (include node ID), standardize counter windows, and add a correlation rule for multi-ECU incidents.

Pass criteria: disagreement rate ≤ X% over Y injected incidents; correlation success ≥ Z% with the standardized tuple.

7) Thermal shutdown recovers but CAN stays unstable — first check recovery state vs bus-off policy? Recovery state machine coordination.

Likely cause: the transceiver recovers to a restricted mode while firmware assumes normal, causing repeated errors and bus-off loops.

Quick check: log Tj/thermal flag + mode_state + bus-off state across recovery; verify if firmware re-enables Tx before cooldown (X ms).

Fix: add a recovery gate: require thermal-clear + cooldown + explicit normal-mode confirmation before Tx enable and bus-off recovery.

Pass criteria: post-thermal recovery achieves stable operation for X minutes with bus-off count ≤ Y and no repeated thermal relatch.

8) SPI status bits look “stuck” — is it latch behavior or a missing clear sequence? Device register contract and driver sequencing.

Likely cause: latched status bits require a read/clear or specific unlock sequence, or the driver never executes the clear path after logging.

Quick check: compare raw SPI reads before/after a known clear attempt; check if “clear on read” or “write-1-to-clear” is required (per device manual).

Fix: implement the documented clear sequence, and log a “clear_attempted” flag + post-clear readback for auditability.

Pass criteria: status bit clears within X ms after the clear sequence and stays clear for Y cycles without masking real faults.

9) Fault injection passes once but later becomes flaky — what degradation sign should be logged first? Evidence-first degradation triage.

Likely cause: marginal conditions accumulate (mode latches, repeated recovery cycles, or injection fixture variability) causing non-repeatable detection.

Quick check: log the first drift indicators: mode_state changes, repeated recoveries, time-to-detect, and event rate vs run count.

Fix: add a “pre/post injection” snapshot + cooldown rule; stabilize injection procedure and enforce a fixed run protocol (X cycles, Y spacing).

Pass criteria: detection latency distribution stays within X–Y ms across Z runs; flake rate ≤ W%.

10) ASIL audit asks for detection time — how to derive it from logs without re-testing everything? Evidence-derived detection latency.

Likely cause: logs lack a “fault start marker”, so detection time cannot be computed consistently across incidents.

Quick check: identify two timestamps: (A) earliest observable symptom (pin/SPI/counter threshold crossing) and (B) event_id creation; define Δt = B − A.

Fix: add a standardized “fault start” marker (threshold crossing) and a “reaction start” marker (safe-state entry) to the event schema.

Pass criteria: detection latency Δt ≤ X ms for Y% of incidents; audit computation is reproducible from logs only.

11) Safety pin default state seems unsafe during MCU reset — what pull strategy is the first fix? Deterministic defaults at reset boundary.

Likely cause: mode/safety pins float or conflict during MCU reset, causing unintended enable/standby transitions.

Quick check: measure pin level during reset window (X ms) and compare to required safe default; verify pull direction and strength (X kΩ placeholder).

Fix: add explicit pull-up/down to enforce safe default; if needed, gate enable with an additional monitor line (cross-check) for single-point failure avoidance.

Pass criteria: pin level stays within safe range for X% of reset events; no unintended mode transition in Y reset cycles.

12) Production ATE passes, vehicle shows intermittent fault — what station-to-station mismatch is most common? Evidence consistency across test stations and builds.

Likely cause: different station thresholds or clear/reset sequences, inconsistent firmware/config versions, or missing environment tags in production records.

Quick check: compare ATE/ICT station configs: thresholds (X), timing windows (Y), reset sequence, and software version; ensure records include temperature/supply tags.

Fix: lock a station-to-station golden config, enforce version pinning, and add a minimal “station evidence packet” aligned with field log schema.

Pass criteria: config drift = 0 across stations; correlation success ≥ X% between ATE evidence and vehicle logs over Y units.

Diagnostics & Safety for CAN/LIN/FlexRay Transceivers

Diagnostics & Safety for CAN/LIN/FlexRay Transceivers

H2-1 · Definition & Scope: Diagnostics vs Safety at the Bus Interface

Scope guard (avoid overlap)

Diagnostics vs Safety (executable definitions)

Safety boundary at the bus interface (who owns what)

Required evidence deliverables (requirements → tests → logs)

H2-2 · Fault Taxonomy: What Can Go Wrong (and What Must Be Detected)

Taxonomy structure (three layers)

Physical faults (testable, evidence-ready)

Functional/control faults (bus-interface only)

Safety classification (single-point vs latent) + detection window

Deliverable: Fault → Detection → Reaction → Evidence (starter template)

H2-3 · Fail-Safe Receive: Default States, Biasing, and Safe Output Contracts

Executable definition: invalid input → safe output

Safe output contracts (CAN / LIN / FlexRay)

Key parameter vocabulary (placeholders)

H2-4 · Bus Fault Detection: What to Detect, Where It Lives, and How It Reports

Layered detectors: transceiver vs controller

What to detect (common item list)

How it reports (pins vs SPI vs counters)

Deliverable: Detection signal map (fault → pin/bit/counter)

H2-5 · ASIL Interface Hooks: Safety Pins, Safe-State Control, and Safety Manual Mapping

Hook inventory: pins, SPI safety regs, and watchdog interfaces

Design rules: single-point fault avoidance and testability

Safety manual mapping template (engineering-only)

H2-6 · Fault Injection Support: How to Prove Detection and Coverage

Coverage vocabulary (placeholders) and proof rules

Injection methods: built-in test modes vs external emulation (engineering principles)

Deliverable: fault injection matrix (Fault / Method / Expected detect / Pass criteria)

H2-7 · Diagnostics Quality: False Positives, Debounce, and “Serviceable” Events

False positive sources: classify before tuning

Debounce toolkit: time windows, voting, and multi-source agreement

“Serviceable” event definition: evidence packet + DTC mapping

H2-8 · Verification Plan: Design → Bring-Up → Production (Evidence-Driven)

Gate checklist template (inputs / checks / outputs / pass criteria)

H2-9 · Engineering Checklist: Wiring, Firmware, and Lab Setup

Hardware checklist: safety pins, defaults, and power sequencing

Firmware checklist: sampling, bus-off policy, reset behavior, and throttling

Lab checklist: injection fixtures, isolation, and evidence capture templates

H2-10 · IC Selection Logic: Choosing Transceiver/SBC/Controller for Safety Diagnostics

Step 1 — Define targets: ASIL goal, safety concept, and fault tolerant time (placeholders)

Step 2 — Must-have capabilities: fail-safe receive, reporting channel, test mode, observability

Trade-offs: pins vs SPI, standby vs always-on diagnostics, sensitivity vs false positives

H2-11 · Applications: Where Diagnostics & Safety Dominates the Architecture

Application heatmap: where diagnostics & safety pressure is highest

Powertrain / Chassis: high consequence → heavier coverage and evidence

Minimum evidence packet (field service ready)

Example BOM anchors (part numbers)

Common pitfalls

HV / Isolation contexts: cross-domain observability dominates diagnostics success

Minimum evidence packet (cross-domain)

Example BOM anchors (part numbers)

Common pitfalls

Body / Comfort: high node count → false positives and event storms dominate cost

Minimum evidence packet (service & triage)

Example BOM anchors (part numbers)

Common pitfalls

Recommended topics you might also need

Request a Quote

Accepted Formats

Attachment

H2-12 · FAQs (12): Diagnostics & Safety Boundary Troubleshooting

Explore

Categories

Get in Touch