Diagnostics & Safety for CAN/LIN/FlexRay Transceivers
← Back to: Automotive Fieldbuses: CAN / LIN / FlexRay
Core idea
Diagnostics & Safety at the vehicle bus interface means turning “the network feels unstable” into a specific, logged root-cause and a deterministic safe-state reaction. The goal is a provable chain from signal → event → evidence → recovery, so faults are detectable, auditable, and serviceable in the field.
H2-1 · Definition & Scope: Diagnostics vs Safety at the Bus Interface
The bus interface becomes safety-critical when its failures can hide, amplify, or misreport system risk. This section defines diagnostics as measurable observability and localization, and safety as risk control with verifiable evidence.
Scope guard (avoid overlap)
Owns (deep coverage in this page)
- Fail-safe receive contracts: default outputs under faults, predictable safe behavior.
- Bus fault detection & reporting: fault flags, pins, registers, and evidence mapping.
- ASIL interface hooks: safety pins, safe-state control, testability-oriented design hooks.
- Fault-injection support: how detection and coverage are proven with repeatable tests.
Mentions only (1–2 sentences + internal link)
- Timing / sample-point / loop-delay budgets → link to “Data Rate & Timing”.
- EMC / protection components / layout → link to “EMC / Protection & Co-Design”.
- Selective wake / partial networking → link to “ISO 11898-6”.
- Controller scheduling / gateway / DoIP → link to “CAN Controller / Bridge”.
Diagnostics vs Safety (executable definitions)
Diagnostics answers: “What happened, where, and how often?” Safety answers: “What risk exists, what safe response is enforced, and how is it proven?”
| Dimension | Diagnostics (Observability) | Safety (Risk control + Evidence) |
|---|---|---|
| Primary goal | Detect + localize faults with measurable signals. | Enforce safe states and prove coverage and reaction time. |
| Signals | Fault flags, pins, counters, status registers, snapshots. | Safety hooks, safe-state control paths, test modes, evidence logs. |
| Pass criteria | Stable detection, low false positives, consistent labeling. | Defined safe outputs, bounded detection time, verified coverage. |
| Typical failure mode | “Fault seen once” with no root-cause evidence package. | Safe state triggered, but coverage/time/evidence cannot be demonstrated. |
Note: a “passing waveform” does not guarantee safety; safety requires explicit contracts and evidence-ready verification.
Safety boundary at the bus interface (who owns what)
The bus interface safety boundary is defined by detection location, reporting channel, and safe-state enforcement. Responsibility is split across blocks to prevent single-point blind spots.
- Transceiver/PHY: closest to harness; best for physical fault flags, fail-safe receive outputs, and immediate protection reactions.
- Controller: protocol-layer counters and bus-off state (referenced only; detailed timing/counters belong to the Controller page).
- SBC (if present): power/reset/watchdog hooks that bound unsafe behavior during undervoltage or MCU reset windows.
- MCU safety monitor: system decision and evidence packaging (event snapshots, DTC mapping, recovery policies).
Required evidence deliverables (requirements → tests → logs)
- Requirements map: fault list coverage + detection time window (X) + safe reaction + safe output contract.
- Verification map: fault injection method + expected detection signals + pass criteria (threshold X / time Y).
- Evidence log schema: event ID + timestamp + mode + V/T snapshot + pin/reg/counter snapshot + action taken.
H2-2 · Fault Taxonomy: What Can Go Wrong (and What Must Be Detected)
A complete fault taxonomy is the “mother table” for detection, safe reactions, fault injection, and audit-ready evidence. Every later chapter must map back to the same fault IDs and the same evidence fields.
Taxonomy structure (three layers)
- Physical faults: wiring shorts/opens, stuck dominant/recessive, ground shift, thermal, under/over-voltage.
- Functional/control faults: TxD stuck, dominant-timeout behavior, silent-mode latch, mode-pin faults.
- Safety classification: single-point vs latent + detection time window (X) for each fault.
Each fault must be written as a testable contract: (fault condition) → (what detects) → (safe reaction) → (evidence fields).
Physical faults (testable, evidence-ready)
Physical faults are best detected close to the harness. The objective is not only “communication failed,” but a bounded root-cause class (short/open/thermal/UV/OV/ground shift) plus a consistent evidence package.
- Short-to-VBAT / short-to-GND: detect via transceiver fault flags/pins; react by enforcing safe output + inhibit where applicable; log fault ID + mode + V/T snapshot.
- CANH↔CANL short / line short: detect by differential abnormality + fault report; react by bounded recovery policy (X); log counters + pin/reg snapshot.
- Open line / floating bus: detect via fail-safe receive behavior and/or line monitoring; react by safe default state; log fail-safe state entry and duration.
- Stuck dominant / stuck recessive: detect via dominant timeout and stuck detection; react by forcing safe mode; log timeout counter and detection time.
- Ground shift / common-mode excursion: detect by local monitoring (if available) and symptom correlation; react by safe-state + escalation; log supply/ground context (placeholder).
- Thermal / under-voltage / over-voltage: detect via built-in flags; react by entering defined safe state and controlled recovery; log temperature/supply snapshots and recovery reason.
Functional/control faults (bus-interface only)
Functional faults are constrained to the bus-interface control path. Firmware retry policies and gateway scheduling are out of scope.
- TxD stuck: detect by mismatch between commanded state and observed bus behavior; react by entering safe mode; log command context + status bits.
- Dominant-timeout behavior: detect by timeout flag and repeated triggers; react by controlled inhibit and recovery policy; log timeout count and time-to-detect (X).
- Silent-mode latch: detect by mode/status mismatch and lack of transmit effect; react by safe re-initialization sequence; log mode transitions + reset cause.
- Mode-pin fault (float/short): detect by pin readback (if available) or inconsistent mode entry; react by safe defaults with hardware biasing; log pin snapshot + boot phase.
Safety classification (single-point vs latent) + detection window
- Single-point fault: can directly create unsafe behavior → requires bounded detection time X and an explicit safe-state reaction.
- Latent fault: does not immediately cause unsafe behavior, but degrades safety mechanisms → requires periodic test or fault injection proof within interval Y.
- Detection window: expressed as time, frames, or cycles; derived from system hazard analysis and architecture assumptions (placeholders X/Y/Z).
Practical rule: classification must live inside the same fault table so verification and logs cannot drift from requirements.
Deliverable: Fault → Detection → Reaction → Evidence (starter template)
Use this table as the single source of truth. Add project-specific thresholds to pass criteria in later chapters.
| Fault ID | Fault | Detection (pin/bit/counter) | Reaction (safe-state) | Evidence fields (log) |
|---|---|---|---|---|
| P-01 | Short-to-VBAT / GND | Transceiver fault flag + ERR/INT | Enter safe mode; inhibit output if required | event_id, time, mode, V/T, fault_bits, action |
| P-02 | Open line / floating | Fail-safe receive state + status bit | Enforce safe default output contract | event_id, time, fail_safe_state, duration, V/T |
| F-01 | TxD stuck / command mismatch | Status mismatch + symptom correlation | Safe mode; controlled re-init sequence | event_id, time, cmd_state, mode, reg_snapshot |
| F-02 | Dominant-timeout triggered | Timeout flag + counter | Inhibit transmit; bounded recovery policy (X) | event_id, time, timeout_count, detect_time_X, action |
H2-3 · Fail-Safe Receive: Default States, Biasing, and Safe Output Contracts
Fail-safe receive is a predictable output contract under invalid inputs. When the bus input becomes unreliable (floating, open, common-mode out of range), receiver outputs must converge to a defined safe state and remain stable.
Executable definition: invalid input → safe output
A bus input is treated as invalid when it cannot be trusted to represent correct symbols. The receiver must then enforce (1) a default safe RxD state, (2) a reporting behavior (ERR/INT where applicable), and (3) an evidence-ready log snapshot.
Invalid input set (examples)
- Floating / open: line disconnected or bias missing.
- Stuck dominant/recessive: bus forced to one level by a fault.
- Common-mode excursion: outside valid receiver window (placeholder X).
- Local supply boundary: undervoltage/overvoltage causing unreliable thresholds.
- Mode mismatch: silent/standby asserted unexpectedly during operation.
Scope guard: termination, reflections, and harness SI/EMC are out of scope here; only receiver-side contracts are defined.
Safe output contracts (CAN / LIN / FlexRay)
The contract is expressed as a small set of receiver outputs that upper layers can interpret deterministically. Use conservative defaults that prevent “valid-looking” symbols during invalid input windows.
| Bus | Invalid input example | Safe RxD default | Reporter | Notes |
|---|---|---|---|---|
| CAN | Open / floating | Recessive (or defined safe state) | ERR/INT + status bits | Avoid “valid-looking” toggles; log fail-safe entry. |
| LIN | Line open / missing bias | Defined idle-safe level | Wake/INT + status | Keep wake attribution separate from fault flags. |
| FlexRay | Channel fault / invalid receive | Defined safe output (per channel) | Status + fault pins | Ensure deterministic output under dual-channel behavior. |
Contract rules: (a) safe default must be explicit, (b) transitions must be debounced, (c) evidence must log entry/exit and snapshots.
Key parameter vocabulary (placeholders)
- Fail-safe threshold (X): receiver boundary that classifies input as invalid; expressed as a window or limit (placeholder X).
- Filtering / debounce (Y): minimum persistence for entry/exit to prevent chatter; expressed as time/frames (placeholder Y).
- Glitch immunity (Z): transient pulse width that must not flip safe-state; expressed as a max glitch duration (placeholder Z).
Pass-criteria templates (fill with project thresholds)
- Enter Fail-Safe Output within X after invalid input persists for Y.
- Do not flip state for glitches shorter than Z.
- Log a single event with snapshot fields; no event storm under boundary chatter.
H2-4 · Bus Fault Detection: What to Detect, Where It Lives, and How It Reports
Fault detection must separate root-cause classes (short/open/thermal/UV/OV/control faults) from “communication failed.” Use layered detectors and a consistent signal-to-evidence map.
Layered detectors: transceiver vs controller
- Transceiver (root-cause oriented): local fault flags (short, UV/OV, thermal, dominant-timeout, mode mismatches).
- Controller (symptom oriented): protocol counters and bus-off state; useful for confirming severity and persistence.
Root-cause preference: when transceiver fault flags exist, classify by those flags first; controller counters provide context and trend, not root cause.
What to detect (common item list)
- Line faults: CANH/CANL short, line-to-battery/ground, open/floating, stuck dominant/recessive.
- Power/thermal: undervoltage (UV), overvoltage (OV), thermal warning/shutdown.
- Control faults: TxD stuck, RxD stuck, dominant-timeout triggers, silent-mode latch, mode pin faults.
- Symptom context: controller error counters, bus-off entry, recovery cycles (referenced only).
How it reports (pins vs SPI vs counters)
- ERR/INT pins: fast, simple, suitable for immediate response; limited root-cause detail unless paired with snapshots.
- SPI status: rich fault bits and mode context; requires a robust interrupt/poll policy and clear/latched behavior handling.
- Controller counters: symptom trending; used to confirm persistence and classify severity; do not treat as a root-cause label.
- Diagnostic frames / gateway reporting: mention only; deep details belong to controller/bridge pages.
Evidence snapshot (minimum set)
event_id · timestamp · mode · reporter(pins) · status_bits(SPI) · counter_snapshot · V/T snapshot · action_taken · recovery_reason
Deliverable: Detection signal map (fault → pin/bit/counter)
Map each fault ID to a primary detector and a primary reporter. This prevents inconsistent root-cause labels across ECUs and test stations.
| Fault ID | Primary detector | Primary reporter | Backup signal | Log snapshot focus |
|---|---|---|---|---|
| P-01 | Transceiver | ERR/INT + status bits | Controller counters | fault_bits + mode + V/T |
| P-02 | Transceiver | Fail-safe state bit | ERR/INT (optional) | fail_safe_entry + duration |
| F-02 | Transceiver | Timeout flag + counter | Controller bus-off | timeout_count + detect_time_X |
| C-01 | Controller | Bus-off state | Transceiver flags (if any) | counter trend + recovery reason |
H2-5 · ASIL Interface Hooks: Safety Pins, Safe-State Control, and Safety Manual Mapping
ASIL hooks are the wiring-level interfaces that enable detection, controlled safe-state transitions, and evidence-ready safety claims. This section maps pins, registers, and watchdog behaviors into an auditable control loop.
Scope guard: no ISO clause reproduction; only engineering mapping templates and integration rules.
Hook inventory: pins, SPI safety regs, and watchdog interfaces
Treat every hook as a contract with three attributes: Direction (in/out), Role (Detect/Control/Report), and Failure concern (how the hook itself can fail and how it is monitored).
| Hook | Role | Typical meaning | Failure concern | Integration note |
|---|---|---|---|---|
| ERR / INT | Report | Fault event notification | Stuck-high/low; missing edge | Pair with SPI snapshot + periodic line check |
| EN / STB | Control | Mode / enable control | Stuck control line | Cross-check mode readback; define safe default |
| INH / FS | Safe-state control | Power inhibit / forced safe behavior | Single-point path risk | Provide a redundant safe-state path or monitoring |
| WAKE | Report/Control | Wake request indication | False wake attribution | Log wake source separately from fault flags |
| SPI safety regs | Detect/Report | Latched fault bits, mode readback | Read/clear policy errors | Snapshot before clear; include counter/timestamp |
| Window watchdog | Detect/Control | Timing-checked servicing | Over/under-service | Tie reset cause into safety event logs |
Integration tip: classify hooks into Report (pins), Root-cause detail (SPI), and Safe-state actuation (FS/INH/EN/STB) to prevent ambiguous safety claims.
Design rules: single-point fault avoidance and testability
Rule A · Avoid single-point safe-state control paths
Safe-state actuation (FS/INH/EN/STB) must not rely on a single unmonitored line. Use redundant actuation or cross-monitoring (readback + plausibility checks).
Pass criteria (placeholder): detect control-line faults within X, and achieve safe-state within Y.
Rule B · Cross-check report pins with SPI snapshots
ERR/INT pins provide fast notification but are insufficient for root-cause classification. Always capture a pre-clear SPI snapshot to bind the event to fault bits, mode, and counters.
Pass criteria (placeholder): each event produces exactly one snapshot record; no event storm under boundary chatter.
Rule C · Periodic self-test must exercise the detection chain
Self-test should validate report → snapshot → decision → actuation, not only pin toggling. Use maintenance windows for intrusive tests and ensure controlled recovery.
Pass criteria (placeholder): self-test cycle period P; maximum service impact Q.
Safety manual mapping template (engineering-only)
Use this template to align safety claims with integration assumptions and measurable mechanisms. Replace placeholders with project-specific thresholds and coverage targets.
| Section | What to write | Evidence fields | Placeholders |
|---|---|---|---|
| Assumptions | Operating boundaries, power sequencing, monitoring task availability | mode, V/T, reset cause | Boundaries = X |
| Safety mechanisms | Hooks used for detect, control, and reporting | fault bits, IRQ edges, actuation state | Detect time = Y |
| Diagnostic coverage | Coverage rate, detection windows, false-positive controls | matrix pass/fail records | Coverage = Z |
H2-6 · Fault Injection Support: How to Prove Detection and Coverage
Diagnostic capability is only meaningful when it is provable. Prove detection, timing, and coverage by executing a controlled injection matrix tied to a consistent evidence snapshot.
Coverage vocabulary (placeholders) and proof rules
- Coverage rate (Z): share of the fault list that produces the expected detect signal under controlled injection.
- Detect time (Y): maximum allowed delay between injection start and first valid detection report (placeholder Y).
- False-positive rate (X): maximum allowed unexpected detections per time window (placeholder X).
Proof rule: every injected fault must yield (1) an expected reporter signal, (2) an optional safe-state action (when required), and (3) an evidence snapshot containing consistent fields.
Injection methods: built-in test modes vs external emulation (engineering principles)
Use built-in injection when available to isolate internal detection paths; use external emulation to validate the full wiring and reporting chain. Execute injections only in controlled verification environments with defined entry/exit procedures.
| Method category | Examples (placeholders) | Best for proving | Evidence focus |
|---|---|---|---|
| Built-in | forced dominant, loopback, test mode, error-flag injection | Internal detect path and reporting | fault bits + latch/clear behavior |
| External | pin forcing, line short jig, thermal / UV / OV emulation | Full chain: wiring → detect → decision → actuation | report pins + safe-state + snapshot timing |
Safety note (scope-level)
Execute injections only under controlled verification conditions with defined current-limits/protection and a documented recovery procedure. Do not treat injection content as in-field troubleshooting instructions.
Deliverable: fault injection matrix (Fault / Method / Expected detect / Pass criteria)
Tie each row to the fault taxonomy (fault IDs). For safety-critical faults, require at least two proof paths (e.g., built-in + external) to prevent single-path blind spots.
| Fault ID | Injection method | Expected detect | Expected reaction | Pass criteria | Evidence fields |
|---|---|---|---|---|---|
| P-01 | External line jig (controlled) | ERR/INT + fault bits | Safe-state when required | Detect ≤ Y; no storm | id, bits, mode, V/T |
| F-02 | Built-in test mode (placeholder) | Timeout flag + counter | TX inhibit / mode change | Detect ≤ Y; latch ok | bits, count, time |
| PW-01 | UV/OV emulation (controlled) | UV/OV flag + ERR/INT | Safe-state + recovery record | Detect ≤ Y; recover ok | V, bits, reset cause |
| T-01 | Thermal emulation (controlled) | Thermal warn/shutdown flag | Safe-state + restart policy | Detect ≤ Y; no loop | T, bits, action |
H2-7 · Diagnostics Quality: False Positives, Debounce, and “Serviceable” Events
Diagnostics is only valuable when events are trustworthy and serviceable. Standardize false-positive control, debounce rules, and an evidence packet that can be reproduced and repaired.
Scope guard: event-level quality only (no EMC design details, no timing budgets).
False positive sources: classify before tuning
Treat false positives as category problems, not “random noise.” A stable classification prevents endless threshold chasing and makes field data comparable across programs.
| Source class | Typical trigger | Quick check | Fix rule | Pass criteria |
|---|---|---|---|---|
| Transient boundary | very short excursions | compare event width to debounce window | N-of-M + minimum duration gate | FP ≤ X per window |
| Mode transition | sleep↔active, silent↔normal | check event clustering at transitions | state-gated detection + warm-up window | no spikes beyond Y |
| Window definition error | wrong denominator | recompute counts with fixed window | freeze metric spec + version tag | metrics consistent |
| Snapshot race | clear-before-read, IRQ/poll conflict | compare pin edge vs SPI latched bits | pre-clear snapshot + ordering rule | no empty snapshots |
Debounce toolkit: time windows, voting, and multi-source agreement
Debounce is a confirmation rule that defines when an event becomes serviceable. Combine three layers to avoid “delay-only” designs.
Layer 1 · N-of-M voting (time window)
Within a window M, require at least N detections to qualify. Define a minimum event duration to suppress short spikes.
Pass criteria (placeholder): FP ≤ X per T.
Layer 2 · Multi-source agreement (pin + counter + network context)
Require consistency across at least two signals: report pin edge, latched fault bit/counter, and network context (e.g., utilization state).
Pass criteria (placeholder): agreement rate ≥ A; missing-source events flagged as “suspect.”
Layer 3 · State-gated detection (transition windows)
During mode transitions and recovery windows, suppress or downweight detections to prevent systematic false positives. Always tag events with mode to make the behavior reviewable.
Pass criteria (placeholder): transition-related events ≤ B per cycle.
“Serviceable” event definition: evidence packet + DTC mapping
An event is serviceable only when it carries enough context to reproduce, classify, and repair. Standardize an evidence packet and map it to a stable DTC taxonomy.
| Field | Why it matters | Notes |
|---|---|---|
| Event ID + version | Stable cross-team reference | Freeze once released |
| Start/End + Duration | Separates spikes from persistent faults | Defines dedupe rules |
| Mode + Network state | Explains transition-related events | Include utilization summary |
| Snapshot (pre-clear) | Root-cause signals | Bits + counters + ordering |
| V/T + Reset cause | Makes field events reproducible | Use compact encoding |
H2-8 · Verification Plan: Design → Bring-Up → Production (Evidence-Driven)
Convert verification into executable gates with explicit inputs, outputs, owners, and pass criteria. Evidence is the common currency across design, bring-up, and production.
Gate checklist template (inputs / checks / outputs / pass criteria)
Each gate is a contract: if inputs are incomplete, the gate must fail early. Keep pass criteria measurable and link each output to an evidence artifact (logs, reports, matrices).
| Gate | Inputs | Checks | Outputs | Pass criteria | Owner |
|---|---|---|---|---|---|
| Design gate | fault list, hook wiring, log schema | completeness + consistency checks | v1 artifacts frozen | no TBD in core rows | Design / FW |
| Bring-up gate | injection cases, thresholds, FP baseline | matrix pass/fail + snapshot validity | calibrated values + report | Detect ≤ Y; FP ≤ X | FW / Test |
| Production gate | ATE/ICT items, sampling, regression triggers | measurability + version alignment | production test spec + plan | yield stable; regressions caught | MFG / QE |
H2-9 · Engineering Checklist: Wiring, Firmware, and Lab Setup
Convert diagnostics-and-safety requirements into an executable checklist across hardware, firmware, and lab validation.
Scope guard: wiring/contracts/logging/injection readiness only (no EMC design details, no bitrate/timing tuning).
Hardware checklist: safety pins, defaults, and power sequencing
The hardware layer must make safety/diagnostics signals unambiguous at reset and predictable during brown-out. When defaults are wrong, software diagnoses the wrong root cause.
- Safety pin strapping: define pull-up/pull-down for EN/STB/WAKE/ERR/INH/FS-class pins; forbid floating inputs unless explicitly supported.
- Default-state contract: record expected RxD/INT/ERR states for reset, standby, bus disconnected, and supply undervoltage (placeholders).
- Redundancy and cross-monitor: if two channels exist, ensure independent sensing (pin vs SPI vs counter) to avoid single-point reporting failures.
- Power-up / power-down order: define sequencing between MCU reset, SBC rails, and transceiver mode pins; prevent transition-window false events.
- Reset cause observability: capture reset reason (SBC/MCU) and keep it inside the event snapshot for service triage.
Deliverables (evidence): wiring map, default-state table, and a power-seq checklist with pass criteria (X/Y placeholders).
Firmware checklist: sampling, bus-off policy, reset behavior, and throttling
Firmware defines the diagnostic truth model. Standardize counter sampling periods, event confirmation rules, and recovery policies so field events remain comparable across versions.
| Item | What to lock | Quick check | Pass criteria |
|---|---|---|---|
| Counter sampling | period, window, denominator definition | recompute with fixed window | metrics stable within X |
| Bus-off strategy | entry, recovery, cooldown | verify recovery does not spam events | event rate ≤ Y |
| Reset policy | who resets whom, when | confirm snapshot before reset | no empty evidence |
| Event throttling | dedupe rule + rate limiter | look for bursty repeats | burst ≤ B per T |
Deliverables (evidence): event schema vX, sampling spec, bus-off & reset policy sheet, and throttling rules with placeholders.
Lab checklist: injection fixtures, isolation, and evidence capture templates
Lab readiness is measured by repeatability and auditability. Every injection must have a defined expected signal, a snapshot requirement, and a pass/fail record template.
- Injection fixture safety: define “safe-to-connect” states and interlocks; prevent accidental hard shorts during setup.
- Isolation discipline: keep measurement grounds consistent; tag any intentional ground offset tests (event-level only).
- Pass/fail recording: for each injection, record method, duration, expected detect channel, and captured evidence packet fields.
- Repeatability: run three cycles minimum; if results depend on the operator, the procedure is incomplete.
Deliverables (evidence): injection setup SOP, safety checklist, and a pass/fail record template with placeholders.
H2-10 · IC Selection Logic: Choosing Transceiver/SBC/Controller for Safety Diagnostics
Selection must be driven by diagnostics and safety evidence: default contracts, fault reporting, and proof capability. Keep trade-offs explicit to avoid over-design and avoid blind spots.
Scope guard: diagnostics/safety capability only (no EMC optimization, no bitrate/timing tuning).
Step 1 — Define targets: ASIL goal, safety concept, and fault tolerant time (placeholders)
Start with system-level targets and translate them into interface requirements: diagnostic coverage, detection latency, and safe-state entry conditions (all as placeholders for now).
- ASIL target (placeholder): required diagnostic coverage and evidence strength.
- Fault tolerant time (placeholder): how fast a detection must surface to prevent unsafe actuation.
- Safe-state definition: what outputs/pins/commands constitute a safe fallback at the ECU boundary.
Step 2 — Must-have capabilities: fail-safe receive, reporting channel, test mode, observability
Map each candidate device to a small set of must-have properties that enable detection, reporting, and proof. Missing any of these typically causes “silent failures” or non-serviceable field events.
| Capability | Why it matters | Proof hook |
|---|---|---|
| Fail-safe receive contract | predictable output under open/floating/common-mode anomalies | documented default table |
| Fault reporting channel | distinguish short/open/thermal/UV/OV from generic comm loss | pin + SPI bits + counters |
| Test/injection support | prove detection latency and coverage without destructive tests | test modes / forced flags |
| Diagnostic observability | evidence packet fields can be collected with stable ordering | pre-clear snapshot rule |
Trade-offs: pins vs SPI, standby vs always-on diagnostics, sensitivity vs false positives
Trade-offs must be explicit so the system does not lose serviceability or create false alarms. Prefer decision rules that produce a repeatable device class outcome.
| Trade-off | Benefit | Risk | Mitigation |
|---|---|---|---|
| Pins vs SPI | simple latency / easier safety monitor | pin count pressure or missing detail | hybrid: pin for alert + SPI for root-cause bits |
| Standby vs always-on | lower Iq or stronger observability | blind windows during sleep | state-tagging + wake reason capture |
| Sensitivity vs FP rate | catch latent faults early | service noise / false DTCs | debounce + evidence packet standard |
H2-11 · Applications: Where Diagnostics & Safety Dominates the Architecture
Focus on why certain vehicle domains force heavier diagnostics evidence and safety behavior. This section does not discuss protocol timing, bitrate tuning, or EMC implementation details.
Scope guard: application drivers → architecture implications → minimum evidence packets → common pitfalls. Isolation is referenced only as a diagnostic path boundary, not as a design/EMC topic.
Part numbers below are example BOM references to anchor implementation choices. Always verify the exact grade, suffix, safety documentation, and availability for the target program.
Application heatmap: where diagnostics & safety pressure is highest
Rows are application domains. Columns are the diagnostic/safety drivers that shape architecture and evidence needs.
Interpretation rule: higher intensity means the architecture must prioritize deterministic detection, serviceable evidence, and safe-state behavior above feature breadth.
Powertrain / Chassis: high consequence → heavier coverage and evidence
When a bus fault can trigger torque limiting, steering fallback, or braking degradation, diagnostics must be specific (root-cause) and provable (evidence before recovery/reset).
Architecture implications: multi-source evidence (pin + SPI + controller counters), deterministic safe-state entry, and recovery policies that preserve snapshots.
Minimum evidence packet (field service ready)
- event_id, timestamp, reset_cause (before any recovery action)
- mode_state (normal / standby / silent / fault-latched)
- fault_source (pin alert vs SPI bits vs controller counters)
- snapshot placeholders: Vbat/Vio/Tj, error counters, bus-off flag
- action_taken (safe-state applied? reset executed? cooldown?)
Example BOM anchors (part numbers)
- HS CAN / CAN FD transceiver (examples):
TI TCAN1042-Q1,TI TCAN1051-Q1,NXP TJA1044,Microchip MCP2562FD - Selective wake / partial networking (examples):
TI TCAN1145-Q1,NXP TJA1145 - Controller-side observability (examples):
Microchip MCP2517FD(SPI CAN FD controller),TI TCAN4550-Q1(controller + transceiver class)
Common pitfalls
- Reset or bus recovery triggers before snapshot capture → root-cause becomes unserviceable.
- Single reporting path (only a pin, or only SPI) → latent reporting failures look like generic comm loss.
- Event storms during intermittent faults → safe-state oscillation unless throttling/dedupe is defined.
HV / Isolation contexts: cross-domain observability dominates diagnostics success
Isolation creates a diagnostic boundary. The core risk is not only the bus fault itself, but that evidence cannot cross domains and service teams only see “communication lost”.
Architecture implications: define where detection lives (HV or LV), define how reports cross domains, and tag every event with domain context.
Minimum evidence packet (cross-domain)
- domain_tag (HV / LV) + mode_state
- cross_domain_link_state (reporting available? cached? degraded?)
- fault_source (local pin/SPI/counter) + report_path (gateway/log channel)
- time alignment placeholder (how HV/LV timestamps correlate)
Example BOM anchors (part numbers)
- Isolated CAN transceiver (examples):
TI ISO1042-Q1,Analog Devices ADM3053,Analog Devices ADM3055E - Non-isolated CAN transceiver on domain edge (examples):
TI TCAN1042-Q1,NXP TJA1044 - Controller-side event anchoring (examples):
Microchip MCP2517FD,TI TCAN4550-Q1
Common pitfalls
- HV side detects a fault but LV side only records “bus down” → root-cause cannot be isolated.
- Domain resets are not sequenced → timestamps and event ordering become inconsistent.
- Single reporting channel across isolation → a reporting-path fault mimics a bus fault.
Body / Comfort: high node count → false positives and event storms dominate cost
When many nodes wake/sleep and switch modes frequently, the biggest architecture risk is service noise: false DTCs, repeated alerts, and unbounded log volume.
Architecture implications: debounce rules, dedupe + rate limiters, and explicit mode tags so transient transitions do not look like faults.
Minimum evidence packet (service & triage)
- duration and repeats_in_window (storm control)
- mode_tag (sleep / wake / standby) + wake_reason placeholder
- bus_state snapshot placeholders (utilization, counters)
- DTC mapping key fields (consistent identifiers across variants)
Example BOM anchors (part numbers)
- LIN transceiver (examples):
TI TLIN1029-Q1,TI TLIN2029-Q1,NXP TJA1021,Microchip MCP2003B - CAN transceiver with strong reporting (examples):
TI TCAN1042-Q1,NXP TJA1044,Microchip MCP2562FD - SBC-style integration anchor (examples):
NXP FS65,NXP UJA1169
Common pitfalls
- Debounce implemented only in UI/reporting → storage and uplink still overflow.
- Counter window definitions differ across firmware versions → field metrics are not comparable.
- Mode transitions are not tagged → wake/sleep transients become “faults”.
Recommended topics you might also need
Request a Quote
H2-12 · FAQs (12): Diagnostics & Safety Boundary Troubleshooting
Fixed 4-line answers only: Likely cause / Quick check / Fix / Pass criteria (threshold placeholders).
Scope guard: diagnostics + safety interfaces only (signal → event → evidence → safe-state). No EMC details, no timing/bitrate tuning, no PN mechanism deep-dive.
1) Bus-off happens after a short glitch — first check counter window definition or a real fault? Counter semantics vs actual fault classification.
Likely cause: error counter window/denominator mismatch, or a latched fault condition being interpreted as a brief glitch.
Quick check: log TEC/REC + bus-off flag with timestamp; verify the sampling window (X ms) and whether counters are reset/rolled over.
Fix: standardize the counter window + denominator, and add a “fault-latched vs transient” tag before executing recovery/reset.
Pass criteria: bus-off rate ≤ X per Y minutes under the defined window; counters and flags are consistent across versions.
2) ERR pin toggles but logs show no DTC — what’s the first signal-to-event mapping check? Pin/SPI/counter → event schema alignment.
Likely cause: wrong edge polarity/level mapping, missing interrupt enable, or an event filter that drops the pin pulse.
Quick check: correlate ERR pin timestamp with MCU ISR entry + event_id creation; verify mapping table: pin → event_id → DTC code.
Fix: lock a “signal-to-event map” spec (pin/SPI/counters), then implement a minimum pulse capture (X µs) and a debounce window (Y ms).
Pass criteria: ≥ X% of ERR pulses produce an event within Y ms; DTC mapping is deterministic across firmware builds.
3) Fail-safe receive keeps output recessive — how to confirm line-open vs transceiver silent mode? Output contract vs mode-latched state.
Likely cause: receiver fail-safe default drives recessive on open/floating, or the device is in silent/standby mode with RxD forced recessive.
Quick check: read mode status (pin state + SPI status if available) and log “mode_state”; compare RxD behavior across known mode transitions.
Fix: explicitly tag silent/standby transitions in logs; add a mode-guard that blocks root-cause classification unless mode_state is “normal”.
Pass criteria: for open-line tests, classification accuracy ≥ X% with mode_state recorded; no “silent-mode mislabels” in Y cycles.
4) Dominant timeout triggers in the field but never on bench — what injection case is missing? Coverage gap in fault injection matrix.
Likely cause: missing injection of TxD stuck-dominant under realistic mode transitions (sleep/wake/reset), or missing “partial latch + recovery” sequence.
Quick check: compare lab injection matrix vs field timeline; verify whether TxD forcing was tested across: reset window (X ms), standby exit, and bus-off recovery.
Fix: add an injection case set: TxD dominant during reset, during wake, and during recovery; define expected detect channel(s) and evidence fields.
Pass criteria: dominant-timeout detection latency ≤ X ms and repeatability ≥ Y% over Z runs for all defined injection states.
5) False wake vs false fault — how to separate wake-source attribution from bus-fault detection? Attribution only (no partial-networking mechanism details).
Likely cause: wake reason is not tagged (bus/local/timed), so a wake transition is misclassified as a bus fault event.
Quick check: ensure logs contain wake_reason + mode_state + timestamp; check if fault events cluster within X ms after wake/standby transitions.
Fix: enforce a two-step classification: (1) attribute wake_reason; (2) enable bus-fault detection only after a stabilization window (X ms) in normal mode.
Pass criteria: post-wake false fault rate ≤ X per Y wakes; wake_reason is present in ≥ Z% of related events.
6) Two ECUs disagree on fault root-cause — first alignment check: timestamps, event IDs, or counters? Correlation hygiene across nodes.
Likely cause: timebases are not aligned, event IDs are not globally unique, or counters are sampled with different windows.
Quick check: align on a single tuple: (timestamp ± X ms, event_id, counter snapshot window); verify both ECUs record the same mode_state at the same time.
Fix: standardize event_id schema (include node ID), standardize counter windows, and add a correlation rule for multi-ECU incidents.
Pass criteria: disagreement rate ≤ X% over Y injected incidents; correlation success ≥ Z% with the standardized tuple.
7) Thermal shutdown recovers but CAN stays unstable — first check recovery state vs bus-off policy? Recovery state machine coordination.
Likely cause: the transceiver recovers to a restricted mode while firmware assumes normal, causing repeated errors and bus-off loops.
Quick check: log Tj/thermal flag + mode_state + bus-off state across recovery; verify if firmware re-enables Tx before cooldown (X ms).
Fix: add a recovery gate: require thermal-clear + cooldown + explicit normal-mode confirmation before Tx enable and bus-off recovery.
Pass criteria: post-thermal recovery achieves stable operation for X minutes with bus-off count ≤ Y and no repeated thermal relatch.
8) SPI status bits look “stuck” — is it latch behavior or a missing clear sequence? Device register contract and driver sequencing.
Likely cause: latched status bits require a read/clear or specific unlock sequence, or the driver never executes the clear path after logging.
Quick check: compare raw SPI reads before/after a known clear attempt; check if “clear on read” or “write-1-to-clear” is required (per device manual).
Fix: implement the documented clear sequence, and log a “clear_attempted” flag + post-clear readback for auditability.
Pass criteria: status bit clears within X ms after the clear sequence and stays clear for Y cycles without masking real faults.
9) Fault injection passes once but later becomes flaky — what degradation sign should be logged first? Evidence-first degradation triage.
Likely cause: marginal conditions accumulate (mode latches, repeated recovery cycles, or injection fixture variability) causing non-repeatable detection.
Quick check: log the first drift indicators: mode_state changes, repeated recoveries, time-to-detect, and event rate vs run count.
Fix: add a “pre/post injection” snapshot + cooldown rule; stabilize injection procedure and enforce a fixed run protocol (X cycles, Y spacing).
Pass criteria: detection latency distribution stays within X–Y ms across Z runs; flake rate ≤ W%.
10) ASIL audit asks for detection time — how to derive it from logs without re-testing everything? Evidence-derived detection latency.
Likely cause: logs lack a “fault start marker”, so detection time cannot be computed consistently across incidents.
Quick check: identify two timestamps: (A) earliest observable symptom (pin/SPI/counter threshold crossing) and (B) event_id creation; define Δt = B − A.
Fix: add a standardized “fault start” marker (threshold crossing) and a “reaction start” marker (safe-state entry) to the event schema.
Pass criteria: detection latency Δt ≤ X ms for Y% of incidents; audit computation is reproducible from logs only.
11) Safety pin default state seems unsafe during MCU reset — what pull strategy is the first fix? Deterministic defaults at reset boundary.
Likely cause: mode/safety pins float or conflict during MCU reset, causing unintended enable/standby transitions.
Quick check: measure pin level during reset window (X ms) and compare to required safe default; verify pull direction and strength (X kΩ placeholder).
Fix: add explicit pull-up/down to enforce safe default; if needed, gate enable with an additional monitor line (cross-check) for single-point failure avoidance.
Pass criteria: pin level stays within safe range for X% of reset events; no unintended mode transition in Y reset cycles.
12) Production ATE passes, vehicle shows intermittent fault — what station-to-station mismatch is most common? Evidence consistency across test stations and builds.
Likely cause: different station thresholds or clear/reset sequences, inconsistent firmware/config versions, or missing environment tags in production records.
Quick check: compare ATE/ICT station configs: thresholds (X), timing windows (Y), reset sequence, and software version; ensure records include temperature/supply tags.
Fix: lock a station-to-station golden config, enforce version pinning, and add a minimal “station evidence packet” aligned with field log schema.
Pass criteria: config drift = 0 across stations; correlation success ≥ X% between ATE evidence and vehicle logs over Y units.