Diagnostics / Gateway / TCU: Multi-bus, DoIP, OTA, Security

Q: DoIP connects, but diagnostic services intermittently time out — check concurrency or queue starvation first?

Likely cause: session concurrency spikes and control/telemetry traffic steals queue time, starving diagnostic request/response processing. Quick check: correlate timeouts with active_sessions, queue watermarks, and per-class drop/latency counters at ingress→classify→egress. Fix: reserve a diagnostics class (queue + minimum service rate) and enforce per-source rate limiting; cap non-diagnostic bursts via backpressure/circuit-breaker. Pass criteria: timeout rate ≤ X/1k requests and P99 diag latency ≤ X ms at X concurrent sessions; diag queue watermark < X%.

Q: OTA download completes, but installation fails — missing persisted fields or dependency matrix mismatch?

Likely cause: install step lacks required persisted state (resume tokens, verified hash, selected slot) or dependency rules reject the target set. Quick check: inspect state store for required fields per step (campaign_id, artifact_id, hash_ok, slot, precheck_result) and compare actual ECU versions to the declared dependency matrix. Fix: make install preconditions explicit and persisted; validate dependencies before install; emit a single reject reason code for serviceability. Pass criteria: install success ≥ X% across X cycles; dependency rejects always carry reason_code; no missing_field failures in logs over X installs.

Q: After update, diagnostics wake-up intermittently fails — certificate/policy rotation or routing table version mismatch?

Likely cause: policy or certificate set is updated without synchronized versioning across wake filters and routing, causing valid wake frames to be dropped. Quick check: compare policy_version, cert_bundle_version, and route_table_version in the wake event log; verify rejection reason (auth vs filter vs route-miss). Fix: enforce atomic rollout (version bundle + monotonic policy) and add compatibility window; fail closed on external access but fail open for local service-mode (if allowed by safety concept). Pass criteria: wake success ≥ X% within X s; all wake drops tagged with reason_code; version skew events = 0 over X days.

Q: Stress test OK, but in-vehicle hot / low-voltage resets increase — watchdog source or power policy?

Likely cause: watchdog triggers due to scheduler stalls under thermal throttling or brownout/reset policy causes repeated reboot loops. Quick check: read reset_reason histogram (wdt/brownout/thermal/exception) and correlate with temperature and supply monitors; check whether queue deadlocks precede WDT. Fix: add health monitor + staged degraded modes (shed non-critical traffic, preserve diag) and tune power-up/down policy; ensure watchdog window matches worst-case stall budget. Pass criteria: unexpected resets ≤ X/24h at X°C and low-voltage profiles; recovery time ≤ X s; reset_reason always populated.

Q: IDS generates too many alerts and hurts availability — tune thresholds first or build whitelist/baseline first?

Likely cause: thresholds are applied without traffic baseline per mode (factory/service/road), causing normal bursts to be flagged as anomalies. Quick check: split alerts by mode, source, service type, and rate bucket; compare against baseline percentiles (P50/P95/P99) rather than raw averages. Fix: establish whitelist + per-mode baselines, then set thresholds to target false-positive rate; add alert rate limiting so IDS cannot starve core routing. Pass criteria: false-positive ≤ X/hour and IDS CPU ≤ X%; alert storms capped to ≤ X/min without impacting P99 routing latency.

Q: Diagnostic write is denied but read works — permission tiering or security access unlock flow?

Likely cause: access control differentiates read vs write (role/level), or the unlock step is missing/expired for this session/policy version. Quick check: log the decision tuple (session_id, tester_id, requested_service, access_level, policy_version, deny_reason); verify unlock state and TTL for that ECU/service group. Fix: define explicit levels (read/write/flash) and bind unlock to session + target ECU; make denies deterministic with a single reason code and guidance. Pass criteria: authorized writes succeed ≥ X%; unauthorized writes always denied with deny_reason; no ambiguous denies (reason missing) over X attempts.

Q: With multi-ECU diagnostics, one ECU gets “starved” — how to adjust queue classification and rate limiting?

Likely cause: queues are keyed only by service type (diag) and not by ECU/tenant, so a noisy ECU consumes the shared budget. Quick check: break down throughput and drops per ECU address (logical/physical) and per-source tester; verify fairness metrics (per-ECU share, queue wait time). Fix: add per-ECU (or per-tenant) sub-queues with weighted fairness; apply token-bucket limits per ECU and per tester to prevent domination. Pass criteria: minimum per-ECU service rate ≥ X req/s; starvation time ≤ X ms; fairness index ≥ X during X-ECU load.

Q: OTA repeats installation after power-loss recovery — where is idempotency missing?

Likely cause: state machine transitions are not atomic; the completed marker is not persisted (or persisted after side effects), so reboot re-enters install. Quick check: inspect persisted state ordering (downloaded→verified→installed→activated→confirmed) and confirm each step writes a monotonic marker before executing irreversible actions. Fix: implement idempotent guards (operation_id + step checkpoint), use monotonic state with write-ahead markers, and ensure activation/confirm are strictly separated. Pass criteria: after forced power loss at any step, resume never regresses more than X state; duplicate installs = 0 over X fault-injection runs.

Q: Enabling encryption drops throughput a lot — copy count or crypto offload capability first?

Likely cause: extra memory copies dominate, or crypto runs on CPU without acceleration, turning throughput into a compute-bound pipeline. Quick check: measure bytes-copied per payload, CPU cycles per byte, and DMA usage; compare crypto on/off with identical queue/traffic. Fix: reduce copies (zero-copy buffers, scatter/gather), enable acceleration/offload where available, and pre-size queues to avoid encryption-induced backpressure. Pass criteria: sustained throughput ≥ X Mbps with encryption; CPU ≤ X%; copy count ≤ X per payload; P99 latency ≤ X ms.

Q: Unstable external link slows in-vehicle buses — how to set isolation domains and circuit breakers?

Likely cause: external retries and reconnect storms consume shared CPU/queues, leaking failure across domains due to missing isolation and breaker policy. Quick check: correlate external link state changes with internal queue growth, CPU spikes, and diag/control latency; confirm whether domains share the same limiter/worker pool. Fix: isolate external domain (separate queues + worker budget), apply circuit breaker on repeated failures, and degrade gracefully (keep control/diag alive while external is throttled). Pass criteria: external flaps do not change internal P99 latency by more than X%; breaker trips within X failures; internal timeout rate ≤ X/1k.

← Back to: Automotive Fieldbuses: CAN / LIN / FlexRay

A Diagnostics/Gateway/TCU is the system “traffic + policy” core that bridges multi-bus vehicle networks to Ethernet/external services, while keeping diagnostics and OTA stable, recoverable, and secure.

This page provides an engineering path from architecture and bridging rules to serviceability logs, OTA state machines, security enforcement, and measurable pass criteria—so failures are contained and systems remain operable in real vehicles.

Definition & Boundary: What “Diagnostics / Gateway / TCU” Means Here

This page defines a system-level gateway/TCU as a network boundary node that bridges in-vehicle buses to Automotive Ethernet and external services, while enforcing diagnostic/OTA/security policies and producing service-grade observability.

Role separation (no PHY deep-dive)

Gateway

Traffic boundary: filtering, rate limiting, prioritization, and fault containment.
Policy execution point: access control, routing rules, and session gates.
Serviceability anchor: consistent logging, counters, and traceability IDs.

Domain Controller

Domain compute: coordinates domain functions (body/chassis/powertrain/infotainment).
May host partial gateway functions, but priority is domain feature execution.
Interfaces are treated as abstract ports; physical-layer tuning belongs to bus-specific pages.

TCU

External termination: cellular/Wi-Fi/VPN/TLS endpoints and cloud connectivity.
OTA & remote diagnostics orchestration with recoverable state machines.
Often doubles as the secure gateway boundary for access and update authorization.

Page I/O contract (what enters, what must exit)

Inputs: in-vehicle frames (CAN/LIN/FlexRay/Ethernet), diagnostic sessions (DoIP/service tool), OTA campaigns, security policies, timing base.
Outputs: policy-compliant forwarding/bridging, bounded latency/throughput behavior, auditable security events, traceable diagnostic logs, OTA state transitions with recovery.
Deliverables: architecture boundary map, rule tables (filter/limit/route), minimal observability schema (fields + counters), verification hooks.

Boundary ledger (anti-overlap guard)

This page covers

Multi-bus to Ethernet bridging logic (filter/limit/queue/fault containment).
DoIP/diagnostic path engineering (session gates, address mapping, serviceability logging).
OTA lifecycle reliability (state machine, rollback, dependency control, power-loss recovery).
Secure gateway integration (trust chain, keys, policy enforcement, auditability).

This page does NOT cover (use sibling pages)

CAN/CAN-FD/SIC/XL waveform timing, sample-point tuning, termination values → see CAN FD Transceiver, SIC/SIC-XL, CAN XL PHY.
LIN physical-layer slew/auto-baud electrical details → see LIN Transceiver.
FlexRay port electrical tuning and topology specifics → see FlexRay Transceiver.
CMC/TVS placement and exact protection parasitics → see EMC / Protection & Co-Design.

Linking rule: if the topic requires waveform/termination/CMC/TVS/CMTI deep details, provide one sentence of context and link out—do not expand inside this page.

System map (Zone → Backbone → External)

Diagram intent: show the gateway/TCU as the boundary control point between zonal buses, Ethernet backbone, and external diagnostics/OTA services.

System Architecture: Data Plane / Control Plane / Management Plane

A robust gateway/TCU architecture separates fast-path forwarding (data plane), decision-making and policy (control plane), and long-horizon operations (management plane). This prevents diagnostic/OTA/security features from destabilizing real-time traffic.

Plane responsibilities (engineering-grade split)

Data plane

Ingress parsing → classification → policy match → queueing → scheduling → egress shaping.
Hard requirements: bounded latency, bounded queue growth, and controlled loss under congestion.
Fault containment: isolate noisy ports/sessions before they impact the rest of the vehicle network.

Control plane

Diagnostic sessions: admission control, authorization levels, and timeout policies.
Routing & filtering rules: versioned distribution, rollback, and safe defaults on mismatch.
Error handling: degrade modes, reset boundaries, and safe recovery sequencing.

Management plane

Observability: structured logs, counters, traces, and health snapshots for field triage.
Configuration & versions: rule packs, certificates, OTA campaigns, and policy baselines.
Timebase consistency: unified timestamp source for auditability and cross-ECU correlation.

Data plane: decisions that prevent “diagnostics storms”

Classification keys: bus type, source ECU, service class (control / diagnostics / OTA / logging), and safety criticality.
Queue model: dedicate queues per service class; protect real-time control from diagnostic bursts via strict priority or minimum service.
Rate limiting: enforce per-session/per-source caps; apply backoff to misbehaving testers to avoid starvation and watchdog resets.
Congestion policy: define drop order (e.g., bulk logs before control), and record drops with reasons for field analysis.
Fault containment: circuit-breaker rules for repeated timeouts/resets; isolate the port rather than rebooting the entire gateway.

Artifact: maintain a “Bridge Policy Table” (Ingress → Class → Action → Queue → Limit → Exception).

Control & management: minimum fields for traceability

Diagnostic/OTA incidents are rarely reproducible without a consistent schema. Define a minimal field set and enforce it across sessions.

Session: session_id, tester_id, auth_level, start/stop timestamp, timeout reason.
Routing: rule_pack_version, mapping_id, source_port, dest_port, action (forward/drop/shape).
Performance: queue_id, queue_depth_peak, drop_count, p99_latency (X ms), throughput (X Mbps).
Security: cert_version, key_id, policy_decision, violation_type, audit_id.
Reliability: reboot_cause, watchdog_stage, power_event_flag, recovery_state.

Pass criteria placeholders: p99 latency ≤ X ms, drop rate ≤ X/1k frames, false rejects ≤ X/day, recovery success ≥ X%.

Cross-page guard (keep interfaces abstract)

Allowed: “port type, throughput class, error model, wake/sleep impact, diagnostics session behavior”.
Forbidden: “sample point, termination components, SIC waveform symmetry, TVS/CMC parasitic tuning”.
Action: when forbidden terms appear, provide a one-line context and link to the dedicated PHY/EMC page.

Three-plane architecture (fast path + policy + operations)

Diagram intent: enforce a clean split—data plane stays deterministic; control plane decides; management plane records and operates.

Multi-bus ↔ Ethernet Bridging Fundamentals

Bridging is a data-path engineering problem: classify traffic, apply policy, protect control flows with queueing and limits, and contain faults so one noisy endpoint cannot destabilize the entire vehicle network.

Bridge modes: L2 forwarding vs message routing vs proxy termination

L2 forwarding

Fit: Ethernet-to-Ethernet segments where L2 boundaries are explicitly controlled.
Risk: broadcast/unknown-unicast storms and uncontrolled fan-out under misconfiguration.
Rule: require storm control and strict isolation policies; do not rely on “best effort” forwarding.

Message routing

Fit: CAN/LIN/FlexRay ↔ Ethernet where traffic is mapped by ID/address/service class.
Risk: rule-table growth, rule conflicts, and “works on bench, fails in field” version drift.
Rule: version rule packs and default-safe behavior on mismatch (deny/shape + audit).

Proxy termination

Fit: diagnostics/OTA/security where the gateway must enforce authorization and produce audit trails.
Risk: state explosion and resource exhaustion if proxy logic leaks into the fast path.
Rule: keep proxy decisions in the control plane; keep the data plane deterministic.

Filtering & rate limiting: prevent storms, false diagnostics, and DoS-like overload

Filter keys: port, source ECU, service class (control / diagnostics / OTA / logs), and session identity.
Admission control: bound concurrent diagnostic sessions; reject excess sessions with a reason code and audit log.
Rate caps: apply per-session and per-source limits; add backoff when repeated timeouts indicate retry storms.
Congestion policy: define drop order that protects control traffic; record drop reason for field triage.
Safe defaults: unknown traffic is shaped or denied (never broadcasted) and always audited.

Verification placeholders: storm trigger time ≤ X s, queue depth peak ≤ X, drop rate ≤ X/1k frames, session reject is always logged.

Queueing: protect control flows while keeping diagnostics explainable

Service-class queues: Control / Diagnostics / OTA / Logs as the primary split.
Fairness knobs: per-ECU or per-session sub-queues to prevent single-source starvation.
Scheduling: strict priority or minimum-service guarantees for control; diagnostics are shaped, not “randomly dropped”.
Explainability: every throttle/drop should map to a policy rule and emit a reason code.

Pass criteria placeholders: control traffic p99 latency ≤ X ms under full diagnostic load; diagnostics p99 latency ≤ X ms in service mode.

Fault containment: isolate, circuit-break, and degrade instead of rebooting everything

Isolation domains: by port, by session, and by service class (preferred for service stability).
Circuit breaker states: Closed → Open → Half-open; transitions require explicit reasons and timers.
Degrade modes: keep control + restrict diagnostics + pause OTA; or service-only mode for recovery.
Auditability: every isolation event produces an audit_id and correlates with queue and session metrics.

Recovery placeholders: isolation duration ≤ X s, half-open success ≥ X%, system-level reboot rate ≤ X/day.

Output artifact: Bridge Policy Table (Ingress → Class → Action → Exception)

Ingress	Class	Match Keys	Action	Queue & Rate	Exception	Audit
CAN Port A	Control	ECU / ID range	Forward	Q1 · min service	Service mode	rule_pack_version
Tester / DoIP	Diagnostics	session_id / auth_level	Proxy / Shape	Q2 · cap X	Factory mode	reason_code
Cloud / OTA	OTA	campaign_id / version	Shape / Pause	Q3 · cap X	Degrade mode	audit_id

Implementation rule: every forward/drop/shape decision must be attributable to a single policy row and emit a stable reason code.

Typical bridging path (CAN → policy → queue/limit → Ethernet)

Diagram intent: show where decisions happen (policy), where protection happens (queues/limits), and where containment happens (circuit breaker).

Diagnostics Path: DoIP Session, Addressing, and Serviceability

Stable diagnostics requires three things: a deterministic admission gate, a versioned address-mapping contract, and a minimal logging schema that makes field failures explainable without reproducing the exact harness setup.

End-to-end chain (Tester → DoIP → Gateway → Target ECU) and first failure checks

Session overload: too many concurrent sessions or retries can starve control traffic; check admission counters and queue watermarks first.
Mapping drift: address-table version mismatch can look like random timeouts; verify mapping_id and rule_pack_version alignment.
Policy mismatch: an auth-level downgrade may cause silent rejects; require explicit reason codes and audit IDs.
Resource collapse: CPU/memory spikes or encryption overhead can trigger watchdog resets; correlate session events with health logs.

Fast triage order: admission/queues → mapping/version → auth gate → system health.

Address mapping: an engineering contract (not a spec reprint)

Goal: translate DoIP-side logical addressing into stable target identities without ambiguity.
Versioning: every mapping must carry mapping_id and rule_pack_version for field correlation and rollback.
Conflicts: duplicates and gaps must resolve to safe defaults (reject/shape + audit) rather than “best-effort forward”.
Self-check: validate coverage, uniqueness, and default actions before enabling service mode in production.

Artifact: “Mapping Pack” = {mapping_id, rule_pack_version, default_action, audit_id policy}.

Authentication gate: read vs write vs programming must be explicit and audited

Read-only: lowest risk, still requires session identity and rate limits.
Write: requires elevated auth_level and strict per-target quotas; reject is never silent.
Programming/flash: highest risk; enforce strong authorization, maintenance mode constraints, and mandatory audit trails.
Reject behavior: always return a stable reason_code and record an audit_id with timestamps and latency.

Pass criteria placeholders: false rejects ≤ X/day, unauthorized writes = 0, programming attempts always logged with audit_id.

Serviceability: minimal logging fields that make failures explainable

Time: timestamp (unified timebase), duration/latency_ms.
Session: session_id, tester_id, auth_level, start/stop reason.
Target: target_ecu, logical_addr, physical_addr (or stable target ID).
Operation: service_id, payload_len, status (ok/fail/timeout/reject), reason_code.
Resources: queue_id, queue_depth, drop_count_delta, throttle_events.
Versions: rule_pack_version, mapping_id, cert_version (if applicable).

Rule: every timeout/reject must be attributable to one gate (admission / mapping / auth / resource) and must emit a reason code.

DoIP session flow (state + auth gate + audit points)

Diagram intent: show the auth gate as a mandatory step and mark audit points that enable field debugging without reproducing the entire setup.

OTA Lifecycle: State Machine, Rollback, and Dependency Control

Automotive-grade OTA requires recoverability. The lifecycle must define durable state per step, strict activation gates, and rollback behavior that keeps the vehicle serviceable under power loss, weak links, or reboots.

Layered OTA model: Campaign → Download → Verify → Install → Activate → Confirm

Campaign: define target set, allowed windows, rollout groups, and dependency rules as a versioned contract.
Download: chunked transfer with resume; rate caps protect control traffic under weak links.
Verify: signature and hash checks; reject on mismatch with explicit reason codes and audit IDs.
Install: write to staging/A-B slot; keep the current image untouched until activation is safe.
Activate: switch pointers/slots via an atomic flag; ensure a deterministic boot path.
Confirm: commit only after health signals pass; otherwise trigger rollback or safe service mode.

Principle: activation is separated from installation; confirmation is the only commit point.

Recoverable state machine: power loss, reboots, and weak links

Durable progress: each step persists minimal fields to resume or roll back without guessing.
Atomic boundaries: define where interruption is allowed (download, staged install) vs guarded (activate switch).
Restart rules: on reboot, recover from the last durable state and follow deterministic transitions.
Failure semantics: verification failures never activate; install failures keep the old image bootable.

Pass criteria placeholders: reboot-resume success ≥ X%, activation switch is always atomic, bricks = 0 over X cycles.

A/B (dual image) rollback: triggers, window, and non-rollback exceptions

Triggers: boot-loop counters, health-check failure, missing critical services, or explicit negative confirmation.
Confirm window: commit only after stable operation across defined cycles/time; otherwise rollback automatically.
Non-rollback cases: if rollback is disallowed (e.g., mandatory security update), fail into a restricted safe mode with full diagnostics.
Auditability: every rollback records audit_id, reason_code, and the last known durable state.

Recovery placeholders: rollback completion ≤ X s, safe-mode entry ≤ X s, post-rollback serviceability preserved.

Multi-ECU dependency control: version matrix, ordering, and damage containment

Version matrix: define compatible sets and minimum versions; reject activation when dependencies are not satisfied.
Ordering: enforce explicit sequences per domain role (gateway services, targets, then optional modules) with rollback points.
Stop-loss: if any critical ECU fails, pause the campaign and keep the vehicle in a known serviceable mode.
Confirm scope: confirmation checks both ECU health and dependency satisfaction as a whole.

Metric placeholders: dependency-violation activations = 0, partial-complete campaigns ≤ X%, recovery success ≥ X%.

Output artifacts: OTA state machine table + durable fields checklist

State	Entry	Do	Durable Fields	Exit	Fail →	Safety Note
Download	Campaign accepted	Chunked fetch + resume	package_id, received_ranges, bytes_done, hash_state	All chunks received	Retry / Pause	Old image untouched
Verify	Download complete	Signature + hash checks	signature_ok, hash_ok, verified_version, audit_id	Verified	Abort	Never activate on fail
Install	Verify ok	Write to staging slot	staging_slot, write_offset, progress, result	Installed	Retry / Abort	Old boot slot preserved
Activate	Install ok	Atomic slot switch	next_boot_slot, activation_flag, time	Boot new image	Rollback	Switch must be atomic
Confirm	Boot ok	Health checks + commit	confirm_deadline, signals, result, reason_code	Committed	Rollback / Safe	Vehicle stays serviceable

Implementation rule: durable fields must be sufficient to determine the next state after reboot without heuristic guesses.

Durable fields checklist (minimum)

Campaign: campaign_id, target_set_hash, rollout_group, policy_window
Download: package_id, received_ranges, bytes_done, chunk_hash_state
Verify: signature_ok, hash_ok, verified_version, audit_id
Install: staging_slot, write_offset, install_progress, install_result
Activate: next_boot_slot, activation_flag, activation_time
Confirm: confirm_deadline, health_signals, confirm_result, reason_code
Audit: audit_id, mapping_id, rule_pack_version (for correlation with gateway policies)

OTA state machine (main path + rollback branch)

Diagram intent: highlight durable persistence points and show rollback/safe-mode paths without dense text.

Secure Gateway Integration: Trust Chain, Keys, and Policy Enforcement

Secure gateways are operational systems: a trust chain anchors identity, key management keeps credentials alive across the vehicle lifecycle, and policy enforcement produces auditable decisions for diagnostics and OTA.

Trust chain: secure boot → HSM/root key → runtime identity → policy

Secure boot: establishes a trusted software identity for the gateway/TCU runtime.
HSM/root key: anchors cryptographic operations; private roots remain non-exportable.
Runtime identity: produces stable device_id/cert_id used by sessions and policy decisions.
Policy tie-in: identity + session attributes map to allow/deny/shape decisions with audit IDs.

Rule: policy enforcement must never depend on unauthenticated identity claims.

Key management: lifecycle, rotation, revocation, factory injection, service updates

Lifecycle: issue → activate → rotate → revoke/expire with explicit ownership and audit trails.
Rotation: time-based and event-based rotation; support rollback of configuration but not of root identity.
Revocation: define behavior on expired/revoked credentials (restricted mode vs deny-all) by policy.
Factory injection: bind identity to hardware roots; record injection batch and provisioning version.
Service updates: update credentials in service mode without breaking diagnostics access.

Pass criteria placeholders: rotation success ≥ X%, expired-credential outages ≤ X/day, revocation effects are predictable and logged.

Secure comms boundary: where TLS/VPN terminates and who owns certificates

Termination point: terminate at the gateway for fine-grained audit/policy, or upstream for simplified roles; make it explicit.
Certificate ownership: define who rotates and who revokes; enforce a single source of truth for cert versions.
Failure semantics: handshake failure routes to restricted mode or deny by policy; never fall back silently.
Audit: record peer_id, cert_id, channel, and reason_code for every deny or downgrade.

Policy placeholder: “no silent downgrade”; all downgrades emit audit_id and reason_code.

Policy enforcement: allowlists, diagnostics privilege, OTA authorization, domain isolation

Network allowlist: permitted peers, ports, and services; default deny for unknown traffic.
Diagnostics privilege: read/write/flash mapped to auth_level; enforce quotas per session and per target.
OTA authorization: campaign must be signed/approved; activation depends on dependency satisfaction and policy windows.
Domain isolation: cross-domain flows require explicit permits; log every cross-boundary decision.

Correlation fields: rule_id, audit_id, reason_code, mapping_id, rule_pack_version.

IDS / anomaly (minimal viable): rate, session, and replay-like indicators

Rate anomalies: sudden spikes per peer/service; link to throttling and circuit-breaker triggers.
Session anomalies: abnormal failures, timeouts, or concurrency patterns; tie to admission gates.
Replay-like patterns: repeated identical requests in short windows; enforce policy-based rejection and auditing.
Actionability: anomaly signals must trigger degrade/isolation/audit escalation instead of being passive dashboards.

Output placeholder: anomaly_id + counter_id + threshold + outcome + audit_id.

Trust chain and enforcement flow (Boot → HSM → Keys → TLS → Policy → Logging)

Diagram intent: show a closed loop from trust anchoring to policy enforcement and auditability, with anomaly inputs driving action.

Functional Safety & Reliability: Fail-Operational vs Fail-Silent

A gateway/TCU failure must have a defined outcome. Safety objectives grade impacts by function class, health monitoring closes the loop from detection to action, and degraded modes preserve serviceability without allowing fault propagation.

Fail-Operational vs Fail-Silent (engineering meaning)

Fail-Operational: preserve a minimal set of essential functions under fault, typically via controlled degradation.
Fail-Silent: stop emitting potentially harmful effects, isolate external connectivity, and block cross-domain propagation.
Per-function decision: control, diagnostics, OTA, and external connectivity do not share the same allowed outcome.
Evidence requirement: every degrade/isolate decision must be explainable with audit_id and reason_code.

Design rule: “fail-operational” is scoped and explicit; everything else defaults to “fail-silent” across trust boundaries.

Safety objectives: grade impacts by function class and required behavior

Function Class	Failure Impact	Required Outcome	Recovery Target	Audit Minimum
Control (critical)	High	Fail-operational (minimal set) or isolate to local domain	≤ X s	mode_id, detector_id, action_id, audit_id
Diagnostics	Medium	Degraded (service-only) with strict privileges	≤ X s	session_id, reason_code, audit_id
OTA	Medium	Fail-safe pause + recover/rollback (no partial activation)	≤ X min	campaign_id, state, audit_id
External connectivity	High	Fail-silent by default (isolate) unless explicitly allowed	≤ X s	peer_id, channel, rule_id, reason_code

Implementation note: safety objectives must map to explicit degraded modes; “undefined behavior” is treated as a design failure.

Watchdog & health monitoring: detect, classify, and recover

Deadlock / stalls: heartbeat gaps, scheduling delay spikes, and unresponsive service endpoints.
Memory pressure: heap high-watermark, allocation failures, handle growth, and leak-rate estimates.
Queue blockage: queue depth saturation, tail latency, drops, and backpressure trigger counts.
Session health: abnormal timeouts, failed handshakes, and runaway concurrency.

Actions (layered recovery)

Soft recovery: restart a service, clear a stuck queue, re-load rule packs, re-open sessions.
Containment: circuit breaker, rate limiting, deny-by-policy, cross-domain blocking.
Hard recovery: controlled reboot, revert to last known-good config, enter service-only mode.

Pass criteria placeholders: detection time ≤ X ms, recovery time ≤ X s, false triggers ≤ X/day, audit completeness = 100%.

Degraded modes: explicit policy matrix

Mode	Allowed	Denied	Enter	Exit	Required Logs
Mode 1 Control-preserve	Minimal control flows, critical routing, bounded queues	External access, non-essential cross-domain traffic	Resource stress, repeated stalls	Health passes Y cycles	mode_id, detector_id, action_id
Mode 2 Service-only	Diagnostics + logging, strict auth_level gates	OTA activation, broad routing, external sessions	Policy failure, rule-pack mismatch	Service confirmation	session_id, rule_id, reason_code
Mode 3 Silent / Isolated	Local safe logging only, bounded watchdog recovery	External connectivity, cross-domain flows	Untrusted state, repeated failed recoveries	Manual service reset	audit_id, fault_id, last_state

Pass criteria placeholders: wrong-mode entries ≤ X/1k hours, isolation time ≤ X ms, recovery success ≥ X%.

Fault injection hooks: verify detection and actions

Injection	Expected Detector	Expected Action	Pass Criteria
CPU saturation (X%)	sched-delay / heartbeat timeout	rate limit + degrade to Mode 1	detect ≤ X ms; action ≤ X ms
Memory allocation failures	heap watermark / alloc-fail counter	restart service; if repeated → Mode 2	recovery ≤ X s; logs complete
Queue blockage / flood	queue depth + tail latency	circuit breaker + drop policy	no cross-domain collapse
Policy engine stalls	policy heartbeat + rule-pack integrity	Mode 2 or Mode 3 depending on trust	deny-by-default is enforced

Evidence fields (minimum): timestamp, fault_id, detector_id, threshold_id, action_id, mode_id, audit_id, reason_code.

Health monitoring closed loop (Detect → Decide → Act → Record → Recover/Isolate)

Diagram intent: present a complete reliability loop with explicit actions and audit evidence, without protocol-level details.

Physical Integration Envelope: EMC & Protection Boundaries for Gateways

Gateway/TCU ports define the system boundary. Protection must be layered from the connector inward, return paths must be planned, and configurable drive/slew must follow a system policy that balances emission, margin, and robustness.

Port boundary layering: connector → surge → ESD → common-mode → clean domain

Connector / shield: define the physical entry and shield bonding point(s).
Surge layer: manage energy and return paths; keep surge loops out of sensitive signal ground.
ESD layer: clamp fast events close to the entry; minimize inductive distance to the return path.
Common-mode layer: suppress radiated/common-mode energy before the clean IC domain.
Clean domain: keep protocol ICs inside a clearly defined “quiet zone” behind the protection stack.

Placement rule: protection components stay near the connector; clean-domain routing starts after the last boundary element.

Return paths & ground: chassis, shield, and surge loops

Keep “dirty return” away: ESD/surge return must not traverse sensitive digital/analog reference regions.
Shield continuity: bonding must be explicit; floating or intermittent shield connections often amplify emissions.
Single controlled tie: define where clean reference and chassis/body ground connect (if required), and keep it deterministic.
Black-box logging: record port, trigger type, and the resulting action (reset/isolate) for service correlation.

Pass criteria placeholders: post-event resets ≤ X/100 tests, false wake ≤ X/day, recover ≤ X s with port attribution.

Configurable slew/drive: system policy, not per-board guesswork

Slower edges: reduce emissions but shrink timing margin and increase sensitivity to noise and loading.
Stronger drive: improves robustness but can increase crosstalk and radiated energy.
Policy table: define profiles by harness class (length, node count, environment) and validate against a fixed checklist.

Policy placeholders (example profiles)

Harness A: low emission profile (slew low, drive medium)
Harness B: balanced profile (slew medium, drive medium)
Harness C: high robustness profile (slew high, drive high with strict containment)

When isolation is required: decision criteria (system-level)

Uncontrolled ground potential differences: domains with unpredictable reference offsets across operating conditions.
High disturbance interfaces: external-facing ports or long harness segments with strong coupling risk.
Security partitioning: boundaries where external connectivity must not influence safety-critical domains.
Fault containment: when a single-port event must not impact the rest of the network.

Validation placeholder: isolation decision documented, test evidence stored, and service diagnostics remain available.

Output artifacts: port protection stack + return-path checklist

Port Type	Layer Stack	Placement	Return Target	Risk Notes
Ethernet / External	Surge → ESD → CM → Clean	PCB edge / connector-side	Chassis / body ground	Reset, false wake, session drops
Diagnostics port	ESD → CM → Clean	Connector-side	Controlled tie point	False diagnostics triggers
Power entry	Surge → ESD → Filtering	At entry + tight loop	Chassis/body + power return	Brownout / reset storms

Return-path checklist (minimum)

“Dirty return” paths do not cross clean reference regions.
Shield bonding is explicit and mechanically reliable.
Protection components are connector-close with short return loops.
Clean/dirty zone boundary is drawn and enforced in layout review.

Connector protection “onion” (Surge → ESD → Common-mode → Clean domain + return arrows)

Diagram intent: visualize protection layering at the connector and emphasize return-path direction without waveform-level details.

Performance Budgeting: Latency, Throughput, CPU/Memory, and Congestion

Budgeting turns performance into engineering contracts: an end-to-end latency model with measurable boundaries, a throughput-to-resource accounting view, congestion protections that preserve serviceability, and pass criteria that can be accepted in production.

End-to-end latency decomposition (measurable boundaries)

Ingress: first observable entry point until classification begins (driver scheduling included).
Classify: policy match (routing/ACL/session) and rule-pack lookup decision.
Queue: waiting time under contention; often the dominant term in P99 latency.
Process: forwarding/encapsulation, copy count, crypto checks, and log field assembly.
Egress: shaping, rate limiting, and transmission scheduling to the next hop.

Measurement rule: each stage must expose time markers (t0…tN) so P50/P95/P99 can be reported per stage, not just end-to-end.

Throughput-to-resource accounting (DoIP / OTA)

CPU budget: packet/session management, policy evaluation, crypto checks, and logging pipeline overhead.
Memory budget: session state, buffers, queue depths, retry windows, and log staging buffers.
Copy budget: copy count is a primary multiplier for CPU and memory bandwidth under OTA payloads.
Storage budget: write/flush policies can dominate tail latency; measure it explicitly as a stage.

Practical budgeting targets (placeholders)

Peak throughput: ≥ X Mbps for Y seconds
Sustained throughput: ≥ X Mbps for Y minutes
CPU headroom: peak CPU ≤ X% with P99 latency within budget
Memory headroom: peak memory ≤ X% with stable queue depths

Congestion control: backpressure, watermarks, and drop policy (preserve serviceability)

Traffic classes: Control / Diagnostics / OTA / Telemetry must not share a single “best-effort” queue.
Watermarks: define low/high thresholds per queue to trigger shaping and admission control.
Backpressure: reject new sessions, delay non-critical transfers, and rate-limit at the boundary.
Drop rules: drop-by-class and drop-by-session to prevent one client from starving critical services.
Containment: circuit breakers for floods; avoid retry/log storms that amplify congestion.

Policy table skeleton (engineering contract)

Traffic Class	Queue	Priority	Rate Limit	Drop Rule	Protection
Control	Q0	Highest	Min guarantee + cap	Never drop by default	Admission control
Diagnostics	Q1	High	Cap per session	Drop by session on flood	Breaker + audit
OTA	Q2	Medium	Windowed throttle	Drop non-critical chunks	Resume + persist
Telemetry	Q3	Low	Aggressive cap	Drop-first	No starvation

Pass criteria placeholders: Q0 starvation = 0; Q1 timeouts ≤ X/1k; Q2 throughput ≥ X with P99 within budget; breaker trips ≤ X/day.

Metrics definition (make results comparable)

Latency: P50/P95/P99 per stage and end-to-end, plus tail-spike counts (above X ms).
Throughput: sustained (Y minutes) and peak (Z seconds) with resource headroom recorded.
Sessions: max concurrent sessions with handshake success and timeout rate thresholds.
Congestion: queue depth distribution, drop rate, and backpressure trigger counts.
Stability: degrade/recover counts and recovery time after congestion or stress.

Measurement rule: every pass criterion must reference a measurement method (tool, window, sampling, and normalization).

Pass criteria placeholders (acceptance contract)

Latency: end-to-end P99 ≤ X ms; stage-level P99 ≤ X ms (Queue must remain bounded).
Sessions: max concurrent sessions ≥ X with timeout rate ≤ Y% over Z minutes.
Throughput: peak ≥ X Mbps and sustained ≥ X Mbps while CPU ≤ Y% and memory ≤ Z%.
Loss/timeout: drop ≤ X/1k and timeout ≤ X/1k per traffic class.
Recovery: congestion recovery ≤ X s and no mode oscillation above X/hour.

Evidence fields (minimum): run_id, workload_id, timestamps, stage metrics, queue stats, session stats, resource stats, config_version.

Output artifact: performance budget template (stage / metric / target / method / margin)

Stage	Metric	Target	Measurement Method	Margin
Ingress	P99 latency	≤ X ms	timestamp t0→t1 (fixed window)	X%
Classify	P99 latency	≤ X ms	timestamp t1→t2 (rule eval)	X%
Queue	P99 wait	≤ X ms	queue timestamp t2→t3	X%
Process	CPU / copies	≤ X% / ≤ X	profiling + counters	X%
Egress	P99 latency / drop	≤ X ms / ≤ X	timestamp t4→t5 + stats	X%

Budget waterfall (stage bars + margin)

Engineering Checklist: Design → Bring-up → Production

Gate-based checklists turn complex gateway/TCU programs into repeatable execution. Each gate defines required inputs, checks, evidence artifacts, and pass criteria placeholders so cross-team delivery remains consistent from first bring-up to production.

Design Gate (freeze boundaries and contracts)

Inputs: architecture boundary, policy tables, minimum log fields, key/cert policy, state machines, performance budget.
Checks: deny-by-default rules, explicit exceptions, audit fields completeness, persistence points, budget measurability.
Outputs: frozen config_version / rule_pack_version, acceptance draft (pass criteria placeholders).
Evidence: review record, signed tables, baseline workload definition (workload_id).

Pass criteria placeholders: boundary completeness = 100%, policy coverage = 100%, audit fields present in all flows, budgets defined for P99.

Bring-up Gate (stability, stress, and recovery)

Inputs: test tools, logging pipeline, workload profiles, congestion tests, fault-injection matrix.
Checks: session stability, pressure and congestion behavior, P99 within budget, recovery actions verified end-to-end.
Outputs: baseline performance report (run_id), known-good config, validated degraded-mode behavior.
Evidence: P99 report, queue stats report, fault-injection report, recovery timing report.

Pass criteria placeholders: max sessions ≥ X, P99 latency ≤ X, timeouts ≤ X/1k, recovery ≤ X s, evidence artifacts complete.

Production Gate (consistency and traceability)

Inputs: release bundle, cert injection flow, station self-test, traceability schema.
Checks: config/cert version alignment, self-test coverage, consistent pass criteria validation, audit attribution fields.
Outputs: trace bundle (device_id, config_version, cert_version), factory pass report.
Evidence: station logs, regression comparison report, sampling audit report.

Pass criteria placeholders: version match = 100%, self-test pass = 100%, trace fields present, production drift ≤ X per batch.

Checklist table (items + owner + method + pass criteria + evidence)

Gate	Item	Owner	Method	Pass Criteria	Evidence
Design	Boundary + policy tables complete	System / Security	Review + diff	Coverage = 100%	Signed tables
Bring-up	P99 within budget under stress	Test	Load test	P99 ≤ X ms	run_id report
Bring-up	Congestion protections verified	System	Queue stats	No starvation	queue report
Production	Version + cert alignment	Factory	Station self-test	Match = 100%	station logs

3-gate flow (Inputs → Checks → Outputs + Evidence)

H2-11 · Applications: Diagnostics / Gateway / TCU Patterns

This section maps common system shapes to practical constraints, required modules, and serviceability expectations. Each pattern keeps boundaries clear: it focuses on gateway/TCU system integration (bridging, DoIP/OTA, security policy, logging), and avoids PHY-level deep dives that belong to sibling pages.

Pattern → Constraints → Modules → Verification Includes example material numbers

A) Central Gateway (centralized)

Typical scene: Multiple CAN/LIN domains converge to one gateway; DoIP diagnostics and OTA coordination are centralized.
Key constraints: High session concurrency, queue isolation (diagnostics vs control vs logging), strict fault containment, predictable P99 latency.
Serviceability minimum: session-id, tester-id, target ECU, service id, result code, duration, drop reason, policy decision.
Related sibling pages to link: CAN FD transceiver / Selective wake / Ethernet PHY & switch / EMC & port protection.

Example BOM candidates (material numbers)

Gateway compute / SoC: NXP S32G274AABK0CUCT, Renesas R8A779F0, Infineon SAK-TC397XX-256F300S-BD
Automotive Ethernet switch: NXP SJA1105TEL, NXP SJA1105EL
Automotive Ethernet PHY: NXP TJA1100, TI DP83TC811R-Q1
CAN FD transceiver / controller: TI TCAN1044-Q1, TI TCAN4550
LIN transceiver: TI TLIN1029-Q1, TI TLIN1021-Q1
Secure element / TPM: NXP SE050A2HQ1/Z01SHZ, Infineon OPTIGA TPM SLB 9672 FW16
Safety PMIC / SBC: NXP MFS2633HMBA0AD, Infineon TLF35584QVVS1
Port ESD (recommended-for-new): Nexperia PESD2ETH100T-Q, Nexperia PESD2CANFD24LT-Q

B) Zonal Gateway + Ethernet Backbone

Typical scene: Many LIN/CAN nodes aggregated per zone; zonal gateways uplink to an Ethernet backbone; DoIP/OTA policy is shared or centralized.
Key constraints: Broadcast storm containment, per-zone rate limiting, deterministic forwarding under congestion, isolation of “noisy” zones.
Pitfall to guard: Unbounded retries across zones turning into global queue collapse; missing per-zone “circuit breaker”.
Related sibling pages to link: Selective wake / SBC with CAN/LIN / CAN FD transceiver / Ethernet PHY & switch.

Example BOM candidates (material numbers)

Zonal SBC / CAN: NXP UJA1169A, TI TCAN4550
CAN FD transceiver: TI TCAN1044-Q1
Ethernet backbone switch: NXP SJA1105TEL
Ethernet PHY: NXP TJA1100, TI DP83TC811R-Q1
Power & safety monitor: NXP MFS2633HMBA0AD, Infineon TLF35584QVVS1

C) TCU-as-Gateway (TCU terminates external link + policies)

Typical scene: TCU is the external termination point (TLS/VPN), and also enforces gateway policies for DoIP/OTA.
Key constraints: Trust chain clarity (boot→keys→runtime policy), strong logging/forensics, robust rollback, strict separation between external and in-vehicle domains.
Pitfall to guard: “Security termination” placed too deep (policy after bridging), causing untrusted traffic to consume internal queues.
Related sibling pages to link: Secure gateway / selective wake / DoIP diagnostics / Ethernet PHY.

Example BOM candidates (material numbers)

Compute: NXP S32G274AABK0CUCT, Renesas R8A779F0
External trust anchor: Infineon OPTIGA TPM SLB 9672 FW16, NXP SE050A2HQ1/Z01SHZ
In-vehicle networking: NXP SJA1105TEL, TI TCAN1044-Q1, TI TLIN1029-Q1
Safety PMIC: Infineon TLF35584QVVS1

D) Service Tool / Factory Mode (diagnostics throughput + traceability)

Typical scene: Factory flashing, EOL test, service station diagnostics, controlled bypass policies, strong traceability fields.
Key constraints: High throughput, strict access levels, deterministic test time, audit-ready logs (who/what/when/result).
Pitfall to guard: “Test-only” backdoors leaking into field images; missing policy attestation tags.
Related sibling pages to link: DoIP diagnostics / OTA lifecycle / security policy enforcement.

Example BOM candidates (material numbers)

Compute / safety MCU option: Infineon SAK-TC397XX-256F300S-BD
DoIP-facing Ethernet PHY: TI DP83TC811R-Q1
CAN FD access: TI TCAN4550
Trust anchor for station auth: Infineon OPTIGA TPM SLB 9672 FW16

Figure 11 — Four system patterns (mobile-stacked)

Use the pattern choice to drive: (1) queue isolation boundaries, (2) rate-limit placement, (3) trust termination location, and (4) minimum logs for field diagnosis.

H2-12 · IC Selection Logic: What to Choose and Why (with material numbers)

Selection is driven by policy boundaries and serviceability requirements first, then by compute/network/security/power modules. Material numbers below are illustrative references; always confirm AEC grade, package, longevity, and safety documentation for the target program.

1) Gateway compute (MCU/SoC)

Choose by: session concurrency, copy budget (DMA/zero-copy), security acceleration, real-time partitioning, safety concept (ASIL targets), and I/O count.
Validation hook: CPU headroom at peak DoIP + OTA + logging; verify P99 latency with congestion + encryption enabled.
Example parts: NXP S32G274AABK0CUCT, Renesas R8A779F0, Renesas R8A779G0, Infineon SAK-TC397XX-256F300S-BD, TI TDA4VM-Q1

2) In-vehicle networking (Ethernet / CAN / LIN / FlexRay)

Ethernet topology: port count, QoS/TSN needs, mirroring for diagnostics, and storm control placement.
Bus access strategy: discrete transceivers vs integrated SBC/controller; decide by wake policy, SPI bandwidth, and failure isolation.
Example parts: Switch: NXP SJA1105TEL, NXP SJA1105EL Ethernet PHY: NXP TJA1100, TI DP83TC811R-Q1 CAN FD transceiver/controller: TI TCAN1044-Q1, TI TCAN4550 LIN transceiver: TI TLIN1029-Q1, TI TLIN1021-Q1 FlexRay (if required): NXP TJA1080A, Infineon TLE9221SX

3) Security (trust anchor + policy enforcement)

Choose by: root-of-trust availability, key storage capacity, crypto throughput, update/rotation method, and auditability.
Validation hook: attested boot chain + policy versioning + log integrity (tamper evidence).
Example parts: NXP SE050A2HQ1/Z01SHZ, Infineon OPTIGA TPM SLB 9672 FW16

4) Power, reset, and safety monitoring

Choose by: rail count, fail-safe outputs, watchdog concept, wake sources, and required safety diagnostics coverage.
Validation hook: brownout/ignition cranking recovery + OTA power-loss recovery + watchdog-induced safe state.
Example parts: NXP MFS2633HMBA0AD, Infineon TLF35584QVVS1

5) Port protection (keep SI while meeting ESD)

Choose by: capacitance budget, surge model assumptions, and placement feasibility (return path length dominates).
Validation hook: post-ESD link stability + insertion loss / reflection sanity checks on the real harness.
Example parts (recommended-for-new): Nexperia PESD2ETH100T-Q, Nexperia PESD2CANFD24LT-Q, Nexperia PESD2CANFD36UU-Q

Figure 12 — Decision flow (scenario → module combo → verification)

Practical acceptance placeholders (replace X/Y with program targets): P99 latency, max concurrent sessions, peak OTA throughput, drop/timeout rate, and post-ESD stability.

Request a Quote

Name

Company

Part Number(s) / BOM

Quantity & Target Lead Time

Alternates Allowed

Temperature Grade

Package / Footprint

Compliance

Budget Window

Lot Size / Qty

Message

Attachment

Accepted Formats

pdf, csv, xls, xlsx, zip

Attachment

Drag & drop files here or use the button below.

H2-13 · FAQs (Diagnostics / Gateway / TCU)

Long-tail troubleshooting only. Each answer is a fixed 4-line engineering path: Likely cause → Quick check → Fix → Pass criteria (threshold placeholder X).

DoIP connects, but diagnostic services intermittently time out — check concurrency or queue starvation first?

Likely cause: session concurrency spikes and control/telemetry traffic steals queue time, starving diagnostic request/response processing.

Quick check: correlate timeouts with active_sessions, queue watermarks, and per-class drop/latency counters at ingress→classify→egress.

Fix: reserve a diagnostics class (queue + minimum service rate) and enforce per-source rate limiting; cap non-diagnostic bursts via backpressure/circuit-breaker.

Pass criteria: timeout rate ≤ X/1k requests and P99 diag latency ≤ X ms at X concurrent sessions; diag queue watermark < X%.

OTA download completes, but installation fails — missing persisted fields or dependency matrix mismatch?

Likely cause: install step lacks required persisted state (resume tokens, verified hash, selected slot) or dependency rules reject the target set.

Quick check: inspect state store for required fields per step (campaign_id, artifact_id, hash_ok, slot, precheck_result) and compare actual ECU versions to the declared dependency matrix.

Fix: make install preconditions explicit and persisted; validate dependencies before install; emit a single “reject reason code” for serviceability.

Pass criteria: install success ≥ X% across X cycles; dependency rejects always carry reason_code; no “missing_field” failures in logs over X installs.

After update, diagnostics wake-up intermittently fails — certificate/policy rotation or routing table version mismatch?

Likely cause: policy or certificate set is updated without synchronized versioning across wake filters and routing, causing valid wake frames to be dropped.

Quick check: compare policy_version, cert_bundle_version, and route_table_version in the wake event log; verify rejection reason (auth vs filter vs route-miss).

Fix: enforce atomic rollout (version bundle + monotonic policy) and add compatibility window; fail closed on external access but fail open for local service-mode (if allowed by safety concept).

Pass criteria: wake success ≥ X% within X s; all wake drops tagged with reason_code; version skew events = 0 over X days.

Stress test OK, but in-vehicle hot / low-voltage resets increase — watchdog source or power policy?

Likely cause: watchdog triggers due to scheduler stalls under thermal throttling or brownout/reset policy causes repeated reboot loops.

Quick check: read reset_reason histogram (wdt/brownout/thermal/exception) and correlate with temperature and supply monitors; check whether queue deadlocks precede WDT.

Fix: add health monitor + staged degraded modes (shed non-critical traffic, preserve diag) and tune power-up/down policy; ensure watchdog window matches worst-case stall budget.

Pass criteria: unexpected resets ≤ X/24h at X°C and low-voltage profiles; recovery time ≤ X s; reset_reason always populated.

IDS generates too many alerts and hurts availability — tune thresholds first or build whitelist/baseline first?

Likely cause: thresholds are applied without traffic baseline per mode (factory/service/road), causing normal bursts to be flagged as anomalies.

Quick check: split alerts by mode, source, service type, and rate bucket; compare against baseline percentiles (P50/P95/P99) rather than raw averages.

Fix: establish whitelist + per-mode baselines, then set thresholds to target false-positive rate; add alert rate limiting so IDS cannot starve core routing.

Pass criteria: false-positive ≤ X/hour and IDS CPU ≤ X%; alert storms capped to ≤ X/min without impacting P99 routing latency.

Diagnostic write is denied but read works — permission tiering or security access unlock flow?

Likely cause: access control differentiates read vs write (role/level), or the unlock step is missing/expired for this session/policy version.

Quick check: log the decision tuple: (session_id, tester_id, requested_service, access_level, policy_version, deny_reason); verify unlock state and TTL for that ECU/service group.

Fix: define explicit levels (read / write / flash) and bind unlock to session + target ECU; make denies deterministic with a single reason code and guidance.

Pass criteria: authorized writes succeed ≥ X%; unauthorized writes always denied with deny_reason; no ambiguous denies (reason missing) over X attempts.

With multi-ECU diagnostics, one ECU gets “starved” — how to adjust queue classification and rate limiting?

Likely cause: queues are keyed only by service type (diag) and not by ECU/tenant, so a “noisy” ECU consumes the shared budget.

Quick check: break down throughput and drops per ECU address (logical/physical) and per-source tester; verify fairness metrics (per-ECU share, queue wait time).

Fix: add per-ECU (or per-tenant) sub-queues with weighted fairness; apply token-bucket limits per ECU and per tester to prevent domination.

Pass criteria: minimum per-ECU service rate ≥ X req/s; starvation time ≤ X ms; fairness index ≥ X during X-ECU load.

OTA repeats installation after power-loss recovery — where is idempotency missing?

Likely cause: state machine transitions are not atomic; the “completed” marker is not persisted (or persisted after side effects), so reboot re-enters install.

Quick check: inspect persisted state ordering: (downloaded→verified→installed→activated→confirmed) and confirm each step writes a monotonic marker before executing irreversible actions.

Fix: implement idempotent guards (operation_id + step checkpoint), use monotonic state with write-ahead markers, and ensure activation/confirm are strictly separated.

Pass criteria: after forced power loss at any step, resume never regresses more than X state; duplicate installs = 0 over X fault-injection runs.

Enabling encryption drops throughput a lot — copy count or crypto offload capability first?

Likely cause: extra memory copies dominate, or crypto runs on CPU without acceleration, turning throughput into a compute-bound pipeline.

Quick check: measure bytes-copied per payload, CPU cycles per byte, and DMA usage; compare “crypto on/off” with identical queue/traffic to isolate overhead source.

Fix: reduce copies (zero-copy buffers, scatter/gather), enable acceleration/offload where available, and pre-size queues to avoid encryption-induced backpressure.

Pass criteria: sustained throughput ≥ X Mbps with encryption; CPU ≤ X%; copy count ≤ X per payload; P99 latency ≤ X ms.

Unstable external link slows in-vehicle buses — how to set isolation domains and circuit breakers?

Likely cause: external retries and reconnect storms consume shared CPU/queues, leaking failure across domains due to missing isolation and breaker policy.

Quick check: correlate external link state changes with internal queue growth, CPU spikes, and diag/control latency; confirm whether domain boundaries share the same limiter/worker pool.

Fix: isolate external domain (separate queues + worker budget), apply circuit breaker on repeated failures, and degrade gracefully (keep control/diag alive while external is throttled).

Pass criteria: external flaps do not change internal P99 latency by more than X%; breaker trips within X failures; internal diag/control timeout rate ≤ X/1k.

Field complaint “intermittent and not reproducible” — which minimum black-box fields are missing?

Likely cause: missing correlation keys prevent joining events across planes (session, policy decision, route version, queue state, reset reasons), so failures cannot be reconstructed.

Quick check: verify whether each failure log has: timestamp, session_id, tester_id, ecu_id, service_id, result/deny_reason, policy_version, route_table_version, queue_watermark, reset_reason (if any).

Fix: enforce a minimal schema and unique run_id per diagnostic/OTA operation; add sampling for high-rate counters and keep last-N ring buffer for pre-failure context.

Pass criteria: ≥ X% of incidents are explainable within X minutes using logs alone; missing-required-field rate ≤ X%.

Factory flashing passes, but service fails — which injection step (cert/version) is inconsistent?

Likely cause: factory and service pipelines use different key/cert bundles, policy versions, or metadata stamping, causing auth/compatibility mismatch in the field.

Quick check: compare artifacts and stamps: firmware version, policy_version, cert_bundle_id, provisioning_profile, and signing chain across factory vs service; confirm that logs report the active bundle IDs.

Fix: unify provisioning profiles, make injection steps deterministic and audited, and add a “configuration attestation” record that must match before allowing service operations.

Pass criteria: provisioning mismatches = 0 over X vehicles; service auth success ≥ X%; every unit exports attestation with bundle IDs and version stamps.

Diagnostics / Gateway / TCU: Multi-bus, DoIP, OTA, Security

Diagnostics / Gateway / TCU: Multi-bus, DoIP, OTA, Security

Definition & Boundary: What “Diagnostics / Gateway / TCU” Means Here

System Architecture: Data Plane / Control Plane / Management Plane

Multi-bus ↔ Ethernet Bridging Fundamentals

Diagnostics Path: DoIP Session, Addressing, and Serviceability

OTA Lifecycle: State Machine, Rollback, and Dependency Control

Secure Gateway Integration: Trust Chain, Keys, and Policy Enforcement

Functional Safety & Reliability: Fail-Operational vs Fail-Silent

Physical Integration Envelope: EMC & Protection Boundaries for Gateways

Performance Budgeting: Latency, Throughput, CPU/Memory, and Congestion

Engineering Checklist: Design → Bring-up → Production

H2-11 · Applications: Diagnostics / Gateway / TCU Patterns

H2-12 · IC Selection Logic: What to Choose and Why (with material numbers)

Request a Quote

Accepted Formats

Attachment

H2-13 · FAQs (Diagnostics / Gateway / TCU)

Explore

Categories

Get in Touch

Diagnostics / Gateway / TCU: Multi-bus, DoIP, OTA, Security

Diagnostics / Gateway / TCU: Multi-bus, DoIP, OTA, Security

Definition & Boundary: What “Diagnostics / Gateway / TCU” Means Here

System Architecture: Data Plane / Control Plane / Management Plane

Multi-bus ↔ Ethernet Bridging Fundamentals

Diagnostics Path: DoIP Session, Addressing, and Serviceability

OTA Lifecycle: State Machine, Rollback, and Dependency Control

Secure Gateway Integration: Trust Chain, Keys, and Policy Enforcement

Functional Safety & Reliability: Fail-Operational vs Fail-Silent

Physical Integration Envelope: EMC & Protection Boundaries for Gateways

Performance Budgeting: Latency, Throughput, CPU/Memory, and Congestion

Engineering Checklist: Design → Bring-up → Production

H2-11 · Applications: Diagnostics / Gateway / TCU Patterns

H2-12 · IC Selection Logic: What to Choose and Why (with material numbers)

Recommended topics you might also need

Request a Quote

Accepted Formats

Attachment

H2-13 · FAQs (Diagnostics / Gateway / TCU)

Explore

Categories

Get in Touch