Temperature and aging don’t “randomly” break in-vehicle fieldbuses—they systematically consume timing margin, shift thresholds, and trigger thermal protection behaviors.
This page shows how to define the right temperature metrics, predict the dominant drift terms, design stable recovery (no retry storms), and validate with repeatable sweeps and black-box logs.
tokens: accent/bg/card/fg/muted/line
H2-1 · Scope, Definitions & Temperature Vocabulary
Why this chapter exists
Temperature discussions fail when the temperature “name” is ambiguous. Lock the vocabulary first (Ta/Tc/Tj, θJA/θJC, drift vs tolerance),
then every later section becomes measurable and repeatable.
Scope guard (anti-overlap)
Do: define temperature references, measurement points, and conversion workflow.
Do not: expand into EMC/ESD, wake policies, protocol timing, or functional safety trees.
Outcome: later chapters can state pass/fail criteria without vocabulary drift.
Core temperature references (use one name at a time)
Ta (Ambient)
System/environment reference. Affects airflow, enclosure, and nearby heat sources.
Tc (Case)
Package surface reference. Useful for comparing builds only when the measurement point is controlled.
Tj (Junction)
Silicon reference. Most directly related to thermal shutdown and long-term aging risk.
θJA depends heavily on board, copper, airflow, and neighbors. θJC is more stable but still requires a defined case point.
ΔT & thermal network
Treat heat flow as a path (junction → package → board → chassis/air). A single θ number is never the whole story.
Drift vs tolerance
Tolerance: part-to-part spread at a fixed condition. Drift: the same part changes with temperature/time.
Mission profile
Real vehicles run cycles: cold start, warm-up, peaks, hot soak, cool-down. Aging depends on time-at-temp and cycling.
Minimal workflow (turn vocabulary into an engineering action)
Name the temperature reference explicitly: Ta, Tc, or Tj.
Lock the measurement point (sensor location) and measurement method (contact vs estimate vs internal).
Build a heat-path assumption: power P + thermal path (θ segments) → estimated Tj.
Run a sanity check: if the estimate is implausible, the model is wrong (not the silicon).
Map the vocabulary to later pass/fail metrics: thermal trip, recovery time, drift budget, and aging exposure.
Diagram · Temperature reference map (Ta → Tc → Tj)
H2-2 · Automotive Temperature Grades & Mission Profiles
Practical framing
A temperature “grade” is a capability label. A mission profile is the real workload. Reliability and aging depend on
time-at-temperature and thermal cycling, not a single worst-case number.
Commonly used AEC-Q100 grades (engineering shorthand)
Grade 0
Typical range: -40°C to +150°C. Use for the hottest zones and worst-case hot-soak exposure.
Grade 1
Typical range: -40°C to +125°C. Common for many ECUs with moderate under-hood thermal coupling.
Grade 2
Typical range: -40°C to +105°C. Common for cabin/body domain with controlled airflow and lower heat density.
Grade 3
Typical range: -40°C to +85°C. Use only when placement and mission profile keep silicon away from hot soak and dense heat sources.
Selection rule of thumb (avoid under-spec)
Match the grade to the mission profile peak + hot soak, not just ambient air.
Budget for board-level ΔT from nearby power devices and copper constraints.
Aging risk scales with time-at-high-Tj and cycling amplitude; design for margins, not optimism.
Mission profile template (copy-and-fill)
Use a profile to connect “grade” to real exposure. Keep it minimal but complete enough to feed validation and drift budgets.
Convert field symptoms (hot CRC spikes, cold-start link failure, error counters rising after warm-up) into a measurable parameter shortlist.
This chapter focuses on what drifts and how to measure, not device-specific register recipes.
Do / Don’t (anti-overlap)
Do: list drift-sensitive parameters and the fastest measurement method for each.
Do not: provide device register tuning, protocol-specific timing configuration, or EMC/ESD component selection.
Use later chapters for: timing budgets (H2-4), recovery strategies (H2-7), validation matrices (H2-9).
What drifts: VOD / dominant level margin, drive strength, rise/fall time, symmetry.
Fast check: measure dominant/recessive levels + edge times at cold/room/hot; compare symmetry.
Log: error counters vs temperature point; any TX timeout/thermal events.
RX front-end (threshold & fail-safe)
What drifts: input threshold, hysteresis, fail-safe decision margin.
Fast check: validate RX decoding at minimum edge rate and worst-case common-mode near limits.
Log: RX framing/bit errors, dominant/recessive mis-detect events.
Common-mode tolerance (CM window)
What drifts: effective common-mode headroom under temperature and load.
Fast check: force controlled ground offset/common-mode shift in lab across temperature points.
Log: error rate vs offset and temperature, identify the boundary condition.
Timing (prop delay & loop delay symmetry)
What drifts: TX→bus and bus→RX delays; forward/reverse symmetry (skew).
Fast check: measure round-trip delay at cold/room/hot; track asymmetry vs node count.
Log: CRC/errors vs measured delay shift and harness configuration.
Slew rate (edge shaping)
What drifts: effective rise/fall time and edge symmetry under temperature and load.
Fast check: compare edge times at identical bus load across temperature points.
Log: sampling-related errors vs edge-time changes (correlation).
Oscillator / clock (frequency error & start-up)
What drifts: frequency offset vs temperature, start-up stability, divider rounding sensitivity.
Fast check: measure frequency error at cold/room/hot; verify start-up at cold crank conditions.
Log: link-up time vs temperature, retries required, and any clock fault indicators.
Diagnostics thresholds (fault detect edges)
What drifts: diagnostic comparators and internal thresholds, affecting false alarms vs missed faults.
Fast check: sweep temperature while applying controlled fault stimuli at safe levels; observe detect consistency.
Log: fault event counters vs temperature; flag any temperature-correlated mis-detection.
Diagram · Drift map (parameters → symptoms)
H2-4 · Timing Margin vs Temperature
Intent
Explain why “scope looks fine” can still fail: the usable sampling window is a budget. Temperature shifts delay, symmetry, clock error,
and edge position — progressively consuming margin until CRC and error frames spike.
Do / Don’t (keep it general, not a CAN FD-only page)
Do: build a timing budget template and show which terms are temperature-sensitive.
Do not: give protocol-specific register/segment configuration recipes.
Goal: identify the dominant term and verify it by measurement.
Timing margin budget template (copy-and-fill)
1) Harness propagation
What it is: cable/harness delay and reflections occupy a fixed portion of the window.
Measure: confirm harness length/topology; characterize delay on the real harness when possible.
Note: temperature usually changes this term less than the silicon terms, but it sets the baseline.
2) Transceiver delays
What it is: TX→bus and bus→RX propagation, plus loop delay symmetry (skew).
Measure: round-trip delay at cold/room/hot; track asymmetry across nodes/vendors.
Temp sensitivity: often a dominant drift term under high speed or heavy loading.
3) Controller clock error
What it is: oscillator frequency error and divider quantization affect sampling alignment.
Measure: frequency error vs temperature; verify start-up and stability at cold conditions.
Aging link: long-term drift and temperature cycling can shift this over product life.
4) Edge/slew contribution
What it is: slower or asymmetric edges shrink the stable region and move the effective sampling boundary.
Measure: rise/fall time vs temperature at identical bus load; correlate with CRC/error counters.
Risk: scope “looks acceptable” while the stable window is already marginal.
5) Temperature drift reserve
What it is: reserved margin for combined temperature drift and model uncertainty.
Define: system-level budget target X (time units) for worst-case cold/hot.
Pass criteria: reserve ≥ X across the mission profile extremes.
6) Dominant-term identification
Method: change one condition at a time (temperature point, harness load, node count, edge shaping).
Goal: find the term that moves the most and correlates with errors.
Output: a single “dominant drift term” to target in design and validation.
Translate “works at launch but degrades after years” into measurable indicators. Focus on observable spec shifts
(threshold, delay, leakage, thermal path) and how to detect them with practical tests and logs.
Do / Don’t (anti-overlap)
Do: map aging mechanisms to measurable spec-level shifts and system symptoms.
Do: prefer before/after (baseline vs stressed) comparisons and correlation with logs.
Do not: turn into semiconductor physics encyclopedia or ASIL fault-tree coverage.
Boundary: no EMC/ESD component selection; no device-specific register recipes.
Symptom-first entry (what gets noticed in the field)
“Intermittent failures” after years
Likely indicators: delay skew creeping, threshold margin shrinking, thermal events becoming frequent.
Fast evidence: error counters correlate with temperature/time-in-state; reproduce near boundary temperature points.
“High-temp aging makes RX more picky”
Likely indicators: RX threshold/hysteresis shift; reduced stable sampling window.
Fast evidence: worst-case common-mode and slow-edge cases fail at hot after stress but pass at baseline.
“Standby current creeps up”
Likely indicators: leakage increase; thermal headroom shrinks; shutdown triggers earlier under the same load.
Fast evidence: Iq vs temperature curve shifts upward after stress; shutdown frequency increases.
EM (electromigration) → resistance & drive headroom
Accelerated by: sustained high current density, high temperature, long duty cycles.
Spec shifts: effective drive margin reduces, edge/symmetry degrade, local heating worsens.
System symptoms: CRC/error frames rise under high load; failures become temperature-sensitive.
Detect: baseline vs post-stress edge-time/level comparison; correlate errors with load and temperature point.
NBTI / PBTI → threshold drift & timing
Accelerated by: high temperature and long bias time (mission profile matters).
Spec shifts: input threshold shifts, propagation delays increase, margins thin.
System symptoms: “works at room, fails at hot/cold edge”; cold-start sensitivity increases.
Detect: before/after threshold-margin tests; delay/loop symmetry checks across temperature points.
HCI (hot-carrier) → speed & edge behavior
Accelerated by: high switching stress, large voltage stress, high activity.
Spec shifts: delay drift, slew/edge changes, symmetry degradation under load.
System symptoms: borderline timing failures emerge; CRC spikes show up first at the fastest modes.
Detect: edge-time and delay vs activity sweep; correlate with fastest-mode error counters.
Accelerated by: temperature cycling, vibration, repeated heat-soak and cool-down.
Spec shifts: effective thermal resistance worsens, local hot spots intensify, recovery becomes slower.
System symptoms: thermal shutdown becomes frequent; “intermittent” failures near certain temperature points.
Detect: compare shutdown frequency and recovery time under identical load before/after cycling.
Risk ranking (engineering decision view)
Exposure
Time-at-high-Tj, cycling count, peak duration, and duty cycles decide the acceleration of long-term drift.
Sensitivity
Systems with thin timing/threshold reserve are more likely to show field failures for the same amount of drift.
Detectability
Intermittent issues are risky because they are hard to reproduce; require temperature-tagged logs and controlled sweeps.
Pass criteria placeholders (set by system budget)
Δthreshold ≤ X · Δdelay/loop symmetry ≤ X · leakage/Iq increase ≤ X ·
thermal trip frequency increase ≤ X / hour (under identical load).
Diagram · Mechanism → symptom → test method
H2-6 · Thermal Shutdown Behavior
Intent
Standardize how thermal events are described and diagnosed: trip point, hysteresis, cooldown, latched vs auto-retry,
and the bus-visible behavior for TX/RX and error counters — distinct from TxD dominant timeout.
Thermal shutdown state model (engineering vocabulary)
Trip
Entry condition when internal junction temperature crosses the trip threshold. Define the measurement reference (Tj vs proxy).
Hysteresis
The temperature delta required to exit shutdown. Small hysteresis risks repeated in/out oscillation under pulsed loads.
Cooldown
A time component may exist even after temperature drops below recovery threshold; treat recovery as a timed gate if applicable.
Latched vs auto-retry
Some devices require host intervention to return to normal. Others auto-retry after cooldown. Log which behavior applies.
Bus-visible behavior (what can be observed)
TX outward state
During shutdown/cooldown, TX may force recessive, tri-state, or enter a limited/fail-safe state. Confirm with bus-level observation.
RX outward state
RX may remain active, switch to fail-safe, or present stable but degraded decode. Observe error counters and any persistent faults.
Error counters & bus-off
Track TEC/REC growth, bus-off occurrences, and recovery time. Thermal events often show strong correlation with temperature and duty cycle.
Thermal shutdown vs TxD dominant timeout (boundary check)
Thermal: triggered by internal temperature; recovery depends on cooling and hysteresis/cooldown.
Dominant timeout: triggered by TxD staying dominant too long; recovery depends on TxD release and internal timer policy.
Fast discriminator: thermal events correlate with load/temperature; dominant timeout correlates with TxD behavior/software states.
Reproducible diagnostic loop (minimum)
1) Tag every dropout
Record temperature point, duty cycle/load state, and time since boot. Without temperature tags, “intermittent” remains unbounded.
2) Separate thermal vs timeout
Check whether TxD was held dominant; check whether thermal flags/events correlate with rising temperature under identical bus conditions.
3) Pass criteria placeholders
No repeated trip oscillation more than Y times within X minutes under steady load;
recovery time ≤ X; post-recovery error counters stabilize within Y seconds.
Diagram · Thermal shutdown state machine (TX/RX outward behavior)
tokens: accent/bg/card/fg/muted/line
H2-7 · Recovery Strategies & Rate Limiting
Intent
Prevent “reconnect storms” after thermal events. Build a rate-limited recovery loop with cooling gates, controlled retries,
graded de-rating, safe rejoin steps, and black-box fields that make postmortems reproducible.
Stop-the-storm triage (field-first)
Reconnect faster → fails faster
Likely cause: retry loop re-heats the device; cooling gate is missing or too weak.
Quick check: compare retry cadence vs temperature proxy and shutdown count.
Fix: enforce cooldown gate + exponential backoff + max retries per time window.
Pass criteria: no more than N retries within Y minutes under steady load.
Recovers, then bus-off repeats
Likely cause: rejoin is too aggressive; TX resumes before the bus is stable.
Quick check: measure TEC/REC trend during the first seconds after recovery.
Fix: staged rejoin: silent observe → limited TX → normal.
Pass criteria: TEC/REC stop rising within X seconds after rejoin.
Recovers but error rate stays high
Likely cause: residual thermal stress or drift keeps margins thin; immediate full-speed resumes too soon.
Quick check: run temperature-tagged error counters at reduced load vs normal load.
Fix: de-rate ladder (speed/drive) + longer cooldown; keep “listen-only” at higher levels.
Pass criteria: error counters stable within X per Y minutes after stabilization window.
Recovery loop blueprint (rate-limited)
1) Detect
Thermal flag/event (if available) or bus symptoms (bus-off, sustained TEC/REC growth, repeated error frames) + temperature tag.
2) Cooling gate
Pass at least one gate: temperature (Tproxy<X and dT/dt<Y), time (wait X + observe Y),
or power (load<X% / current<X).
3) Rate limit retries
Use one or combine: exponential backoff (1→2→4→8s… cap X),
window cap (≤N per Y min), token bucket (k tokens/min; each attempt costs 1–m).
4) De-rate ladder
If an attempt fails, step down: lower speed → lower drive → listen-only → isolate/disable.
Only step up after stable windows.
5) Rejoin steps
Clear/record counters → silent observe → limited TX (rate/drive constrained) → normal mode.
Abort and backoff if TEC/REC rises or bus-off repeats.
Policy level (L0–L4), retry index, backoff time, retry cost (token), last stable window length.
Pass criteria placeholders
After recovery, no repeated trip oscillation more than Y times within X minutes;
rejoin stability window ≥ X; error counters stop rising within Y seconds.
Diagram · Recovery timeline (temperature + bus state + rate limiting)
H2-8 · Thermal Design Hooks
Intent
Explain why the same IC behaves differently on different boards. Provide a thermal-path checklist and a practical junction
temperature estimation loop (power → thermal resistance → Tj), plus measurement hygiene to make comparisons meaningful.
Why board-to-board thermal behavior differs
Heat source proximity
Nearby DC/DCs, power switches, and dense hot zones raise local ambient and reduce cooling headroom.
Copper spreading & vias
Copper area continuity and thermal via density dominate conduction into inner planes and the backside.
Chassis / airflow interface
Mechanical contact, thermal pads, enclosure conduction, and airflow create large unit-to-unit differences.
PCB thermal-path checklist (actionable)
Placement
Avoid local hot pockets and blocked airflow zones.
Keep distance from high-power converters and switches where possible.
Ensure a continuous copper spreading region is available near the package.
Copper spreading
Expand copper around pads to spread heat laterally.
Avoid split planes that cut the heat path into islands.
Connect to inner planes for additional spreading area.
Thermal vias
Use via arrays near/under the package to couple layers.
Ensure vias connect to meaningful spreading copper on other layers.
Check that solder mask/land patterns do not unintentionally block the thermal route.
Thermometry
Define temperature reference: Ta / Tc / Tj-proxy and keep it consistent.
Use identical load profiles when comparing boards; log duty cycle and airflow conditions.
Treat probe attachment and IR emissivity as potential sources of comparison error.
Junction temperature estimation loop (power → θ → Tj)
Estimate power (P): use typical and peak power with duty-cycle weighting (mission profile).
Choose thermal reference: Ta with θJA or Tc with θJC (keep the reference consistent).
Compute ΔT: ΔT = P × θ (use the matching θ definition for the chosen reference).
Compute Tj: Tj = Tref + ΔT (Tref is Ta or Tc).
Validate: compare predicted vs measured Tc/Tproxy under identical load; fold residual into margin.
Pass criteria placeholders
Under peak load for X minutes, Tproxy stays at least Y°C below trip;
recovery time ≤ X; trip frequency ≤ X per hour for the same mission profile.
Build a credible temperature validation plan. Combine sweep (find boundaries), soak (prove steady-state margin), and cycling
(expose stress-driven intermittents), with consistent windows, counters, and pass criteria that enable regression.
Evidence chain (what “credible” means)
Reproducible setup
Fixed definition of Tproxy (Ta/Tc/Tj-proxy), fixed log windows, and consistent load profiles (duty, burstiness, bus load).
Corner coverage
Validate worst thermal corner, cold-start corner, and boundary corners (fine sweep around the sensitive temperature band).
Comparable pass criteria
Use the same metric windows: error frames per window, bus-off per hour, recovery time, and thermal trip count per mission segment.
Set: choose step size in the sensitive band; define window length.
Stabilize: require dT/dt < Y and Tproxy stable for X minutes.
Stimulate: fixed bus load pattern + controlled burst (repeatable).
Judge: record counters and mark the first failing temperature band for deeper sweep.
Temperature soak (prove steady-state margin)
Set: hold Tmin and Tmax long enough to reach true steady state.
Stimulate: sustained typical and peak segments; include recovery attempts only by policy.
Record: TEC/REC trend, bus-off, recovery duration, and trip count.
Judge: counters must not drift upward across windows (no silent degradation).
Thermal cycling (expose intermittents)
Profile: define cycle amplitudes and dwell times (cold soak → hot soak).
Trigger: log event snapshots at cycle transitions and after dwell completion.
Watch: intermittent failures that appear only after multiple cycles.
Regress: reproduce the failure at a single temperature band using sweep to localize margin loss.
Pass criteria placeholders (consistent windows)
In each window: error frames ≤ X / minute; bus-off ≤ X / hour;
recovery duration ≤ X; thermal trips ≤ X per mission segment.
Window definition (length, counters, resets) must be versioned.
Diagram · Verification matrix flow (set → stabilize → stimulate → record → judge → regress)
H2-10 · Production & Field Monitoring
Intent
Turn temperature drift and aging into observable, locatable signals. Define a minimum logging set, event windows, and a black-box
pipeline that helps separate “thermal” vs “harness/topology” vs “external disturbance” using data rather than guesswork.
Minimum viable logging (start with 6 fields)
Required 6
Tproxy value + label (Ta/Tc/Tj-proxy)
Thermal event / trip count
Bus-off count
Recovery duration (start/end timestamps or elapsed)
TEC/REC (peak and delta around events)
VBAT / critical rail minimum within the event window
Strongly recommended
Node role (ECU / gateway / endpoint) + topology ID (harness/stub option)
Rate / de-rate policy level tag
Load tag (utilization tier) and duty/burst profile ID
dT/dt estimate around the event
Metric window version (prevents “same name, different meaning”)
Eventization (avoid raw-log floods)
Triggers (examples)
bus-off transition
error passive exceeds X seconds
thermal event / trip flag
recovery start → recovery end
Window aggregation
For each event window, store: start/end timestamps, max/mean Tproxy, VBAT min, TEC/REC peak and slope,
counts (bus-off, error frames), and recovery duration. Keep the window definition fixed and versioned.
Service payload
Report compact event summaries (not raw streams) to a service tool or backend. Preserve topology ID and policy level tags
for comparison across vehicles and boards.
Data-driven separation: thermal vs harness/topology vs external disturbance
Looks thermal
Errors correlate with Tproxy and approach-trip neighborhood. De-rate or load reduction improves stability.
Recovery requires time/cooling gates; dT/dt and trip count track the failure frequency.
Looks harness / topology
Same temperature, different harness/topology ID changes the outcome. Higher node count or longer harness increases error rate.
Symptoms reproduce without needing a high Tproxy or thermal neighborhood.
Looks external disturbance
Weak temperature correlation; failures appear as bursts. Event timestamps cluster around specific vehicle actions
(motor start, relay switching). VBAT minima or transient tags align with the error spikes.
Provide an executable checklist engineers can run. Items stay strictly within temperature/aging scope and include Quick checks and
Pass criteria placeholders. Where relevant, example material part numbers are listed for direct BOM adoption.
Example materials (MPN palette used in this checklist)
Analog temperature sensor: TI TMP235-Q1 (example orderable code: TMP235AQDBZTQ1)
Digital temperature sensor: ADI ADT7420
VBAT / rail monitor: TI INA226-Q1 (I²C current/power monitor)
Note: part numbers are examples; verify qualification, package, derating, and availability against program requirements.
Design Gate
Lock definitions, thermal path assumptions, recovery policy, and observability before layout freeze.
D1 · Temperature vocabulary and Tproxy measurement locked
Quick check
Confirm a single definition for Tproxy (Ta/Tc/Tj-proxy label), sensor placement, and sampling window. Compare Tproxy trend against a second reference point
(case or ambient) during a controlled load step.
Pass criteria
Tproxy label and location documented; window length fixed; Tproxy noise ≤ X over Y seconds
in steady state; load step produces consistent ΔTproxy within ±X%.
Example parts: TI TMP235-Q1 (analog T sensor),
ADI ADT7420 (digital T sensor).
D2 · Tj estimation sanity check (power → θ → Tj)
Quick check
Under peak load script, measure rail current and estimate worst-case dissipation. Run a simple Tj estimate using documented θ assumptions,
then compare Tproxy rise trend as a sanity check (trend consistency matters more than absolute accuracy).
Pass criteria
Worst-case estimated Tj ≤ (limit − X margin). Peak-load measurement confirms dissipation model within
±X%. θ assumptions versioned and tied to board variant.
Example parts: TI INA226-Q1 (rail current/power monitor),
TI TMP235-Q1 or ADI ADT7420 (Tproxy).
Review layout for heat spread copper, thermal vias, and proximity to other heat sources. Identify top 3 likely thermal bottlenecks
(pad area, via density, airflow blockage) and tag them for bring-up verification.
Pass criteria
Bottlenecks documented; board variant has a thermal path checklist completed; design includes at least one direct measurement point for Tproxy and a repeatable load script for thermal baselining.
Example parts (measurement aids): TI TMP235-Q1 or ADI ADT7420 (Tproxy reference for thermal baseline).
D4 · Recovery policy defined (rate limiting + de-rate ladder)
Quick check
Define explicit cooling gates (Tproxy threshold + cooldown time) and retry backoff to prevent reconnect storms.
Include at least one de-rate level (reduced rate / limited duty / receive-only mode) for near-trip conditions.
Pass criteria
Recovery attempts ≤ N per Y minutes; cooldown gates prevent oscillation.
De-rate ladder documented and testable.
Example parts (for policy inputs): TI TMP235-Q1 / ADI ADT7420 (Tproxy), TI INA226-Q1 (rail droop correlation).
Ensure event windows capture: Tproxy, thermal trips, bus-off count, recovery duration, TEC/REC peak+delta, and VBAT/rail minimum.
Verify the window definition is versioned and included in records.
Pass criteria
A triggered event produces a complete record with timestamps; at least X events can be stored without loss;
window version is always present.
Example parts: Fujitsu MB85RS64V (SPI FRAM for event storage),
TI INA226-Q1 (rail metrics), TI TMP235-Q1 or ADI ADT7420 (temperature).
Bring-up Gate
Locate temperature boundaries early; prove recovery stability; establish correlations using consistent windows.
B1 · Boundary sweep (find first failing temperature band)
Quick check
Perform a step sweep across the suspected sensitive band (step size fixed). At each step: stabilize (dT/dt gate),
run the same stimulus window, then record window metrics.
Pass criteria
First-fail band repeats within ±X°C across two runs; metrics are comparable (same window definition version).
Example parts: ADI ADT7420 (fine-resolution T readout) or TI TMP235-Q1 (analog Tproxy).
B2 · Soak at extremes (prove steady-state margin)
Quick check
Hold Tmin and Tmax long enough to reach true steady state. Run typical and peak segments.
Check for trending counters (error frames, bus-off, recovery duration) across windows.
Pass criteria
No monotonic drift of counters across K windows; bus-off ≤ X / hour; recovery duration ≤ X.
Example parts: TI INA226-Q1 (power/rail tracking), TI TMP235-Q1 or ADI ADT7420 (Tproxy).
B3 · Controlled recovery test (no reconnect storm)
Quick check
Near the trip neighborhood, trigger a thermal event (or simulate near-trip conditions) and verify recovery gating:
cooldown check, retry backoff, and de-rate ladder transitions.
Pass criteria
Recovery attempts ≤ N in Y minutes; no oscillation between states;
stable operation resumes within X after cooldown gate is satisfied.
Example parts: TI TMP235-Q1 / ADI ADT7420 (cooldown gate input),
Fujitsu MB85RS64V (store recovery event windows for later regression).
B4 · Correlation triage (thermal vs rail vs non-thermal)
Quick check
Tag each window with Tproxy, VBAT minimum, and load level. Repeat one failing case with de-rate or load reduction.
Correlate improvement with reduced temperature rise and/or reduced power.
Pass criteria
Correlation conclusion is reproducible: “thermal-dominant” vs “rail-dominant” vs “non-thermal-like” based on window stats.
At least X repeated trials support the classification.
Example parts: TI INA226-Q1 (VBAT/rail min and power),
ADI ADT7420 or TI TMP235-Q1 (Tproxy).
Production Gate
Fix corner coverage and make temperature/aging issues observable in production and service.
P1 · Corner set fixed and station-to-station comparable
Quick check
Freeze corner set definitions (worst thermal, cold-start, boundary, recovery). Verify every station uses the same stimulus script,
the same window definition version, and the same logging fields.
Pass criteria
Station-to-station metric delta ≤ X for baseline windows; corner set version and window version always recorded.
Example parts (for comparability): TI INA226-Q1 (rail/power in test rigs),
ADI ADT7420 / TI TMP235-Q1 (temperature reference).
P2 · Eventization pipeline produces compact service payloads
Quick check
Trigger representative events (bus-off, recovery, thermal flag). Confirm the system writes a compact event summary with
timestamps, window stats, topology/policy tags, and window version.
Pass criteria
Each event produces a complete summary record; storage supports ≥ X events; readout works via service tooling without raw log floods.
Example parts: Fujitsu MB85RS64V (nonvolatile event storage),
TI INA226-Q1 (rail min/power in event window), TI TMP235-Q1 / ADI ADT7420 (temperature).
P3 · Field triage rules operational (thermal vs non-thermal separation)
Quick check
Using stored event windows, classify a case with the triage rules: “thermal-like” (Tproxy correlation, near-trip neighborhood, de-rate helps),
“rail-like” (VBAT minima aligns), or “non-thermal-like” (burst errors without Tproxy correlation). Validate the classification with a controlled reproduction.
Pass criteria
For a known reference case, classification matches reproduction results in ≥ X out of Y trials; required 6 fields are always present.
Example parts: TI INA226-Q1 (rail minima),
Fujitsu MB85RS64V (event retention),
TI TMP235-Q1 / ADI ADT7420 (temperature correlation).
Diagram · Three Gates (Design → Bring-up → Production)
Likely cause / Quick check / Fix / Pass criteria (numeric thresholds as X, Y, N placeholders; always tied to a measurement window).
▸ High-temperature intermittent bus-off: thermal shutdown or timing margin eaten?
Likely cause:
The event is either heat-triggered state changes (near trip) or margin collapse from temperature-driven delay/threshold drift.
Quick check:
Correlate bus-off timestamps with Tproxy vs trip distance (ΔT to trip), and with rail minima. Repeat once with a temporary de-rate (lower data rate or reduced dominant duty) and compare bus-off rate.
Fix:
If heat-triggered, add cooldown gating + retry backoff; if margin-driven, recompute worst-case temperature timing budget and apply a safer operating point (de-rate ladder / tighter window control).
Pass criteria:
At Tmax soak, bus-off ≤ X/hour over Y-minute windows; ΔT to trip ≥ X°C in steady state; recovery duration ≤ X seconds.
Likely cause:
Either power-up sequencing (VBAT ramp, reset release timing) is marginal at cold, or receiver thresholds/edge rates shift enough to break early communication windows.
Quick check:
Log VBAT minimum during ramp, reset cause, and first-success timestamp at Tmin. Re-run with a delayed network start (wait for stable rails) and compare first-link success rate.
Fix:
Add a cold-start gate: “rail stable + cooldown time met” before network entry; widen reset-release margin; apply conservative start mode until stable temperature/rails are confirmed.
Pass criteria:
At Tmin, first-link success ≥ X% over N cold starts; VBAT min ≥ X V during ramp; time-to-first-traffic ≤ X seconds.
▸ After thermal shutdown, the link keeps dropping: retry storm or cooldown gate too aggressive?
Likely cause:
Reconnect attempts happen while junction temperature is still near trip, causing repeated fault entry (oscillation) and compounding bus error counters.
Quick check:
Plot a single timeline: Tproxy, thermal-trip flag (or equivalent), recovery attempts, and bus state (error-active/passive/bus-off). Look for “reconnect attempts while ΔT-to-trip is small”.
Fix:
Implement cooldown gates (temperature + time), exponential backoff, and a de-rate ladder (reduced load / receive-only) until ΔT-to-trip stays healthy.
Pass criteria:
Recovery attempts ≤ N per Y minutes; no oscillation for Y minutes; ΔT-to-trip ≥ X°C before network re-entry; recovery duration ≤ X seconds.
▸ Same device runs hotter on a different PCB: θJA assumption or real power mismatch?
Likely cause:
Thermal path differences (copper/vias/airflow/nearby heat sources) invalidate θ assumptions, or real operating dissipation is higher (duty cycle, load, rail droop).
Quick check:
Run identical load script and compare: rail current/power, Tproxy rise rate (dT/dt), and steady-state ΔT. If power is similar but ΔT is higher, the board thermal path is the main suspect.
Fix:
Update θ model per board variant; add/repair thermal spreading and vias; isolate heat sources; add airflow/heat sinking where needed; tighten power accounting and dominant-duty policies.
Pass criteria:
For the same script, ΔT steady-state difference between boards ≤ X°C; power accounting error ≤ ±X%; ΔT-to-trip margin ≥ X°C at Tmax ambient.
▸ Lab is OK, vehicle fails at high temperature: which mission profile item is most often missing?
Likely cause:
Hot-soak + rapid restart + peak load bursts are missing in the lab plan, so the true worst thermal transient and recovery loop never got exercised.
Quick check:
Reproduce “park hot-soak → restart → peak traffic” with a scripted sequence. Compare Tproxy peak, ΔT-to-trip margin, and recovery attempts vs steady-state lab soak.
Fix:
Add the missing mission segment into the verification matrix and tune de-rate/recovery gates for that transient; ensure logging windows capture the restart and first traffic phase.
Pass criteria:
During hot-soak restart, ΔT-to-trip ≥ X°C; bus-off ≤ X per event; recovery attempts ≤ N; recovery duration ≤ X seconds.
▸ Only fails after thermal cycling: solder/package stress or parameter drift?
Likely cause:
Thermal cycling introduces intermittent physical effects (micro-cracks/contact resistance) or shifts thermal resistance, which then looks like “new drift” under the same load.
Quick check:
Compare pre/post-cycle baselines: steady-state ΔT at the same power, and event rate at the same temperature point. Look for hysteresis (depends on recent thermal history) and sensitivity to mechanical stress (tap/fixture change).
Fix:
If physical/intermittent, improve mechanical/thermal coupling and assembly controls; if drift-like, tighten operating margins, update the verification matrix to include cycle-induced worst cases, and enhance black-box capture around transitions.
Pass criteria:
After N cycles, baseline ΔT shift ≤ X°C at the same power; event rate increase ≤ X%; no intermittent resets over Y hours at boundary temperature.
▸ Errors appear only in one temperature band: how to “temperature sweep” to find the boundary term?
Likely cause:
A specific temperature-sensitive contributor dominates only in a narrow band (delay/threshold shift, recovery gating edge, or rail behavior tied to temperature).
Quick check:
Run a step sweep with fixed step size and fixed stabilization gate (dT/dt). At each point, log the same window stats (error frames, bus-off, recovery duration, rail minima) and mark first-fail + last-pass temperatures.
Fix:
Use the boundary band as the new qualification corner; tune cooldown thresholds or de-rate ladder to avoid operating right on the cliff; tighten power/rail stability in that band.
Pass criteria:
First-fail temperature repeats within ±X°C; within the pass region, error frames ≤ X per Y minutes; boundary mitigation removes failures across a X°C guard band.
▸ Shutdown threshold “looks like it drifted”: sensor/algorithm error or aging-driven thermal resistance change?
Likely cause:
The observed shift is often a measurement-proxy mismatch (Tproxy location/filters/windowing) or a real ΔT change due to thermal path degradation, not a true internal trip change.
Quick check:
Re-run the same controlled load and compare: power, Tproxy rise rate, and steady-state ΔT at the moment the event occurs. Validate that the Tproxy mapping and window version are identical across runs/boards.
Fix:
Calibrate or relocate Tproxy; standardize filtering/windowing; if ΔT increased at the same power, repair thermal path bottlenecks and update θ per board aging condition.
Pass criteria:
Across repeats, the event occurs within ±X°C in Tproxy when window version is unchanged; ΔT at the same power changes ≤ X°C; mapping error ≤ X%.
▸ Field logs are insufficient: which minimum 6 fields determine thermal involvement?
Likely cause:
Without the right fields, thermal-like failures are misclassified as random, forcing part swaps without root cause.
Quick check:
Verify each event window records these 6 fields: (1) Tproxy, (2) thermal trip/warn count, (3) bus-off count, (4) recovery duration, (5) TEC/REC peak (or delta), (6) VBAT/rail minimum. Include timestamp + window version if available.
Fix:
Implement a compact black-box record with fixed windowing; store a bounded number of recent events; add a service readout path that does not require full raw logs.
Pass criteria:
For ≥ X% of incidents, all 6 fields are present; window version is always present; storage holds ≥ N events; readout succeeds within X minutes at service.
▸ Lowering the data rate makes it stable: margin came back or power/temperature dropped?
Likely cause:
Both effects can occur: lower rate increases timing slack and can reduce switching loss and dominant duty, lowering junction temperature.
Quick check:
Repeat the failing window at two rates while logging Tproxy and rail/power. If Tproxy peak drops materially at the lower rate, thermal/power is a major contributor; if Tproxy is unchanged but errors disappear, margin is dominant.
Fix:
Use a de-rate ladder that reacts to ΔT-to-trip and error trends; lock the operating point away from the boundary; refine timing budget assumptions at worst temperature.
Pass criteria:
At the chosen operating point, error frames ≤ X per Y minutes; ΔT-to-trip ≥ X°C; Tproxy peak reduction (if de-rate) ≥ X°C vs baseline; no bus-off over Y hours soak.
▸ Many nodes at the same temperature make it worse: bus load increases power or margins stack up?
Likely cause:
Higher bus utilization and dominant duty can raise dissipation, while stacked propagation/delay variation across many nodes can consume system margin.
Quick check:
Log per-window bus utilization, dominant duty estimate, and Tproxy rise; compare “few nodes” vs “many nodes” runs at the same ambient. Check whether Tproxy peak and error rate scale with utilization.
Fix:
Cap duty via scheduling or throttling under high temperature, apply de-rate policies when utilization is high, and validate worst-case timing budget with the maximum node count in the matrix.
Pass criteria:
At max node count, error frames ≤ X per Y minutes; Tproxy peak ≤ X°C; bus-off = 0 over Y hours; utilization cap enforced at X%.
▸ “Overheating but the case is not hot”: how to prove junction temperature is truly over-limit?
Likely cause:
Junction-to-case thermal gradient can be large under localized power, so case touch/IR is not a reliable proxy. A mislocated sensor can also under-report the hot spot.
Quick check:
Compute a conservative Tj estimate using measured power and documented θ path (choose a worst-case θ). Cross-check with Tproxy rise rate and with thermal warning/trip events during a repeatable load step.
Fix:
Move or add Tproxy nearer the thermal hot spot; repair thermal spreading; reduce peak power via de-rate; add cooldown gating and event-window logging to capture peak conditions.
Pass criteria:
Worst-case estimated Tj ≤ (limit − X°C); Tproxy mapping error ≤ X%; no thermal trip events over Y hours at peak script; ΔT-to-trip margin ≥ X°C.