Lighting Protection & Metering: eFuse, Event Logs, Telemetry
← Back to: Lighting & LED Drivers
Lighting protection & metering is about making surges/ESD/faults survivable while turning every protection action into auditable evidence (waveforms + event logs + counters). With a coordinated clamp→limit→disconnect chain and crash-safe logging, field failures become measurable, explainable, and serviceable instead of guesswork.
H2-1. Page Mission & System Boundary (What We Protect, What We Prove)
This page treats protection as a controlled action and metering/logging as the proof layer. The goal is not only to survive transients, but to explain what happened (event), what the hardware did (action), and what the product experienced (outcome) using a consistent evidence schema.
1) Protection Domains (what must be protected)
- Input domain (AC/DC/PoE feed): highest energy exposure; dominated by surge/lightning coupling and line disturbances.
- Bus domain (DC link & distribution): vulnerable to UVLO/brownout and repetitive burst disturbances; often where resets begin.
- Load domain (LED strings / wiring): open/short events, hot-plug transients, and reverse energy injection from long cable runs.
- Logic & I/O domain (MCU, aux rails, PHY-side rails): low-energy but fragile; ESD/EFT can create silent leakage or repeated resets.
Domain separation exists to decide where energy is allowed to go and which parts may sacrificially fail (clamps) versus which parts must survive (switching silicon, metering core, and control rails).
2) Evidence Chain (what must be proven)
- Event: surge / ESD / EFT (burst) / lightning-like common-mode coupling.
- Action: clamp, current-limit, disconnect, auto-retry, latch-off (and any thermal foldback behavior).
- Outcome: brownout/reset, hard failure, degraded brightness, abnormal heating, communication errors, parameter drift.
A useful system makes the action observable and the outcome attributable. That requires both a fast protection path (trip) and a persistent proof path (counters + logs + timestamps).
3) Deliverables from This Page (what the reader can reuse)
- Reference architecture: clamp → limit/disconnect → sense → event log → telemetry export.
- Protection action taxonomy: how “limit vs latch vs retry” decisions map to survivability and serviceability.
- Minimum evidence schema: which fields must exist so a transient can be reconstructed after the fact.
- Verification evidence points: the two most valuable waveforms and the log alignment rules.
H2-2. Threat Model for Luminaires (Surge/ESD/EFT/Lightning vs Real Symptoms)
Standards terminology is only useful when it predicts field behavior. The practical model is symptom → likely threat class → first evidence points. The same symptom can be produced by different threats, and the same threat can produce both catastrophic failures and “still works but degraded” outcomes—so evidence must be designed in.
1) Symptom groups (start from what is observable)
- Reset / reboot: repeated restarts, flickering on/off, intermittent recovery after power cycle.
- Hard fail: no light output, no response, or protective shutdown that never recovers.
- Degrade: still turns on, but runs hotter, dimmer, or derates earlier than before.
- Comms errors: sporadic CRC/errors, missing acknowledgements, unstable auxiliary interfaces.
- Drift: thresholds shift, current accuracy changes, nuisance trips increase over time.
“Degrade” and “drift” are the costly failures: the product appears functional but silently loses margin. Without metering + event logs, these cases are easily misattributed to general design quality rather than transient exposure.
2) Threat classes (translate into damage mechanism + signature)
- Surge: high-energy line transient → clamp stress, silicon overstress, latch-up, and parameter shift.
Typical signature: clamp activation + bus voltage dip + current-limit/disconnect event. - ESD: localized discharge at I/O/connector → latent leakage or intermittent interface weakness.
Typical signature: I/O fault flags + rising comms error count + unexplained standby current. - EFT/Burst: repetitive fast transients → repeated UVLO/brownout, reset storms, nuisance trips.
Typical signature: reset_reason + brownout counter + short-duration undervoltage events. - Lightning-like common-mode coupling: large CM energy → ground reference shift and multi-rail anomalies.
Typical signature: simultaneous anomalies across domains + elevated event density around the incident time window.
3) Evidence-first triage (the “two waveforms” rule)
- Waveform #1: input/bus voltage after the clamp stage (captures residual stress and brownout depth).
- Waveform #2: eFuse/high-side switch current limit or disconnect timing (captures the protective action).
If only one log can be kept, preserve: event_type, peak (V/I), duration, trip_action, rail_id, temp_bin, and a monotonic time base. These fields allow reconstruction even without absolute wall-clock time.
H2-3. Protection Chain Topology (Clamp → Limit → Disconnect → Recover)
A reliable luminaire protection design is not “adding more parts.” It is a repeatable 4-stage action chain: Clamp controls peak stress, Limit controls energy, Disconnect defines the survival boundary, and Recover defines serviceability. Each stage must leave a traceable log signature.
1) The 4 stages and what they guarantee
- Clamp (TVS/MOV/GDT): reduces voltage spikes to a defined ceiling; may be sacrificial under extreme energy.
- Limit (eFuse/HSS): enforces a programmable current/energy envelope; prevents “long-duration stress.”
- Disconnect (latch-off / ORing / reverse block): isolates the downstream rail when the envelope is exceeded.
- Recover (auto-retry vs latched + service): defines whether the product self-heals or requires intervention.
Practical rule: Clamp decides peak, Limit decides energy, Disconnect decides survival, and Recover decides user experience. If any stage is missing, the “weakest stage” becomes unpredictable silicon damage.
2) Sacrificial vs recoverable: when each is required
- Sacrificial-first (clamps accept damage): when external energy is unbounded or line coupling is harsh (outdoor, long cable runs).
- Recoverable-first (eFuse/HSS self-manages): when downtime/false trips are costly (commercial, networked nodes, frequent switching).
- Hybrid is common: clamp absorbs peak; eFuse limits energy and sets the action policy.
The design objective is not “never fail.” It is controlled failure or controlled recovery, with evidence to prove which occurred.
3) Coupling to metering/logging: actions must leave evidence
- Counters (count): how often protection activates (event density).
- Peak & duration (peak, duration): how severe each event was (stress intensity).
- Reason & action (trip_reason, trip_action): why the stage engaged and what it did (limit vs disconnect vs retry).
Minimum log fields that make the chain diagnosable: event_type, trip_reason, trip_action, peak_v/peak_i, duration, count, rail_id, temp_bin, plus a monotonic time base to correlate repeated bursts.
H2-4. eFuse / High-Side Switch Deep Dive (SOA, Inrush, Fault Modes)
An eFuse/high-side switch is a programmable power risk executor: it shapes inrush, enforces a current/energy envelope, and exposes state and reason codes so protection becomes measurable. In lighting, it effectively upgrades a fuse from “one-time disconnect” into a state machine with evidence.
1) Capabilities that matter in luminaires
- Inrush shaping: soft-start, dv/dt control, hot-plug handling to avoid nuisance trips.
- Fault handling: programmable current limit, short-circuit response, thermal protection, OV/UV gating.
- Energy direction: reverse current block / ideal-diode ORing behavior to prevent back-feed damage.
- Observability: fault pins + telemetry registers (reason, peak, counters) to build an evidence trail.
2) SOA + repetition: why “survived once” can still fail later
- Peak vs energy: a clamp handles peak; the eFuse limits time-integrated stress on the pass FET.
- Thermal accumulation: dense auto-retry cycles can create heat oscillation (cool → retry → re-trip).
- Field diagnosis requirement: log retry_count_window, duration, and temp_bin to correlate failures with event density and recovery time.
Mechanism-first takeaway: repeated disturbances turn into failure when the protection policy allows too much “on-time under stress” before a full disconnect, or retries are too aggressive for the thermal time constant.
3) Failure modes (and what the evidence should look like)
- Nuisance trips: thresholds too tight or sensing noise → elevated
countwith shortduration, weak peaks. - Too-slow disconnect: limit response not fast enough → high
peak_iand longdurationbeforetrip_action. - Clamp dies first: clamp absorbs energy because limit is ineffective → rising
clamp_countand higher residualpeak_v. - Reverse energy injection: missing reverse block/ORing → abnormal reverse events and downstream rail anomalies.
These signatures map directly to state transitions in the eFuse state machine below; logs should capture reason, action, peak, and duration per transition.
4) Interfaces to MCU / metering SoC (make trips measurable)
- Hardware pins: FAULT / PG / EN / IMON (align waveforms with log commits).
- Digital telemetry: I²C/PMBus registers for reason codes, counters, temperature, and configuration.
- Sampling sync: trip edge triggers “pre/post” capture and a crash-safe log commit.
H2-5. Current/Voltage Sensing for Protection & Metering (Shunt/Hall + Sampling Strategy)
Protection needs peak + duration to justify a trip action; metering needs energy + trend to explain degradation over time. A robust luminaire therefore uses a dual-path sensing strategy: a fast path for threshold decisions and a slow path for RMS/average/integration—synchronized with a pre/post event window.
1) Current sensing: shunt vs Hall (what changes the evidence quality)
- Shunt (resistor + amplifier): high bandwidth and linearity, ideal for capturing short over-current peaks. Key trade-offs are loss/heat and common-mode robustness during surges.
- Hall (magnetic sensing): galvanic isolation and low insertion loss, useful when the measurement domain must be isolated. Trade-offs include offset/temperature drift and limited peak fidelity for very fast events.
- Evidence-first selection: protection evidence prioritizes no-saturation peak capture; metering evidence prioritizes stable, calibratable drift behavior.
If the sensor saturates during the disturbance, peak is understated and root-cause becomes guesswork.
Sensor robustness during common-mode excursions is therefore part of the protection chain, not an afterthought.
2) Voltage sensing: divider vs isolated sampling (avoid injecting the transient)
- Divider sampling: simplest and fast, but must be protected so surge energy is not injected into the logic domain.
- Isolated sampling: separates hazardous domain from measurement domain; often safer for evidence retention but may trade bandwidth and cost.
- Common-mode reality: surge/lightning-like events can shift reference potentials; the measurement chain must remain functional
to preserve
bus_min,V_valley, and event timing.
Metering is only valuable if it survives the incident. Measurement-domain protection is a prerequisite for a credible evidence chain.
3) Sampling strategy: one sensor, two paths, one timeline
- Fast protect channel: comparator/window thresholding to capture
peakandduration, generate the trip edge, and labeltrip_reason. - Slow metering channel: ADC sampling for RMS/average/integration to produce
Wh,Ah,runtime_h, and drift metrics. - Synchronization: the trip edge freezes a pre/post buffer so causality can be proven (e.g., “bus dipped first” vs “limit engaged first”).
Minimal synchronization requirement: the trip edge triggers (1) a log commit and (2) a pre/post snapshot capture. Without both, the system can say “a trip happened” but cannot explain the timeline.
H2-6. Metering SoC Architecture (Energy, Runtime, Health Indicators)
A metering SoC becomes valuable when it converts raw V/I/P samples into accounts the luminaire can defend: energy and runtime for usage, counters and valleys for disturbance exposure, and drift metrics for health. The output must be structured as a dashboard plus an event trail.
1) Core accounts (the minimum set worth keeping)
- Instant/average:
V/I/P(instant + average) for trend baselining and anomaly detection. - Energy:
Wh(and optionallyAh) to quantify usage and support lifecycle accounting. - Runtime:
runtime_h,switch_countas exposure proxies for stress and service cycles. - Temperature linkage:
temp_binsfor lifetime inference and for explaining derating/degradation.
2) Health indicators (turn incidents into diagnosable signals)
- Protection density:
protect_trip_count_window,event_density. - Power integrity:
brownout_count,bus_min_hist(valley distribution, not only min). - Performance drift:
P_avg_drift/I_error_trendto detect aging, contamination, or thermal path degradation. - Policy quality:
retry_count_window,latch_countto expose nuisance trips vs true hazards.
These indicators answer three field questions with evidence: “How often?”, “How severe?”, and “Is it getting worse?”
3) Data flow and retention: periodic snapshots vs event-driven commits
- Periodic snapshot: low-rate dashboard updates for energy, runtime, temperature bins, and drift baselines.
- Event-driven commit: on trip/reset, record compact incident facts (reason/action/peak/duration/counters).
- Wear-aware storage: store summaries and rolling histograms; keep only the most recent N detailed incidents.
Evidence design goal: stable long-term trends plus a short incident trail that survives resets and power loss.
4) Interfaces (export the dashboard, not raw noise)
- I²C / PMBus / UART: read the dashboard blocks (energy/runtime/health/counters) and fetch the latest incidents.
- Two-level API: summary view for service and monitoring; detail view for forensic reconstruction.
H2-7. Event Logging That Survives Reality (Timestamp, NVM, Wear, Integrity)
A luminaire log is only useful if it survives power loss, can be recovered after reboot, and remains verifiable after long-term wear. The recommended approach is an append-only ring log with a two-step commit marker and per-record integrity checks.
1) Minimal event schema (strongly recommended as a hard requirement)
- Identity:
event_id,event_type(surge/esd/oc/ot/uv/ov). - Severity:
peak_value(V/I),duration. - Action:
trip_action(limit/latch/retry), plustrip_reasonif available. - Localization:
rail_id(which domain),temperature_bin. - Density:
counters(lifetime + recent window counts). - Timeline:
time_base(RTC or monotonic counter for ordering).
The schema is designed to answer field questions with evidence: what happened, how severe, what the protection did, where it occurred, and whether events are becoming more frequent.
2) NVM choice: match the medium to the write pattern
- FRAM: best for frequent small writes (counters, compact event records), strong power-loss tolerance.
- EEPROM: suitable for moderate logging rates; control update frequency to avoid hot-spot wear.
- Flash: large capacity but erase/write granularity demands append-only design and careful recovery rules.
Reliability comes from the log structure first; the storage medium determines endurance and recovery margins.
3) Crash-safe write rule: payload first, commit last
- Stage A (payload): write the full record body and CRC, but do not “publish” it yet.
- Stage B (commit): write a small commit marker that turns the record valid.
- On reboot: scan backward for the last valid commit marker to rebuild
head/tail.
Avoid “index-first” updates. If power fails mid-write, the commit marker cleanly separates valid vs incomplete records.
4) Integrity: make logs verifiable, not just present
- CRC per record: detects torn writes and bit flips; invalid CRC means “do not trust.”
- Optional signature/MAC concept: if tamper resistance is required, authenticate the record payload. (Keep the implementation details outside this page’s scope.)
A verifiable log supports warranty disputes and forensic analysis without relying on assumptions.
H2-8. Surge/ESD/Lightning Evidence Playbook (What to Capture, Where to Probe)
Field debugging succeeds when evidence collection is disciplined: capture the smallest set of signals that can prove causality. The default triage is two waveforms (input voltage and protection current/limit), with an optional third (MCU rail/reset) when the failure looks like a system reboot rather than a pure power event.
1) Probe points that answer “what happened” vs “what reacted”
- Input & clamp: before/after clamp to quantify residual stress (supports
peak_vevidence). - eFuse/HSS VIN/VOUT: confirm limit/disconnect behavior (supports
trip_action). - Current sense / IMON: observe current limiting dynamics and event duration (supports
peak_i,duration). - MCU rail / reset: attribute resets to power integrity vs unrelated logic (supports
brownout_countcorrelation).
2) Two-waveform triage (default SOP)
- Waveform #1:
V_inand/orV_after_clamp(residual voltage and valleys). - Waveform #2:
I_limitorIMONat the protection stage (action timing and severity). - Optional #3:
MCU_VDDorRESET(prove reboot causality).
The goal is not to “measure everything” but to prove the sequence: disturbance → protection action → system consequence.
3) Trigger alignment (pre/post window): prove causality, not correlation
- Trigger source: use the protection trip edge (FAULT/PG change) as the primary alignment anchor.
- Pre window: capture the disturbance arrival and clamp response.
- Post window: capture limit/disconnect behavior and any brownout/reset aftermath.
Alignment matters more than raw sample rate. Without a shared time base, “before/after” cannot be proven.
4) Use the log to prove exposure (did a surge event actually happen?)
- Type + severity:
event_type+ highpeak_v/peak_i+ meaningfuldurationsupports real exposure. - Density: rising
counters(especially in a time window) indicates a harsh electrical environment. - Power integrity trend: changes in
bus_min_histor increasedbrownout_countshow weakening margins over time.
H2-9. Coordination Rules (TVS/MOV/GDT vs eFuse vs Downstream Loads)
Protection parts do not add up automatically. Coordination is about energy going to the right path, actions happening in the right order, and sensitive ICs never becoming the transient “energy sink.” A coordinated chain uses a clamp stage to shave peaks, a limit stage to control energy injection, and a disconnect/retry policy to prevent thermal accumulation and nuisance trips.
1) Coordination objectives (what “good” looks like)
- Route energy correctly: large transient energy is handled by the clamp path, not by metering/MCU rails.
- Protect expensive silicon: the most fragile domains (metering SoC, ADC front-end, MCU) must see reduced residual stress.
- Maintain system continuity: short disturbances should not become repeated resets or “false faults.”
2) Common failure patterns (why “more parts” can die faster)
- TVS placed too late: the transient hits switches or metering ICs first; the clamp acts after damage occurs.
- eFuse limiting too slow: MOV absorbs repeated energy, overheats, and drifts or fails after multiple events.
- Thresholds too tight: EFT-like bursts trigger nuisance trips, creating repeated retry/reset storms.
When coordination fails, the log typically shows either “no action during high peak” or “too many actions during low severity.”
3) Layered threshold strategy (relative order matters more than absolute numbers)
- Clamp threshold: earliest action to reduce peak stress (
V_after_clampshould be bounded). - Limit threshold: engages after clamp to control energy injection (
I_limittiming defines severity control). - Disconnect threshold: triggers only when abnormal conditions persist (prevents thermal accumulation).
- Retry policy: distinguishes “short disturbance” vs “persistent fault” using duration and count windows.
Coordination is proven by sequence: peak arrives → clamp reduces residual → limit shapes current → disconnect/retry only if needed.
4) Evidence-based checks (use logs + two waveforms to validate coordination)
- High peak without action: high
peak_v/ residual voltage but notrip_actionindicates missing coupling between sensing and protection timing. - Action that causes collapse:
trip_action=limitwith worseningbus_min_histindicates limit settings or downstream hold-up margin mismatch. - Nuisance storm signature: very short
duration+ highretry_count_windowsuggests EFT-like bursts being interpreted as real faults.
H2-10. Telemetry & Maintenance (From Logs to Action)
Metering and logging are only valuable when they drive decisions. The recommended model is a closed loop: events + snapshots are reduced into KPI signals, then mapped to actions for field maintenance, warranty handling, and predictive service—without requiring any specific cloud platform assumptions.
1) Device-side telemetry outputs (what to export)
- Periodic snapshot:
Wh,runtime_h, temp bins, drift summaries, valley histogram summaries. - Event report:
event_type,peak,duration,trip_action,rail_id,time_base, window counters. - Rate control: during event storms, report counts/summary first and limit detailed records to the latest N.
2) Local maintenance workflow (field-ready sequence)
- Read dashboard first: energy/runtime, trip density, brownout exposure, drift indicators.
- Read latest N events: confirm type/severity/action and timeline ordering.
- Capture 2-wave evidence: V_in (or V_after_clamp) + I_limit, aligned to the trip edge.
The fastest root-cause path is: dashboard anomaly → event pattern → two waveforms to prove sequence.
3) Reset/clear policy (protect evidence value)
- Who can clear: restrict to authenticated service access or physical presence.
- What to clear: clear window counters and health score; keep lifetime counters for history and warranty evidence.
- When to clear: clear only after a maintenance action (replace clamp parts, fix grounding, clean thermal path) to create a new epoch.
Clearing is not housekeeping; it defines a before/after comparison window for maintenance effectiveness.
4) Decision mapping (KPI → recommended action)
- High surge density: rising event density → inspect grounding/wiring, verify external surge protection condition.
- Power drift + hotter bins: increasing drift with hotter bins → check thermal path, contamination, aging indicators.
- Many brownouts/valleys: increased brownout exposure → point to input quality or hold-up margin issues (without expanding into PFC design).
H2-11. Compliance-Ready Artifacts (What to Document for EMC/Surge Readiness)
This section is a pre-test evidence pack: it does not teach standards or lab procedures. It defines the minimal artifacts that make surge/ESD readiness auditable and test failures quickly reproducible: (1) protection-chain BOM evidence, (2) threshold & policy table, and (3) event-log field guide with sample exports, plus a fail-package for fast root-cause replay.
1) Protection-chain BOM evidence (category + placement + rating + MPN examples)
Provide a BOM-level list that allows a reviewer to validate energy path and exposure boundaries
without guessing. Include: category, location, nominal rating fields,
and example MPNs (verify ratings against the target design).
| Stage / Category | Where it sits (example) | What to document (rating fields) | Example MPNs (verify suitability) |
|---|---|---|---|
| TVS diode (fast clamp) | After connector, before sensitive rails (P2) | VRWM / VCL / peak pulse power, package, polarity | Vishay SMBJ58A, SMCJ58A; Littelfuse SMBJ33A |
| MOV (energy absorber) | Across input (line-to-line / line-to-PE), “energy path” | VAC rating, surge current, energy rating, thermal class | EPCOS/TDK S14K275; Bourns MOV-14D471K |
| GDT (high-energy shunt) | Line-to-PE (or shield/earth) for large common-mode events | DC sparkover, impulse sparkover, surge rating, insulation | Bourns 2038-09-SM; Littelfuse CG3-230L |
| eFuse (programmable limit) | After clamp, before downstream conversion/loads (P3/P4) | ILIM range, t_blank/debounce, SOA, thermal shutdown behavior | TI TPS25947, TPS25940A; ADI/LTC LTC4368 (surge stopper class) |
| High-side switch (protected load switch) | Domain isolation (LED load vs control rail), per-rail (rail_id) | RDS(on), current limit, fault flag timing, reverse current handling | Infineon PROFET™ examples: BTS7002-1EPP, BTS50010-1TAD |
| Current sense (shunt) | Near eFuse output or LED load return (P5/IMON) | mΩ value, TCR, power rating, Kelvin routing note | Vishay WSL2512 series; Isabellenhütte ISA-WELD (example family) |
| Isolated current sense (optional) | When sense domain must stay isolated from surge reference | CM range, isolation rating, bandwidth/latency, output type | TI AMC1301, AMC3301; ADI ADuM3190 (isolation/control class) |
| NVM for logs | Local ring log storage (H2-7) | Endurance, write granularity, power-loss behavior | Fujitsu MB85RS64V (FRAM); Microchip 24LC256 (EEPROM); Winbond W25Q64JV (SPI Flash) |
The MPN list is provided as concrete examples for documentation completeness. Always confirm electrical ratings, creepage/clearance, and safety constraints against the specific luminaire architecture.
2) Threshold & policy table (limit / latch / retry)
Before any lab run, produce a single-page “behavior contract” that maps conditions to detection and action. Keep it device-side and design-focused (no platform assumptions).
| Condition | Detection basis | Action | Timing / counters | Evidence fields |
|---|---|---|---|---|
| OC / short | IMON exceeds ILIM; duration integration | Limit → disconnect (if persistent) | t_blank, t_limit_max, retry backoff, retry cap | event_type, peak_i, duration, trip_action, counters |
| OV / surge-like | V_after_clamp peak + window count | Clamp first; limit injection; optional latch if repeated | count_window thresholds, escalation policy | peak_v, duration, trip_action, rail_id, temperature_bin |
| UV / brownout | Bus valley histogram; PG/reset correlation | Graceful retry; avoid reset storm | debounce, min-off time, window caps | bus_min_hist, brownout_count, time_base |
| EFT nuisance | Short repetitive bursts; low severity but high frequency | Reject/ignore via debounce + caps; avoid false faults | short duration gate, retry_count_window threshold | duration (short), retry_count_window (high) |
3) Event-log field guide + sample export
Provide a field dictionary and a small sample export that shows how events are interpreted. The goal is “explainable evidence” (what happened, how severe, what the system did, where it occurred, and when).
- Field dictionary:
field→ meaning → unit → source (sensor/logical) → notes - Sample export: at least 3 entries (surge-like, OC, brownout) with valid CRC/commit markers
- Time base: if RTC is unreliable, keep a monotonic counter for ordering
Sample export (illustrative)
CSV-like rows (device-side)
event_id,event_type,peak_v,peak_i,duration_ms,trip_action,rail_id,temp_bin,time_base,counters,crc_ok,commit_ok
10421,surge,860V,2.1A,0.12,limit,INPUT,25-50C,mono:8891231,win=3;life=57,1,1
10422,oc,52V,6.8A,4.5,disconnect,LED_BUS,50-75C,mono:8891450,win=2;life=12,1,1
10423,brownout,48V,1.0A,18.0,retry,CTRL_RAIL,25-50C,mono:8892102,win=5;life=90,1,1
4) Minimal fail-package (what to bring back when a test fails)
A failure is only actionable when it can be replayed as a short evidence bundle. Keep the package minimal and consistent:
- Waveforms (2 required, 1 optional):
V_in(orV_after_clamp) +I_limit/IMON; optionalMCU_VDD/RESET. - Log export: last N events around the failure point (include
time_base,trip_action, window counters, CRC/commit validity). - Conditions sheet: input setup, grounding mode, temperature range (temp_bin), load mode/power level.
This bundle is intentionally independent of any cloud platform. It is designed to minimize re-test cycles and guesswork.
H2-12. FAQs ×12 (Evidence-Routed, No Scope Creep)
Each answer routes the symptom to the first evidence to capture (waveforms + log fields) and the chapter to use. The intent is long-tail coverage without introducing new design scope.
1Adding a TVS makes reboots more frequent—clamp placement or eFuse thresholds?
V_after_clamp and I_limit/IMON around the reboot edge, then check trip_action, duration, and retry_count_window to confirm whether the system is clamping late or tripping too aggressively.
2After lightning it still lights but looks dimmer—what “degradation signal” should logs show?
event_density, hotter temperature_bin, and long-term P_avg_drift (or equivalent).
3EFT testing keeps triggering protection—threshold too tight or sensing-path noise?
V_after_clamp and I_limit/IMON with a short pre/post window; then inspect duration and retry_count_window. If duration is tiny yet retries surge, thresholds/debounce are likely too tight; if IMON is noisy, the sensing path needs filtering or placement correction.
4After a surge, occasional lockups occur with no hard damage—how to align waveforms and logs using an “event window”?
V_in/V_after_clamp and I_limit/IMON, then match the captured edge to time_base and the nearest event_id. A consistent alignment shows whether the lockup follows a brownout valley, a retry storm, or a single long-duration limit.
5eFuse auto-retry keeps heating the system—should it switch to latch-off, and what evidence proves it?
retry_count_window, repeated trip_action=retry without recovery, and temperature bins trending upward during retries. Capture I_limit waveform to see whether each retry hits the same limit plateau and duration. If the fault persists across retries and thermal stress accumulates, a latch-off escalation policy is safer than endless retry.
6Same batch, some sites show extreme surge counts—check grounding first or input wiring first?
event_density and peak_v distributions across sites, then validate with two probes: V_in/V_after_clamp and the clamp path current (or IMON proxy). If one site has consistently higher residual voltage after clamp, input wiring/SPD placement is suspect; if counts spike with ground reference shifts, grounding is suspect.
7High-side switch thermal trips frequently—real short-circuit or excessive inrush?
I_limit/IMON during power-up and during the thermal trip. Check duration, trip_action, and whether events cluster at enable edges. If trips correlate with turn-on edges and short windows, tune soft-start/inrush control; if trips occur during steady run, suspect load fault or wiring.
8Energy (Wh) metering is inaccurate—sampling bandwidth limits or calibration/temperature drift?
temperature_bin, calibration version, and min/max to detect drift signatures.
9Logs sometimes disappear—how to survive power-loss writes without corrupting the index?
crc_ok, commit_ok, and a monotonic time_base. Validate by forced power interruption tests: the system should recover to a consistent pointer and preserve earlier committed events.
10Protection triggers but MCU records nothing—timing issue or reset interrupts logging?
MCU_VDD/RESET along with V_after_clamp and I_limit. Then check whether commit_ok fails around that time and whether a brownout counter increments. If reset precedes commit, move logging to an always-powered domain or use a two-phase commit that finalizes quickly before rails collapse.
11What is the minimal “warranty-grade” event log field set?
event_id, event_type, peak_value (V or I), duration, trip_action, rail_id, temperature_bin, counters (window + lifetime), and time_base (RTC or monotonic). Add integrity flags (crc_ok, commit_ok) so exported logs remain trustworthy after power-loss recovery and field service handling.
12On-site triage: lightning strike or grid disturbance—how to tell quickly?
peak_v and clamp activity with relatively sparse counts, sometimes followed by degradation trends. Grid disturbance shows frequent bus valleys, rising brownout_count, and resets that align to input dips rather than clamp residual peaks. Capture two waveforms: V_in/V_after_clamp and I_limit, then compare the event density and valley histogram over time. The correct classification drives the next inspection step.