EMC, Safety & Energy Metering Subsystem
← Back to: Smart Home & Appliances
Central Thesis: This page distills a reusable, evidence-based EMC + safety + metering subsystem—covering protection, isolation/leakage safety, accuracy under noise, and brownout-safe event logging—so teams can diagnose failures fast and reduce field uncertainty without redesigning each product from scratch.
It focuses on measurable targets, “first two measurements,” and minimum viable logs that turn ESD/EFT/surge, leakage trips, and metering drift/tamper into actionable root-cause paths.
H2-1. Boundary & Architecture: What this subsystem is (and is not)
This subsystem page defines a reusable evidence chain for EMC protection, isolation/leakage safety, energy metering, and event recording. It is designed to be cited by device pages via interface points and acceptance checks—without turning into device-specific architecture.
- Clear scope contract
- Reusable deliverables
- Interface points to cite
- Acceptance-style wording
1.1 What this page delivers (reusable deliverables)
The subsystem is written as a set of portable building blocks. Each block is specified using the same engineering language: interface point → evidence to capture → acceptance check. This keeps the page reusable while preventing scope creep.
Defines protection zoning, clamp placement rules, and selection dimensions (clamp behavior, energy handling, parasitics), plus evidence points to prove “no reset / no damage / no silent corruption”.
Cite by: entry node + residual spike + reset/log checkDefines barrier type boundaries, PCB keepout/slot intent, and leakage monitoring discriminators (true leakage vs transient common-mode events), with evidence and acceptance wording.
Cite by: barrier boundary + leakage loop + false-trip discriminatorDefines where to sense, which sensor class fits the constraints, and how accuracy survives EMI/noise and temperature drift. Focuses on drift budget, not just “typical accuracy”.
Cite by: metering tap + noise coupling + drift budget checksDefines minimum viable log fields, brownout-safe write strategy, timestamp trust model, and a forensics workflow: “2 waveforms + 3 counters + 1 log dump” to isolate the root cause fast.
Cite by: log continuity + reset cause + export point1.2 Where it applies (AC/DC/PoE/SELV entry contexts)
The subsystem is anchored at the “system boundary” where external energy and disturbances enter. It supports AC mains front-ends, low-voltage DC entry (12V/24V), and PoE/SELV interfaces by describing what to clamp, what to filter, what to isolate, what to measure, and what to log—without assuming a specific appliance.
- Primary risks: surge energy, leakage/touch safety, insulation margin stress.
- Subsystem emphasis: entry protection stack + leakage monitoring discriminators + event log continuity.
- Citeable interface points: AC entry clamp node, barrier boundary, fault log export.
- Primary risks: EFT bursts, ground bounce, cable-coupled common-mode spikes.
- Subsystem emphasis: zoning + filtering + evidence-first verification.
- Citeable interface points: I/O protection at connector, sensitive-rail droop evidence, reset/log counters.
- Primary risks: interface ESD/EFT, common-mode noise coupling across PHY barriers.
- Subsystem emphasis: interface clamp + isolation boundary hygiene + metering tap noise immunity.
- Citeable interface points: port-side clamp point, barrier parasitic coupling control, logging of link drops.
- When metering shares ground with noisy power: accuracy must be protected by layout, filtering, and drift budget.
- Subsystem emphasis: “return path design” + “metering trust loop” + “black-box logging”.
- Citeable interface points: metering tap + drift checks + tamper/log fields.
1.3 What it explicitly does NOT cover (scope contract)
This page does not replace a device’s system design. It will not provide device-specific schematics, control loops, or protocol-stack walkthroughs. If a reader needs those, the correct approach is to reference the appropriate device page.
- Not covered: device architecture deep dives (e.g., motor drives, HVAC thermodynamics), protocol stacks, cloud/app/OS tutorials.
- Allowed only as one-line handoff: “For device-specific integration details, see the sibling page.”
1.4 How sibling pages should cite this subsystem (interface + acceptance)
Each sibling page should cite this subsystem using a consistent “fill-in template” so readers can verify outcomes without repeating theory.
- Interface point: Identify the node/boundary (entry clamp node / isolation barrier / metering tap / log export).
- Evidence to capture: Two waveforms + counters + a log excerpt (example set listed below).
- Acceptance wording: Use outcome-based checks (“no reset”, “no silent corruption”, “log continuity preserved”).
Evidence primitives
- Waveforms: connector residual spike, sensitive-rail droop, common-mode burst envelope.
- Counters: reset cause, watchdog reason, comm error counter, protection trip count.
- Logs: event_id, timestamp/sequence, peak/duration, affected rail/interface, firmware version/hash.
H2-2. Threat Model & Compliance Targets: turning standards into design targets
Threat names are not design inputs. This chapter converts ESD/EFT/surge/leakage/hi-pot into measurable targets, evidence points, and subsystem-level countermeasure strategies.
- Coupling-path taxonomy
- Outcome-based targets
- First evidence to measure
- Threat→chapter mapping
2.1 Taxonomy by coupling path (the map that matters)
EMC and safety failures repeat because energy enters through a small set of coupling paths. Classifying threats by where energy couples produces stable, reusable design rules that remain valid across different devices.
Fast rise-time discharge couples into I/O, chassis seams, and exposed metal. The design goal is to clamp at the boundary and control the return path.
Primary: clamp placement + return pathRepeated bursts couple into long cables, relay wiring, and connector pins. The design goal is to prevent rail droop, false triggers, and silent corruption.
Primary: filtering + zoning + isolationLarge energy couples through the power entry network. The design goal is to share energy across protection elements and verify post-stress aging signals.
Primary: MOV/GDT/TVS + fuse/limiterLeakage increases via Y-cap paths, humidity/contamination, and insulation wear. The design goal is to monitor leakage and discriminate real faults vs transients.
Primary: monitoring + discrimination2.2 Converting threats into outcome-based targets (verify, don’t guess)
Targets should be written as outcomes that can be verified on a bench with minimal ambiguity. Absolute standard levels vary by product class, so the subsystem uses stable acceptance language that device pages can bind to their final test plan.
- ESD target: no hang/reset, no permanent damage, and no silent state corruption (log continuity preserved).
- EFT target: no false trips, no communications collapse, and no lost events (counters and logs remain coherent).
- Surge target: entry protection remains within safe temperature/leakage drift; device continues normal operation without degraded safety margin.
- Leakage/hi-pot target: leakage faults are detected reliably; transient common-mode events do not cause chronic false trips.
2.3 First evidence to capture (two measurements first)
To keep debugging deterministic, each threat category starts with the same evidence discipline: capture the boundary electrical stress, capture the internal consequence, and verify log continuity.
2) sensitive-rail droop / reset-cause
2) error counters + false trip count
2) post-stress log + function check
2) trip discriminator evidence
2.4 Threat → Coupling → Countermeasure → Evidence (chapter map)
Countermeasures are specified as strategy combinations, not single parts. Each strategy maps to later chapters where placement, selection dimensions, and verification points are defined in detail.
H2-3. Protection Zoning: where to clamp, where to filter, where to isolate
Most field failures come from poor zoning and uncontrolled return paths—not from “weak parts”. This chapter turns protection into a repeatable layout rule set: 3 zones + return-path priority + two-point evidence.
- 3-zone definition
- Return-path priority
- Clamp vs filter decision
- Two-point evidence
3.1 The 3-zone model (Entry / Interface / Sensitive)
Protection must be designed as space + current loops. Each zone has a different goal and a different “allowed” component set. Mixing goals across zones is the fastest way to create instability.
The boundary where energy enters (AC/DC/PoE). The goal is to handle higher energy and keep large currents out of internal reference planes. Typical elements: MOV/GDT, higher-energy TVS, CMC, fuse/limiter.
Do not place entry protection deep insideThe boundary for connectors and I/O. The goal is fast clamping and controlled impedance shaping without injecting noise into sensitive rails. Typical elements: low-C TVS arrays, series R / ferrite, small RC.
TVS must sit at the connector boundaryWhere state and accuracy live (MCU, AFE, metering, clocks). The goal is stable reference and rail integrity. Typical elements: local decoupling, quiet reference routing, isolation boundary.
No protection return current through sensitive ground3.2 Return-path priority (ground / chassis / PE)
Return path is the real design knob. The same TVS can be stable or unstable depending on where the discharge current is forced to flow.
- Priority 1: High-current ESD/surge energy returns to chassis/PE (if available) using the shortest and widest path.
- Priority 2: If PE is not present, return to the nearest boundary reference point without crossing the sensitive reference area.
- Priority 3: Sensitive references connect to noisy returns only through a controlled tie (single-point / constrained coupling).
3.3 Clamp-first vs filter-first (decision conditions)
“Clamp first” and “filter first” are both valid—when used in the correct zone and aligned to signal constraints. Use a stable decision based on bandwidth and energy, not habit.
3.4 Checklist + evidence (two measurements first)
A zoning plan is only real if it can be proven by two measurements: boundary stress and internal consequence. Capture both, then verify state continuity.
- Protection order matches zone: connector boundary → clamp → shaping → sensitive.
- TVS-to-connector trace is short; return path is wide and local.
- Filter reference is consistent (no “ground hop” between elements).
- Sensitive zone keep-out prevents protection-loop current from crossing.
- Point A: connector residual spike after clamp (boundary evidence).
- Point B: sensitive-rail droop/noise envelope (internal evidence).
- State: reset-cause / watchdog reason / error counters / log continuity.
H2-4. ESD & TVS Selection: clamp behavior + capacitance + robustness
TVS selection is not “pick the highest power”. It is a trade among clamp behavior, dynamic resistance, capacitance, leakage, and layout parasitics. This chapter provides a decision system and evidence checks to avoid unstable ports and hidden degradation.
- Parameter-to-risk mapping
- Interface-specific Cj budgeting
- Combination strategies
- ESD evidence checks
4.1 Translate TVS parameters into engineering risk
A TVS datasheet only becomes useful when each parameter is mapped to what can break in the system. The key is to distinguish “survives” from “stays stable and accurate”.
- Vclamp: sets the peak voltage that reaches the protected node. Too high means the IC sees stress even if the TVS survives.
- Rd (dynamic resistance): determines how much Vclamp rises at high peak current. Larger Rd means worse real-world clamping.
- Ipp / peak power: relates to single-event survivability, but does not guarantee “no reset / no silent corruption”.
- Cj (junction capacitance): loads signal lines and can create instability (touch drift, audio distortion, high-speed margin loss).
- Reverse leakage: impacts standby power and bias errors; leakage drift with temperature can create field-only issues.
4.2 Interface-specific Cj budgeting (low-C vs robust)
Capacitance budget should be set by interface sensitivity. Use low capacitance where signal integrity is fragile, and prioritize robustness where it is not.
4.3 Combination strategies (TVS + R/FB/CMC)
Protection stability often improves when a small impedance element reduces di/dt and ringing, leaving the TVS to clamp the remaining peak. The combination must be placed within the correct zone (H2-3).
- TVS + series R: limits peak current and damping; useful when the line tolerates small series impedance.
- TVS + ferrite bead: shapes high-frequency energy and reduces injection into sensitive rails; verify it does not create resonance with line capacitance.
- TVS + CMC: suppresses common-mode energy on paired lines; helps prevent burst-induced ground bounce from becoming a functional failure.
4.4 Evidence + failure pattern (TVS placed too far)
A classic failure is “TVS exists but the IC still gets hit” because trace inductance turns distance into voltage. The fix is usually placement and loop control, not a larger TVS.
- ESD applied at connector → occasional reset, port damage, or latent instability.
- Measured residual spike at connector remains high; ringing persists.
- TVS is placed far from connector; return path crosses internal planes.
- Move TVS to the connector boundary; shorten TVS-to-connector trace.
- Force return current to close locally (chassis/PE or boundary reference).
- Add small series R/FB if signal constraints allow; re-check evidence.
H2-5. EFT & Surge Front-End: MOV/GDT/TVS + fuse + inrush as a system
Treat EFT and surge as energy management, not single-part selection. A stable front-end uses coordinated roles: MOV absorbs energy, GDT handles high current, TVS clamps fast edges, while fuse and inrush control keep the system predictable.
- EFT vs surge behavior
- Entry stack as a system
- MOV aging & fuse coordination
- Post-test acceptance checks
5.1 EFT vs surge (repeated small hits vs one big hit)
EFT tends to create functional instability through repeated fast transients and coupling on harness/entry nodes. Surge is dominated by energy and heat, where survivability and controlled failure (safe disconnect) matter most.
5.2 Typical entry protection stack (series path + shunt path)
A robust front-end separates the series path (normal current flow and inrush control) from the shunt path (transient energy absorption). The stack works only when energy is diverted at the boundary and return paths are short and controlled (see zoning rules in H2-3).
- Fuse / breaker → sets the ultimate “safe disconnect”.
- Inrush control (NTC / limiter) → prevents repetitive stress and nuisance trips.
- EMI element (CMC / filter) → reduces burst injection into internal rails.
- DC bus → feeds downstream converters / loads (not expanded here).
- MOV absorbs surge energy (watch aging and heat).
- GDT handles extreme current when applicable (coordination matters).
- TVS clamps fast edges and residual spikes close to the protected node.
- Return closes to chassis/PE or the entry reference node (do not cross sensitive references).
5.3 Reliability & coordination (MOV aging, thermal, fuse interaction)
Front-end robustness is a lifecycle problem. MOVs can degrade with repeated surges and temperature, and coordination with a fuse/breaker defines whether failures remain safe.
- MOV aging: repeated stress can shift clamp behavior and increase leakage; monitor drift rather than assuming “still OK”.
- Thermal runaway risk: if energy is repeatedly dumped into a hot MOV without a disconnect path, temperature rise accumulates.
- Fuse coordination: the system should disconnect safely under abnormal energy, rather than leaving a degraded shunt element as a leakage/heat source.
- TVS role: clamp fast residual spikes; do not use it as the primary energy absorber for large surge energy.
5.4 Do / Don’t (placement and return discipline)
Do
- Place MOV/GDT/TVS in the Entry Zone, as close to the boundary as possible.
- Provide a short, wide return to chassis/PE or an entry reference node.
- Keep hot components away from sensitive references and isolation keepouts.
- Record post-test drift (temp/leakage/insulation) and keep event logs continuous.
Don’t
- Don’t place shunt absorbers deep on the board (energy will travel through internal copper first).
- Don’t let transient return current cross metering/AFE ground references.
- Don’t rely on “high Ipp” TVS as a surge-energy solution.
- Don’t ignore leakage drift after stress (degradation can be silent).
H2-6. Isolation Barrier: selecting the right isolation and keeping it real in layout
Isolation is a boundary discipline: device choice + creepage/clearance intent + CMTI behavior + isolated power noise control. This chapter focuses on the isolation “anatomy” and evidence that the barrier stays real under common-mode transients.
- Isolation type boundaries
- CMTI-driven evidence
- Layout keepout/slot intent
- Parasitic coupling control
6.1 Isolation types and boundaries (digital / analog / power)
Isolation choices should follow the signal and accuracy needs, not a default part. The barrier must be consistent for data and power.
- Digital isolators: control and data paths; focus on delay/bandwidth/power and CMTI behavior under fast common-mode edges.
- Optocouplers: suitable for some slower paths; watch long-term drift/aging and ensure the layout preserves the barrier.
- Isolated amplifiers / isolated AFE: used when analog fidelity and common-mode range are key drivers.
- Isolated DC-DC: defines whether the power domain truly respects the barrier or rebuilds a bridge across it.
6.2 Key specs that predict field behavior (CMTI, working voltage, lifetime terms)
Instead of memorizing labels, link each specification to what can be measured and what can fail.
6.3 Layout reality: keepout, slots, and parasitic coupling paths
Layout determines whether the barrier is real. Any copper crossing the boundary (or running too close) creates a parasitic capacitor that can bypass isolation under fast edges.
- Keepout intent: prevent copper, stitching, and long traces from bridging hot-to-cold domains.
- Slots: reduce surface leakage paths and shrink the parasitic coupling channel.
- Crossing traces: avoid routing signals or planes that create a predictable “capacitive bridge”.
- Controlled coupling: if a coupling element is required, keep it explicit and predictable (not accidental).
6.4 Isolated power noise: don’t rebuild a bridge across the barrier
Isolated DC-DC can silently defeat an isolation barrier if its return currents or filtering loops cross domains. Keep power loops local to each side and avoid large area loops spanning the barrier.
- Input filtering of isolated DC-DC should close its loop on the hot side without injecting into the cold reference.
- Output decoupling should close its loop on the cold side without spanning the barrier.
- Verify common-mode events do not translate into rail droop or false switching across the barrier.
6.5 Evidence under common-mode transients (bit errors, resets, log continuity)
Isolation must be verified as system behavior. Under common-mode stress, the barrier is “real” only if counters and logs remain consistent while data paths stay error-free.
- Link counters: CRC/bit errors, dropouts, unexpected retransmits.
- System state: reset cause / watchdog / brownout indicators.
- Continuity: event log remains monotonic and complete across stress.
- Errors or resets often indicate a bypass path via parasitic capacitance, not an “isolation chip issue”.
- Fix is usually keepout/slot discipline and power-loop control, then re-check evidence.
H2-7. Leakage & Touch Safety: monitoring, limiting, and diagnosing leakage paths
Turn leakage and touch-safety risks into measurable engineering evidence. Map leakage loops, choose monitoring that matches the path, and distinguish nuisance trips from real insulation degradation using two-point measurements and logs.
- Leakage path map
- RCD/CT/insulation monitoring boundaries
- Nuisance trips vs missed detection
- Field debug checklist (symptom → evidence → isolate)
7.1 Leakage path map (from hot side to touchable metal / cold side)
Leakage is a loop, not a single node. Start by identifying whether the path is intentional (EMI components), conditional (moisture/contamination), or aging/fault driven (degraded insulation). Each class has different symptoms and different evidence.
7.2 Monitoring options (RCD/GFCI, CT/differential sensing, insulation monitoring)
Monitoring must match the leakage loop. A trip device proves a differential imbalance, but not necessarily the root path. A sensing chain provides trend and correlation—only if it can separate real leakage from transient common-mode injection.
- Best for catching true differential leakage loops.
- May nuisance-trip under fast transients if the system injects common-mode spikes into the sensing loop.
- Use logs and two-point measurements to separate “spike” events from baseline trend.
- Enables leakage trend tracking and event records (time-correlated with resets and EMI).
- Requires bandwidth/filters that avoid treating fast CM bursts as real leakage.
- Evidence output should include peak, RMS/mean, duration, and timestamp.
- Helps detect degradation paths before catastrophic failure.
- Most valuable when correlated to humidity/contamination and post-stress drift.
- Focus on trend and repeatability, not one-time readings.
7.3 Nuisance trips vs missed detection (common sources and discriminators)
The highest-risk failure mode is confusing transient injection with true insulation leakage. Use discriminators that separate spike-only events from baseline drift and link both to entry stress and EMI coupling.
Nuisance trip (false)
- Transient-driven spike: short pulses aligned with switching/entry events.
- EMI capacitor injection: higher baseline but stable and repeatable across humidity.
- CM burst coupling: sensor chain sees spikes while insulation trend stays flat.
Missed detection (danger)
- Slow drift: leakage baseline rises over days/weeks (aging/contamination).
- Humidity sensitivity: strong correlation with moisture/condensation cycles.
- Post-stress change: drift appears after surge/EFT events and stays elevated.
7.4 Evidence fields (engineering records without certification procedures)
Record what makes leakage explainable. The goal is reproducible evidence that links leakage behavior to environment and to subsystem events.
- Environment tags: humidity/temperature, “dry vs wet” surface condition, contamination notes.
- Leakage waveform class: spike peak, duration, repetition; baseline RMS/mean trend.
- Action & state: trip/no-trip, response delay category, reset causes (if any).
- Correlation: entry stress events, EMI sampling points, rail ripple, event-log continuity timestamp.
7.5 Field debug checklist (symptom → two measurements → isolate → first fix)
Use a consistent field workflow. Each symptom starts with two measurements and ends with an isolation decision.
H2-8. EMI Filtering & Grounding: designing the return path, not just adding parts
EMI control is return-path control. Identify common-mode vs differential-mode behavior, place filters to shrink loop area, and verify improvements using two measurements first: entry noise plus sensitive-rail ripple, correlated to resets, metering error, and logs.
- CM vs DM quick mapping
- Placement rules (short, executable)
- Chassis/PE/logic ground principles
- Two measurements first + correlation
8.1 Common-mode vs differential-mode (recognize first, then filter)
Adding parts without identifying the mode often moves noise rather than reducing it. Use mode recognition to choose the right element and the right placement.
8.2 Placement rules (short, executable)
These rules keep the filter effective and prevent protection elements from turning into injection paths.
- Boundary placement: place entry/interface filters at the boundary so noise is contained before it enters sensitive domains.
- Close the return: capacitor return loops must be short and local; long returns create big radiating loops.
- CMC discipline: a CMC needs a controlled return strategy; otherwise it shifts CM energy into sensitive references.
- Minimize high di/dt loop area: shrink current loops rather than stacking parts deep inside the board.
- Keep noise out of metering references: prevent transient return currents from crossing measurement/AFE reference nodes.
8.3 Ground / chassis / PE principles (high-level, device-agnostic)
Grounding is about defining where current is allowed to return. Separate roles at a principle level: chassis/PE anchors high-frequency return behavior, while logic/measurement reference must remain stable and protected from large transient currents.
- Chassis/PE: serves as a high-frequency return anchor and safety reference point (concept level).
- Logic/measurement ground: defines ADC/meter references; keep transient return currents from crossing it.
- Controlled connections: any connection between chassis/PE and logic reference must be intentional and predictable, not accidental.
8.4 Two measurements first (entry noise + sensitive-rail ripple) and correlation
EMI success must show up in both the electrical domain and the system domain. Start with two measurements and correlate them to symptoms and logs.
- Capture at the entry boundary (conducted noise sampling point or near-field proxy).
- Look for mode changes (CM vs DM behavior) after placement adjustments.
- Capture rail ripple and reference noise near metering/AFE/MCU.
- Verify the ripple improvement aligns with reductions in resets, metering drift, and log anomalies.
- Reset cause / watchdog counters do not increase under stress.
- Metering error and leakage false-trip rates decrease.
- Event log remains continuous with stable timestamps.
H2-9. Energy Metering Architecture: where to sense, how to keep accuracy under noise
Metering accuracy is architecture: tap location + sensor type + front-end chain + noise coexistence. The goal is not only “accurate when clean,” but “continuous and explainable under switching noise and stress.”
- Tap options (AC / DC / high-side / low-side)
- Shunt vs CT vs Rogowski selection
- Front-end chain choice logic
- Accuracy killers + evidence checklist
9.1 Where to sense: AC line vs DC bus, high-side vs low-side
Start with the tap location. The same sensor behaves very differently depending on whether it sits on the AC line, on the DC bus, or on a high-side/low-side node where return currents and common-mode stress differ.
9.2 Sensor choice: shunt vs CT vs Rogowski (engineering tradeoffs)
Pick sensors by what breaks accuracy in the real environment: overload behavior, low-current resolution, bandwidth needs, and EMI sensitivity. Avoid choosing by “spec sheet power” alone.
9.3 Front-end chain: ΣΔ modulator vs metering AFE vs integrated SoC
The front-end chain must produce evidence, not only measurements. Prefer chains that expose saturation flags, missing-sample counters, coefficient integrity checks, and stable timestamping under stress.
- Strong fit when isolation and high resolution are required.
- Works best with a disciplined return path and controlled references.
- Evidence focus: overload recovery and continuity flags.
- Good for integrated energy computation and calibration support.
- Evidence focus: calibration coefficient CRC and internal status flags.
- Use when “explainable accuracy” is a primary requirement.
- Great for size/cost integration when evidence export is sufficient.
- Evidence focus: counters/log fields must remain accessible.
- Avoid if stress behavior cannot be observed and recorded.
9.4 Keeping accuracy under noise (and not losing samples)
Accuracy fails in three ways: the input saturates, the reference is injected, or the system loses continuity. Treat metering as a chain that must stay stable through switching noise and entry stress.
9.5 Selection matrix + “accuracy killers” checklist
Use a compact decision matrix for early architecture selection, then verify against the most common accuracy killers.
H2-10. Calibration, Drift & Anti-Tamper: making metering trustworthy over life
Trust comes from a closed loop: calibration evidence, drift discrimination, tamper signals, and event logs that survive stress. This chapter focuses on engineering evidence and fields—without legal or certification claims.
- Factory calibration vs in-field self-check
- Drift sources + discriminators
- Tamper signals → sensing → log fields
- Evidence-to-log mapping + no overclaim
10.1 Calibration strategy: factory calibration vs in-field self-check
Factory calibration provides controlled reference conditions and repeatability. In-field self-check is a sanity loop that detects “no longer trustworthy” behavior and triggers service evidence—without pretending to replace calibration.
10.2 Drift sources and discriminators (temperature, aging, magnetic effects)
Drift is diagnosable when it is categorized and correlated. Treat drift sources as classes with observable signatures, then decide what must be logged for explainability.
10.3 Anti-tamper: signals → sensing → log fields
Anti-tamper is evidence. For each tamper category, define one sensing method and one log field so that service can distinguish user behavior, environment influence, and true tampering intent.
tamper_cover_open + timestamptamper_magnetic_bias + magnitudetamper_bypass_or_reversetamper_waveform_anomaly + scoremeter_chain_integrity10.4 Evidence → event log mapping (what must be recorded)
Trust requires that evidence survives stress events. Record calibration state, self-check results, drift deltas, tamper flags, and continuity counters with stable timestamps so the service readout can reconstruct what happened.
- Calibration: version/date, coefficient CRC, authorization state, reference tags.
- Self-check: trigger reason, pass/fail, deviation magnitude, temperature tag.
- Drift evidence: baseline trend slope, post-stress delta, persistence indicator.
- Tamper: category field + magnitude/score + duration + recovery marker.
- Integrity: missing-sample counter, timestamp continuity marker, reset cause linkage.
10.5 Don’t overclaim (engineering evidence, not legal guarantees)
This chapter provides an engineering trust loop: evidence fields, discriminators, and event logging. It does not claim legal compliance, certification completion, or absolute tamper-proof behavior. Trust is demonstrated by logs, coefficient integrity, anomaly statistics, and repeatable re-tests.
H2-11. Event Recording & Forensics: what to log, how to timestamp, how to survive brownouts
Event recording is an engineering SOP, not a checkbox. A good on-device “black box” logs the right fields at the right time, survives brownouts, and enables fast root-cause triage with a minimal evidence set.
- Event taxonomy + dictionary
- MVS (minimum viable schema)
- FRAM/EEPROM/Flash + ring buffer
- Timestamp vs sequence strategy
- Brownout survivability
- 5-step forensics SOP
11.1 Log taxonomy: turn symptoms into a consistent event dictionary
The event dictionary must be consistent across product lines: one event_id plus a scoped subcode
(e.g., rail index, trip source, channel index). This makes filtering and forensics deterministic.
UVLO/BOR, rail dip, inrush fail, watchdog reset correlation- Required:
rail_id,min_v,dip_ms,reset_cause
- ESD/EFT hit counters, surge counter, eFuse/OVP/OCP/OTP trips
- Required:
trip_source,peak,duration_ms, recovery state
- Tamper flags, overcurrent signatures, coefficient CRC mismatch
- Required:
tamper_type,score,coeff_crc, sample continuity
- Leakage trip, insulation degrade flags, nuisance-trip pattern detection
- Required:
leak_trip,threshold_bin,duration_ms, retry/lockout state
11.2 MVS (Minimum Viable Schema): the smallest set that still enables reconstruction
Avoid “data dumping.” A record should be small, deterministic, CRC-protected, and sufficient to answer: what happened, which domain, how severe, in what order, under which firmware build.
schema_version, event_id, subcode, fw_version, config_crc
rtc_state (valid/invalid), timestamp (optional), seq_no (mandatory), uptime_ms
rail_id, min_v, dip_ms, peak, duration_ms, reset_cause, crc
seq_no + uptime_ms must still reconstruct the timeline.
11.3 When to log: triggers, pre/post snapshots, and early commit for reset-type events
Many failures become unreproducible because “the interesting context” was only in RAM and disappeared after reset. Treat logging as a lightweight state-machine snapshot, not a waveform recorder.
- Trigger points: BOR/UVLO IRQ, trip latch, leakage trip, tamper flag change, coefficient CRC mismatch, missing-sample spike.
- Pre-event snapshot: keep the last N “min/peak/counter” values (no big payloads).
- Post-event snapshot: record recovery state (auto-recover / lockout / degraded accuracy).
- Early commit: for reset-type events, write the record before the system resets (or at BOR early warning if available).
11.4 Storage strategy: FRAM vs EEPROM vs Flash, ring buffer, and write-life control
Storage selection is about write determinism and survivability, not only capacity. Use a ring buffer with a fixed record size and a commit byte written last.
- Fast, deterministic writes; excellent endurance for frequent events.
- Best fit for: BOR/UVLO/rail dip and “last record must survive” scenarios.
- Works for moderate event rates; pay attention to page write behavior.
- Best fit for: infrequent critical events + configuration snapshots.
- Great for large history, but frequent small writes are risky without buffering.
- Best fit for: aggregated counters, periodic summaries, lower-frequency events.
[header | payload | crc | commit] — write commit last. On boot, scan for the latest valid seq_no.
- SPI FRAM: Infineon/Cypress
FM25V10,FM25V02 - I²C FRAM: Fujitsu
MB85RC256V - I²C EEPROM: Microchip
24LC256,24AA256 - SPI NOR Flash: Winbond
W25Q64JV, MacronixMX25L6406E - RTC: Microchip
MCP7940N, NXPPCF85063A, ADI/MaximDS3231 - Reset supervisor / BOR helper: TI
TPS3839, ADI/MaximMAX809 - Power-path / hold-up assist (examples): TI
TPS2121, ADILTC4412 - Supercap (hold-up energy, example): Panasonic
EEC-F5R5H105
11.5 Timestamping: RTC + backup, and “time untrusted” fallback with sequence numbers
Time is not always trustworthy. The log must carry an explicit RTC validity state and a monotonic sequence number.
- When time is trusted: RTC has backup supply, passes sanity checks, and does not jump unexpectedly.
- When time is untrusted: first power-up, RTC backup lost, or detected time discontinuity → mark
rtc_state=invalid. - Always required:
seq_noincrements monotonically and survives resets (stored in NVM with wear-safe strategy).
timestamp only if rtc_state=valid. Otherwise sort by seq_no, then uptime_ms.
11.6 Forensics SOP: the minimal evidence set + a 5-step reconstruction method
The fastest triage uses a minimal, repeatable evidence set: two waveforms, three counters, and one log dump. This is enough to separate “power collapse” from “measurement chain collapse” and from “protection latch behavior.”
- 2 waveforms: (1) most sensitive rail near MCU/AFE, (2) entry/interface noise proxy near protection zone.
- 3 counters: missing samples, saturation/overload flag count, reset/trip counter.
- 1 log dump: last K records +
schema_version+fw_version.
- Sort events by
seq_no(or by trustedtimestamp). - Find the first “system turning-point” (BOR/UVLO/trip/leakage/tamper-integrity).
- Time-align that event to the two waveforms (rail dip vs entry spike window).
- Use counters to decide: saturation vs missing samples vs reset chain.
- Output the first fix direction: entry energy control / return path / clamp placement / recovery policy (no over-claim).
Figure F11 — Black box recorder: event sources → queue → NVM → service export (survive brownouts)
The black box recorder is a deterministic pipeline. Brownout survivability is achieved by early commit, hold-up energy for the last write, and a two-phase commit record format.
schema_version + fw_version so field forensics can attribute changes.
H2-12. FAQs (Evidence-based; mapped to chapters)
Each answer stays inside this subsystem evidence chain: Protection, Isolation, Leakage Safety, Metering Trust, and Event Recording. No device-specific system designs are assumed.
1) After adding TVS, resets become more frequent: capacitance load or return path?
Treat this as an energy-routing problem, not a “TVS brand” problem. First capture two points: residual spike at the connector (after TVS) and the sensitive rail dip/BOR counter during the same hit. If spikes clamp but rails still dip, the return path is injecting ground bounce; if rails stay solid but the interface degrades, TVS capacitance/impedance is the likely load. Log event_id + reset_cause + seq_no.
2) ESD does not crash the system, but metering “jumps”: front-end hit or rail ripple?
Separate “measurement chain saturation” from “reference corruption.” First capture metering AFE flags (overrange/saturation/missing-sample) and the metering-domain rail ripple around the ESD event. If the jump aligns with front-end saturation while rails remain stable, the input path needs protection/limiting; if the jump aligns with a rail dip or BOR warning, the power/return path is dominating. Preserve the record with event_id, peak, and seq_no.
3) EFT causes packet loss, but surge does not: why, and which filter stage first?
EFT is a repeated fast-edge aggressor, so coupling paths and return-path discontinuities matter more than bulk energy. First measure a common-mode noise proxy at the interface and the internal sensitive-rail ripple while logging drop counters. If loss occurs without rail dip, prioritize interface-side common-mode control (CMC placement + return-path continuity); if loss correlates with rail dips, prioritize entry energy management and rail resilience. Confirm by correlating seq_no with counter spikes.
4) Surge passes, but MOV runs hot / leakage increases: how to judge aging risk?
“Pass” is not “healthy,” so post-test evidence must be captured. First record MOV temperature rise under a defined input condition and measure leakage/insulation-related indicators before and after the surge campaign. If both temperature rise and leakage drift upward, aging/derating risk is likely; if only temperature rises, thermal coupling or fuse/MOV coordination is more likely. Store the test stamp (fw_version/config_crc) with the log snapshot for traceability.
5) The isolator misbehaves on transients: insufficient CMTI or isolated supply coupling?
Use time alignment to discriminate CMTI stress from supply coupling. First collect link error/CRC counters and isolated-supply ripple or ground-reference movement across the barrier during the transient window. If errors line up with the common-mode edge while isolated supply remains quiet, CMTI/layout/keepout is the limiter; if errors track isolated-supply ripple, the isolated power and its return loop are coupling noise into the receiver. Log event_id + duration_ms + error_count with seq_no.
6) Leakage alarms worsen in humidity: what are the three most common leakage paths?
Focus on measurable paths: (1) Y-cap related common-mode leakage across the safety boundary, (2) surface leakage from moisture/contamination films, and (3) insulation damage/aging creating a persistent resistive path. First record humidity/condensation conditions and the leakage-sense output (or trip timing) trend. Strong humidity correlation points to surface leakage; switching-state correlation points to Y-cap/common-mode behavior; slow monotonic growth points to aging. Store threshold_bin and duration_ms in the log.
7) RCD/GFCI nuisance trips: surge common-mode or Y-cap design effect?
Distinguish impulse-driven imbalance from an elevated steady leakage baseline. First capture a transient common-mode proxy during the trip window and the steady leakage baseline under normal operation. Trips that occur only during impulse windows suggest common-mode transients and return-path discontinuities; a consistently high baseline suggests Y-cap/insulation/leakage design drivers. Apply the first fix accordingly: improve common-mode control/placement for transients, or reduce baseline leakage contributors for steady issues. Log trip_source + peak + seq_no.
8) Same metering concept, new PCB revision is less accurate: sampling loop or ground reference?
Assume layout parasitics changed the measurement reference. First measure the metering input noise floor (at the AFE/ADC pins) and the sensitive-domain rail ripple/ground bounce under the same load condition. If noise tracks switching states and rail ripple, the ground/return reference is compromised; if noise changes mainly with routing distance/loop geometry, the sampling loop and its coupling are dominant. Implement the smallest fix first (loop area, reference continuity, shielding distance), then confirm by repeating the same calibration points and logging drift deltas.
9) Readings go low at high current: CT saturation or shunt thermal drift?
Use waveform shape versus time/temperature correlation. First observe the sensor output shape at peak current and capture a temperature proxy (or time-under-load) alongside the measurement drift. A sudden compression/flattening at peaks suggests CT saturation or magnetic bias; a gradual drift correlated with temperature rise suggests shunt self-heating/thermal coefficient effects. Apply the appropriate fix: avoid magnetic saturation conditions for CT paths, or reduce shunt heating and improve thermal modeling for shunt paths. Record peak, duration_ms, and temperature bin in the log.
10) Metering is suspected of tampering: minimal anti-tamper sensing + log fields?
Keep the design auditable with a minimal, deterministic loop. First select one tamper signal class (cover-open, magnetic bias, or bypass/reverse signature) and one plausibility statistic (waveform anomaly score or coefficient CRC integrity). If the signal asserts without a corresponding log record, the recorder chain is not trustworthy; if logs exist without fw_version/config_crc and record CRC, forensic attribution fails. Minimum fields: tamper_type, score, coeff_crc, seq_no, and timestamp/rtc_state. Tie each tamper event to an evidence snapshot.
11) Frequent power loss causes “last event missing”: choose FRAM or add hold-up?
Start from the time budget: can the last record be committed before rails collapse? First measure BOR early-warning-to-reset time and estimate worst-case record write time with two-phase commit. If the time window is too short, hold-up energy (supercap/backup rail) or earlier commit is mandatory; if the window is sufficient but records still drop, move to deterministic storage such as FRAM. Practical parts for architecture discussion: SPI FRAM FM25V02/FM25V10 or I²C FRAM MB85RC256V plus a supervisor like TPS3839 to stabilize reset behavior. Verify by seq_no continuity across power cycles.
12) Use “two waveforms + one log” to tell ESD vs EFT vs power collapse: how?
Use a minimal reconstruction method. First capture (1) a connector/interface spike proxy and (2) the most sensitive internal rail waveform, then export the last K log records with reset_cause and seq_no. If spikes are large but rails remain stable, the aggressor is likely ESD/EFT coupling; if rails dip or BOR triggers, power collapse dominates. Differentiate ESD versus EFT by repetition patterns and event counters: single sharp hits versus repeated bursts. Confirm by aligning event_id timestamps/seq_no with the waveform windows.