Link Health & Black-Box: CRC/Drop Logging for Ethernet Forensics

Q: CRC looks “high” but throughput still looks normal — denominator/window artifact or real bursts?

Likely cause: mismatch in denominator (per-frame/per-byte/per-time) or window (avg hides burst); mixed layers (PHY vs MAC vs switch). Quick check: compare {CRC_avg_ppm, window=60s} vs {CRC_peak_ppm, peakWindow=5s} and confirm denominator = frames (FCS-checked). Fix: publish a Metric Spec: {name, scope(layer/port), increment_condition, denominator, window, reset_rule}; add peak-window reporting. Pass criteria: CRC_peak_ppm < X1 (peakWindow=Y1 s) AND CRC_p95_ppm < X2 (slidingWindow=Y2 s), per-port.

Q: Drop counters rise, but packet capture shows “no loss” — ingress/egress definition or mirror tap point?

Likely cause: drop counted at {ingress/queue/policer} but capture tap is {egress}; or capture sampling misses micro-bursts. Quick check: read {ingress_drop, egress_drop, queue_drop} per-port in the same window; record {mirror_location} and {capture_rate}. Fix: require drop attribution fields in the evidence bundle: {drop_stage, queue_id, reason_code}; standardize capture tap placement. Pass criteria: attribution_coverage=100% AND |drop_observed(drop_stage) − drop_counter(drop_stage)| / denominator < X% over Y minutes.

Q: CRC spikes only during temperature ramp — time alignment issue or power/clock event correlation?

Likely cause: temp sampling not aligned to counter window; missing {clock_status/sync_state} makes correlation ambiguous. Quick check: align timelines on one timebase: {counter_ts, event_ts, temp_ts, power_ts}; compute Δt between “temp step” and “CRC burst”. Fix: evidence bundle must include {timebase_type, uptime, clock_status, sync_state, offset_samples}; apply consistent windowing. Pass criteria: alignment_error_ms=|Δt| < X ms AND correlation_reproducibility ≥ X% across N ramps (same window/denominator).

Q: Link flaps (up/down) but no power alarm — autoneg retry/EEE transitions or reset source missing?

Likely cause: unlogged link-state transitions ({autoneg retry, EEE enter/exit}) or silent reset ({driver reset, watchdog}). Quick check: inspect event log around each flap: {link_down, link_up, autoneg_retry, EEE_transition, reset_reason} with timestamps. Fix: enforce event taxonomy with minimum fields: {timestamp, severity, source, context}; log reset_reason and driver_reset explicitly. Pass criteria: event_missing_rate=0 for flap windows AND flap_rate ≤ X / hour in the defined operating mode.

Q: Only one port misbehaves — per-port counters missing or global aggregation masking the fault?

Likely cause: only global counters are read; port mapping {port_id ↔ phy_addr} inconsistent; per-queue/per-port attribution absent. Quick check: collect per-port {CRC_ppm, drop_rate, flap_count} for all ports in the same window; verify port map checksum/version. Fix: store port map in the evidence bundle: {port_id, phy_addr, role, speed}; default dashboards to per-port view. Pass criteria: time_to_isolate_port < X minutes AND per-port anomaly explains ≥ X% of the global deviation.

Q: Reboot “fixes it”, but the issue returns hours later — ring buffer overwrite or retention policy losing evidence?

Likely cause: ring buffer too shallow; trigger too strict/too noisy; retention purge removes pre-failure context. Quick check: review {overwrite_count, retention_days, snapshot_success} and confirm pre/post windows cover symptom lead-up. Fix: apply two-tier logging: {low-rate health sampling} + {triggered snapshot}; tune {threshold, cooldown, pre/post window}. Pass criteria: evidence_preservation_rate ≥ X% over N recurring incidents AND overwrite_count=0 within retention window.

Q: “ESD test passed”, but the link later becomes more fragile — what degradation signature is fastest to check?

Likely cause: baseline drift: higher {CRC baseline}, higher {flap/hour}, stronger {temperature sensitivity} even if pass/fail still passes. Quick check: compare pre/post test baselines: {CRC_avg_ppm, flap_rate, peak_burst_ppm, P95} with identical window and mode. Fix: add trend snapshots to the bundle: {daily_baseline, peak_window}; tag events with {test_phase, test_id} for traceability. Pass criteria: baseline_shift_ppm ≤ X and flap_rate_increase ≤ X% after test (same config, same window).

Q: PoE-powered system drops sporadically — what is the first time-alignment to run (PoE fault vs link flap)?

Likely cause: {PoE port state change, power-limit, UV} triggers reset or link renegotiation; event timing is not correlated to counters. Quick check: align {PoE_fault_ts, brownout_ts, reset_ts} with {link_down_ts, CRC_burst_ts} on the same timebase. Fix: log PoE/PoDL states as typed events with {timestamp, port_id, reason_code}; include power telemetry metadata. Pass criteria: alignment_error_ms < X ms AND co-occurrence_rate(PoE_fault, link_down) ≤ X% after mitigation.

Q: Fails only in a specific operating mode — how should triggers be written to catch it (pre/post window)?

Likely cause: trigger uses only averages; burst/derivative conditions are missing; cooldown mis-tuned (either floods or misses). Quick check: replay history: test rules {metric>X for Y s} and {Δmetric>X within Y s}; measure hit/miss/false-positive rates. Fix: define trigger schema: {signal, threshold, duration, derivative, cooldown, preWindow, postWindow}; require pre/post capture. Pass criteria: false_positive ≤ X/day AND miss_rate ≤ X% with preWindow ≥ X s and postWindow ≥ X s.

Q: Same issue behaves differently across firmware versions — which three version fields are most often missing?

Likely cause: evidence bundle lacks {fw_version}, {config_profile_id}, {counter_schema_version/driver_build_id} → no valid A/B comparison. Quick check: validate bundle envelope completeness: required_fields_present? {fw, config, schema} + {device_id, uptime, timebase}. Fix: enforce “ticket envelope” schema: export must fail fast if required fields are missing; version fields become non-optional. Pass criteria: field_completeness=100% AND schema_validation_pass=100% across exported incidents.

← Back to: Industrial Ethernet & TSN

Link Health & Black-Box turns intermittent Ethernet faults into reproducible evidence.

It standardizes counters, event logs, and time correlation so each incident exports a minimal evidence bundle that supports fast root-cause and measurable verification.

H2-1 · What “Link Health & Black-Box” Means (Scope + Deliverables)

Scope guard (to prevent cross-page overlap)

In scope: link-health counters + event timeline + temperature/power correlation + an exportable evidence bundle.
Out of scope: magnetics/CMC selection, mode suppression design, PoE center-tap clamping layout details (capture as events only; design details belong to the protection pages).

Related pages for design details: Magnetics & Common-Mode Chokes · PoE / Data Co-Design

Definition

Link up answers “does it connect now?” while link health answers “does it remain stable across time, environment, and load?” A practical link-health model treats the link as a rate of badness (CRC/error bursts, drop bursts, flap bursts) rather than a binary state.

Key distinction (field-ready)

Stability: low error/drop rate over long windows (minutes → hours).
Robustness: error rate does not spike under temperature/power/EMI events.
Recoverability: when incidents happen, evidence survives and RCA is repeatable.

Why it matters

Industrial Ethernet failures are often intermittent: short bursts that vanish before a technician arrives. A black-box converts “cannot reproduce” into a time-aligned incident record, enabling fast triage and consistent fixes.

Typical field symptoms (3 high-frequency cases)

Intermittent packet loss: throughput “looks fine” but real-time control misses deadlines.
Occasional link flaps: short up/down cycles caused by marginal conditions.
CRC spikes tied to environment: temperature ramps or power droops correlate with error bursts.

Pass criteria template (use as acceptance gate)

Error rate stays within X per 10⁶ frames over Y minutes at nominal conditions.
During stress events (temp/power), bursts are captured with pre/post windows and exported successfully.
Incidents are reproducibly classified (drop vs CRC vs flap) with consistent metric definitions.

Deliverables (Minimum Evidence Bundle)

A field-ready black-box is not “lots of logs”. It is a structured evidence bundle that makes RCA repeatable across teams and sites.

Bundle contents (MVP)

Counters snapshot: CRC/FCS + drops + link flaps + autoneg/EEE transitions (layer-tagged).
Event log timeline: link/power/thermal/reset/PoE events (status only, no layout details).
Environment: temperature, supply rails status, brownout flags, uptime.
Configuration: speed/duplex/EEE state, port role, queue/policer mode, firmware version.
Time base: monotonic timestamp + wall-clock (optional PTP state recorded as metadata).

What this enables (RCA loop)

Classify incidents by signature: CRC-burst vs drop-burst vs flap-burst.
Correlate with environment: temperature ramps / power events / external disturbances.
Produce a repeatable “fix + verify” checklist with clear pass criteria (X).

Diagram: Field incident → observability → evidence bundle → RCA → fix/verify (minimal text, box-first)

H2-2 · Metrics Map: Counters You Must Have (PHY / MAC / Switch / Stack)

A useful metrics map answers two questions: what to collect and how to interpret. To keep the system field-friendly, counters are grouped by layer and by incident signature.

Error counters (CRC/FCS, symbol, alignment) Drop counters (RX drop, queue drop, policer drop) Volatility counters (flap, autoneg retry, EEE transitions)

Layer 1 — PHY (signal-to-bits)

Must-have (MVP)

Symbol/PCS errors: increment when decoding fails at the PHY/PCS boundary.
Link up/down count: flaps per hour (burst detection uses a short window).
Autoneg retries (if present): negotiation loops reveal marginal channels.

Optional (high value when available)

SNR/quality hint: any vendor “link quality” indicator should be stored with a timestamp.
EEE transitions: enter/exit counts help explain latency spikes or wake-related glitches.

Layer 2 — MAC / Controller (bits-to-frames)

Must-have (MVP)

FCS/CRC errors: frames failing integrity checks at MAC receive.
RX drops: descriptor exhaustion, buffer overrun, or policy drops (must be distinguishable if possible).
TX retries / underrun: reveals starvation or bus contention, often misread as “link issue”.

Optional (high value)

Queue occupancy / watermark: supports “drop burst” root-cause classification.
Interrupt/coalescing stats: helps separate CPU scheduling artifacts from true link corruption.

Layer 3 — Switch / Bridge (frames-to-queues)

Must-have (MVP)

Per-port drops: ingress/egress drop counters (must indicate direction).
Storm / discard reason: if the switch provides drop reason classes, store the class id.
Link flaps per port: preserves “one bad port” signatures.

Optional (high value)

Queue depth histogram: supports “congestion vs corruption” separation.
Mirror/telemetry counters: use summary stats; full packet capture is not required as MVP.

Layer 4 — Stack / Driver (frames-to-app)

Must-have (MVP)

Socket/driver drops: distinguish “stack drop” from “wire drop”.
Retry/timeouts: store counts and the time window used for rate calculation.
CPU load / scheduling hint: minimal metadata to flag starvation incidents.

Optional (high value)

Per-flow or per-endpoint rollup: identifies a single bad talker without capturing payload.
Latency sampling hooks: keep it as summary stats (P50/P95) to avoid heavy logging.

4-field discipline (required for every metric)

Without strict definitions, different tools will show “conflicting” numbers. Each counter must ship with a small spec:

Definition

When does it increment? Which frames/ports/roles are included?

Denominator

Per frame? Per byte? Per second? “Rate” must declare the denominator.

Window

Instant / sliding window / peak window. Bursts require a short window plus a long window.

Unit

Count, rate, ppm, or percentage. Choose one and stick to it.

Diagram: Layered metrics map (PHY → MAC → Switch → Stack) with representative counters

H2-3 · Counter Hygiene: Definitions, Denominators, Windows

Counter numbers become actionable only when they are comparable across devices, tools, and time. This section standardizes definitions, denominators, and windowing so “the numbers look wrong” becomes a solvable engineering issue.

Common mismatches (why numbers disagree)

1) Scope mismatch

Symptom: one device shows CRC=0 while another shows CRC bursts on the same incident.
Quick check: per-port? per-direction (ingress/egress)? frame class filtered?
Fix: enforce port + direction tagging; define included frame classes explicitly.

2) Denominator mismatch

Symptom: error “rate” differs by 10×–100× between dashboards.
Quick check: per-frame vs per-byte vs per-time? mixed denominators?
Fix: standardize: errors per 10⁶ frames (or per second) and document it.

3) Window mismatch

Symptom: long-window averages look fine, but real-time control still fails.
Quick check: only one long window reported? no peak or burst metric?
Fix: always publish short + long windows and a burst/peak indicator.

4) Reset / rollover mismatch

Symptom: counters “jump” or clear unexpectedly after a short incident.
Quick check: boot reset vs link-session reset? counter width and wrap?
Fix: log boot_id/uptime, reset reason, and wrap detection flags.

Metric spec template (minimum fields)

Every counter should ship with a short “spec” so different teams do not invent their own definitions.

Metric ID / Name: stable name for long-term trend and version migration.
Layer: PHY / MAC / Switch / Stack (matches the layered map).
Scope: port + direction + included frame classes.
Increment condition: one sentence “when X happens, +1”.
Reset condition: boot / link-down / manual clear / session reset.
Denominator: per-frame / per-byte / per-time (choose one).
Window policy: short + long + peak / P95 (fixed output set).
Units: count, rate, ppm, or percentage (one canonical unit).
Rollover behavior: width + wrap detection + reporting rule.
Validation hook: how to inject and verify the counter increments correctly.

Windowing rules (burst-proof reporting)

Default window set (recommended)

Instant: quick “now” sanity check.
Short window: 1s–5s for burst detection.
Long window: 5–60 minutes for stability trends.
Peak / P95: captures the worst user experience hidden by averages.

Which metrics require peak / P95

CRC / FCS rate: peak window catches short corruption storms.
Drop rate: peak window separates “burst drop” from mild congestion.
Flap rate: peak reveals repeated up/down cycles in minutes.

Pass criteria (threshold placeholders)

Pass criteria must be expressed with window and denominator. Use placeholders (X, Y) and bind them to the system class.

CRC_rate_5min: ≤ X per 10⁶ frames over Y minutes.
Drop_rate_5min: ≤ X per 10⁶ frames; burst count reported separately.
Flap_count_60min: ≤ X per hour per port.
P95(CRC_rate_5s): ≤ X per 10⁶ frames (captures hidden bursts).
Evidence completeness: bundle includes the same windows + timebase metadata for replay.

Diagram: Same burst → different denominators/windows → different reported numbers

H2-4 · Event Taxonomy: What to Log Besides Counters

Counters show what is happening; events explain when and under what conditions. A black-box requires an event dictionary with minimal fields so incidents can be correlated and replayed.

Minimum event fields (required)

timestamp: monotonic time (and optional wall-clock).
severity: info / warn / critical.
source: phy, mac, switch, psu, thermal, poe, firmware.
event_id: stable identifier for parsing and long-term compatibility.
context: small key/value set (port, speed, rail_id, temp_c, state).

Protection and PoE events record status only (flags, states, limits). Design details belong to the protection pages.

Link events

Examples: link up/down, flap burst, autoneg retry, EEE enter/exit.
Context: port, speed/duplex, role, link partner id (if available).

Power events

Examples: UV/OV, brownout, rail drop, reset cause.
Context: rail_id, min/max voltage (if available), reset_reason, uptime.

Thermal events

Examples: board temperature threshold crossing, thermal throttle flag.
Context: temp_c, sensor_id, throttle_state, fan state (if present).

Protection events (status only)

Examples: ESD flag, surge detect flag, protection fault latch.
Context: port, fault_code, latch_state, test_mode flag.

PoE / PoDL events (status only)

Examples: detect/class, power limit, fault/overcurrent, thermal limit.
Context: class, power_w, limit_state, fault_code, port.

Diagram: Event log taxonomy tree (minimal text, many box elements)

H2-5 · Time Correlation: Single Timebase for Forensics

Forensics fails when counters, events, and environmental logs cannot be aligned on a single axis. Use a monotonic timeline as the primary source of truth, and attach optional wall/PTP time with explicit sync state.

Timebase options (rules of use)

Required: Monotonic time

Purpose: stable ordering and duration, immune to wall-clock jumps.
Strength: always available, consistent across re-sync events.
Failure mode: resets on reboot, so boot_id + uptime must be logged.

Recommended: Wall clock (RTC/system)

Purpose: human-readable timestamps for field service workflows.
Strength: ties logs to tickets, screenshots, and external systems.
Failure mode: can jump during set-time; must record time_jump and offset.

Optional: PTP time

Purpose: cross-device alignment when PTP is available.
Strength: enables multi-node correlation with a shared epoch.
Failure mode: quality depends on sync; must log sync_state and clock_status.

Drift handling (what to record)

Calibration point: record (monotonic_ts, wall_ts/ptp_ts, offset_ms) at X-minute intervals (X = system class policy).
Sync state change: log every transition (locked / holdover / free-run / unknown) with monotonic_ts.
Time jump detection: if wall/PTP jumps by ≥ X ms, emit a time_jump event and keep monotonic ordering.
Reboot continuity: attach boot_id + uptime to every record so timelines can be stitched after resets.

The goal is not perfect time, but explicit clock quality so every timestamp can be trusted (or discounted) correctly.

What to include in the bundle (minimum)

boot_id + uptime_ms (required for continuity).
monotonic_ts for every counter and event record (primary axis).
clock_source (monotonic / rtc / ptp) and clock_status (locked/holdover/free-run).
sync_state (PTP/RTC validity flags and state changes).
offset_records (last N calibration points: monotonic ↔ wall/PTP).
time_jump_flag + jump magnitude if detected.

Missing boot_id/uptime/monotonic_ts makes the bundle non-forensic and should fail validation.

Diagram: Counters + Events + Temp/Power aligned to one monotonic axis

H2-6 · Sampling & Triggers: Ring Buffer, Snapshot, “Flight Recorder”

A black-box must be lightweight during normal operation yet preserve evidence when an incident happens. Use a two-level strategy: baseline sampling for health, plus triggered snapshots backed by a ring buffer.

Baseline sampling (lightweight health)

What: windowed CRC/drop/flap + temp/power state + clock sync state.
How often: every X seconds/minutes (policy by system class).
Cost guard: record size ≤ X bytes; daily budget ≤ X.

Trigger rules (examples, status-only)

Trigger A · CRC rate burst

Condition: CRC_rate_5s ≥ X per 10⁶ frames.
Debounce/Cooldown: require X consecutive samples; cooldown X s.
Snapshot scope: counters + event log + clock status + temp/power states.

Trigger B · Link flap storm

Condition: flap_count_60s ≥ X per port.
Debounce/Cooldown: ignore repeats within X s after a snapshot.
Snapshot scope: port state + negotiation/EEE status + recent error windows.

Trigger C · Environment shock (status)

Condition: temp_jump ≥ X °C in X s, or brownout/PoE fault flag asserted.
Debounce/Cooldown: one snapshot per X s per port/rail.
Snapshot scope: clock sync state + counters + event dictionary entries around the flag.

Triggers treat PoE/Protection as status sources only and do not expand into hardware design details.

Ring buffer design (pre/post windows)

Indexing: use monotonic_ts so pre/post extraction is deterministic.
Pre-trigger window: keep last X seconds for “lead-up”.
Post-trigger window: keep next X seconds for “aftermath”.
Size budget: bundle cap X KB/MB; define overwrite policy for non-critical records.
Freeze rule: on trigger, freeze pointers and seal an incident_id for export.

Pass criteria (evidence preservation)

Seal latency: trigger → bundle sealed ≤ X seconds.
Pre-window coverage: ≥ X% of the configured pre seconds preserved.
Post-window coverage: ≥ X seconds captured after trigger.
Uniqueness: incident_id unique; bundle binds boot_id + monotonic_ts range.

Diagram: Ring buffer with pre-trigger and post-trigger windows (flight recorder)

H2-7 · Evidence Bundle: What “One Incident” Should Contain

A usable black-box is defined by a single incident bundle: a bounded, exportable object that contains identity, configuration context, and replayable evidence.

Identity (traceability)

incident_id (unique), device_id, port_id.
boot_id + uptime_range (stitch across resets).
firmware_version + schema_version (long-term decoding compatibility).
capture_reason (trigger rule ID) + timebase status (clock_status/sync_state).

Context (configuration snapshot)

Link: speed/duplex, autoneg mode, EEE enable, flow-control enable.
Queues (status-only): queue_id, drop_policy name, shaping_enabled flag.
Sampling: baseline_period, pre_window, post_window, debounce/cooldown.
Metric hygiene: denominator and window set identifiers for every counter series.

Evidence (replayable minimum)

Key counters (windowed)

CRC/FCS error windows (short/long/peak), drop windows (RX/queue), link flap windows.
Autoneg retries, EEE transitions (state changes), and selected “top offenders” per port.

Event log slice

timestamp (monotonic), severity, source, context key-value list.
Include the last N link state changes (up/down/negotiation/EEE state).

Environment (status-only)

temp range, power rail flags (UV/OV/brownout), reset reason flags.
clock status and sync state for the incident time range.

Optional packet summary (no full capture required)

pcap_hash or payload hash, ethertype histogram, length histogram, peak frame-rate.
Keep only summaries that support “incident happened” without large storage cost.

Output A · Summary (one screen)

incident_id port trigger time_range

Top anomalies: CRC burst / drop burst / flap storm (auto-ranked).
Environment: temp range + power flags + clock status.
Next check hint: correlation pointers (e.g., “errors align with brownout flag”).

Output B · Details (engineering replay)

Counters series: short/long/peak windows for CRC/drop/flap, with denominators and window IDs.

Event sequence: ordered by monotonic_ts, includes severity/source/context.

Config snapshot: speed/EEE/flow-control/queue flags and sampling settings for the incident.

Output C · Raw (structured export)

Raw counters dump (units, denominators, windows, timestamps).
Raw event records (structured key-value or TLV).
Clock calibration points and offset records for correlation.
Integrity fields (chunk_crc, bundle_hash) for verification.

Diagram: One incident as a single evidence bundle (Summary / Details / Raw)

H2-8 · Storage, Retention, and Integrity

Reliability comes from power-fail safe commit, a clear retention policy, and integrity checks that can detect corruption or tampering.

Storage options (strategy-only)

High-frequency small writes: prefer low-latency NVM-class storage for commit metadata.
Large bundles / long retention: use block storage with wear-aware policies.
Hybrid approach: metadata/journal on fast NVM, payload on bulk storage.

Power-fail safe write (atomic commit)

Prepare journal: write a header with incident_id, schema_version, and payload size placeholders.
Write payload chunks: append data with per-chunk CRC for local fault isolation.
Seal hash: compute bundle_hash (e.g., SHA-256) and store it in the header region.
Flip valid bit (A/B header): commit by switching the active header atomically.
Verify on load: reject non-committed bundles and mark corrupted records without blocking others.

Retention policy (prevent useless overwrite)

Severity buckets: critical / warn / info with separate caps (N incidents or M MB).
Export gate: exported bundles can be moved to a secondary tier or become overwrite-eligible.
Write control: limit snapshot storms with cooldown and de-dup by incident fingerprint.

Integrity checks (verify and detect corruption)

chunk_crc: detect localized corruption without scanning the entire bundle blindly.
bundle_hash: verify full-bundle integrity after reconstruction.
schema_version: prevent mis-decoding due to format drift.
Optional escalation: signing/key management belongs to the Security page when required.

Diagram: Ring buffer → Snapshot → Atomic commit → Export (power-fail safe)

H2-9 · Remote Export & Field Workflow (forensics pipeline)

A repeatable forensics pipeline requires two things: a consistent bundle format and a field workflow that preserves context from collection to RCA and verification.

Export channels (transport only)

Interfaces are treated as transport. Every channel must export the same incident bundle schema and ticket envelope.

Local interactive

CLI / Web UI export for cabinet-side service.
Offline-friendly: export to a file without relying on upstream connectivity.
Guarantee: identical bundle schema, identical hashing fields.

Remote management

Management-plane export (e.g., CLI automation / REST / management protocols).
Batch pull: retrieve bundle index first, then selected incident bundles.
Guarantee: export bundles or bundle indices only, not full captures.

Discovery assist

Topology/port identification and field labeling support (e.g., discovery protocols).
Use for: “which device/port is the failing link” and “where to attach a ticket”.
Boundary: discovery data supplements the ticket envelope; it does not replace evidence.

Ticket envelope (work-order wrapper)

Exported evidence should be wrapped by a ticket envelope so multiple incidents remain traceable and comparable.

Ticket: ticket_id, site_id, operator_role/shift.
Context: symptom_tag, service_window, notes (structured).
Asset: device_id, port_id, firmware_version.
Evidence: incident_id_list, bundle_count, bundle_hash_list.

Field SOP (collection → RCA → verification)

Identify: confirm device/port and assign symptom_tag.
Freeze window: record the affected time range and timebase status.
Export: export recent top incidents + current summary snapshot.
Annotate: fill ticket envelope fields and structured现场变更 notes.
Transfer: upload via gateway/server; keep offline copy if needed.
Replay: correlate counters/events/env timelines per incident_id.
Infer: produce evidence-backed hypotheses (correlations only).
Verify: run regression steps and check metrics return within threshold X.

If network is unavailable: export offline → store locally → upload later; preserve hashes and ticket envelope.

If too many incidents: export index + top offenders + last N; avoid uncontrolled bundle explosion.

If time is not synchronized: correlate by monotonic timeline and keep sync_state in the envelope.

RCA report template (structured)

1) Symptoms: impact scope, frequency, triggers, affected ports.

2) Timeline: counters/events/env aligned to one axis.

3) Evidence: incident_id references + key metric windows.

4) Inference: evidence-backed correlations and hypotheses only.

5) Verification: regression steps + pass criteria threshold X.

6) Conclusion: fix action, residual risks, monitoring plan.

Diagram: Device → Gateway → Server → RCA → Fix/Verify (forensics pipeline)

H2-10 · Engineering Checklist (Design → Bring-up → Production)

A practical black-box must pass three gates: Design, Bring-up, and Production. Each checklist item should be verifiable and includes a pass criteria placeholder X.

Design Gate

Metric spec completeness: name/scope/increment/reset/denominator/windows. Pass criteria: X
Trigger rules: thresholds + cooldown + de-dup fingerprint. Pass criteria: X
Bundle schema frozen: schema_version + required fields. Pass criteria: X
Summary/Details/Raw defined: one-screen summary and structured export. Pass criteria: X
Timebase policy: monotonic main axis + sync_state fields. Pass criteria: X
Storage budget: baseline + snapshot volume ceilings. Pass criteria: X
Retention buckets: critical/warn/info with overwrite rules. Pass criteria: X
Ticket envelope fields: ticket_id + site + symptom + hashes. Pass criteria: X

Bring-up Gate

Counter alignment: consistent windows/denominators across endpoints. Pass criteria: X
CRC burst injection: trigger latency and pre/post capture coverage. Pass criteria: X
Link flap injection: flap counts and negotiation transitions captured. Pass criteria: X
Drop burst injection: queue drops visible and ranked correctly. Pass criteria: X
Power-fail tests: non-committed records discarded; committed records readable. Pass criteria: X
Time discontinuity tests: monotonic stitching + sync_state labeling. Pass criteria: X
Trigger storm control: cooldown and de-dup prevent bundle explosion. Pass criteria: X
Export path validation: offline or remote export success with hash verification. Pass criteria: X

Production Gate

Default thresholds frozen: manufacturing config consistency. Pass criteria: X
Compression/limits: bounded CPU/IO overhead with low evidence loss. Pass criteria: X
Regression suite: fault injection + normal load scenarios each release. Pass criteria: X
Schema compatibility: multi-version decoding success rate. Pass criteria: X
Tool mapping frozen: field changes require a controlled change process. Pass criteria: X
Field SOP adoption: envelope completeness and reporting compliance. Pass criteria: X
Retention audit: critical exports traceable and not overwritten prematurely. Pass criteria: X
Fix→Verify closure: metrics return within threshold after remediation. Pass criteria: X

Diagram: Three gates checklist (Design → Bring-up → Production)

H2-11 · Applications (Where Black-Box Saves You Most)

Use cases focus on why evidence matters and what the black-box should capture, without expanding into magnetics/EMI design or protocol deep dives.

A) Long cable + noisy plant: “works on bench, fails in the field”

Symptom: CRC bursts, sporadic drops, link flap under motor/relay switching.
What to log: PHY error counters (CRC/FCS, symbol/alignment), link-up/down timeline, autoneg retry/EEE transitions, plus temp & rail events.
What it reveals: correlation between error bursts and environmental/power events, and whether faults are link-quality vs buffer/queue artifacts.
Example silicon building blocks (part numbers): Marvell Alaska 88E1512/88E151x (PTP + VCT cable test feature family), Microchip VSC8574 (multi-port GbE PHY w/ IEEE 1588), TI DP83640 (IEEE 1588 hardware timestamp PHY), Microchip KSZ9567R (managed switch w/ IEEE 1588 support).

B) PoE / PoDL systems: “power events and link symptoms are coupled”

Symptom: link drops only when load steps, inrush, or port power limit events occur.
What to log: brownout/UV/OV flags, PD class / PSE port state, power-limit / fault events, plus link counters around the same timestamp window.
What it reveals: whether drops are caused by power collapse, port power cycling, or unrelated data-plane congestion.
Example power controllers (part numbers): TI TPS23881 (802.3bt PSE controller), TI TPS2373 (high-power PoE PD interface), Analog Devices LTC4291 (802.3bt PSE controller), Skyworks Si3406 (PoE PD).

C) TSN / deterministic traffic: “rare jitter or drops break closed-loop control”

Symptom: end-to-end latency spikes, occasional missed cycles, or micro-bursts causing queue drops.
What to log: per-queue drops, shaping/gate schedule state changes, timestamp quality, and port-level CRC/error counters.
What it reveals: whether the issue is traffic shaping / scheduling vs physical link integrity.
Example TSN-capable switches (part numbers): NXP SJA1105 (AVB/TSN switch family), Microchip LAN9662 (TSN GbE switch with integrated CPU), Microchip KSZ9567R (managed switch w/ IEEE 1588/AVB hooks).

D) Single-Pair Ethernet long reach: “intermittent margin over long runs”

Symptom: occasional retries/drops at certain temperature or cable routing conditions.
What to log: link-state transitions, error counters, and environment (temp/power) around incidents.
What it reveals: whether failures are driven by temperature drift, power events, or gradual link-quality degradation.
Example SPE PHY (part number): Analog Devices ADIN1100 (10BASE-T1L).

Diagram: Symptom → Evidence → Conclusion (field-friendly storyboard)

Reading tip: the black-box is judged by whether one incident becomes reproducible evidence, not by raw counter volume.

H2-12 · IC Selection Logic (What Features to Demand)

Selection is based on fields + verification criteria. Parts listed are example building blocks (not recommendations and not sales).

1) Readable, layered counters (PHY / MAC / Switch)

What: counters must be accessible per port, with clear reset rules and stable naming.
Why: without layered attribution, “drops” cannot be separated into link-quality vs queue/policer artifacts.
How to verify: read counters at idle, inject a controlled burst, confirm monotonic increments and consistent denominators (packets/bytes/time).
Example silicon (part numbers): Microchip KSZ9567R (managed switch counters + IEEE 1588), Microchip LAN9662 (TSN switch with integrated CPU and switch telemetry), Marvell Alaska 88E1512 family (PHY diagnostics feature family).

2) Time correlation hooks (PTP hardware timestamp or equivalent)

What: hardware timestamping (preferred) or standardized time markers to anchor counters/events to a single timebase.
Why: forensics requires aligning “CRC burst” with “power dip” or “thermal ramp” on the same axis.
How to verify: generate periodic sync events; confirm timestamp monotonicity across reboots and quantify drift / offset capture in the evidence bundle.
Example PHYs / switches (part numbers): TI DP83640 (IEEE 1588 hardware timestamp PHY), Microchip VSC8574 (IEEE 1588-capable multi-port PHY), Marvell Alaska 88E1510/88E1512/88E1514 family (PTP timestamping), NXP SJA1105 (AVB/TSN switch family).

3) Event sources (temp / power / reset / PoE port state)

What: hardware-visible events with timestamped severity and context (no deep analog design discussion).
Why: link health anomalies are often driven by environment and supply integrity, not by packet rate alone.
How to verify: sweep temperature and load; induce controlled brownout; confirm events appear with correct ordering and no missing windows.
Example sensors / PoE controllers (part numbers): TI TMP117 (digital temperature sensor), TI INA226 (I2C current/power monitor), TI TPS23881 (PoE PSE), TI TPS2373 (PoE PD), Analog Devices LTC4291 (PoE PSE).

4) Snapshot capability + nonvolatile storage options

What: ring buffer + triggered snapshot, with power-fail-safe commit and versioned record format.
Why: incidents are rare; evidence must survive reboot and power loss without corruption.
How to verify: trigger snapshot, then cut power during commit; boot and validate journal/CRC and record continuity.
Example NVM building blocks (part numbers): Infineon FM25V10-G (SPI F-RAM), Fujitsu MB85RS64V (SPI FRAM), Winbond W25Q128JV (SPI NOR flash), Micron MTFC32GBCAQTC-IT (industrial eMMC).

5) Export pathway (field-friendly evidence retrieval)

What: deterministic “export” path to retrieve one incident bundle without full packet captures.
Why: field service depends on consistent artifacts (ID/version/config + summary + raw).
How to verify: export under CPU load; confirm bundle integrity hash and complete metadata fields.
Example integrated platforms (part numbers): Microchip LAN9662 (switch + integrated CPU for local collection/export), Microchip KSZ9567R (managed switch as a telemetry source), Analog Devices ADIN1100 (SPE PHY as a monitored endpoint).

Diagram: Requirements → Candidate → Verification → Pass (selection loop)

Acceptance should require a repeatable injection test that proves: counters increment correctly, snapshots survive power loss, and export produces an intact incident bundle.

Request a Quote

Name

Company

Part Number(s) / BOM

Quantity & Target Lead Time

Alternates Allowed

Temperature Grade

Package / Footprint

Compliance

Budget Window

Lot Size / Qty

Message

Attachment

Accepted Formats

pdf, csv, xls, xlsx, zip

Attachment

Drag & drop files here or use the button below.

H2-13 · FAQs (Field Troubleshooting, Black-Box Evidence Only)

Scope guard: answers close only long-tail field troubleshooting using counters + event log + time correlation + triggers + evidence bundle. No EMC/magnetics/TSN parameter deep dives.

CRC looks “high” but throughput still looks normal — denominator/window artifact or real bursts?

Likely cause: mismatch in denominator (per-frame/per-byte/per-time) or window (avg hides burst); mixed layers (PHY vs MAC vs switch).

Quick check: compare {CRC_avg_ppm, window=60s} vs {CRC_peak_ppm, peakWindow=5s} and confirm denominator = frames (FCS-checked).

Fix: publish a Metric Spec: {name, scope(layer/port), increment_condition, denominator, window, reset_rule}; add peak-window reporting.

Pass criteria: CRC_peak_ppm < X1 (peakWindow=Y1 s) AND CRC_p95_ppm < X2 (slidingWindow=Y2 s), per-port.

Drop counters rise, but packet capture shows “no loss” — ingress/egress definition or mirror tap point?

Likely cause: drop counted at {ingress/queue/policer} but capture tap is {egress}; or capture sampling misses micro-bursts.

Quick check: read {ingress_drop, egress_drop, queue_drop} per-port in the same window; record {mirror_location} and {capture_rate}.

Fix: require drop attribution fields in the evidence bundle: {drop_stage, queue_id, reason_code}; standardize capture tap placement.

Pass criteria: attribution_coverage=100% AND |drop_observed(drop_stage) − drop_counter(drop_stage)| / denominator < X% over Y minutes.

CRC spikes only during temperature ramp — time alignment issue or power/clock event correlation?

Likely cause: temp sampling not aligned to counter window; missing {clock_status/sync_state} makes correlation ambiguous.

Quick check: align timelines on one timebase: {counter_ts, event_ts, temp_ts, power_ts}; compute Δt between “temp step” and “CRC burst”.

Fix: evidence bundle must include {timebase_type, uptime, clock_status, sync_state, offset_samples}; apply consistent windowing.

Pass criteria: alignment_error_ms=|Δt| < X ms AND correlation_reproducibility ≥ X% across N ramps (same window/denominator).

Link flaps (up/down) but no power alarm — autoneg retry/EEE transitions or reset source missing?

Likely cause: unlogged link-state transitions ({autoneg retry, EEE enter/exit}) or silent reset ({driver reset, watchdog}).

Quick check: inspect event log around each flap: {link_down, link_up, autoneg_retry, EEE_transition, reset_reason} with timestamps.

Fix: enforce event taxonomy with minimum fields: {timestamp, severity, source, context}; log reset_reason and driver_reset explicitly.

Pass criteria: event_missing_rate=0 for flap windows AND flap_rate ≤ X / hour in the defined operating mode.

Only one port misbehaves — per-port counters missing or global aggregation masking the fault?

Likely cause: only global counters are read; port mapping {port_id ↔ phy_addr} inconsistent; per-queue/per-port attribution absent.

Quick check: collect per-port {CRC_ppm, drop_rate, flap_count} for all ports in the same window; verify port map checksum/version.

Fix: store port map in the evidence bundle: {port_id, phy_addr, role, speed}; default dashboards to per-port view.

Pass criteria: time_to_isolate_port < X minutes AND per-port anomaly explains ≥ X% of the global deviation.

Reboot “fixes it”, but the issue returns hours later — ring buffer overwrite or retention policy losing evidence?

Likely cause: ring buffer too shallow; trigger too strict/too noisy; retention purge removes pre-failure context.

Quick check: review {overwrite_count, retention_days, snapshot_success} and confirm pre/post windows cover symptom lead-up.

Fix: apply two-tier logging: {low-rate health sampling} + {triggered snapshot}; tune {threshold, cooldown, pre/post window}.

Pass criteria: evidence_preservation_rate ≥ X% over N recurring incidents AND overwrite_count=0 within retention window.

“ESD test passed”, but the link later becomes more fragile — what degradation signature is fastest to check?

Likely cause: baseline drift: higher {CRC baseline}, higher {flap/hour}, stronger {temperature sensitivity} even if pass/fail still passes.

Quick check: compare pre/post test baselines: {CRC_avg_ppm, flap_rate, peak_burst_ppm, P95} with identical window and mode.

Fix: add trend snapshots to the bundle: {daily_baseline, peak_window}; tag events with {test_phase, test_id} for traceability.

Pass criteria: baseline_shift_ppm ≤ X and flap_rate_increase ≤ X% after test (same config, same window).

PoE-powered system drops sporadically — what is the first time-alignment to run (PoE fault vs link flap)?

Likely cause: {PoE port state change, power-limit, UV} triggers reset or link renegotiation; event timing is not correlated to counters.

Quick check: align {PoE_fault_ts, brownout_ts, reset_ts} with {link_down_ts, CRC_burst_ts} on the same timebase.

Fix: log PoE/PoDL states as typed events with {timestamp, port_id, reason_code}; include power telemetry metadata.

Pass criteria: alignment_error_ms < X ms AND co-occurrence_rate(PoE_fault, link_down) ≤ X% after mitigation.

Fails only in a specific operating mode — how should triggers be written to catch it (pre/post window)?

Likely cause: trigger uses only averages; burst/derivative conditions are missing; cooldown mis-tuned (either floods or misses).

Quick check: replay history: test rules {metric>X for Y s} and {Δmetric>X within Y s}; measure hit/miss/false-positive rates.

Fix: define trigger schema: {signal, threshold, duration, derivative, cooldown, preWindow, postWindow}; require pre/post capture.

Pass criteria: false_positive ≤ X/day AND miss_rate ≤ X% with preWindow ≥ X s and postWindow ≥ X s.

Same issue behaves differently across firmware versions — which three version fields are most often missing?

Likely cause: evidence bundle lacks {fw_version}, {config_profile_id}, {counter_schema_version/driver_build_id} → no valid A/B comparison.

Quick check: validate bundle envelope completeness: required_fields_present? {fw, config, schema} + {device_id, uptime, timebase}.

Fix: enforce “ticket envelope” schema: export must fail fast if required fields are missing; version fields become non-optional.

Pass criteria: field_completeness=100% AND schema_validation_pass=100% across exported incidents.

A counter suddenly resets — rollover, reboot, or driver reset? How to distinguish in logs?

Likely cause: rollover not flagged; reset_reason missing; driver reset silently reinitializes stats without a typed event.

Quick check: correlate {uptime jump}, {reset_reason}, {driver_reset_event}, {rollover_flag/high_word} around the reset timestamp.

Fix: define reset semantics: rollover extends with high-word; reboot emits reset event; driver reset emits typed event and splits epochs.

Pass criteria: reset_cause_determinable=100% AND trend_stitch_error ≤ X% across epoch boundaries.

“Health score” is questioned — how to prove it is tied to reproducible tests and thresholds (X)?

Likely cause: score is not auditable: inputs/weights/windows/thresholds and calibration tests are not documented.

Quick check: list score schema: {inputs(counters/events), windowing, normalization, thresholds, calibration_suite}; confirm monotonicity vs injected fault level.

Fix: publish scoring spec + calibration suite mapping; store {score_inputs_snapshot, thresholds_version} in each incident bundle.

Pass criteria: monotonicity=TRUE AND repeatability ≥ X% over N injections; score variance ≤ X under stable conditions.

Link Health & Black-Box: CRC/Drop Logging for Ethernet Forensics

Link Health & Black-Box: CRC/Drop Logging for Ethernet Forensics

H2-1 · What “Link Health & Black-Box” Means (Scope + Deliverables)

H2-2 · Metrics Map: Counters You Must Have (PHY / MAC / Switch / Stack)

H2-3 · Counter Hygiene: Definitions, Denominators, Windows

H2-4 · Event Taxonomy: What to Log Besides Counters

H2-5 · Time Correlation: Single Timebase for Forensics

H2-6 · Sampling & Triggers: Ring Buffer, Snapshot, “Flight Recorder”

H2-7 · Evidence Bundle: What “One Incident” Should Contain

H2-8 · Storage, Retention, and Integrity

H2-9 · Remote Export & Field Workflow (forensics pipeline)

H2-10 · Engineering Checklist (Design → Bring-up → Production)

H2-11 · Applications (Where Black-Box Saves You Most)

H2-12 · IC Selection Logic (What Features to Demand)

Request a Quote

Accepted Formats

Attachment

H2-13 · FAQs (Field Troubleshooting, Black-Box Evidence Only)

Explore

Categories

Get in Touch

Link Health & Black-Box: CRC/Drop Logging for Ethernet Forensics

Link Health & Black-Box: CRC/Drop Logging for Ethernet Forensics

H2-1 · What “Link Health & Black-Box” Means (Scope + Deliverables)

H2-2 · Metrics Map: Counters You Must Have (PHY / MAC / Switch / Stack)

H2-3 · Counter Hygiene: Definitions, Denominators, Windows

H2-4 · Event Taxonomy: What to Log Besides Counters

H2-5 · Time Correlation: Single Timebase for Forensics

H2-6 · Sampling & Triggers: Ring Buffer, Snapshot, “Flight Recorder”

H2-7 · Evidence Bundle: What “One Incident” Should Contain

H2-8 · Storage, Retention, and Integrity

H2-9 · Remote Export & Field Workflow (forensics pipeline)

H2-10 · Engineering Checklist (Design → Bring-up → Production)

H2-11 · Applications (Where Black-Box Saves You Most)

H2-12 · IC Selection Logic (What Features to Demand)

Recommended topics you might also need

Request a Quote

Accepted Formats

Attachment

H2-13 · FAQs (Field Troubleshooting, Black-Box Evidence Only)

Explore

Categories

Get in Touch