Link Health & Black-Box: CRC/Drop Logging for Ethernet Forensics
← Back to: Industrial Ethernet & TSN
H2-1 · What “Link Health & Black-Box” Means (Scope + Deliverables)
- In scope: link-health counters + event timeline + temperature/power correlation + an exportable evidence bundle.
- Out of scope: magnetics/CMC selection, mode suppression design, PoE center-tap clamping layout details (capture as events only; design details belong to the protection pages).
Link up answers “does it connect now?” while link health answers “does it remain stable across time, environment, and load?” A practical link-health model treats the link as a rate of badness (CRC/error bursts, drop bursts, flap bursts) rather than a binary state.
- Stability: low error/drop rate over long windows (minutes → hours).
- Robustness: error rate does not spike under temperature/power/EMI events.
- Recoverability: when incidents happen, evidence survives and RCA is repeatable.
Industrial Ethernet failures are often intermittent: short bursts that vanish before a technician arrives. A black-box converts “cannot reproduce” into a time-aligned incident record, enabling fast triage and consistent fixes.
- Intermittent packet loss: throughput “looks fine” but real-time control misses deadlines.
- Occasional link flaps: short up/down cycles caused by marginal conditions.
- CRC spikes tied to environment: temperature ramps or power droops correlate with error bursts.
- Error rate stays within X per 10⁶ frames over Y minutes at nominal conditions.
- During stress events (temp/power), bursts are captured with pre/post windows and exported successfully.
- Incidents are reproducibly classified (drop vs CRC vs flap) with consistent metric definitions.
A field-ready black-box is not “lots of logs”. It is a structured evidence bundle that makes RCA repeatable across teams and sites.
- Counters snapshot: CRC/FCS + drops + link flaps + autoneg/EEE transitions (layer-tagged).
- Event log timeline: link/power/thermal/reset/PoE events (status only, no layout details).
- Environment: temperature, supply rails status, brownout flags, uptime.
- Configuration: speed/duplex/EEE state, port role, queue/policer mode, firmware version.
- Time base: monotonic timestamp + wall-clock (optional PTP state recorded as metadata).
- Classify incidents by signature: CRC-burst vs drop-burst vs flap-burst.
- Correlate with environment: temperature ramps / power events / external disturbances.
- Produce a repeatable “fix + verify” checklist with clear pass criteria (X).
H2-2 · Metrics Map: Counters You Must Have (PHY / MAC / Switch / Stack)
A useful metrics map answers two questions: what to collect and how to interpret. To keep the system field-friendly, counters are grouped by layer and by incident signature.
- Symbol/PCS errors: increment when decoding fails at the PHY/PCS boundary.
- Link up/down count: flaps per hour (burst detection uses a short window).
- Autoneg retries (if present): negotiation loops reveal marginal channels.
- SNR/quality hint: any vendor “link quality” indicator should be stored with a timestamp.
- EEE transitions: enter/exit counts help explain latency spikes or wake-related glitches.
- FCS/CRC errors: frames failing integrity checks at MAC receive.
- RX drops: descriptor exhaustion, buffer overrun, or policy drops (must be distinguishable if possible).
- TX retries / underrun: reveals starvation or bus contention, often misread as “link issue”.
- Queue occupancy / watermark: supports “drop burst” root-cause classification.
- Interrupt/coalescing stats: helps separate CPU scheduling artifacts from true link corruption.
- Per-port drops: ingress/egress drop counters (must indicate direction).
- Storm / discard reason: if the switch provides drop reason classes, store the class id.
- Link flaps per port: preserves “one bad port” signatures.
- Queue depth histogram: supports “congestion vs corruption” separation.
- Mirror/telemetry counters: use summary stats; full packet capture is not required as MVP.
- Socket/driver drops: distinguish “stack drop” from “wire drop”.
- Retry/timeouts: store counts and the time window used for rate calculation.
- CPU load / scheduling hint: minimal metadata to flag starvation incidents.
- Per-flow or per-endpoint rollup: identifies a single bad talker without capturing payload.
- Latency sampling hooks: keep it as summary stats (P50/P95) to avoid heavy logging.
Without strict definitions, different tools will show “conflicting” numbers. Each counter must ship with a small spec:
H2-3 · Counter Hygiene: Definitions, Denominators, Windows
Counter numbers become actionable only when they are comparable across devices, tools, and time. This section standardizes definitions, denominators, and windowing so “the numbers look wrong” becomes a solvable engineering issue.
- Symptom: one device shows CRC=0 while another shows CRC bursts on the same incident.
- Quick check: per-port? per-direction (ingress/egress)? frame class filtered?
- Fix: enforce port + direction tagging; define included frame classes explicitly.
- Symptom: error “rate” differs by 10×–100× between dashboards.
- Quick check: per-frame vs per-byte vs per-time? mixed denominators?
- Fix: standardize: errors per 10⁶ frames (or per second) and document it.
- Symptom: long-window averages look fine, but real-time control still fails.
- Quick check: only one long window reported? no peak or burst metric?
- Fix: always publish short + long windows and a burst/peak indicator.
- Symptom: counters “jump” or clear unexpectedly after a short incident.
- Quick check: boot reset vs link-session reset? counter width and wrap?
- Fix: log boot_id/uptime, reset reason, and wrap detection flags.
Every counter should ship with a short “spec” so different teams do not invent their own definitions.
- Metric ID / Name: stable name for long-term trend and version migration.
- Layer: PHY / MAC / Switch / Stack (matches the layered map).
- Scope: port + direction + included frame classes.
- Increment condition: one sentence “when X happens, +1”.
- Reset condition: boot / link-down / manual clear / session reset.
- Denominator: per-frame / per-byte / per-time (choose one).
- Window policy: short + long + peak / P95 (fixed output set).
- Units: count, rate, ppm, or percentage (one canonical unit).
- Rollover behavior: width + wrap detection + reporting rule.
- Validation hook: how to inject and verify the counter increments correctly.
- Instant: quick “now” sanity check.
- Short window: 1s–5s for burst detection.
- Long window: 5–60 minutes for stability trends.
- Peak / P95: captures the worst user experience hidden by averages.
- CRC / FCS rate: peak window catches short corruption storms.
- Drop rate: peak window separates “burst drop” from mild congestion.
- Flap rate: peak reveals repeated up/down cycles in minutes.
Pass criteria must be expressed with window and denominator. Use placeholders (X, Y) and bind them to the system class.
- CRC_rate_5min: ≤ X per 10⁶ frames over Y minutes.
- Drop_rate_5min: ≤ X per 10⁶ frames; burst count reported separately.
- Flap_count_60min: ≤ X per hour per port.
- P95(CRC_rate_5s): ≤ X per 10⁶ frames (captures hidden bursts).
- Evidence completeness: bundle includes the same windows + timebase metadata for replay.
H2-4 · Event Taxonomy: What to Log Besides Counters
Counters show what is happening; events explain when and under what conditions. A black-box requires an event dictionary with minimal fields so incidents can be correlated and replayed.
- timestamp: monotonic time (and optional wall-clock).
- severity: info / warn / critical.
- source: phy, mac, switch, psu, thermal, poe, firmware.
- event_id: stable identifier for parsing and long-term compatibility.
- context: small key/value set (port, speed, rail_id, temp_c, state).
- Examples: link up/down, flap burst, autoneg retry, EEE enter/exit.
- Context: port, speed/duplex, role, link partner id (if available).
- Examples: UV/OV, brownout, rail drop, reset cause.
- Context: rail_id, min/max voltage (if available), reset_reason, uptime.
- Examples: board temperature threshold crossing, thermal throttle flag.
- Context: temp_c, sensor_id, throttle_state, fan state (if present).
- Examples: ESD flag, surge detect flag, protection fault latch.
- Context: port, fault_code, latch_state, test_mode flag.
- Examples: detect/class, power limit, fault/overcurrent, thermal limit.
- Context: class, power_w, limit_state, fault_code, port.
H2-5 · Time Correlation: Single Timebase for Forensics
Forensics fails when counters, events, and environmental logs cannot be aligned on a single axis. Use a monotonic timeline as the primary source of truth, and attach optional wall/PTP time with explicit sync state.
- Purpose: stable ordering and duration, immune to wall-clock jumps.
- Strength: always available, consistent across re-sync events.
- Failure mode: resets on reboot, so boot_id + uptime must be logged.
- Purpose: human-readable timestamps for field service workflows.
- Strength: ties logs to tickets, screenshots, and external systems.
- Failure mode: can jump during set-time; must record time_jump and offset.
- Purpose: cross-device alignment when PTP is available.
- Strength: enables multi-node correlation with a shared epoch.
- Failure mode: quality depends on sync; must log sync_state and clock_status.
- Calibration point: record (monotonic_ts, wall_ts/ptp_ts, offset_ms) at X-minute intervals (X = system class policy).
- Sync state change: log every transition (locked / holdover / free-run / unknown) with monotonic_ts.
- Time jump detection: if wall/PTP jumps by ≥ X ms, emit a time_jump event and keep monotonic ordering.
- Reboot continuity: attach boot_id + uptime to every record so timelines can be stitched after resets.
- boot_id + uptime_ms (required for continuity).
- monotonic_ts for every counter and event record (primary axis).
- clock_source (monotonic / rtc / ptp) and clock_status (locked/holdover/free-run).
- sync_state (PTP/RTC validity flags and state changes).
- offset_records (last N calibration points: monotonic ↔ wall/PTP).
- time_jump_flag + jump magnitude if detected.
H2-6 · Sampling & Triggers: Ring Buffer, Snapshot, “Flight Recorder”
A black-box must be lightweight during normal operation yet preserve evidence when an incident happens. Use a two-level strategy: baseline sampling for health, plus triggered snapshots backed by a ring buffer.
- What: windowed CRC/drop/flap + temp/power state + clock sync state.
- How often: every X seconds/minutes (policy by system class).
- Cost guard: record size ≤ X bytes; daily budget ≤ X.
- Condition: CRC_rate_5s ≥ X per 10⁶ frames.
- Debounce/Cooldown: require X consecutive samples; cooldown X s.
- Snapshot scope: counters + event log + clock status + temp/power states.
- Condition: flap_count_60s ≥ X per port.
- Debounce/Cooldown: ignore repeats within X s after a snapshot.
- Snapshot scope: port state + negotiation/EEE status + recent error windows.
- Condition: temp_jump ≥ X °C in X s, or brownout/PoE fault flag asserted.
- Debounce/Cooldown: one snapshot per X s per port/rail.
- Snapshot scope: clock sync state + counters + event dictionary entries around the flag.
- Indexing: use monotonic_ts so pre/post extraction is deterministic.
- Pre-trigger window: keep last X seconds for “lead-up”.
- Post-trigger window: keep next X seconds for “aftermath”.
- Size budget: bundle cap X KB/MB; define overwrite policy for non-critical records.
- Freeze rule: on trigger, freeze pointers and seal an incident_id for export.
- Seal latency: trigger → bundle sealed ≤ X seconds.
- Pre-window coverage: ≥ X% of the configured pre seconds preserved.
- Post-window coverage: ≥ X seconds captured after trigger.
- Uniqueness: incident_id unique; bundle binds boot_id + monotonic_ts range.
H2-7 · Evidence Bundle: What “One Incident” Should Contain
A usable black-box is defined by a single incident bundle: a bounded, exportable object that contains identity, configuration context, and replayable evidence.
- incident_id (unique), device_id, port_id.
- boot_id + uptime_range (stitch across resets).
- firmware_version + schema_version (long-term decoding compatibility).
- capture_reason (trigger rule ID) + timebase status (clock_status/sync_state).
- Link: speed/duplex, autoneg mode, EEE enable, flow-control enable.
- Queues (status-only): queue_id, drop_policy name, shaping_enabled flag.
- Sampling: baseline_period, pre_window, post_window, debounce/cooldown.
- Metric hygiene: denominator and window set identifiers for every counter series.
- CRC/FCS error windows (short/long/peak), drop windows (RX/queue), link flap windows.
- Autoneg retries, EEE transitions (state changes), and selected “top offenders” per port.
- timestamp (monotonic), severity, source, context key-value list.
- Include the last N link state changes (up/down/negotiation/EEE state).
- temp range, power rail flags (UV/OV/brownout), reset reason flags.
- clock status and sync state for the incident time range.
- pcap_hash or payload hash, ethertype histogram, length histogram, peak frame-rate.
- Keep only summaries that support “incident happened” without large storage cost.
Output A · Summary (one screen)
- Top anomalies: CRC burst / drop burst / flap storm (auto-ranked).
- Environment: temp range + power flags + clock status.
- Next check hint: correlation pointers (e.g., “errors align with brownout flag”).
Output B · Details (engineering replay)
Output C · Raw (structured export)
- Raw counters dump (units, denominators, windows, timestamps).
- Raw event records (structured key-value or TLV).
- Clock calibration points and offset records for correlation.
- Integrity fields (chunk_crc, bundle_hash) for verification.
H2-8 · Storage, Retention, and Integrity
Reliability comes from power-fail safe commit, a clear retention policy, and integrity checks that can detect corruption or tampering.
- High-frequency small writes: prefer low-latency NVM-class storage for commit metadata.
- Large bundles / long retention: use block storage with wear-aware policies.
- Hybrid approach: metadata/journal on fast NVM, payload on bulk storage.
- Prepare journal: write a header with incident_id, schema_version, and payload size placeholders.
- Write payload chunks: append data with per-chunk CRC for local fault isolation.
- Seal hash: compute bundle_hash (e.g., SHA-256) and store it in the header region.
- Flip valid bit (A/B header): commit by switching the active header atomically.
- Verify on load: reject non-committed bundles and mark corrupted records without blocking others.
- Severity buckets: critical / warn / info with separate caps (N incidents or M MB).
- Export gate: exported bundles can be moved to a secondary tier or become overwrite-eligible.
- Write control: limit snapshot storms with cooldown and de-dup by incident fingerprint.
- chunk_crc: detect localized corruption without scanning the entire bundle blindly.
- bundle_hash: verify full-bundle integrity after reconstruction.
- schema_version: prevent mis-decoding due to format drift.
- Optional escalation: signing/key management belongs to the Security page when required.
H2-9 · Remote Export & Field Workflow (forensics pipeline)
A repeatable forensics pipeline requires two things: a consistent bundle format and a field workflow that preserves context from collection to RCA and verification.
Interfaces are treated as transport. Every channel must export the same incident bundle schema and ticket envelope.
- CLI / Web UI export for cabinet-side service.
- Offline-friendly: export to a file without relying on upstream connectivity.
- Guarantee: identical bundle schema, identical hashing fields.
- Management-plane export (e.g., CLI automation / REST / management protocols).
- Batch pull: retrieve bundle index first, then selected incident bundles.
- Guarantee: export bundles or bundle indices only, not full captures.
- Topology/port identification and field labeling support (e.g., discovery protocols).
- Use for: “which device/port is the failing link” and “where to attach a ticket”.
- Boundary: discovery data supplements the ticket envelope; it does not replace evidence.
Exported evidence should be wrapped by a ticket envelope so multiple incidents remain traceable and comparable.
- Ticket: ticket_id, site_id, operator_role/shift.
- Context: symptom_tag, service_window, notes (structured).
- Asset: device_id, port_id, firmware_version.
- Evidence: incident_id_list, bundle_count, bundle_hash_list.
- Identify: confirm device/port and assign symptom_tag.
- Freeze window: record the affected time range and timebase status.
- Export: export recent top incidents + current summary snapshot.
- Annotate: fill ticket envelope fields and structured现场变更 notes.
- Transfer: upload via gateway/server; keep offline copy if needed.
- Replay: correlate counters/events/env timelines per incident_id.
- Infer: produce evidence-backed hypotheses (correlations only).
- Verify: run regression steps and check metrics return within threshold X.
RCA report template (structured)
H2-10 · Engineering Checklist (Design → Bring-up → Production)
A practical black-box must pass three gates: Design, Bring-up, and Production. Each checklist item should be verifiable and includes a pass criteria placeholder X.
- Metric spec completeness: name/scope/increment/reset/denominator/windows. Pass criteria: X
- Trigger rules: thresholds + cooldown + de-dup fingerprint. Pass criteria: X
- Bundle schema frozen: schema_version + required fields. Pass criteria: X
- Summary/Details/Raw defined: one-screen summary and structured export. Pass criteria: X
- Timebase policy: monotonic main axis + sync_state fields. Pass criteria: X
- Storage budget: baseline + snapshot volume ceilings. Pass criteria: X
- Retention buckets: critical/warn/info with overwrite rules. Pass criteria: X
- Ticket envelope fields: ticket_id + site + symptom + hashes. Pass criteria: X
- Counter alignment: consistent windows/denominators across endpoints. Pass criteria: X
- CRC burst injection: trigger latency and pre/post capture coverage. Pass criteria: X
- Link flap injection: flap counts and negotiation transitions captured. Pass criteria: X
- Drop burst injection: queue drops visible and ranked correctly. Pass criteria: X
- Power-fail tests: non-committed records discarded; committed records readable. Pass criteria: X
- Time discontinuity tests: monotonic stitching + sync_state labeling. Pass criteria: X
- Trigger storm control: cooldown and de-dup prevent bundle explosion. Pass criteria: X
- Export path validation: offline or remote export success with hash verification. Pass criteria: X
- Default thresholds frozen: manufacturing config consistency. Pass criteria: X
- Compression/limits: bounded CPU/IO overhead with low evidence loss. Pass criteria: X
- Regression suite: fault injection + normal load scenarios each release. Pass criteria: X
- Schema compatibility: multi-version decoding success rate. Pass criteria: X
- Tool mapping frozen: field changes require a controlled change process. Pass criteria: X
- Field SOP adoption: envelope completeness and reporting compliance. Pass criteria: X
- Retention audit: critical exports traceable and not overwritten prematurely. Pass criteria: X
- Fix→Verify closure: metrics return within threshold after remediation. Pass criteria: X
H2-11 · Applications (Where Black-Box Saves You Most)
- Symptom: CRC bursts, sporadic drops, link flap under motor/relay switching.
- What to log: PHY error counters (CRC/FCS, symbol/alignment), link-up/down timeline, autoneg retry/EEE transitions, plus temp & rail events.
- What it reveals: correlation between error bursts and environmental/power events, and whether faults are link-quality vs buffer/queue artifacts.
- Example silicon building blocks (part numbers): Marvell Alaska 88E1512/88E151x (PTP + VCT cable test feature family), Microchip VSC8574 (multi-port GbE PHY w/ IEEE 1588), TI DP83640 (IEEE 1588 hardware timestamp PHY), Microchip KSZ9567R (managed switch w/ IEEE 1588 support).
- Symptom: link drops only when load steps, inrush, or port power limit events occur.
- What to log: brownout/UV/OV flags, PD class / PSE port state, power-limit / fault events, plus link counters around the same timestamp window.
- What it reveals: whether drops are caused by power collapse, port power cycling, or unrelated data-plane congestion.
- Example power controllers (part numbers): TI TPS23881 (802.3bt PSE controller), TI TPS2373 (high-power PoE PD interface), Analog Devices LTC4291 (802.3bt PSE controller), Skyworks Si3406 (PoE PD).
- Symptom: end-to-end latency spikes, occasional missed cycles, or micro-bursts causing queue drops.
- What to log: per-queue drops, shaping/gate schedule state changes, timestamp quality, and port-level CRC/error counters.
- What it reveals: whether the issue is traffic shaping / scheduling vs physical link integrity.
- Example TSN-capable switches (part numbers): NXP SJA1105 (AVB/TSN switch family), Microchip LAN9662 (TSN GbE switch with integrated CPU), Microchip KSZ9567R (managed switch w/ IEEE 1588/AVB hooks).
- Symptom: occasional retries/drops at certain temperature or cable routing conditions.
- What to log: link-state transitions, error counters, and environment (temp/power) around incidents.
- What it reveals: whether failures are driven by temperature drift, power events, or gradual link-quality degradation.
- Example SPE PHY (part number): Analog Devices ADIN1100 (10BASE-T1L).
H2-12 · IC Selection Logic (What Features to Demand)
Why: without layered attribution, “drops” cannot be separated into link-quality vs queue/policer artifacts.
How to verify: read counters at idle, inject a controlled burst, confirm monotonic increments and consistent denominators (packets/bytes/time).
Example silicon (part numbers): Microchip KSZ9567R (managed switch counters + IEEE 1588), Microchip LAN9662 (TSN switch with integrated CPU and switch telemetry), Marvell Alaska 88E1512 family (PHY diagnostics feature family).
Why: forensics requires aligning “CRC burst” with “power dip” or “thermal ramp” on the same axis.
How to verify: generate periodic sync events; confirm timestamp monotonicity across reboots and quantify drift / offset capture in the evidence bundle.
Example PHYs / switches (part numbers): TI DP83640 (IEEE 1588 hardware timestamp PHY), Microchip VSC8574 (IEEE 1588-capable multi-port PHY), Marvell Alaska 88E1510/88E1512/88E1514 family (PTP timestamping), NXP SJA1105 (AVB/TSN switch family).
Why: link health anomalies are often driven by environment and supply integrity, not by packet rate alone.
How to verify: sweep temperature and load; induce controlled brownout; confirm events appear with correct ordering and no missing windows.
Example sensors / PoE controllers (part numbers): TI TMP117 (digital temperature sensor), TI INA226 (I2C current/power monitor), TI TPS23881 (PoE PSE), TI TPS2373 (PoE PD), Analog Devices LTC4291 (PoE PSE).
Why: incidents are rare; evidence must survive reboot and power loss without corruption.
How to verify: trigger snapshot, then cut power during commit; boot and validate journal/CRC and record continuity.
Example NVM building blocks (part numbers): Infineon FM25V10-G (SPI F-RAM), Fujitsu MB85RS64V (SPI FRAM), Winbond W25Q128JV (SPI NOR flash), Micron MTFC32GBCAQTC-IT (industrial eMMC).
Why: field service depends on consistent artifacts (ID/version/config + summary + raw).
How to verify: export under CPU load; confirm bundle integrity hash and complete metadata fields.
Example integrated platforms (part numbers): Microchip LAN9662 (switch + integrated CPU for local collection/export), Microchip KSZ9567R (managed switch as a telemetry source), Analog Devices ADIN1100 (SPE PHY as a monitored endpoint).