Edge UPF Appliance: Hardware Architecture & Validation
← Back to: 5G Edge Telecom Infrastructure
An Edge UPF Appliance is “done” only when its packet-rate forwarding, inline crypto, PHY/retimer link margin, PTP/SyncE timing, and PMBus power evidence are all measurable, repeatable, and explainable with counters and logs. In practice, the fastest way to keep performance stable is to treat every failure as an evidence problem: align drops/queues, error counters, clock alarms, and brownout events to pinpoint whether the limit is the pipeline, the link, the timebase, or the power rails.
H2-1 · What an “Edge UPF Appliance” is (and what it is NOT)
An Edge UPF Appliance is a purpose-built, edge-deployable user-plane box designed to sustain high packet-rate forwarding (small packets, bursty traffic) while keeping latency predictable, maintaining operational evidence (counters, timestamps, event logs), and optionally inserting inline link/user-plane crypto without turning the system into a general security gateway.
This page is intentionally hardware- and validation-centric: it focuses on what must be true inside the appliance (data path, ports, timing distribution, power telemetry) so the box can be proven stable in the field.
What it is NOT (boundary rules to prevent scope drift)
- Not a SmartNIC/DPU card: the engineering focus is the appliance (ports, thermals, power, management evidence), not PCIe card topology.
- Not a ZTNA / security gateway: security coverage is limited to inline crypto insertion and boot/identity evidence, not full policy/inspection stacks.
- Not a boundary-clock switch / time hub: timing discussion is restricted to internal clock-tree integrity and alarms that affect appliance timestamps and stability.
“Done” means the box can be proven, not just claimed
- Meets both Gbps throughput and Mpps packet rate targets (especially 64B/IMIX).
- Controls tail latency (e.g., P99) and jitter under congestion, bursts, and crypto on/off.
- Recovers deterministically from brownouts and thermal events with traceable evidence.
- Exports a minimum evidence set: drop reasons, queue depth, link error trends, timing alarms, PMBus faults, and reset causes.
H2-2 · System block diagram: planes, ports, and evidence paths
The fastest way to avoid design ambiguity is to draw the appliance as three planes and force every requirement to attach to a plane and a measurable evidence path. The data plane must carry the packet pipeline at edge rates; the control plane must keep policies and sessions stable; and the management plane must export the minimum evidence set that makes field failures diagnosable.
Three planes (and the role each plays)
- Data plane: forwarding + optional crypto insertion; owns packet-rate, latency, and drop behavior.
- Control plane: session/route/policy orchestration; owns correctness under churn and failover.
- Management plane: OOB access, telemetry collection, event logs, and safe upgrade/rollback evidence.
Ports and “where truth comes from”
- Front-panel ports: traffic I/O; PHY/FEC/link error trends are first-line evidence for physical issues.
- OOB management: stable access to logs/telemetry even when the data plane is degraded.
- Time I/O (optional): internal timing distribution and alarms (PLL lock, phase error) that affect timestamps.
- Power input: PMBus rails, faults, brownouts, and temperature sensors that explain resets and throttling.
The diagram below intentionally uses two arrow classes: thick arrows for traffic and thin arrows for evidence. This keeps mobile readability while preserving a field-debuggable model.
H2-3 · Performance targets that drive hardware choices (Gbps vs Mpps vs latency)
Edge UPF performance is not a single number. A box can look “fast” on throughput (Gbps) yet fail in production when the traffic mix shifts to small packets, bursts, or crypto-enabled flows. Hardware choices must be driven by a joint target: Gbps, Mpps (64B/IMIX), and tail latency (P99), plus an evidence-backed loss and recovery profile.
Why “Gbps is high but 64B Mpps is low” is a common failure mode
- Per-packet work dominates: parsing, metadata updates, and counter writes consume fixed cycles per packet. As packet size shrinks, the pipeline becomes packet-rate limited even if link bandwidth is underused.
- Lookup stalls: flow/ACL/QoS lookups can introduce variable latency (hash collisions, misses, memory contention), reducing sustained Mpps and increasing tail latency.
- Microbursts overwhelm queues: short bursts can fill buffers faster than scheduling can drain, producing drops while “average throughput” still appears acceptable.
Low latency and low jitter translate to concrete hardware constraints
- Buffer depth vs tail latency: deeper buffers improve burst tolerance but increase queue residence time (P99/P999). Shallow buffers reduce latency but raise sensitivity to burst loss.
- Pipeline determinism: deeper pipelines can stabilize throughput, but exception paths (slow-path handling, rare headers, error recovery) create latency spikes if not isolated and accounted for.
- Clock integrity for measurement stability: consistent timestamps and repeatable latency measurement require stable internal timing distribution and alarms (no Time Hub/GNSS scope here).
Metrics breakdown (what must be measured and proven)
- Throughput: sustained line-rate under representative IMIX and real policy sets.
- Packet rate (Mpps): 64B/128B/IMIX curves, not a single headline number.
- Tail latency: median is not sufficient; P99 (and jitter) must be stable under bursts and crypto on/off.
- Loss curve: the “knee” where drops begin, and drop reasons (buffer overflow vs policy drop vs link errors).
- Congestion recovery: time to drain queues and return to baseline latency once congestion clears.
H2-4 · Inline forwarding ASIC pipeline (where packets are classified, modified, scheduled)
Sustained edge user-plane performance depends on a predictable data-plane pipeline. The appliance should be reasoned about as a sequence of stages where each stage has (1) a bounded per-packet cost, (2) a set of failure modes, and (3) a minimal set of counters that makes bottlenecks and drops explainable.
Pipeline stages (hardware view, protocol-agnostic)
- Parser: extracts headers and metadata; determines the fast path vs exception path triggers.
- Classifier: assigns traffic class and policy context; tags packets for lookup and scheduling decisions.
- Flow lookup / ACL / QoS: retrieves per-flow state and rules; the main source of variable latency if stalled.
- Header rewrite: applies forwarding decisions; updates encapsulation fields (only the stage role is covered here).
- (Optional) Inline crypto: encrypt/decrypt path integrated into the pipeline; must avoid burst-driven latency spikes.
- Scheduler / Queues: absorbs burstiness; defines loss behavior and tail latency through buffer policy and draining.
Memory trade-offs (why the same features can behave very differently)
- SRAM: deterministic and fast; ideal for hot tables and frequent counters.
- TCAM: strong matching but expensive and power-heavy; useful when rule priority and mask-based matching dominate.
- DRAM: high capacity but higher and more variable latency; can inflate P99 when cold tables or large states are accessed.
Field symptoms → likely stage-level causes (and what to check)
- Mpps drops first while Gbps looks fine: per-packet work or lookup stalls → check lookup miss/collision and pipeline exception counters.
- P99 latency spikes under bursts: queue residence time grows → check queue high-watermark and scheduler drops.
- Drops appear only during microbursts: buffer threshold and draining mismatch → check drop-by-reason with queue depth correlation.
H2-5 · Where crypto belongs (inline IPsec/MACsec/DTLS) and how it breaks performance
In an Edge UPF appliance, crypto is not “just security.” Where encryption is inserted defines the box’s true limits on Mpps, P99 latency, power/thermals, and operational stability. The goal is to keep encryption measurable, predictable, and recoverable under bursts, small packets, and key updates—without turning the platform into a full security gateway.
Placement choices (port-side vs mid-pipeline) and their impact
- Port-side (MACsec): encryption is attached to the link/port path. This tends to preserve a clean forwarding pipeline, but it pushes power and thermal load toward the port complex and can expose rate-dependent behavior under dense port configurations.
- Mid-pipeline (IPsec/DTLS-style insertion): crypto is integrated near classification/rewrite stages. This enables flexible policy binding, but it can amplify tail-latency spikes if lookup, queueing, and crypto resource contention are not tightly bounded.
How crypto breaks performance (typical failure mechanisms)
- Small-packet collapse: fixed per-packet work (context fetch, sequence/tag handling, counter updates) dominates at 64B, so Mpps drops long before bandwidth is exhausted.
- Latency spikes under bursts: crypto engines and key contexts contend for bandwidth; queue residence time grows and P99 jitter appears, especially when crypto is inserted after classification and before scheduling.
- Thermal-driven derating: enabling crypto increases power density. If cooling headroom is tight, temperature rise can cause rate fallback, higher error rates, or throttling—often misread as “network congestion.”
Common operational traps (symptom → what to check)
- Key rotation causes traffic jitter: correlate key_update_event with P99 spikes and queue high-watermarks.
- Replay/sequence-related drops: verify replay_drop / seq_error counters and confirm drop-by-reason distribution.
- Bypass/fallback ambiguity: ensure the device logs whether crypto is enabled, bypassed, or failed-closed.
- Crypto errors mistaken for PHY faults: separate crypto_err from port FEC/PCS error counters.
Key-domain isolation (TPM/HSM boundary, evidence only)
- Key domain: key material remains in a protected domain; the pipeline consumes handles/contexts rather than exporting secrets.
- Upgrade/rollback evidence: measured boot and firmware version records provide proof of what crypto implementation was running when an event occurred.
H2-6 · Ethernet PHYs & retimers: port density, SerDes margin, and board-level realities
Port stability is a first-order requirement for an Edge UPF appliance. Even when the forwarding pipeline is strong, weak link margin turns into retries, latency inflation, and intermittent drops. PHYs and retimers must be evaluated as a chain: front-panel connectors, channel loss, equalization, FEC behavior, and the SerDes interface into the forwarding ASIC.
Roles and boundaries (what PHYs and retimers actually do)
- PHY: physical-layer transceiver functions plus PCS/FEC statistics that reveal link health trends.
- Retimer: re-clocks and equalizes signals to restore margin over long or lossy channels; critical for dense ports and high rates.
- Lane mapping (practical impact): dense port designs increase routing complexity; lane swaps and crossovers amplify debug time and failure localization needs.
Board reality (what silently destroys margin)
- Channel + connectors: insertion loss and contact variability create rate-sensitive instability that often appears “random.”
- Power integrity noise: SerDes and PLL sensitivity can make errors rise under high load even if routing is nominal.
- Thermals: retimers/PHYs heat up under traffic; margin shrinks and corrected FEC climbs before uncorrected errors appear.
Field failure modes (symptom → likely evidence)
- Intermittent link flaps: check link_flap counters and whether FEC errors surge first.
- Only fails at high temperature or load: correlate port temperature with fec_corrected trends.
- One data rate unstable while a lower rate is fine: indicates marginal channel/equalization headroom.
- “Looks like congestion” but is link-induced: compare port error counters against queue depth and drop-by-reason.
H2-7 · PTP/SyncE clock tree inside the appliance (timestamp points + jitter cleaning)
An Edge UPF appliance often participates in synchronized measurement, timestamped telemetry, and audit-grade event traces. Even when the box is “just forwarding,” poor internal clock quality turns into unstable timestamps, jittery latency evidence, and false root-cause signals. The engineering goal is simple: maintain a clean time base inside the chassis, expose lock/alarm states, and make any switchover fully traceable.
Why a UPF needs a good clock (practical drivers)
- Timestamp consistency: packet and telemetry timestamps must be comparable across ports and across time windows.
- Repeatable latency evidence: P99 and jitter investigations rely on a stable reference and clear timebase state flags.
- Auditability: if timebase quality changes (holdover, reference loss), logs must prove when and how it happened.
Timestamp points (PHY vs MAC vs ASIC) — what each point “means”
- PHY timestamp: closest to the physical interface; best for link-level timing evidence and reduced internal queue influence. Typical evidence: synce_lock, PHY time counters, port-level timestamp path selection.
- MAC timestamp: reflects MAC scheduling and arbitration effects; useful to expose internal contention sensitivity. Typical evidence: MAC timestamp status, per-port arbitration indicators.
- ASIC timestamp: closest to the forwarding pipeline; captures queue residence effects and data-plane handling variability. Typical evidence: latency_stamp, per-queue timing deltas, pipeline timing state flags.
Jitter cleaning & distribution (inside the chassis only)
- Recovered SyncE input: port-recovered frequency provides a reference that must be validated and monitored.
- PLL / jitter cleaner: cleans phase noise before fanout; exposes lock and holdover states as first-class telemetry.
- Clock fanout: distributes a conditioned clock to PHYs, the forwarding ASIC, and timestamp units.
Redundancy & alarms (switchover evidence, not external time-hub design)
- Reference selection: track which reference is active (A/B/internal) and log any change.
- Alarm taxonomy: loss-of-signal/lock, holdover entry, and recovery must be exportable with timestamps.
- Traceable switchover: each transition should emit ref_select_state, timebase_alarm, and holdover_time.
H2-8 · PMBus-based power architecture: rails, sequencing, telemetry, and brownout evidence
A UPF appliance power system must be measurable and provable, not just “sized.” PMBus turns power into an evidence source: rail health, sequencing correctness, fault flags, and brownout timelines. When the box reboots, throttles, or drops links, the power stack should answer: which rail drifted, which protection tripped, and what action was taken.
Rail domains that matter (what each rail class tends to break)
- SerDes / PHY rails: noise or droop shows up as BER/FEC growth and intermittent flaps.
- ASIC core rails: droop becomes resets, forwarding stalls, or counter discontinuities.
- DDR rails: instability can create hard-to-explain errors and unstable tail latency behavior.
- Management rails: determines whether telemetry and event logs survive the fault window.
- Clock rails: power noise can masquerade as timing or timestamp instability.
Sequencing & PG (why “occasional boot failure” is usually rail logic)
- Ordering sensitivity: releasing reset before critical rails are stable can create intermittent failures that only appear under temperature or load variation.
- PG fan-in: multiple PG signals must be debounced and combined deterministically; false PG transitions cause phantom resets.
- Restart policy: retries and backoff should be explicit and logged, so a “flaky boot” has a repeatable explanation.
PMBus telemetry strategy (what to read and what to log)
- Read set: V/I/T/P plus fault flags (UV/OV/OC/OT) per rail domain.
- Sampling: use tiered sampling (fast for critical rails, slower for secondary rails) to avoid missing brownout windows or drowning logs.
- Thresholds: align thresholds with observable service impact (link errors, resets, latency drift), not just datasheet margins.
- Event records: each fault must capture rail ID, pre/post readings, fault code, and recovery action (retry, latch, throttling).
Brownout & outage evidence (prove what happened)
- Reset cause: BOR/WDT/PMIC fault classification should always be preserved across reboot.
- Fault timeline: capture the sequence: UV flag → PG drop → protective action → reboot or recovery.
- Correlation: link brownout events to changes in queue depth, drops, and port error counters to avoid misdiagnosis.
H2-9 · Thermal & mechanical constraints that change silicon choices (without becoming a rack page)
Sustained line-rate operation in an Edge UPF appliance is often limited by thermal headroom and mechanical layout, not by a datasheet throughput number. As temperature rises, link margin shrinks, error correction activity increases, silicon throttles, and jitter/latency evidence becomes noisy. The goal is to make hotspots predictable, instrumented, and controllable—so performance remains stable under real site conditions.
How heat turns into performance loss (practical symptom chain)
- SerDes margin collapses: BER rises → FEC works harder → tail latency and throughput stability degrade.
- Throttling activates: power/temperature caps reduce frequency or lanes → Mpps and P99 drift under load.
- Power stage stress: VRM efficiency and transient behavior shift with temperature → intermittent resets or “unexplained” errors.
Typical hotspots that must be treated as “design domains”
- PHY / retimer bank: dense I/O + equalization power; heat maps directly to link errors and flaps.
- Inline crypto block: enablement can shift power abruptly; affects fan policy and latency stability.
- VRM zone: thermal rise reduces margin and can amplify brownout risk during bursts.
- Forwarding ASIC: the main power density source; any throttling changes forwarding determinism.
Derating policies (engineering strategies that should be provable)
- Fan curve by zones: control and alarm by PHY/ASIC/VRM temperature zones (not a single “box temp”).
- Power caps & throttle states: explicit thresholds with logged activation and recovery.
- Rate fallback: port-speed fallback as a last-resort stability tool; every fallback must be timestamped.
H2-10 · Observability & field forensics (counters, timestamps, and event logs that prove the box is healthy)
Field stability depends on proving the appliance is healthy using structured evidence, not guesswork. A minimal, well-chosen set of counters, timestamped state transitions, and event logs can distinguish congestion from link errors, timebase instability, or power events—without turning the system into a TAP/probe product.
Minimum counter set (organized by evidence domains)
- Forwarding / queues: drop_by_reason, queue_depth, queue_hwm, latency_stamp (P99/Jitter).
- Link / PHY: link_flap, fec_corrected, fec_uncorrected, pcs_errors (trend).
- Crypto: crypto_err, key_update_event, replay_drop (if present).
- Timing: synce_lock, ptp_lock, pll_unlock, timebase_alarm, ref_select_state.
- Power / reset: brownout_event, uv_fault, pg_drop, wdt_reset, reset_cause, throttle_state.
Correlation patterns (how to tell what is actually happening)
- Congestion-driven: queue_hwm rises first → latency/jitter grows → drops increase with queue/policer reasons.
- Link-margin-driven: fec_corrected rises first → latency becomes noisy → link flaps or throughput instability follows.
- Timebase-driven: pll_unlock or ptp_lock instability → timestamp validity changes → apparent “metric drift” without real congestion.
- Power-driven: uv_fault or brownout_event → pg_drop → reset_cause / wdt_reset → counters reset discontinuities.
Event logs that support forensics (not string dumps)
- Structured fields: timestamp, domain, component, severity, reason_code, before/after readings, action_taken.
- Required events: key update, pll unlock/holdover, brownout, rate fallback, fan failure, watchdog resets.
- Pre/post window: preserve a short ring buffer around major events so reboots do not erase the cause.
Upgrade & rollback evidence (availability focus, not a security chapter)
- Version traceability: build ID and configuration hash recorded with every major fault.
- Rollback audit: rollback time, reason, target version, and recovery time must be logged.
- Management survivability: ensure OOB management and logs remain reachable during partial failures.
H2-11 · Validation & production checklist (what proves it’s done)
An Edge UPF Appliance is “done” only when forwarding, inline crypto, link integrity, timing sanity, and power/fault evidence are measurable and repeatable across lab, manufacturing, and field drills—without relying on “it usually works”.
1) Lab validation (engineering acceptance)
-
Throughput–Mpps–latency characterization (with/without crypto)
Setup IMIX + fixed sizes (64B/128B/512B/1500B/9K), step-load on queues, enable/disable inline crypto.
Pass No “64B cliff”, no abnormal P99 tails at target utilization; crypto mode meets the minimum service envelope.
Evidence CSV curves + P50/P95/P99 latency histograms + drop reason counters. -
Microburst & congestion recovery
Setup Burst patterns (incast, fan-in), controlled buffer pressure, queue scheduling sanity.
Pass Predictable loss behavior and recovery time; no long-lived head-of-line stalls.
Evidence Queue depth trace, scheduler stats, drop-by-reason counters, recovery time stamps. -
Inline crypto “break points”: small packets, rekey, replay-window stress
Setup Force rekey/SA update cadence; replay-window sweeps; mixed flows (short + elephant).
Pass No rekey-triggered traffic collapse; replay handling does not spike latency beyond guardrail.
Evidence Crypto engine busy%, crypto error counters, rekey event log with timestamps. -
Ethernet PHY/retimer margin & BER validation
Setup PRBS/eye-margin tools where supported; temperature sweep; cable/backplane variants.
Pass Stable link training; no temperature-only failures; BER stays within margin target.
Evidence BER report, link flap counters, retrain reasons, per-port error histograms. -
Timing sanity (PTP/SyncE inside the box)
Setup SyncE lock/unlock cycles; PLL holdover; timestamp consistency under load.
Pass Clean alarm behavior; deterministic switchover logic; timestamps remain monotonic and bounded.
Evidence PLL lock state logs, holdover entry/exit events, timestamp delta statistics. -
Power transient / brownout evidence
Setup Input droop, hot-plug, fan-stop, and load steps (SerDes + crypto worst case).
Pass No silent corruption: faults must be observable and attributable (rail, temperature, watchdog, etc.).
Evidence PMBus snapshots (V/I/T/fault flags), reset-cause register, brownout event log.
2) Production tests (manufacturing & line bring-up)
-
Port bring-up & link stability screening
Method Loopback fixtures or known-good peers; speed matrix sweep (per SKU).
Pass Zero unexpected link flaps in dwell window; consistent training results across units.
Record Per-port link counters + firmware build ID + PHY/retimer configuration checksum. -
Inline crypto self-test (fast path)
Method Known-answer tests (KAT) + short traffic run in crypto-on mode.
Pass No crypto error flags; expected throughput floor is met on golden pattern.
Record Crypto KAT result + engine telemetry snapshot + event log excerpt. -
PMBus health snapshot & threshold sanity
Method Read rail V/I/T/P + fault status; verify thresholds loaded (OV/UV/OCP/OTP).
Pass Telemetry matches golden range; thresholds match signed config bundle.
Record PMBus dump + config signature + rail sequencing result. -
Secure boot / measured boot proof (only the evidence, not the whole security story)
Method Verify boot-chain status bits + TPM measurements presence (where used).
Pass Boot measurement extends successfully; rollback protection state is consistent with SKU policy.
Record Boot attestation summary + firmware version + rollback counter.
3) Field drills (site acceptance & operability)
-
Brownout / power-cycle drill
Pass Service resumes within SLA; root cause remains attributable (no “mystery reboot”).
Evidence Reset cause + PMBus fault snapshot + time-stamped recovery markers. -
Fan failure / thermal derate drill
Pass Controlled derate (or safe shutdown) with alarms; no uncontrolled link instability.
Evidence Fan tach + temperature trend + derate decision log + port error counters. -
Link degradation drill (cable/connector/EMI-like symptoms)
Pass Clear differentiation between congestion vs BER-driven retransmits vs timing alarms.
Evidence Link flap reason, FEC/PCS counters (where available), queue stats, timestamp deltas. -
Upgrade + rollback proof
Pass Upgrade is reversible; rollback produces a verifiable artifact trail.
Evidence Signed image ID, rollback counter, success/fail markers, post-boot health summary.
4) Reference BOM (example part numbers to anchor procurement)
These are example material numbers commonly used to implement the measurable behaviors above. Final selection must match port speeds, thermal limits, and SKU policy.
| Block | Example P/N (specific) | Why it appears in this checklist |
|---|---|---|
| Forwarding / packet processing silicon (examples) |
Broadcom BCM56880 (Trident4 series)Broadcom BCM88690 (Jericho2)
|
Anchors “Gbps vs 64B Mpps vs latency tail” acceptance; counters/queue behavior and deterministic drops matter. |
| Inline crypto accelerator (example family) | Marvell Nitrox CNN35XX-NHB (adapter family) |
Anchors crypto-on/off A/B tests, rekey stability, and crypto error telemetry without hand-waving. |
| Multi-rate Ethernet PHY / gearboxing (example) | Marvell Alaska C 88X5123 |
Anchors link training, BER margin screening, and “temperature-only” failures on dense high-speed ports. |
| 10GBASE-T / NBASE-T PHY (example) | Marvell Alaska X 88X3310P |
Anchors copper-port stability tests, link flap reason logging, and manufacturing speed-matrix sweeps. |
| High-speed retimer (example) | Texas Instruments DS280DF810 |
Anchors SerDes margin recovery, deterministic training behavior, and repeatable BER screening across builds. |
| Jitter attenuator / clock cleaner (examples) |
Skyworks/SiLabs Si5341Skyworks Si5392
|
Anchors SyncE/PTP internal clock sanity, alarm behavior, and holdover entry/exit evidence paths. |
| PMBus power sequencer/monitor (example) | Texas Instruments UCD90120A |
Anchors “sequencing/PG mistakes” and production PMBus dumps (rails, thresholds, fault flags). |
| +48V hot-swap & PMBus power monitor (examples) |
Analog Devices ADM1272Analog Devices LTC4287A
|
Anchors brownout/hot-plug drills with readable evidence: current/voltage/power + fault timestamps. |
| 4.5–60V eFuse (example) | Texas Instruments TPS26633 |
Anchors board-level power-path protection and controlled faults that generate actionable logs. |
| Fan control & tach monitoring (example) | Microchip EMC2305 |
Anchors fan-failure drill evidence (tach, PWM, alerts) and thermal derate determinism. |
| Multi-channel temperature sensing (example) | Texas Instruments TMP464 |
Anchors hotspot attribution (PHY/VRM/ASIC regions) and consistent thermal logs across units. |
| TPM for measured boot evidence (example) | Infineon OPTIGA TPM SLB9670VQ2.0 |
Anchors “measured boot / rollback proof” as an artifact, without expanding into full security architecture. |
| OOB management / BMC SoC (example) | ASPEED AST2600 |
Anchors persistent event logs, sensor collection, and remote evidence retrieval in field drills. |
Figure F6 — “Done” means evidence across Lab → Production → Field
H2-12 · FAQs ×12
Short, field-focused answers with measurable evidence. Each answer links back to the deep-dive section.
1) Why does a UPF box hit Gbps but fail on Mpps (64B packets)?
2) What symptoms indicate the bottleneck is tables/queues vs PHY errors?
fec_corrected rises before queue pressure, suspect link integrity over congestion.
3) Where should inline crypto sit to minimize latency spikes?
4) Why does enabling IPsec/MACsec sometimes cause intermittent packet loss?
key_update_event, replay/drop counters, and crypto busy/queue depth. If loss clusters around updates, smooth rekey cadence and verify fallback behavior.
5) Retimer vs PHY vs switch—how to decide what is actually needed?
6) “Link is up but error counters grow”—how to debug SI/PI vs clock noise?
Si5341). If only certain ports/cables fail, suspect SI.
7) How to tell PTP/SyncE issues are internal clock-tree faults, not the network?
synce_lock/ptp_lock state, ref_select_state, and timestamp delta statistics under steady input. A clean input with recurring internal alarms points inside the appliance.
8) What telemetry must be logged to prove brownout/VRM trips caused resets?
UCD90120A or a hot-swap monitor can capture fault context that survives reboots. Without a snapshot at the event moment, field resets become “unattributable” and slow RMA.