Edge Router / SD-WAN Appliance Hardware Architecture
← Back to: Telecom & Networking Equipment
An edge router or SD-WAN appliance succeeds when its datapath, crypto offload, Ethernet interfaces, and power/reset design are sized to the real workload—not the port label. This page shows how to translate traffic (PPS, flows, tunnels) into silicon choices and the counters/tests that prove stability in production and in the field.
H2-1 · What it is: Edge Router vs SD-WAN Appliance
An edge router appliance is a purpose-built network endpoint at the WAN/LAN boundary that forwards packets at line rate while enforcing routing, QoS, and policy. An SD-WAN appliance adds an overlay/underlay control loop—tunnel formation, path measurement, and policy-driven steering—so the same box must sustain higher flow state, frequent rekeys/handshakes, and always-on encryption without turning latency and loss into “random” field failures.
In-scope for this page: hardware architecture (control/data plane split), packet pipeline, IPsec/SSL offload boundary, Ethernet PHY/SerDes stability,
PMIC sequencing, watchdog/reset strategy, and validation evidence.
Out-of-scope (link out only): DPI/UTM/IPS detection chains, subscriber AAA/BNG functions, carrier-scale CGNAT internals.
Practical boundary (hardware impact, not marketing):
| Dimension | Edge Router (baseline) | SD-WAN Appliance (what changes) |
|---|---|---|
| Traffic shape | Mixed enterprise flows; QoS/ACL can be heavy; performance often bounded by PPS + queues. | Overlay tunnels dominate; packet headers grow; traffic becomes state-heavy (more flows + more tunnel events). |
| Compute pressure | Fast path lookup + scheduling; control CPU mainly for routing/policy. | More continuous crypto + telemetry; handshake/rekey bursts can spike CPU/DDR even when average throughput looks fine. |
| Failure modes | Queue drops under microbursts; link flaps/BER; mis-sized buffers show up as jitter. | “Encryption on = throughput collapse”, rekey storms, MTU/fragmentation surprises, and latency spikes tied to state churn. |
| What to size | PPS, queue depth/scheduler, ACL scale, port mix, SerDes margin. | Crypto throughput + handshake rate, SA/session capacity, flow-cache behavior, telemetry cost, plus the same port/queue fundamentals. |
Key KPIs (must be stated as measurable items):
Recommended topics (link out only): Firewall/UTM Gateway · BNG/BRAS · Carrier-Grade NAT
H2-2 · Workload model: what really drives silicon choice
Silicon selection fails most often when “port speed” is treated as the requirement. Edge appliances are constrained by packet rate, state scale, crypto behavior, and queue dynamics. A box that looks perfect on large-packet throughput can still collapse under small packets, tunnel churn, or QoS-on microbursts. The correct method is to define workloads first, then map each workload to its dominant bottleneck and to the counters that prove it in validation and field logs.
Three workloads that must be specified (each implies a different bottleneck):
- Small packets / high PPS: typical in mixed enterprise + control traffic (VoIP, IoT chatter, keepalives). Symptoms: latency spikes, queue drops, CPU interrupts rising even when “Gbps” looks low. Stress points: parser/classifier, lookup, queue/scheduler.
- Large packets / high throughput: backups, video, replication. Symptoms: near line-rate until thermals/power throttle or DMA/DDR headroom disappears. Stress points: SerDes margin, DMA engines, DDR bandwidth for buffers/counters.
- Encrypted tunnels / high session churn: IPsec + TLS overlays with frequent rekeys and many sites. Symptoms: “encryption on = throughput drops”, jitter during rekey, sporadic fragmentation/MTU issues. Stress points: crypto throughput + handshake/rekey rate + state tables + DDR descriptors.
Translate requirements into measurable spec lines (avoid ambiguous marketing):
A complete performance spec should include large-packet throughput, small-packet PPS, IPsec/TLS throughput (bi-directional), handshake/rekey rate, and a degradation curve with QoS/ACL/encapsulation enabled.
Capacity checklist (questions that must have numeric answers):
- Concurrency: max tunnels, max SAs/sessions, max flows; worst-case behavior when flow-cache miss rate rises.
- QoS depth: queue count, per-queue depth, scheduler type; drop behavior under microbursts (tail-drop/WRED).
- Crypto behavior: bulk throughput vs handshake rate; rekey bursts and latency impact; DMA/descriptor efficiency.
- Memory headroom: DDR bandwidth margin under telemetry + counters + buffering; ECC requirement for reliability targets.
- Port realism: port mix and PHY features; stability evidence via CRC/FEC/BER counters and link-training events.
H2-3 · Hardware architecture: control plane vs data plane
A modern edge router / SD-WAN appliance is built around a strict separation between control plane and data plane. The control plane decides “what should happen” (routing, policy, tunnel lifecycle), while the data plane executes “what must happen per packet” (parsing, lookups, QoS/queues, encapsulation/decapsulation) with predictable latency. When per-packet work leaks into the control plane, performance degrades non-linearly and appears as intermittent jitter and loss rather than a clean throughput drop.
Hardware design goal: keep the common case on a deterministic fast path, and make slow-path fallbacks explicit, measurable, and recoverable (logs + counters + reset cause).
Control plane (multicore CPU)
Routing/policy decisions, tunnel setup/rekey, configuration, logging, and exception handling. Not sized for per-packet execution.
Data plane (NPU / programmable forwarding)
Per-packet pipeline: parser → classify → lookup → rewrite → QoS/queues → egress. Optimized for PPS, table scale, and deterministic scheduling.
Side accelerators (crypto and optional co-processors)
IPsec/TLS bulk crypto and rekey assistance. Engineering focus is DMA/DDR interaction and the offload boundary—not DPI chains.
Board fabric (PCIe + DDR + storage)
PCIe endpoints, DDR channels, and storage I/O can become hidden bottlenecks via DMA contention, descriptor pressure, and logging bursts.
Fast path vs slow path (boundary definition):
- Fast path: known headers, known flow/policy, tables and keys already resident; NPU executes forwarding, QoS, and counters deterministically.
- Slow path: unknown flows/protocol exceptions, control packets, table misses, or rekey/update events that require CPU orchestration.
What to verify in architecture reviews:
H2-4 · Packet pipeline & offload boundary
The data plane is best understood as a deterministic packet pipeline with a small set of hotspots. Feature enablement (ACL scale, QoS complexity, tunneling, and encryption) changes the per-packet work and can force traffic onto slower paths when tables miss, state churn spikes, or queues become the dominant limiter. A correct design makes the offload boundary explicit: which steps are always in hardware, which steps are conditionally offloaded, and which steps trigger CPU exception handling.
Typical pipeline (per packet):
- Parse: L2/L3/L4 header extraction
- Classify: policy tagging, ACL match preparation
- Lookup: LPM/flow hash/ACL tables (TCAM/hashed tables)
- Rewrite: header edits (TTL/DSCP), NAT boundary mention only
- Encap/Decap: VXLAN/GRE/IPsec tunnel handling
- Crypto: bulk encryption/decryption (offloaded when available)
- Meter/Queue: policing, queue selection, buffering
- Schedule: shaping and scheduler (latency/jitter trade-offs)
- Egress: transmit to PHY with backpressure handling
Three hotspots that dominate real-world performance:
Hotspot 1 — Lookup (tables & cache behavior)
Rule count and table misses amplify latency. Flow-cache miss rate and multi-stage lookups can turn “line rate” into jitter and drops.
Hotspot 2 — Crypto (bulk + rekey bursts)
Throughput alone is insufficient: handshake/rekey rate and DMA/DDR interaction often explain “encryption on = collapse”.
Hotspot 3 — Queues/Scheduler (QoS realism)
QoS changes queueing cost. Microbursts and buffer depth decide p99 latency and voice/video quality even when average throughput is stable.
Fast path triggers vs slow path triggers (what forces exceptions):
- Fast path stays valid when: flow/policy/SA is already installed; headers are known; no special exception rules apply.
- Slow path is triggered by: unknown/new flows, protocol exceptions, table misses, control packets, rekey/update events, and heavy per-packet sampling/mirroring.
Common “throughput looks fine but jitter explodes” causes:
- Queueing dominance: scheduler policy mismatched to traffic; queue depth too small/large for microbursts → p99 latency spikes.
- DDR/DMA contention: buffers + counters + descriptors compete; bursts (rekey/log writes) create latency cliffs.
- State churn: short-lived flows and tunnel churn reduce cache hit rate; performance becomes non-linear and intermittent.
Verification signals (minimal, non-invasive): queue drops/depth · crypto utilization/rekey events · DDR bandwidth watermarks · CRC/FEC errors
H2-5 · IPsec/SSL/TLS acceleration in practice
“Crypto offload” only matters when the offload boundary is explicit and measurable. In real deployments, performance collapses more often from handshake/rekey bursts, DMA/DDR contention, and queue-induced jitter than from the cipher primitive itself. A credible platform separates control-plane negotiation from data-plane bulk crypto, and exposes counters that prove where the work is executed.
Engineering focus: define what is handled on the fast path (bulk crypto + encap/decap) vs what triggers the slow path (handshake, policy/SA install, exceptions). Treat “Gbps with IPsec” as incomplete unless handshake rate, tunnel scale, and jitter under rekey are stated.
Where offload can live (and why it behaves differently):
- Integrated crypto engine (on-SoC / near-NPU): lowest copy overhead when descriptors and buffers stay on the same fabric.
- External accelerator (PCIe card/module): can scale bulk throughput, but adds DMA and PCIe scheduling as a first-class bottleneck.
- Pipeline coupling point: the practical boundary is around encap/decap; bulk crypto must align with tunnel processing to avoid extra copies.
Metrics that are actually comparable (publishable spec items):
Common engineering failure modes (with observable symptoms):
- Control-plane bottleneck: many tunnels / frequent rekeys → CPU spikes → intermittent drops and jitter even if bulk crypto is “offloaded”.
- DMA/DDR bottleneck: throughput looks acceptable but p99 latency spikes appear during bursts (descriptor pressure, buffer contention).
- Header growth & MTU: encapsulation + crypto overhead → fragmentation/reassembly → lower goodput and bursty retransmits.
Quick “true offload vs fake offload” checks:
- With encryption enabled, does CPU utilization rise proportionally with encrypted throughput? (often indicates the data path is still CPU-heavy)
- Do crypto-engine counters (utilization, drops) correlate with traffic and rekey events?
- When tunnel count increases, does performance degrade smoothly or cliff at a threshold? (cliffs usually point to handshake/table/DDR limits)
H2-6 · Ethernet interfaces: PHY/SerDes/retimers/port mix
Port labels (1G/10G/25G/100G) do not guarantee real-world stability. Interface robustness depends on the PHY/SerDes margin, board/channel loss, connector quality, and how link training and FEC behave under temperature and EMI. A practical hardware view starts from the connector and follows the signal path into the SoC, then maps stability issues back to observable counters such as training retries, CRC/PCS errors, and FEC corrected/uncorrectable events.
Port mix (kept at interface level):
- Connector types: RJ45 / SFP / QSFP (only to frame channel budget; module internals belong elsewhere).
- Management path: MDIO/sideband access to link status, temperature, and error counters enables field troubleshooting.
- Stability proof: link training behavior + error counters are more meaningful than nominal port speed.
PHY-side engineering points that matter:
Link training & autoneg
Training retries and long bring-up time often indicate insufficient margin (loss, reflections, or EMI sensitivity).
FEC behavior
“Corrected” rising steadily is a warning sign; “uncorrectable” correlates with packet loss and service-impact events.
Error counters (CRC/PCS)
CRC/PCS errors help distinguish channel issues from higher-layer problems; correlate with temperature and cable/connector changes.
When a retimer is justified (rule-based triggers):
- Channel loss is high: long PCB routes, multiple connectors/backplane, dense routing and vias.
- Lane rate is high: SerDes margin shrinks quickly with speed; training becomes sensitive to drift.
- Environment is harsh: temperature swing and EMI push the link over the edge intermittently.
- Soft evidence: training retries climb; FEC corrected grows; occasional uncorrectable bursts appear.
Practical diagnosis hint: if errors track temperature or specific port/cable swaps, treat it as a channel-margin problem first (connector loss, routing, retimer/equalization), not as a routing or policy problem.
H2-7 · Memory, storage, and I/O fabric
Data-plane performance depends on determinism, while memory and I/O are typically shared resources. When DDR, DMA, PCIe, and storage write bursts collide with packet buffering, flow/SA tables, and crypto buffers, the first symptom is often tail latency (p99/p999 jitter), not headline throughput. A robust appliance treats DDR bandwidth and I/O topology as first-class design constraints and publishes evidence (counters, correlation tests) that validate isolation.
Engineering focus: “capacity” rarely fails first. The common failure mode is contention—queues, descriptors, and stats/telemetry compete with crypto and forwarding for the same DDR and DMA resources.
DDR: bandwidth, latency, and contention (what competes for cycles):
- Channels & frequency: more parallelism reduces contention sensitivity and stabilizes tail latency under bursts.
- Key DDR consumers: packet buffers/queues, flow & SA/state tables, crypto DMA buffers, telemetry/stats rings, exception handling.
- Practical symptom mapping: stable average throughput with spiky p99 latency often points to DDR/DMA pressure rather than port speed.
ECC: reliability boundary for high-availability edge devices:
- Why ECC matters: prevents silent corruption of tables, pointers, and descriptors that otherwise appear as intermittent, hard-to-reproduce failures.
- Evidence to expose: correctable/uncorrectable events and whether the platform degrades gracefully or resets.
- Operational implication: the goal is fault containment; a corrupted state table is worse than a clean restart.
Storage: keep roles clear and control write behavior:
eMMC / NAND
OS + configs + bounded logs. Prefer rate-limited logging to avoid burst contention and wear.
NVMe (optional)
High-volume logs, crash dumps, large images. Useful, but can introduce PCIe/DMA contention and tail-latency spikes under bursts.
Write discipline
Control the burstiness: log level, batching, and backpressure matter more than peak device bandwidth.
PCIe topology: isolation is a performance feature:
- Shared root complex risk: port expansion, crypto cards, and NVMe can contend on the same fabric.
- What to validate: correlate NVMe writes or crypto DMA peaks against port jitter and drop counters.
- Design intent: keep critical data-plane DMA paths short and predictable; isolate bulk logging paths where possible.
H2-8 · Power tree, PMIC, sequencing & watchdogs
Edge routers and SD-WAN appliances fail in the field less from “insufficient wattage” and more from sequencing, reset dependencies, and transient behavior. A stable platform defines power domains, enforces PG/RESET relationships, and uses watchdog and supervisor logic that avoids “busy-time false kills” while still guaranteeing recovery from deadlocks. The difference shows up as repeatable evidence: rail dips, PG chatter, thermal derating, and reset-cause logs that correlate with link drops.
Engineering focus: treat SerDes/DDR/PHY rails as “stability rails”. Many intermittent link issues are power-transient issues first. Publish reset causes and PMIC fault logs so “random resets” become diagnosable events.
Power domains that must be explicit (and why they matter):
- SoC core: compute stability; brownouts often appear as random resets.
- SerDes: margin-sensitive; transient dips can trigger link retraining and sudden drops.
- DDR: training sensitivity; rail instability can cause sporadic crashes or silent corruption if not contained.
- PHY: port stability; thermal or rail noise can increase CRC/FEC events before full link failure.
- Crypto engine: throughput and jitter; rail droop can show up as rekey spikes and throughput collapse.
- Aux (fans/sensors): thermal headroom and derating behavior under load.
Sequencing + PG/RESET dependencies (how to avoid intermittent bring-up failures):
- Order matters: bring up prerequisite rails before DDR/SerDes sensitive rails; verify clocks before releasing resets.
- PG stability: debounce/hold times prevent PG chatter from causing repeated resets and unstable link training.
- Reset granularity: separate resets for SoC, DDR, and PHY allow targeted recovery instead of full system resets.
Watchdog strategy (recover without busy-time false kills):
- Window watchdog: effective against deadlocks but requires careful servicing rules to avoid false triggers under load.
- External supervisor: monitors rails/clocks/heartbeat and can provide cleaner containment than CPU-only watchdogs.
- Brownout reset: ensures clean recovery when rails dip; pair with a reset-cause log for field evidence.
Failure signatures and evidence (make “random” diagnosable):
- Rail dip / PG chatter: PMIC fault logs + PG timestamping correlate with reboots and link retraining.
- Thermal derating: VRM/SoC temperatures trend upward, then throughput drops and CRC/FEC rise.
- Transient → SerDes drop: link retrain counts spike near rail events; ports flap with repeatable temperature/load patterns.
H2-9 · Observability & operations: telemetry that proves stability
Observability is not a list of protocols. It is a proof system that maps field symptoms to specific counters and narrows root cause to a small set of domains: PHY, queues, CPU/control plane, crypto, DDR/DMA, or thermal/power. The minimum goal is a repeatable workflow that explains why drops or jitter happen, and what to check next.
A stable appliance exposes a minimal evidence set: link health, per-queue drops, crypto utilization and handshake rate, CPU pressure, DDR/DMA watermarks, temperature/power, and reset causes. Without these, “random” incidents stay random.
Minimal evidence set (collect these first):
Port & link health
CRC, FEC corrected/uncorrectable, training retries, link flaps.
Queues & drops
Per-queue drops, buffer occupancy (high-watermark), scheduler events (WRED/ECN if used).
CPU / control-plane pressure
CPU load, interrupt/softirq pressure, policy/route install latency, exception rates.
Crypto performance
Crypto engine utilization, handshake/rekey rate, SA counts, replay drops (if present).
DDR / DMA pressure
DDR bandwidth watermark, DMA backlog, descriptor ring occupancy, cache pressure (if exposed).
Thermal / power
SoC/PHY/VRM temperatures, throttle events, power draw, fan speed, reset causes.
Event-focused logging (make incidents diagnosable):
- Boot & reset causes: brownout, watchdog, thermal, software-initiated reboot, fault-latched reset.
- Link events: flap episodes, training failures, sudden FEC/CRC trend shifts.
- Tunnel events: rekey/reconnect bursts, negotiation failures, policy mismatches.
- Control events: policy push failure, route install timeouts, exception bursts.
- Include a snapshot: record a small counter set at the event timestamp to preserve context.
Remote diagnosis: the smallest operational set (without deep management stacks):
- Independent access path value: keeps counters and reset causes reachable when the data plane is degraded.
- What must be reachable: link health, per-queue drops, crypto and CPU pressure, DDR watermarks, thermal, reset causes.
H2-10 · Validation & production checklist
Validation is complete only when performance, stability, and recovery are proven together. A credible test plan includes throughput (large packets), PPS (small packets), mixed traffic with QoS features enabled, and crypto tested both as bulk throughput and as handshake/rekey stress. Stability must be demonstrated under thermal stress, fault injection (fan or power disturbances), and long-run endurance while continuously recording the minimal evidence counters.
Pass/fail must be defined as thresholds (goodput, PPS, p99 latency, drops, FEC/CRC trends, recovery time), plus required telemetry records. A number without evidence is not a test result.
Performance baseline (measure with features on/off):
- Large-packet throughput: bidirectional goodput with fixed MTU profiles and clear traffic mix.
- Small-packet PPS: include control-like traffic patterns; PPS is often the first limiter.
- Mixed traffic: large + small packets, multi-flow, multi-queue; record p99/p999 latency and per-queue drops.
- QoS impact: compare QoS/ACL features enabled vs disabled; record scheduler and drop counters.
Crypto validation (separate bulk from handshake capacity):
- Bulk crypto throughput: IPsec/SSL enabled, measure bidirectional goodput and tail latency.
- Handshake/rekey stress: tunnel creation rate and rekey cadence; measure stability and jitter during transitions.
- Evidence required: crypto utilization, handshake rate, CPU pressure, queue drops, DDR/DMA watermarks.
Stability & fault injection (prove graceful behavior):
- Thermal steady-state: run mixed traffic at high load until temperatures stabilize; watch FEC/CRC trends and throttle events.
- Fan fault / reduced airflow: confirm controlled derating rather than unpredictable port flaps.
- Power disturbance evidence: validate reset causes, PG stability, and post-recovery link retraining behavior.
- Long-run endurance: multi-hour or multi-day run with continuous counters and bounded drift.
Operational scenarios (field realism):
- Policy churn: frequent route/ACL/QoS updates; verify install latency and data-plane stability.
- Tunnel churn: frequent reconnect/rekey; verify recovery time and avoid cumulative performance collapse.
- Impairment recovery: inject delay/loss/jitter and measure converge and recovery time with evidence counters.
H2-11 · BOM / IC selection checklist (criteria-first, not part-number dumping)
This section turns the architecture into a sourcing-ready checklist. The goal is simple: match workload → prove KPIs → choose silicon. Each category lists decision criteria first, then provides example part numbers as fast starting points for procurement and engineering screening.
Checklist A — NPU / SoC (datapath capacity and scale)
- Separate KPIs: plain throughput vs encrypted throughput; do not accept a single “Gbps” number.
- PPS matters: confirm 64B/128B PPS under mixed flows (control traffic + tunnels + user traffic).
- State scale: flow table size, update rate, cache hit expectations, and exception path rate limits.
- Policy scale: LPM/ACL/QoS rule limits and how feature enablement changes throughput and jitter.
- Memory coupling: DDR channels, bandwidth headroom, and whether telemetry/logging steals datapath cycles.
- Programmability: SDK maturity, pipeline flexibility, and upgrade safety (policy updates without collapse).
- Thermal/power: peak power under small-packet + crypto load defines heatsink and VRM margins.
- Lifecycle/ecosystem: long-term availability, reference designs, and toolchain stability.
Marvell OCTEON 10 (CN10K series)— datapath-centric SoCs with strong inline acceleration options.NXP Layerscape LS1046A— gateway/router-class SoC family with integrated networking and security engine.Intel Xeon D-2700/2800— x86-based SD-WAN/uCPE route when software ecosystem is the priority.
Checklist B — Crypto offload (IPsec/SSL/TLS: bulk vs handshake)
- Offload position: inline vs lookaside DMA (affects jitter and DDR pressure).
- Bulk throughput: specify cipher/mode, MTU, bidirectional vs unidirectional, and tunnel counts.
- Handshake rate: new tunnel/session rate and rekey cadence; CPU negotiation is often the true limiter.
- SA/session scale: maximum SA count, replay window behavior, and key update impact on latency.
- DMA efficiency: watch for “crypto util looks low but throughput collapses” due to memory contention.
- Pipeline coupling: define where encap/decap happens and which counters expose crypto drops and causes.
- Observability: crypto utilization, rekey time, drops-by-reason are mandatory for field diagnosis.
- Feature realism: do not validate with “single tunnel, big packets only”; include mixed traffic.
Intel QuickAssist Adapter 8970 / 8960— add-in acceleration path for bulk crypto and compression use cases.Inline crypto engines in OCTEON/Layerscape-class SoCs— integrated path when minimizing PCIe/DDR overhead is key.
Checklist C — Ethernet PHY / SerDes / Retimer (port reliability and margin)
- Port mix definition: RJ45 (multi-rate BASE-T) vs SFP/QSFP (SerDes-facing) with realistic cable/optics constraints.
- Host-side interface: match SoC/switch SerDes modes (USXGMII/XFI/SGMII and friends) without fragile bridges.
- Error visibility: CRC/FEC counters, training failures, flap counts must be readable (MDIO/I²C paths defined).
- Thermal drift: confirm behavior across temperature; “warm-up flaps” are usually margin issues.
- EEE/FEC impacts: validate latency/jitter changes when EEE/FEC features are toggled.
- Retimer decision gate: required when insertion loss / long traces / connectors reduce SerDes eye margin.
- Retimer cost: added latency, power, tuning complexity, and reference clock requirements.
- Production proof: BER/training retry rates under stress, not “link lights on”.
Marvell Alaska 88E1512— common GbE PHY reference point.Marvell Alaska X 88X3310P— multi-gig/10GBASE-T PHY family example.TI DS280DF810— multi-lane retimer option for high-loss board channels.
Checklist D — PMIC / sequencing / supervisor / watchdog (reliability and evidence chain)
- Power domains list: SoC core, SerDes, DDR, PHY, storage, fans; define dependencies and startup order.
- PG/RESET behavior: thresholds, deglitching, and reset release ordering; avoid “busy-load false resets”.
- Telemetry: rail voltage/current/temp readouts and fault flags that support field root-cause analysis.
- Fault logging: brownout, overtemp, current limit, watchdog resets must be traceable.
- Watchdog mode: windowed vs simple WDT; align timeout strategy to peak control-plane workload.
- Recovery design: controlled restart sequences should avoid port retrain storms and repeated flaps.
- Manufacturing access: PMBus/SMBus/I²C registers and scripts can validate rails quickly.
- Thermal coupling: VRM derating and rail dips must not masquerade as “random networking issues”.
ADI LTC2977— multi-rail power system manager (telemetry + sequencing + fault handling).ADI LTC2974— smaller channel-count power system manager variant.TI TPS386000— multi-rail supervisor with watchdog for reset/monitor chains.
Checklist E — System integration quick check (avoid platform-level surprises)
- DDR headroom: confirm no collapse when crypto + telemetry + mixed traffic run together.
- PCIe topology: avoid NIC/crypto/storage contention on a single uplink; plan isolation for peak bursts.
- Counter completeness: must expose queue drops, crypto utilization, DDR watermark, CRC/FEC, reset causes.
- Thermal reality: validate steady-state under worst realistic airflow; watch for derating and link flaps.
- Policy/tunnel churn: frequent updates and rekeys must not trigger latency spikes or exception storms.
- Recovery behavior: prove controlled restarts and predictable link retraining after disturbances.
- Evidence retention: logs and snapshots must survive incidents without exhausting storage or DDR bandwidth.
Procurement shortcut: if a platform cannot provide the evidence counters above, it is not suitable for field operations, regardless of headline throughput claims.
H2-12 · FAQs (Edge Router / SD-WAN Appliance)
Each answer stays within this page’s boundary: datapath/crypto/PHY/power/telemetry/validation.
1What is the hardware boundary between an edge router and an SD-WAN appliance?
An edge router is sized around underlay forwarding: stable L3 lookups, QoS, and predictable PPS/latency on WAN/LAN interfaces. An SD-WAN appliance adds overlay tunnels, path measurement, and frequent policy updates, which increase state scale (flows/tunnels), crypto requirements, and telemetry needs. The practical boundary is where tunnel count, handshake churn, and measurement cadence start dominating silicon choices.
2Why can a “10G” port drop to half throughput when QoS/ACL is enabled?
Port speed labels describe the physical link, not the internal packet pipeline. Enabling QoS/ACL adds per-packet classification, additional table lookups, and queueing/scheduling work, which can shift the bottleneck to PPS, TCAM/lookup stages, or memory bandwidth. Throughput collapses when packets hit exception paths, rule scale exceeds fast-table capacity, or queues become the dominant latency source. Always validate with feature-on workloads and queue/drop counters.
3Which matters more for NPU selection: PPS or throughput, and how to back-calculate from traffic?
Both matter, but the limiter depends on packet size distribution and enabled features. Back-calculate PPS from business traffic using PPS ≈ bps / (8 × average packet size), then apply a worst-case mix (small control packets + tunnel keepalives + user flows). Small packets stress parser/classifier and scheduling; large packets stress SerDes and crypto bulk. Choose silicon by the worst credible PPS with QoS/ACL and tunnels enabled, not by port rate.
4When IPsec “turns on and slows down,” is the bottleneck usually CPU, DDR, or the crypto engine?
Separate bulk data-plane crypto from control-plane negotiation. If CPU load spikes during tunnel setup/rekey and throughput falls, the bottleneck is often handshake/SA management. If crypto utilization saturates near 100%, the engine is the limiter. If crypto utilization is moderate but throughput collapses with high DDR watermark or DMA stalls, memory contention is likely starving the datapath. Confirm by comparing plain vs encrypted tests and correlating crypto util, DDR watermark, and latency/jitter.
5How can “real vs fake” IPsec/SSL offload be verified using metrics and tests?
Use two gates: bulk throughput and handshake rate. Run the same tunnel count and cipher mode with offload enabled/disabled, and verify that encrypted throughput rises while CPU does not scale per packet. Stress with many tunnels, small packets, and frequent rekeys; “fake offload” often collapses under churn. Require counters showing crypto utilization, drops-by-reason, and stable latency under mixed traffic. Validation should be repeatable and included in the acceptance matrix.
6In small-packet traffic, why do latency and jitter explode, and which queue/scheduler stage is usually responsible?
Small packets amplify per-packet overhead and expose queueing behavior. Latency/jitter typically spikes when buffer management, shaping, or hierarchical scheduling becomes the dominant pipeline stage, or when exceptions push traffic into slow paths. Look for queue depth growth, drops/ECN, and scheduler statistics that correlate with jitter bursts. Common root causes include over-subscription of queues, mis-sized buffers for burst absorption, and mismatched QoS policies that create head-of-line blocking across traffic classes.
7If ports flap or training fails, how can counters isolate PHY vs SerDes vs power issues?
Start at the edge and move inward with timestamp correlation. PHY-side issues show as autoneg/training failures and rising CRC/FEC errors; SerDes margin issues present as repeated CDR/lock events, retrains, or loss-of-signal patterns. Power and thermal issues often correlate with rail dips, current-limit flags, or temperature thresholds, and can trigger synchronized multi-port flaps. Require MDIO-readable counters and system telemetry so link events can be correlated with rail/temperature and reset causes.
8Why do failures worsen with temperature, and where are the most common weak points in the power tree?
Temperature reduces electrical margin: VRM efficiency derates, IR drop increases, and SerDes/PLL stability can degrade, making link training and timing more fragile. Weak points are commonly SerDes and DDR rails, where small voltage droops translate into link errors or resets. Evidence should include temperature, rail voltage/current telemetry, PG glitches, and reset causes. A credible design shows thermal headroom under worst-case crypto + PPS load and uses deglitched reset chains to avoid cascading restarts.
9How should a watchdog be set to avoid false triggers under load while still preventing lockups?
Choose the watchdog mode based on failure model: windowed watchdogs detect both “stuck high” and “stuck low” behaviors, while simple watchdogs mainly catch deadlocks. Set timeouts using worst-case control-plane events (policy updates, route reconvergence, tunnel rekey) and feed from a health-checked loop rather than from interrupt-driven code that can mask deadlocks. Always log reset causes and capture pre-timeout state so watchdog resets become diagnosable events, not mysteries.
10When logs/telemetry reduce performance, how can design stay observable without slowing the datapath?
Avoid per-packet logging and treat observability as a budgeted workload. Prefer hardware counters sampled at intervals, plus event-triggered snapshots for anomalies (drops, flaps, rekeys). Rate-limit exports and separate management traffic from dataplane memory hot paths; uncontrolled telemetry can raise DDR watermark and create DMA contention that collapses PPS. Store detailed logs asynchronously with backpressure, and keep ring buffers in memory for burst capture while maintaining clear caps tied to DDR watermark and CPU utilization.
11How should production validation cover encrypted throughput, small-packet PPS, and long-run BER stability together?
Use a matrix that combines performance, features, and stress. Measure plain and encrypted throughput with defined tunnel counts and cipher modes; measure PPS with 64B/128B frames under QoS/ACL; then run long-duration BER and link-flap tracking across temperature and airflow conditions. Add tunnel churn (rekey and renegotiation) and power disturbance tests, and require pass/fail thresholds plus counter capture (queue drops, crypto util, DDR watermark, CRC/FEC, reset causes) for every scenario.
12Which five criteria are most often missed in NPU/crypto/PHY/PMIC/WDT selection?
Five commonly missed criteria are: (1) handshake/rekey rate under high tunnel churn, not just bulk encrypted Gbps; (2) fast-path hit rate when QoS/ACL and tunneling are enabled; (3) counter visibility for drops-by-reason, training failures, crypto util, DDR watermark, and reset causes; (4) DDR bandwidth headroom and watermark behavior under telemetry plus crypto DMA; and (5) deterministic power/reset chains with deglitched PG and diagnosable watchdog resets.