123 Main Street, New York, NY 10001

Core / Aggregation Router Architecture (100G–800G)

← Back to: Telecom & Networking Equipment

A core/aggregation router is defined by how reliably it forwards at massive scale: large tables (FIB/ACL/TCAM), high-speed 100G/400G/800G ports with measurable link margin, predictable buffering/ECN behavior, and chassis-grade power/thermal/HA that can be proven by counters, telemetry, and a repeatable acceptance test pack.

H2-1 · Boundary & Definition

What a Core/Aggregation Router Is (and What It Is Not)

A core/aggregation router is a system engineered for scale, determinism, and service continuity: it must forward large traffic volumes at sustained line rate while holding large forwarding and policy state (FIB, ACL/QoS classification) and surviving faults (component redundancy, fast failover, in-service maintenance).

Engineering boundary (the “why this product exists” test)

  • Scale pressure: large FIB/policy tables and high port density push memory, power, and cooling constraints into the system design.
  • Deterministic forwarding: strict QoS classification, queueing/scheduling, and predictable congestion behavior are first-order requirements.
  • Operational continuity: redundancy, fault-domain isolation, and maintenance without long outages are part of the product definition.

Core vs Aggregation: the difference is practical, not marketing—each role stresses a different failure mode envelope.

Dimension Core Router (Backbone) Aggregation Router (Fan-in / Metro)
Primary stress Sustained throughput, fabric resilience, fault-domain containment, hitless maintenance targets Bursty fan-in, microbursts, queue behavior, classification/policy density, per-service isolation
Table reality Large FIB with strict consistency expectations; high confidence in steady-state forwarding Heavy policy and QoS classification load; “what gets queued/dropped” matters as much as “where it goes”
What breaks first Thermal/power headroom, fabric bandwidth margin, upgrade/failover envelopes Queue thresholds, shared buffer contention, policing/shaping corner cases, burst-loss vs latency tradeoffs

Not this page: this topic is not a data-center ToR/leaf-spine design, not a campus/enterprise switch guide, and not an SD-WAN or security gateway feature overview. The focus stays on the router’s hardware forwarding system: PHY/SerDes realities, forwarding pipeline, table/memory, buffering/QoS mechanics, and power/telemetry foundations.

Figure F1 — Where a core/aggregation router sits (and the boundary)
Core / Aggregation Router — Network Position & Boundary Access Enterprise / Edge Metro Aggregation rings Data Center DC switching domain Backbone Transit / Core High availability Core / Aggregation Router 100G 400G 800G Scale tables · policies · counters Determinism QoS · buffering · scheduling Continuity redundancy · fast failover Not this page DC ToR / Leaf-Spine SD-WAN / Security GW Optical Transport
The diagram anchors scope: this page focuses on router hardware forwarding system constraints (scale, determinism, continuity) rather than DC switch topology or security/optical subsystems.
H2-2 · System Anatomy

System Anatomy: Line Cards, Switch Fabric, Control Plane, Management Plane

A core/aggregation router is best understood as three planes sharing one chassis. The data plane forwards packets at line rate, the control plane programs forwarding state and policies, and the management plane makes the box observable and maintainable (telemetry, alarms, fault logs). Keeping these planes cleanly separated is what enables high availability and fast fault isolation.

Chassis building blocks (what matters in hardware terms)

  • Line cards: high-speed ports + PHY/retimer/SerDes + forwarding pipeline + table/memory attachments (e.g., TCAM) + queues/counters.
  • Switch fabric: the internal interconnect that sustains aggregate throughput and defines fault domains (a fabric issue should not collapse the whole system).
  • Route processor / supervisor (SUP): computes and distributes executable forwarding state; should not destabilize steady-state forwarding during transient load.
  • Power & cooling: rail sequencing, margin, and thermal headroom are not “support functions”; they directly show up as link stability and error rates at 100G+.

The most practical way to avoid “architecture talk” is to anchor each plane to interfaces and observable signals. These signals become the fastest path to root cause when the system is stressed.

Plane What it owns What to observe (examples)
Data plane Ingress/egress pipeline, lookup, ACL/QoS classification, queueing/scheduling, counters Port errors (CRC/FEC), queue depth & drops, ECN marks, per-class throughput, fabric utilization
Control plane Forwarding state computation and distribution; consistent programming to hardware tables Table programming latency, update bursts, reconciliation errors, resource exhaustion signals
Management plane Telemetry, alarms, system health, inventory/FRU, out-of-band access Rail telemetry (PMBus), temperature hotspots, fan RPM, reset causes, fault logs aligned to dataplane counters

Boundary control (to stay in scope): this chapter defines the management plane only as a telemetry and fault-evidence path. Firmware supply chain, deep secure-boot flows, and controller implementation details belong to the dedicated BMC/security pages and are intentionally not expanded here.

Figure F2 — Chassis anatomy: ports → line cards → fabric (with control + management overlays)
Chassis Anatomy — Data Plane + Control Plane + Management Plane Line Card A 100G/400G 800G PHY / Retimer / SerDes Forwarding Pipeline parse · lookup · ACL/QoS TCAM Counters Queues / Scheduler drops · ECN · shaping Switch Fabric Fabric Links BW margin Fault Domains isolation Fabric Telemetry utilization Line Card B 100G/400G 800G PHY / Retimer / SerDes Forwarding Pipeline rewrite Queues / Scheduler egress Control Plane (SUP / RP) Programs: FIB · ACL/QoS · resource envelopes Mgmt OOB Power PSU/VRM Cooling fans/sensors Port error counters Queue depth / drops PMBus rails · thermal hotspots · reset causes
The overlays show how “where to look” aligns with planes: dataplane counters explain forwarding symptoms, control-plane signals explain state programming, and management telemetry provides evidence for power/thermal correlation.
H2-3 · Forwarding Pipeline

Forwarding Pipeline: parse → lookup → ACL/QoS → queue/schedule → rewrite

A core/aggregation router earns its reliability through a deterministic hardware pipeline. Each stage converts “features” into resources (tables, memory bandwidth, queue space), and each stage exposes evidence signals that make debugging and validation repeatable.

What the pipeline must guarantee at scale

  • Line-rate progression: no hidden slow path for common traffic classes under sustained load.
  • Policy determinism: classification and actions (mark/police/shape) are consistent and measurable via counters.
  • Congestion is explainable: drops/ECN marks correlate to queue thresholds and scheduling decisions—not mystery behavior.

Engineering mapping: every pipeline stage has a “what it owns” and a “what proves it” view.

Stage What it owns Evidence signals Typical failure patterns (symptoms)
Ingress MAC/PCS + FEC Bit recovery, lane alignment, FEC correction envelope CRC, lane errors, pre-FEC / post-FEC counters, CDR lock Port-specific link flaps; throughput drops with rising FEC corrections; errors correlated with temperature or power headroom
Parser Header extraction and feature vector generation for lookup/classification Parse error counters, “unknown header” drops, exception flags Only certain encapsulations fail; traffic silently drops when header formats exceed parser capabilities
Lookup L2/L3/label keying, adjacency selection, next-hop resolution Lookup miss counters, adjacency miss, per-path hit counters Blackholes for specific prefixes/classes; asymmetric behavior due to incomplete state programming (without discussing protocol internals)
ACL/QoS classification (TCAM) Masked match + priority resolution for ACL, policy routing, class mapping ACL hit/miss, class counters, rule shadow/conflict indicators (if exposed) Policy appears “random”; rules match unexpectedly due to ordering/shadowing; class distribution diverges from expectations
Queueing & congestion actions Buffer allocation, drop/mark policy (WRED/ECN), per-class queues Queue depth watermarks, drop counters, ECN mark counters, tail drops Microburst loss at low average utilization; latency tail growth; drops concentrated in a subset of classes
Scheduler & egress rewrite Per-class service order (Strict/WRR/DRR), shaping, header rewrite and egress framing Per-class throughput, shaping counters, egress error counters Priority starvation under strict scheduling; rate-limit anomalies; egress-specific errors after rewrite actions

Why TCAM sits in the classification stage

  • Masked match at wire speed: ACL/QoS rules often require wildcard fields and priority resolution.
  • Cost shows up as power/heat: wider keys and more entries increase silicon area and thermal load.
  • Debug requires counters: classification without reliable hit/miss evidence becomes non-actionable.

Where SRAM/HBM/DDR typically matter

  • SRAM: low-latency adjacency and fast counters; sensitive to access patterns.
  • HBM/DDR: larger buffers and extended statistics; capacity helps bursts, but bandwidth/latency shape behavior.
  • Queues are physics: congestion decisions depend on actual memory bandwidth and thresholds.
Figure F3 — Packet pipeline map (with observability probes)
Forwarding Pipeline — From Bits to Policy to Congestion MAC / PCS FEC + CDR Parser Lookup FIB + Adj TCAM / Classifier ACL + QoS Queues WRED / ECN watermarks Scheduler Rewrite CRC / pre-FEC / post-FEC parse errors lookup miss ACL hit/miss · class counters queue depth · drops · ECN marks Debug logic: bit errors (left) → state/lookup (middle) → congestion decisions (right) Validation logic: counters must explain drops/marks at each stage under load
The pipeline is intentionally hardware-centric: it links features to silicon resources and to evidence signals used in acceptance tests and rapid root-cause isolation.
H2-4 · Tables & Memory

Tables & Memory: FIB scale, TCAM sizing, counters, and update mechanics

Router performance and stability are bounded by tables and memory physics. “More rules” or “more routes” is not a software-only change: it increases table width, entry count, memory bandwidth, and heat. The practical design question is not “can the box forward,” but “can it forward and update state consistently without outages.”

Table families (what grows in real deployments)

  • FIB entries: executable forwarding prefixes (scale is driven by aggregation strategy and routing domains).
  • Adjacency / next-hop: per-path resolution, ECMP fan-out, and per-neighbor state needed by forwarding.
  • ACL / policy / QoS: masked-match rules for security boundaries, traffic classes, and service guarantees.
  • Label entries: additional forwarding keys for encapsulation labels (used only as a table-type example).

“Table = cost” (why it impacts BOM and thermals)

  • TCAM width: matching more fields per rule increases key width and power per lookup.
  • Entry count: larger policies consume banks/stages and raise baseline power.
  • Heat headroom: TCAM and external memory can become the dominant hotspot, constraining port density.
  • Update risk: high-rate updates can stress programming paths and create transient inconsistency windows.

A minimal but actionable sizing view is to separate match-heavy tables (TCAM) from capacity-heavy state (SRAM/HBM/DDR), then ensure the update mechanism preserves forwarding continuity.

Memory / block Typical contents (router context) Engineering tradeoff (why it matters)
TCAM ACL rules, QoS classification keys, policy routing match sets Fast masked match, but high power/heat; width and rule ordering affect capacity and predictability
SRAM Adjacency caches, fast counters, low-latency state used in the hot path Excellent latency; constrained capacity; access patterns can become bandwidth bottlenecks
HBM / external high-bandwidth Large buffers, extended counters, scale-out forwarding state (platform dependent) Capacity and throughput enable burst tolerance; thermal/power budgeting becomes a platform constraint
DDR (control/auxiliary) Control-plane state, logs, extended telemetry, non-hot-path data Large capacity; higher latency; not a substitute for deterministic hot-path resources

Update mechanics (hitless principle without vendor specifics)

  • Stage new state: program new entries into a shadow bank or unused region, while old state continues forwarding.
  • Switch atomically: flip a pointer/version selector so lookup uses the new table set.
  • Retire safely: reclaim old entries only after counters and reconciliation confirm continuity.

The objective is to avoid transient “holes” where neither old nor new entries match, and to keep evidence (counters/logs) aligned across the cutover.

Figure F4 — Table taxonomy & memory map (plus the update path)
Tables & Memory — What Lives Where (and how it updates) Memory Map (data plane resources) TCAM ACL / QoS classification policy match sets SRAM adjacency / fast counters hot-path state HBM / High-bandwidth memory buffers · extended stats · large state burst tolerance, but thermal budgeting matters DDR / Control-side memory telemetry/logs · control state · aux data Update Path (consistency) Control Plane RIB/policy → executable state Programming bus/driver + reconciliation ASIC Table Banks FIB · ACL · QoS · labels Hitless model 1) Stage (shadow) 2) Switch (pointer) 3) Retire (verify) match-heavy → power/heat capacity-heavy → bandwidth consistency window matters
The diagram separates match-heavy resources (TCAM) from capacity-heavy resources (buffers/state), then shows a vendor-agnostic hitless update model: stage → switch → retire with reconciliation evidence.
H2-5 · PHY & SerDes

100G/400G/800G PHY & SerDes: where retimers and gearboxes fit

High-speed ports are not “a single interface.” They are a chain from MAC/PCS/FEC to SerDes lanes, then through channel losses (connectors, backplane, traces) to the external medium (module or copper). Retimers and gearboxes exist to keep the chain inside a measurable margin envelope across temperature, aging, and load.

MAC / PCS / FEC SerDes lanes PAM4 vs NRZ Retimer / Gearbox BER + FEC margin Lane skew + CDR lock

Port chain decomposition (what to isolate first)

  • MAC/PCS/FEC: exposes pre/post-FEC behavior and the correction headroom trend.
  • SerDes lanes: the “heart” of timing recovery and equalization; lane-level evidence matters.
  • PHY / retimer / gearbox: restores eye margin and/or maps lane structures to meet channel constraints.
  • Channel: insertion loss, reflections, and crosstalk accumulate across connectors and backplane paths.
  • External medium: treated as an endpoint load (details of module internals are out of scope here).

PAM4 vs NRZ (only the practical engineering delta)

  • PAM4: narrower noise/jitter margin; more sensitive to channel loss and reflections, so equalization and FEC headroom become critical.
  • NRZ: wider margin at lower symbol complexity; scaling to higher throughput pushes channel demands in a different direction.
  • Takeaway: what changes is not “difficulty,” but how fast margin is consumed by loss, jitter, and temperature.

When retimers/gearboxes become unavoidable (decision triggers)

  • Long electrical reach: extended traces, multiple connectors, or backplane paths push insertion loss beyond practical equalization.
  • Backplane + connector stacking: reflections and impedance discontinuities accumulate and shrink the eye quickly.
  • Margin collapses under heat/load: pre-FEC rises with temperature or high system power, even if link stays “up.”
  • Lane alignment issues: lane skew and intermittent CDR lock indicate the chain is operating too close to the edge.
  • Rate/lane mapping constraints: a gearbox may be needed to satisfy lane organization across ASIC ↔ channel ↔ endpoint assumptions.

How to read port health metrics (vendor-neutral intent): interpret trends as “margin consumption,” not only as pass/fail.

Metric What it indicates How it helps isolate the weak segment
pre-FEC error trend Raw channel quality drift (loss/jitter/noise eating eye margin) Rising pre-FEC with stable post-FEC suggests shrinking headroom; correlate with temperature and power loading
post-FEC errors Correction envelope exceeded (uncorrectable events) Signals a hard margin failure; prioritize channel/connector loss, reflections, and retimer placement review
FEC margin / corrected counts How much “reserve” remains and how hard the system is working to stay clean Explains “link up but performance unstable” behavior under thermal stress or aging
lane skew / alignment Multi-lane timing divergence Points to physical lane imbalance, routing asymmetry, or marginal clock recovery
CDR lock stability Timing recovery robustness Intermittent lock suggests jitter/return loss sensitivity; often worsens with temperature and power noise
Figure F5 — Port to module/copper chain (test points and risk points)
Port Chain — SerDes → Retimer/Gearbox → Channel → Endpoint Switch/NP ASIC MAC / PCS / FEC SerDes lanes Retimer re-time Gearbox lane map Connector / Backplane traces + vias channel loss Endpoint module / copper pre/post-FEC · FEC margin eye / BER probe lane skew · CDR lock Risk: insertion loss Risk: reflections Isolation rule: errors + margins should localize to a segment (ASIC/retimer/channel/endpoint) Decision rule: retimers/gearboxes are justified when channel loss or skew consumes margin across heat/load
The figure keeps module internals out of scope and focuses on the electrical chain. Probes and risk points map directly to the metrics used in manufacturing and field troubleshooting.
H2-6 · Buffering & Congestion

Buffering & Congestion: deep buffers, microbursts, VOQ, and ECN boundaries

Core/aggregation routers often fail “in the small time scale”: microbursts can overflow queues even when average utilization looks safe. The practical goal is to make congestion behavior explainable: queue watermarks, drops, and ECN marks must align with a clear buffering model.

Why buffering is a first-order concern in core/aggregation

  • Fan-in aggregation: many ingress streams converge onto fewer egress ports.
  • Long feedback loops: congestion feedback takes time; queues can build before senders react.
  • Class isolation: QoS requires per-class behavior to be stable under bursts, not only at steady state.

Microburst mechanism (why “low average” still drops)

  • Step 1: simultaneous ingress arrivals (parallel flows).
  • Step 2: a fixed egress service rate (the bottleneck).
  • Step 3: queue depth spikes beyond available buffer → ECN marks and/or drops.

Architectural tools (router-centric, vendor-neutral)

  • Shared buffer: increases statistical multiplexing, but needs robust thresholds to avoid “buffer capture” by aggressive traffic classes.
  • VOQ (Virtual Output Queues): reduces head-of-line blocking and helps isolate contention, at the cost of more queue structures and planning.
  • Admission control: limits enqueue into congested regions earlier to prevent explosive queue growth.
  • ECN marking line: marks before dropping, at a defined queue watermark, making behavior measurable and tunable.

Delivery metrics should be reported as curves and percentiles, not single-point numbers.

Metric How it is measured What it proves (or disproves)
Throughput under bursty load Step-load tests with mixed classes; observe stability of per-class rate Whether the pipeline sustains line rate without hidden degradation under burst stress
Loss curve (drops vs load) Increase offered load and burstiness; plot drops vs queue watermarks Whether buffer thresholds are coherent; distinguishes microburst overflow from persistent congestion
Tail latency (p99/p99.9) Measure per-class latency distribution while driving microbursts Whether deep buffers cause unacceptable latency tails; correlates directly with queue depth
ECN marks Count marks at defined watermarks during controlled congestion Whether the system signals congestion early enough to reduce drops (without DC-specific PFC deep dive)
Figure F6 — Microburst & buffer model (VOQ + shared buffer + ECN mark line)
Microburst Model — Fan-in → Queue Spike → ECN Marks / Drops Ingress A Ingress B Ingress C Ingress D Σ VOQ (Virtual Output Queues) to Egress per-class to Egress per-flow to Egress isolation Shared Buffer Pool allocation thresholds · watermarks Egress Queue (bottleneck) ECN mark line Drop region queue depth watermark ECN marks drops Interpretation: microbursts spike queue depth → ECN marks start at a defined watermark → drops appear only above buffer limit Deliverables: loss curve + tail latency must correlate to watermarks and mark/drop regions
The model stays router-centric: it explains burst-driven queue spikes, shows VOQ/shared-buffer roles, and anchors ECN and drops to explicit watermarks (without diving into DC-specific RoCE/PFC tuning).
H2-7 · Clocking (inside the box)

Clocking inside the box: SerDes refclks, jitter, and why it shows up as errors

This chapter stays strictly at the high-speed link layer. Inside a router, SerDes reliability depends on how the reference clock (refclk) is generated and distributed to the switch/NP ASIC and any retimers. If jitter consumes the available margin, the first visible symptoms are usually CRC, rising pre-FEC, and occasionally link flaps.

XO / PLL refclk fanout additive jitter eye closure BER → FEC CRC / link flap

Typical clock chain for SerDes (vendor-neutral)

  • Timebase (XO) + PLL: creates a stable refclk domain used by high-speed blocks.
  • Fanout / distribution: routes refclk to multiple destinations with controlled skew.
  • Destinations: switch/NP ASIC SerDes and retimers consume refclk and convert it into link timing.

The key is not “having a clock,” but keeping distribution asymmetry and additive jitter low enough that margin remains across temperature and load.

How to think about jitter budget (engineering view)

  • Budget = margin: the link has a finite tolerance window before the eye collapses.
  • Every stage spends budget: PLL noise, fanout additive jitter, coupling and power noise all reduce margin.
  • When budget is gone: eye closure → BER rises → FEC works harder → CRC/retries appear → link stability can degrade.

Symptom mapping (what clock issues usually look like)

  • Intermittent CRC / retries: margin is thinning but not fully collapsed; often correlates with temperature or peak load.
  • Rising pre-FEC (stable post-FEC): the channel is getting noisier; FEC headroom is being consumed.
  • Link flaps / unstable lock: the system has crossed a robustness boundary; clock distribution and power noise become prime suspects.
  • Only some ports are sensitive: points to distribution skew/asymmetry or board-level coupling, not a uniform “network configuration” issue.
Figure F7 — SerDes clock tree and the jitter → error chain
Clock Tree for High-Speed Ports (refclk distribution) Timebase + PLL XO domain low jitter refclk Fanout distribution + skew additive jitter Destinations Switch/NP ASIC SerDes refclk Retimers refclk in Why jitter becomes visible errors Jitter ↑ Eye closure margin ↓ BER ↑ Visible symptoms pre-FEC ↑ / CRC link flap (edge) probe: CRC · pre/post-FEC note: additive jitter + skew Diagnosis cue: port sensitivity + thermal correlation often indicates refclk distribution or power-noise coupling Scope cue: timing protocols (PTP/SyncE) are intentionally out of scope here
The diagram is strictly “inside the box” and links clock distribution quality to measurable port symptoms without turning into a PTP/SyncE timing chapter.
H2-8 · Power & PMBus

Power architecture & PMBus telemetry: multi-rail VRMs, sequencing, and brownout-proofing

A core/aggregation router is a multi-rail power system. Many failures do not present as an immediate reboot: marginal rails often show up first as errors, link flaps, or unstable performance under heat and load. PMBus-enabled digital power provides the evidence trail: V/I/P/T telemetry plus fault flags that can be aligned to symptom timestamps.

multi-rail VRMs SerDes rail sensitivity PMBus V/I/P/T fault flags sequencing + PG brownout-proofing

Typical rails (and what tends to be most sensitive)

  • Core rail: severe droop tends to trigger system-level resets or reboots.
  • SerDes/retimer rails: noise or droop commonly appears as BER/FEC stress, CRC, or link instability.
  • Memory rails (DDR/HBM): marginal power can manifest as instability under sustained load and thermal stress.
  • TCAM rail: power/thermal headroom affects deterministic behavior when table activity is high.
  • I/O + auxiliary rails: management and side functions; failures can look “random” without telemetry correlation.

PMBus telemetry (what to capture for soft-fault localization)

  • Vitals: voltage, current, power, temperature (V/I/P/T) per rail.
  • Status: UV/OV/OC/OT and latched fault codes (even short events matter).
  • Correlation: align telemetry timestamps to CRC spikes, link flaps, or reboot events.
  • Trend view: rising power/temperature plus shrinking voltage margin is a common precursor to field issues.

Sequencing / reset and brownout-proofing (principles + verification points)

  • Sequencing: ensure rails that gate stable I/O and high-speed blocks reach regulation before dependent domains are released.
  • PG and debounce: PG thresholds and deglitching prevent false resets while still catching real droops.
  • Brownout behavior: define whether the platform should degrade, alarm, or reset under short undervoltage events.
  • Margining (acceptance test): controlled, small voltage shifts should not immediately cause CRC/link instability if margin is healthy.

The scope stays inside the box: multi-rail VRMs and PMBus evidence. 48 V hot-swap/eFuse design and PoE power are intentionally out of scope.

Fault Where it often occurs PMBus evidence Common symptom
UV SerDes/retimer or core rails under burst load Undervoltage flag + droop trend in V/I/P CRC spikes, rising pre-FEC; in severe cases link flap or reboot
OC VRM current limits during peak traffic/thermal events Overcurrent flag + current plateau Sudden instability when load peaks; errors that correlate with traffic bursts
OT Hotspots near TCAM/memory/VRM zones Temperature rise + OT flag or throttling indicators Performance degradation, link instability in heat-soak conditions
Figure F8 — Power rails & PMBus map (faults → symptoms)
Power Rails + PMBus Telemetry (inside the box) PSU DC outputs platform feed VRM / PoL Rails CORE rail SerDes rail MEM rail TCAM rail I/O rail AUX rail Key Loads ASIC Retimers DDR/HBM TCAM Management MCU PMBus aggregation · logs · alarms UV OC OT SerDes rail → errors / link flap Core rail → reboot / reset MEM/TCAM → instability (load/heat) Evidence loop: symptom timestamp → PMBus rail trends + fault flags → rail-to-load mapping → root-cause isolation
The diagram focuses on multi-rail VRMs, PMBus evidence, and symptom mapping. 48 V front-end hot-swap/eFuse and PoE power topics are intentionally excluded.
H2-9 · Thermal & Mechanical

Thermal & mechanical reality: airflow, hotspots, and derating the right things

In a core/aggregation router, temperature problems rarely look like “a simple overheat.” Hotspots can first appear as port instability, higher error rates, or performance drift under heat-soak. The practical job is to build a closed loop: airflow path → hotspot map → sensors → fan curve → derating triggers.

airflow path ASIC / TCAM retimer banks VRM zones sensor placement derating

Where hotspots come from (and why density matters)

  • ASIC: high power + adjacent SerDes blocks; heat reduces link margin.
  • TCAM: activity and temperature can compound power; sustained load creates hot plates.
  • Retimers: often clustered near dense ports; local heating can make “one row of ports” more sensitive.
  • VRMs: conversion losses concentrate near large loads; temperature reduces efficiency and margin.

Port density increases thermal density. That is why thermal issues often show spatial correlation (specific cards/port banks).

Sensor placement (evidence, not averages)

  • Inlet: boundary condition for the chassis (what the cooling system actually gets).
  • Outlet: measures airflow effectiveness (ΔT across the box).
  • Near hotspots: ASIC, TCAM, retimer banks, VRM zones (where failures start).
  • Correlation rule: hotspot sensors must explain error spikes better than “average chassis temperature.”

Fan curves and derating (do the least harmful thing first)

  • Fan curve goal: keep hotspot sensors below thresholds while avoiding permanent high RPM (noise and wear).
  • Heat-soak vs transient: validate both steady-state and burst-driven heating (traffic + power spikes).
  • Derate the right thing: protect link stability first (reduce port rate/active ports), then protect throughput, then optimize acoustics and lifetime.
  • Closed-loop triggers: tie derating to hotspot sensors and clear recovery conditions (avoid oscillation).
Engineering action What it improves What it prevents
Improve conduction path (heatsink/contact/thermal interface) Reduces hotspot peak temperature and gradient Local margin collapse that shows up as errors or instability
Airflow shaping (baffles/flow guides/recirculation control) Delivers air where it matters (ASIC/retimer/VRM zones) “Cool inlet, hot hotspot” false confidence
Derating (ports/rate/power cap) Preserves system stability under thermal stress Error storms, link flaps, cascading faults triggered by heat
Figure F9 — Thermal map block diagram (airflow + hotspots + sensors + fan control loop)
Thermal Map — Airflow, Hotspots, Sensors, and Fan Loop Inlet Outlet S1 inlet S2 outlet S3 hotspot S4 hotspot S5 hotspot S6 hotspot ASIC hotspot TCAM zone Retimer bank VRM Fan tray + airflow shaping baffles · flow guides · recirculation control Fan Control Sensors Fan curve RPM targets Derate ports / rate Recovery clear rules Rule: hotspot sensors must explain error spikes better than average temperature
The diagram stays inside the chassis. It does not cover site-level environmental monitoring (door/water/smoke), which belongs to the Site Env & Power Monitor page.
H2-10 · Reliability & HA

Reliability & high availability: redundancy, hitless upgrades, watchdogs, and fault domains

High availability is an engineering contract, not a slogan. It is built from redundancy domains, clear fault-domain boundaries, and verifiable criteria for failover and upgrades. A correct design proves that a single component failure does not cascade into chassis-wide outage.

PSU / fan redundancy dual SUP fabric redundancy fault domains ISSU / hitless watchdogs

Redundancy domains (concept + verification intent)

  • PSU: loss of one supply should not drop forwarding stability or trigger mass port errors.
  • Fans: loss of one fan should keep hotspots under control without immediate instability.
  • Dual SUP / route processor: failover should preserve forwarding with bounded control-plane impact.
  • Fabric redundancy: loss of a fabric element should not create a chassis-wide outage.
  • Line-card isolation: a failing card should be fenced off without dragging the whole box.

Hitless / ISSU requirements (criteria, not vendor features)

  • State continuity: the system must preserve the minimum state required for stable forwarding.
  • Table consistency: FIB/ACL/QoS programming must remain coherent across the transition.
  • Failover budget: there is a finite time window for switchover before traffic impact is visible.
  • Rollback safety: failed upgrade paths must return to a known-good state without widening the blast radius.

Fault domains and proof (telemetry + logs as evidence)

  • Fault domain goal: keep failures local (line card, fabric, SUP, or power/thermal domains).
  • Watchdogs & health monitors: detect abnormal behavior early and trigger domain-level containment actions.
  • Proof rule: telemetry and logs must show that only the affected domain degraded while other domains remained stable.
Acceptance check What to observe What it proves
PSU pull test No chassis-wide error storm; telemetry shows clean takeover Power redundancy without cascading instability
Fan failure test Hotspot temperatures remain controlled; no sudden link flaps Thermal domain resilience and correct fan curve triggers
SUP failover Bounded transition; forwarding stability remains within budget Control-plane redundancy and switchover robustness
Fabric element isolation Traffic impact remains localized; alarms identify the correct domain Fabric redundancy and fault-domain containment
Line-card isolation Affected card fenced; rest of chassis stays stable Blast-radius control and domain boundaries
Figure F10 — Fault-domain diagram (redundancy components and isolation boundaries)
Fault Domains — Redundancy and Isolation Boundaries Chassis PSU A PSU B Fans SUP A (active) control-plane services SUP B (standby) failover ready Fabric A redundant paths Fabric B redundant paths Line-card fault domains (isolate per card) Line Card 1 Line Card 2 Line Card 3 Line Card N avoid SPOFs SPOF risk marker single point risk Proof rule: failover and upgrades must be demonstrated by bounded impact + domain-local telemetry/log evidence
The diagram highlights redundancy domains and explicit fault-domain boundaries. Security boot/TPM/HSM topics are intentionally excluded and belong to dedicated security/BMC pages.

H2-11 · Validation & Bring-up Checklist: what proves it’s done (and debugs fast)

This chapter turns the router’s core hardware risks—high-speed links, forwarding behavior under congestion, table scale/updates, power/thermal stability, and redundancy—into a repeatable validation script. Each step defines: what to runwhat to logpass/fail criterianext measurement when it fails.

Principle: verify power + telemetry health first, then links, then forwarding, then tables + updates, then redundancy, then soak. This prevents “electrical issues” from being misdiagnosed as “network issues”.

Figure F11 — Bring-up flow (evidence-driven)
Core/Aggregation Router Bring-up — State Flow + Evidence Outputs POWER-UP reset + sequencing ok PMBUS HEALTH UV/OC/OT = 0 PORT PRBS pre/post-FEC, CRC Errors? If YES check refclk retimer rails noise FORWARD PERF throughput + drop curves TABLE STRESS scale + update consistency REDUNDANCY failover budget ok SOAK TEST thermal steady-state + logs Evidence Bundle CSV counters + screenshots PMBus fault logs + curves

A) Step-by-step bring-up script (inputs → outputs)

Goal convert system validation into a deterministic flow with unambiguous evidence artifacts.

  • Step 0 — Power-up: confirm sequencing, resets, and stable boot to a known state.
  • Step 1 — PMBus health: verify UV/OC/OT and fault logs are clean before any high-speed tests.
  • Step 2 — Port PRBS: PRBS + loopbacks; capture pre/post-FEC, CRC/FEC counters, lane skew, CDR lock.
  • Step 3 — Forwarding: line-rate throughput vs packet size; drop curves; queue depth/ECN observations.
  • Step 4 — Table stress: fill FIB/ACL/QoS tables; test update rate + hitless update behavior (criteria-only).
  • Step 5 — Redundancy drills: pull PSU/fan/SUP/line-card (where applicable); measure failover budget and fault containment.
  • Step 6 — Soak: long-run at thermal steady-state; verify error counters remain flat and logs stay clean.

B) Pass/Fail criteria (make “done” measurable)

Goal avoid “it looks stable” by using counters, curves, and logs as acceptance truth.

  • Power: no persistent PMBus faults; no brownout signatures; rails remain within margin test envelopes.
  • Links: post-FEC error-free over the agreed window; stable CDR lock; CRC/FEC counters do not drift.
  • Forwarding: expected throughput for target packet sizes; drop curve matches buffer/ECN intent.
  • Tables: target scale achieved; updates do not break hit counters, do not trigger unexpected drops/reboots.
  • HA: failover stays within the outage budget; single fault remains in its fault domain.
  • Thermal: steady-state temps within limits; no thermally-correlated error bursts (CRC/link flap).

C) Acceptance checklist (copy/paste friendly)

Area What to run What to log Pass signal
PMBus / rails Read telemetry, clear/verify fault logs, run margin test (safe envelope) VIN/VOUT/IOUT/Temp/FaultCode No UV/OC/OT; no recurring fault entries
Port PRBS PRBS on target speeds; host/line loopback; lane mapping sanity pre-FEC BER, post-FEC, CRC/FEC post-FEC stays clean; counters flat
Forwarding perf Line-rate tests (varied pkt sizes); microburst patterns throughput, drop, queue depth, ECN Expected curves; no unexplained tail spikes
Table scale Fill FIB/ACL/QoS to targets; hit tests; update burst tests hit/miss, update rate, error counters Scale achieved; updates do not destabilize
Redundancy drills PSU/fan/SUP/fabric/line-card fail (as supported) failover time, service impact, logs Within budget; fault contained
Soak Long-run at nominal + elevated load (safe) with stable airflow temp traces + link/error counters No thermally-driven error bursts

D) Fast debug map (symptom → first measurements → next isolation)

Symptom First measurements (must be logged) Next isolation step
Random CRC bursts CRC/FEC counters, CDR lock, refclk alarms, PMBus rail ripple events Swap port ↔ port; check refclk distribution; correlate with rail margin/temperature; bypass retimer path (if possible)
Link flap on certain ports lane skew, pre-FEC BER, training status, module/cable changes, thermal sensors Control environment: fixed airflow; retimer/connector path check; validate insertion loss suspects by pattern
Good PRBS, bad traffic queue depth, drop reason, ECN marks, scheduler stats Reproduce with microburst; verify admission/VOQ behavior; check counter alignment across ingress/egress
Scale works, updates break update rate, hit counters consistency, control-plane CPU health, error logs Reduce update burst; verify atomicity expectations; locate the stage where inconsistency starts (lookup vs counters)
Thermal-triggered errors hotspot temps, fan PWM/RPM, PMBus temps on VRMs, error counters vs time Pinpoint hotspot rail/device; adjust fan curve; validate derating actions (rate/ports) and re-run soak

E) Delivery artifacts (what “done” must include)

  • Bring-up record: the flow in Figure F11 with timestamps and outcomes for each step.
  • Port/link report: PRBS configs, pre/post-FEC, CRC/FEC counters, lane mapping notes, key screenshots.
  • Forwarding curves: throughput vs packet size; drop curves; queue/ECN traces (CSV + plots).
  • Table scale & update: achieved scale, update rate tests, any exception logs, counter consistency checks.
  • Power/thermal evidence: PMBus fault logs (clean), rail telemetry snapshots, steady-state thermal traces.
  • Redundancy drills: failover timing, fault-domain containment proof, “single point of failure” notes.

F) Tools & Silicon References (Part Numbers) — for faster bring-up

The part numbers below are practical reference anchors for building observability and accelerating fault isolation (they are not endorsements, and availability varies by region/BOM strategy).

High-speed link conditioning (retimers / equalizers)

  • Marvell MV-CHA180C0C — 800G PAM4 DSP retimer (monitoring + PRBS/loopback capabilities are useful for isolation).
  • TI DS280DF810 — 28Gbps 8-channel retimer (board-level reach/robustness reference for mid-rate SerDes paths).
  • Semtech GN8112 — 112G PAM4 quad linear equalizer (useful as a “lossy copper reach” reference block).

Refclks / jitter control (when errors look “random”)

  • ADI AD9545 — dual-DPLL clock synchronizer/jitter cleaner class reference for PHY clocks.
  • Skyworks (SiLabs) Si5341 — ultra-low jitter clock generator family reference (multi-output clock trees).
  • ADI HMC7044 — jitter attenuator class reference (high-performance clock distribution use cases).

PMBus / rail telemetry (turn power into evidence)

  • ADI LTC2977 — 8-channel PMBus power system manager (sequencing + telemetry + fault logs).
  • Infineon XDPE192C4C-0000 — digital multiphase controller class (PMBus-managed VR rails).
  • TI INA238 — current/voltage/power monitor (alert-driven “rail anomaly” capture).

Thermal sensing & fan control (prevent heat-driven errors)

  • TI TMP117 — high-accuracy digital temperature sensor reference (hotspot mapping).
  • Microchip EMC2305 — SMBus fan controller (multi-fan curves + closed-loop options).

Lab bring-up instruments (models)

  • Anritsu MP1900A — modular high-performance BERT platform (PHY-layer validation workflows).
  • Keysight M8040A — high-performance BERT (PAM4/NRZ coverage for high-speed links).
  • Keysight IxNetwork — traffic generation + measurement suite (1G→800G class traffic scenarios).

What to record with these tools

  • Retimers/equalizers: PRBS settings, loopback mode, SNR/eye stats (if available), error counters.
  • Clock devices: ref selection events, holdover transitions, jitter alarms (if available), correlation to CRC bursts.
  • PMBus managers: fault logs (timestamped), margin test results, rail telemetry snapshots under load.
  • Thermal/fans: hotspot temps vs time, fan PWM/RPM vs time, error counters vs time.
  • BERT/traffic: test pattern, duration, link settings, counters/curves, and DUT configuration hash.

Request a Quote

Accepted Formats

pdf, csv, xls, xlsx, zip

Attachment

Drag & drop files here or use the button below.

H2-12 · FAQs (Core / Aggregation Router)

Each answer stays within this page’s scope: forwarding pipeline, tables/TCAM, high-speed ports, buffering/ECN, in-box clocking, PMBus power telemetry, thermal reality, HA/ISSU principles, and acceptance evidence.

1) Core router vs data center switch — what is the practical boundary?

The boundary is usually proven by scale + services of the forwarding plane, not by port speed. Core/aggregation routers prioritize large L3/MPLS-class table sets, strict fault domains, serviceable chassis design (line cards + fabric), and predictable behavior during updates and failovers. A data center switch is often optimized for a different buffer model, different table mix, and different operational envelope.

H2-1/2table mixchassis HA
2) Why does a router need TCAM, and how should it be sized?

TCAM is commonly used where multi-field matches must be fast and deterministic (ACLs, classification, policy matches) inside the forwarding pipeline. Sizing starts from the rule families (ACL/QoS/policy), then applies width reality (fields expand entries), growth margin, and redundancy requirements. Validation is practical: fill to target scale and confirm hit behavior, counter stability, and power/thermal headroom.

H2-3/4TCAM width vs entriesvalidation
3) FIB scale vs update rate — what breaks first in real deployments?

Three things often fail before “raw lookup” fails: (1) the update path (control-to-ASIC programming) becomes bursty and creates transient inconsistency, (2) telemetry/counter coherence drifts so the box “forwards” but evidence becomes unreliable, or (3) hitless conditions are violated during redundancy events (failover/ISSU) because state and tables cannot remain aligned under update pressure.

H2-4/10/11update mechanicshitless criteria
4) When do 400G/800G designs require retimers or gearboxes?

Retimers become necessary when the channel (backplane + connectors + traces + cable) consumes too much margin: insertion loss, reflections, or temperature sensitivity causes unstable training, rising pre-FEC errors, or frequent CDR unlocks. Gearboxes typically appear when lane rates or electrical interfaces must be adapted while preserving link integrity. The decision should be evidence-based: PRBS results, margin trends, and repeatable failure location along the port chain.

H2-5margin evidenceport chain
5) Pre-FEC looks fine but post-FEC worsens — what does it indicate?

This often points to a measurement mismatch (different windows, different sampling points), or to bursty/clustered errors that stress correction more than average BER suggests. It can also indicate a hidden contributor that changes with time—refclk jitter events, rail noise, or thermal drift—creating short error storms. The fix is to time-align counters, correlate with clock/power/temperature telemetry, and reproduce under controlled load and airflow.

H2-5/7/11time alignmentcorrelation
6) “Microburst drops” with low average utilization — how can it be proven?

Average utilization hides microbursts because the queue can overflow in microseconds while the long-term average stays low. Proof requires high time-resolution evidence: queue depth samples (or watermark), drop counters with timestamps, and a bursty traffic pattern that reproduces the symptom. The clean demonstration shows multiple ingress flows converging to one egress queue, a queue spike, and drops/ECN marks occurring at the same moment.

H2-6/11queue depthburst reproduction
7) Deep buffers reduce loss but increase latency — how should ECN/thresholds be set?

The practical goal is to keep loss low without creating unacceptable tail latency. ECN marking thresholds should be placed below the hard-drop line so congestion feedback happens before queues saturate. Thresholds are tuned using measured curves: queue depth vs time, drop probability vs load, and latency percentiles under bursty traffic. Different service classes can use different thresholds, but the evidence must remain consistent and repeatable.

H2-6ECN mark linetail latency
8) Link flaps only on certain ports — how to isolate clocking vs power vs thermal?

Start with spatial correlation: do failures cluster by port bank, line card, or proximity to retimers/VRMs? Then isolate in a strict order: fix airflow and ambient, lock refclk source/distribution, and correlate error bursts with PMBus rail telemetry and hotspot temperatures. Swap ports/modules/cables to test locality, and use loopbacks to separate “inside-the-box” from “outside” contributors. The winning approach is a time-aligned evidence trail.

H2-7/8/9/11spatial correlationtime-aligned evidence
9) PMBus shows no fault, yet ASIC errors rise — what telemetry is missing?

“No PMBus fault” usually means no threshold was crossed, not that power is perfect. Many failures come from fast transients, localized droop, or rail noise that is too brief or too local to be captured by coarse telemetry. Missing pieces often include higher-rate rail sampling/event tagging, margin-test logs under load, hotspot temperature sensors near sensitive blocks, and tightly time-aligned port/ASIC error counters. Without these, correlation is slow and MTTR increases.

H2-8/11telemetry gapsmargin evidence
10) How to validate hitless failover/ISSU without service impact?

Hitless validation is a criteria test: forwarding continuity within a budget, stable counters/queues, and controlled fault domains. A safe method uses a staged run: freeze large table updates, run a representative traffic profile, then trigger planned failover/ISSU while recording loss/latency and key counters. The result must show bounded impact and clean recovery, with logs proving the event stayed inside its redundancy domain rather than cascading across the chassis.

H2-10/11budgeted impactdomain containment
11) Which counters/logs are “must-have” to reduce MTTR?

Must-have evidence spans three layers: (1) PHY/port (CRC, FEC, lane status, CDR/training), (2) pipeline/queues (drop reason, queue depth/watermarks, ECN marks, scheduler stats), and (3) system domains (PMBus fault logs, rail/temperature snapshots, reset causes, redundancy events). The key is time alignment across layers, so symptoms can be correlated to a specific domain (port bank, rail, thermal zone, or failover event).

H2-3/8/10/11three-layer observabilitytime alignment
12) What is a minimal acceptance test pack for a new chassis or line card?

A minimal pack should still be evidence-complete: (a) PMBus health + margin snapshots, (b) port PRBS/loopback results with pre/post-FEC and CRC/FEC counters, (c) forwarding throughput + drop/ECN curves under bursty patterns, (d) table scale + update consistency tests, (e) redundancy drills within budget, and (f) soak test with thermal steady-state. Each item must produce artifacts (CSV + screenshots + logs) so results are reproducible and auditable.

H2-11evidence packreproducible