Core / Aggregation Router Architecture (100G–800G)

Q: Core router vs data center switch — what is the practical boundary?

The boundary is usually proven by scale and operational envelope, not by port speed. Core/aggregation routers prioritize large L3-class table sets, serviceable chassis design (line cards + fabric), strict fault domains, and predictable behavior during updates and failovers. A data center switch is often optimized for a different buffer model, a different table mix, and a different operational profile.

Q: Why does a router need TCAM, and how should it be sized?

TCAM is commonly used where multi-field matches must be fast and deterministic (ACLs, classification, policy matches) inside the forwarding pipeline. Sizing starts from rule families (ACL/QoS/policy), then accounts for width reality (fields expand entries), growth margin, and redundancy needs. Validate by filling to target scale and confirming hit behavior, counter stability, and power/thermal headroom.

Q: FIB scale vs update rate — what breaks first in real deployments?

Three things often fail before raw lookup fails: the update path (control-to-ASIC programming) becomes bursty and creates transient inconsistency, telemetry/counter coherence drifts so evidence becomes unreliable, or hitless conditions are violated during redundancy events because state and tables cannot remain aligned under update pressure. The right test time-aligns counters and stresses both scale and update bursts.

Q: Pre-FEC looks fine but post-FEC worsens — what does it indicate?

This often indicates a measurement mismatch (different windows or sampling points) or bursty errors that stress correction more than average BER suggests. It can also indicate a hidden contributor that changes with time, such as refclk jitter events, rail noise, or thermal drift, creating short error storms. Time-align counters, correlate with clock/power/temperature telemetry, and reproduce under controlled load and airflow.

Q: Deep buffers reduce loss but increase latency — how should ECN/thresholds be set?

The goal is to keep loss low without unacceptable tail latency. ECN marking thresholds should be below the hard-drop line so feedback happens before queues saturate. Tune thresholds using measured curves: queue depth vs time, drop probability vs load, and latency percentiles under bursty traffic. Different service classes can use different thresholds, but the evidence must remain consistent and repeatable.

Q: Link flaps only on certain ports — how to isolate clocking vs power vs thermal?

Start with spatial correlation: do failures cluster by port bank, line card, or proximity to retimers/VRMs? Then isolate in order: fix airflow and ambient, lock refclk source/distribution, and correlate error bursts with PMBus rail telemetry and hotspot temperatures. Swap ports/modules/cables to test locality, and use loopbacks to separate inside-the-box from outside contributors. Build a time-aligned evidence trail.

Q: PMBus shows no fault, yet ASIC errors rise — what telemetry is missing?

No PMBus fault usually means no threshold was crossed, not that power is perfect. Many failures come from fast transients, localized droop, or rail noise that is too brief or too local for coarse telemetry. Missing pieces often include higher-rate rail sampling or event tagging, margin-test logs under load, hotspot temperature sensors near sensitive blocks, and time-aligned port/ASIC error counters. These shorten correlation and reduce MTTR.

Q: How to validate hitless failover/ISSU without service impact?

Hitless validation is a criteria test: forwarding continuity within a budget, stable counters/queues, and controlled fault domains. A safe method is staged: freeze large table updates, run a representative traffic profile, then trigger planned failover/ISSU while recording loss/latency and key counters. The result must show bounded impact and clean recovery, with logs proving containment within the intended redundancy domain.

← Back to: Telecom & Networking Equipment

A core/aggregation router is defined by how reliably it forwards at massive scale: large tables (FIB/ACL/TCAM), high-speed 100G/400G/800G ports with measurable link margin, predictable buffering/ECN behavior, and chassis-grade power/thermal/HA that can be proven by counters, telemetry, and a repeatable acceptance test pack.

H2-1 · Boundary & Definition

What a Core/Aggregation Router Is (and What It Is Not)

A core/aggregation router is a system engineered for scale, determinism, and service continuity: it must forward large traffic volumes at sustained line rate while holding large forwarding and policy state (FIB, ACL/QoS classification) and surviving faults (component redundancy, fast failover, in-service maintenance).

Engineering boundary (the “why this product exists” test)

Scale pressure: large FIB/policy tables and high port density push memory, power, and cooling constraints into the system design.
Deterministic forwarding: strict QoS classification, queueing/scheduling, and predictable congestion behavior are first-order requirements.
Operational continuity: redundancy, fault-domain isolation, and maintenance without long outages are part of the product definition.

Core vs Aggregation: the difference is practical, not marketing—each role stresses a different failure mode envelope.

Dimension	Core Router (Backbone)	Aggregation Router (Fan-in / Metro)
Primary stress	Sustained throughput, fabric resilience, fault-domain containment, hitless maintenance targets	Bursty fan-in, microbursts, queue behavior, classification/policy density, per-service isolation
Table reality	Large FIB with strict consistency expectations; high confidence in steady-state forwarding	Heavy policy and QoS classification load; “what gets queued/dropped” matters as much as “where it goes”
What breaks first	Thermal/power headroom, fabric bandwidth margin, upgrade/failover envelopes	Queue thresholds, shared buffer contention, policing/shaping corner cases, burst-loss vs latency tradeoffs

Not this page: this topic is not a data-center ToR/leaf-spine design, not a campus/enterprise switch guide, and not an SD-WAN or security gateway feature overview. The focus stays on the router’s hardware forwarding system: PHY/SerDes realities, forwarding pipeline, table/memory, buffering/QoS mechanics, and power/telemetry foundations.

Figure F1 — Where a core/aggregation router sits (and the boundary)

The diagram anchors scope: this page focuses on router hardware forwarding system constraints (scale, determinism, continuity) rather than DC switch topology or security/optical subsystems.

H2-2 · System Anatomy

System Anatomy: Line Cards, Switch Fabric, Control Plane, Management Plane

A core/aggregation router is best understood as three planes sharing one chassis. The data plane forwards packets at line rate, the control plane programs forwarding state and policies, and the management plane makes the box observable and maintainable (telemetry, alarms, fault logs). Keeping these planes cleanly separated is what enables high availability and fast fault isolation.

Chassis building blocks (what matters in hardware terms)

Line cards: high-speed ports + PHY/retimer/SerDes + forwarding pipeline + table/memory attachments (e.g., TCAM) + queues/counters.
Switch fabric: the internal interconnect that sustains aggregate throughput and defines fault domains (a fabric issue should not collapse the whole system).
Route processor / supervisor (SUP): computes and distributes executable forwarding state; should not destabilize steady-state forwarding during transient load.
Power & cooling: rail sequencing, margin, and thermal headroom are not “support functions”; they directly show up as link stability and error rates at 100G+.

The most practical way to avoid “architecture talk” is to anchor each plane to interfaces and observable signals. These signals become the fastest path to root cause when the system is stressed.

Plane	What it owns	What to observe (examples)
Data plane	Ingress/egress pipeline, lookup, ACL/QoS classification, queueing/scheduling, counters	Port errors (CRC/FEC), queue depth & drops, ECN marks, per-class throughput, fabric utilization
Control plane	Forwarding state computation and distribution; consistent programming to hardware tables	Table programming latency, update bursts, reconciliation errors, resource exhaustion signals
Management plane	Telemetry, alarms, system health, inventory/FRU, out-of-band access	Rail telemetry (PMBus), temperature hotspots, fan RPM, reset causes, fault logs aligned to dataplane counters

Boundary control (to stay in scope): this chapter defines the management plane only as a telemetry and fault-evidence path. Firmware supply chain, deep secure-boot flows, and controller implementation details belong to the dedicated BMC/security pages and are intentionally not expanded here.

Figure F2 — Chassis anatomy: ports → line cards → fabric (with control + management overlays)

The overlays show how “where to look” aligns with planes: dataplane counters explain forwarding symptoms, control-plane signals explain state programming, and management telemetry provides evidence for power/thermal correlation.

H2-3 · Forwarding Pipeline

Forwarding Pipeline: parse → lookup → ACL/QoS → queue/schedule → rewrite

A core/aggregation router earns its reliability through a deterministic hardware pipeline. Each stage converts “features” into resources (tables, memory bandwidth, queue space), and each stage exposes evidence signals that make debugging and validation repeatable.

What the pipeline must guarantee at scale

Line-rate progression: no hidden slow path for common traffic classes under sustained load.
Policy determinism: classification and actions (mark/police/shape) are consistent and measurable via counters.
Congestion is explainable: drops/ECN marks correlate to queue thresholds and scheduling decisions—not mystery behavior.

Engineering mapping: every pipeline stage has a “what it owns” and a “what proves it” view.

Stage	What it owns	Evidence signals	Typical failure patterns (symptoms)
Ingress MAC/PCS + FEC	Bit recovery, lane alignment, FEC correction envelope	CRC, lane errors, pre-FEC / post-FEC counters, CDR lock	Port-specific link flaps; throughput drops with rising FEC corrections; errors correlated with temperature or power headroom
Parser	Header extraction and feature vector generation for lookup/classification	Parse error counters, “unknown header” drops, exception flags	Only certain encapsulations fail; traffic silently drops when header formats exceed parser capabilities
Lookup	L2/L3/label keying, adjacency selection, next-hop resolution	Lookup miss counters, adjacency miss, per-path hit counters	Blackholes for specific prefixes/classes; asymmetric behavior due to incomplete state programming (without discussing protocol internals)
ACL/QoS classification (TCAM)	Masked match + priority resolution for ACL, policy routing, class mapping	ACL hit/miss, class counters, rule shadow/conflict indicators (if exposed)	Policy appears “random”; rules match unexpectedly due to ordering/shadowing; class distribution diverges from expectations
Queueing & congestion actions	Buffer allocation, drop/mark policy (WRED/ECN), per-class queues	Queue depth watermarks, drop counters, ECN mark counters, tail drops	Microburst loss at low average utilization; latency tail growth; drops concentrated in a subset of classes
Scheduler & egress rewrite	Per-class service order (Strict/WRR/DRR), shaping, header rewrite and egress framing	Per-class throughput, shaping counters, egress error counters	Priority starvation under strict scheduling; rate-limit anomalies; egress-specific errors after rewrite actions

Why TCAM sits in the classification stage

Masked match at wire speed: ACL/QoS rules often require wildcard fields and priority resolution.
Cost shows up as power/heat: wider keys and more entries increase silicon area and thermal load.
Debug requires counters: classification without reliable hit/miss evidence becomes non-actionable.

Where SRAM/HBM/DDR typically matter

SRAM: low-latency adjacency and fast counters; sensitive to access patterns.
HBM/DDR: larger buffers and extended statistics; capacity helps bursts, but bandwidth/latency shape behavior.
Queues are physics: congestion decisions depend on actual memory bandwidth and thresholds.

Figure F3 — Packet pipeline map (with observability probes)

The pipeline is intentionally hardware-centric: it links features to silicon resources and to evidence signals used in acceptance tests and rapid root-cause isolation.

H2-4 · Tables & Memory

Tables & Memory: FIB scale, TCAM sizing, counters, and update mechanics

Router performance and stability are bounded by tables and memory physics. “More rules” or “more routes” is not a software-only change: it increases table width, entry count, memory bandwidth, and heat. The practical design question is not “can the box forward,” but “can it forward and update state consistently without outages.”

Table families (what grows in real deployments)

FIB entries: executable forwarding prefixes (scale is driven by aggregation strategy and routing domains).
Adjacency / next-hop: per-path resolution, ECMP fan-out, and per-neighbor state needed by forwarding.
ACL / policy / QoS: masked-match rules for security boundaries, traffic classes, and service guarantees.
Label entries: additional forwarding keys for encapsulation labels (used only as a table-type example).

“Table = cost” (why it impacts BOM and thermals)

TCAM width: matching more fields per rule increases key width and power per lookup.
Entry count: larger policies consume banks/stages and raise baseline power.
Heat headroom: TCAM and external memory can become the dominant hotspot, constraining port density.
Update risk: high-rate updates can stress programming paths and create transient inconsistency windows.

A minimal but actionable sizing view is to separate match-heavy tables (TCAM) from capacity-heavy state (SRAM/HBM/DDR), then ensure the update mechanism preserves forwarding continuity.

Memory / block	Typical contents (router context)	Engineering tradeoff (why it matters)
TCAM	ACL rules, QoS classification keys, policy routing match sets	Fast masked match, but high power/heat; width and rule ordering affect capacity and predictability
SRAM	Adjacency caches, fast counters, low-latency state used in the hot path	Excellent latency; constrained capacity; access patterns can become bandwidth bottlenecks
HBM / external high-bandwidth	Large buffers, extended counters, scale-out forwarding state (platform dependent)	Capacity and throughput enable burst tolerance; thermal/power budgeting becomes a platform constraint
DDR (control/auxiliary)	Control-plane state, logs, extended telemetry, non-hot-path data	Large capacity; higher latency; not a substitute for deterministic hot-path resources

Update mechanics (hitless principle without vendor specifics)

Stage new state: program new entries into a shadow bank or unused region, while old state continues forwarding.
Switch atomically: flip a pointer/version selector so lookup uses the new table set.
Retire safely: reclaim old entries only after counters and reconciliation confirm continuity.

The objective is to avoid transient “holes” where neither old nor new entries match, and to keep evidence (counters/logs) aligned across the cutover.

Figure F4 — Table taxonomy & memory map (plus the update path)

The diagram separates match-heavy resources (TCAM) from capacity-heavy resources (buffers/state), then shows a vendor-agnostic hitless update model: stage → switch → retire with reconciliation evidence.

H2-5 · PHY & SerDes

100G/400G/800G PHY & SerDes: where retimers and gearboxes fit

High-speed ports are not “a single interface.” They are a chain from MAC/PCS/FEC to SerDes lanes, then through channel losses (connectors, backplane, traces) to the external medium (module or copper). Retimers and gearboxes exist to keep the chain inside a measurable margin envelope across temperature, aging, and load.

MAC / PCS / FEC SerDes lanes PAM4 vs NRZ Retimer / Gearbox BER + FEC margin Lane skew + CDR lock

Port chain decomposition (what to isolate first)

MAC/PCS/FEC: exposes pre/post-FEC behavior and the correction headroom trend.
SerDes lanes: the “heart” of timing recovery and equalization; lane-level evidence matters.
PHY / retimer / gearbox: restores eye margin and/or maps lane structures to meet channel constraints.
Channel: insertion loss, reflections, and crosstalk accumulate across connectors and backplane paths.
External medium: treated as an endpoint load (details of module internals are out of scope here).

PAM4 vs NRZ (only the practical engineering delta)

PAM4: narrower noise/jitter margin; more sensitive to channel loss and reflections, so equalization and FEC headroom become critical.
NRZ: wider margin at lower symbol complexity; scaling to higher throughput pushes channel demands in a different direction.
Takeaway: what changes is not “difficulty,” but how fast margin is consumed by loss, jitter, and temperature.

When retimers/gearboxes become unavoidable (decision triggers)

Long electrical reach: extended traces, multiple connectors, or backplane paths push insertion loss beyond practical equalization.
Backplane + connector stacking: reflections and impedance discontinuities accumulate and shrink the eye quickly.
Margin collapses under heat/load: pre-FEC rises with temperature or high system power, even if link stays “up.”
Lane alignment issues: lane skew and intermittent CDR lock indicate the chain is operating too close to the edge.
Rate/lane mapping constraints: a gearbox may be needed to satisfy lane organization across ASIC ↔ channel ↔ endpoint assumptions.

How to read port health metrics (vendor-neutral intent): interpret trends as “margin consumption,” not only as pass/fail.

Metric	What it indicates	How it helps isolate the weak segment
pre-FEC error trend	Raw channel quality drift (loss/jitter/noise eating eye margin)	Rising pre-FEC with stable post-FEC suggests shrinking headroom; correlate with temperature and power loading
post-FEC errors	Correction envelope exceeded (uncorrectable events)	Signals a hard margin failure; prioritize channel/connector loss, reflections, and retimer placement review
FEC margin / corrected counts	How much “reserve” remains and how hard the system is working to stay clean	Explains “link up but performance unstable” behavior under thermal stress or aging
lane skew / alignment	Multi-lane timing divergence	Points to physical lane imbalance, routing asymmetry, or marginal clock recovery
CDR lock stability	Timing recovery robustness	Intermittent lock suggests jitter/return loss sensitivity; often worsens with temperature and power noise

Figure F5 — Port to module/copper chain (test points and risk points)

The figure keeps module internals out of scope and focuses on the electrical chain. Probes and risk points map directly to the metrics used in manufacturing and field troubleshooting.

H2-6 · Buffering & Congestion

Buffering & Congestion: deep buffers, microbursts, VOQ, and ECN boundaries

Core/aggregation routers often fail “in the small time scale”: microbursts can overflow queues even when average utilization looks safe. The practical goal is to make congestion behavior explainable: queue watermarks, drops, and ECN marks must align with a clear buffering model.

Why buffering is a first-order concern in core/aggregation

Fan-in aggregation: many ingress streams converge onto fewer egress ports.
Long feedback loops: congestion feedback takes time; queues can build before senders react.
Class isolation: QoS requires per-class behavior to be stable under bursts, not only at steady state.

Microburst mechanism (why “low average” still drops)

Step 1: simultaneous ingress arrivals (parallel flows).
Step 2: a fixed egress service rate (the bottleneck).
Step 3: queue depth spikes beyond available buffer → ECN marks and/or drops.

Architectural tools (router-centric, vendor-neutral)

Shared buffer: increases statistical multiplexing, but needs robust thresholds to avoid “buffer capture” by aggressive traffic classes.
VOQ (Virtual Output Queues): reduces head-of-line blocking and helps isolate contention, at the cost of more queue structures and planning.
Admission control: limits enqueue into congested regions earlier to prevent explosive queue growth.
ECN marking line: marks before dropping, at a defined queue watermark, making behavior measurable and tunable.

Delivery metrics should be reported as curves and percentiles, not single-point numbers.

Metric	How it is measured	What it proves (or disproves)
Throughput under bursty load	Step-load tests with mixed classes; observe stability of per-class rate	Whether the pipeline sustains line rate without hidden degradation under burst stress
Loss curve (drops vs load)	Increase offered load and burstiness; plot drops vs queue watermarks	Whether buffer thresholds are coherent; distinguishes microburst overflow from persistent congestion
Tail latency (p99/p99.9)	Measure per-class latency distribution while driving microbursts	Whether deep buffers cause unacceptable latency tails; correlates directly with queue depth
ECN marks	Count marks at defined watermarks during controlled congestion	Whether the system signals congestion early enough to reduce drops (without DC-specific PFC deep dive)

Figure F6 — Microburst & buffer model (VOQ + shared buffer + ECN mark line)

The model stays router-centric: it explains burst-driven queue spikes, shows VOQ/shared-buffer roles, and anchors ECN and drops to explicit watermarks (without diving into DC-specific RoCE/PFC tuning).

H2-7 · Clocking (inside the box)

Clocking inside the box: SerDes refclks, jitter, and why it shows up as errors

This chapter stays strictly at the high-speed link layer. Inside a router, SerDes reliability depends on how the reference clock (refclk) is generated and distributed to the switch/NP ASIC and any retimers. If jitter consumes the available margin, the first visible symptoms are usually CRC, rising pre-FEC, and occasionally link flaps.

XO / PLL refclk fanout additive jitter eye closure BER → FEC CRC / link flap

Typical clock chain for SerDes (vendor-neutral)

Timebase (XO) + PLL: creates a stable refclk domain used by high-speed blocks.
Fanout / distribution: routes refclk to multiple destinations with controlled skew.
Destinations: switch/NP ASIC SerDes and retimers consume refclk and convert it into link timing.

The key is not “having a clock,” but keeping distribution asymmetry and additive jitter low enough that margin remains across temperature and load.

How to think about jitter budget (engineering view)

Budget = margin: the link has a finite tolerance window before the eye collapses.
Every stage spends budget: PLL noise, fanout additive jitter, coupling and power noise all reduce margin.
When budget is gone: eye closure → BER rises → FEC works harder → CRC/retries appear → link stability can degrade.

Symptom mapping (what clock issues usually look like)

Intermittent CRC / retries: margin is thinning but not fully collapsed; often correlates with temperature or peak load.
Rising pre-FEC (stable post-FEC): the channel is getting noisier; FEC headroom is being consumed.
Link flaps / unstable lock: the system has crossed a robustness boundary; clock distribution and power noise become prime suspects.
Only some ports are sensitive: points to distribution skew/asymmetry or board-level coupling, not a uniform “network configuration” issue.

Figure F7 — SerDes clock tree and the jitter → error chain

The diagram is strictly “inside the box” and links clock distribution quality to measurable port symptoms without turning into a PTP/SyncE timing chapter.

H2-8 · Power & PMBus

Power architecture & PMBus telemetry: multi-rail VRMs, sequencing, and brownout-proofing

A core/aggregation router is a multi-rail power system. Many failures do not present as an immediate reboot: marginal rails often show up first as errors, link flaps, or unstable performance under heat and load. PMBus-enabled digital power provides the evidence trail: V/I/P/T telemetry plus fault flags that can be aligned to symptom timestamps.

multi-rail VRMs SerDes rail sensitivity PMBus V/I/P/T fault flags sequencing + PG brownout-proofing

Typical rails (and what tends to be most sensitive)

Core rail: severe droop tends to trigger system-level resets or reboots.
SerDes/retimer rails: noise or droop commonly appears as BER/FEC stress, CRC, or link instability.
Memory rails (DDR/HBM): marginal power can manifest as instability under sustained load and thermal stress.
TCAM rail: power/thermal headroom affects deterministic behavior when table activity is high.
I/O + auxiliary rails: management and side functions; failures can look “random” without telemetry correlation.

PMBus telemetry (what to capture for soft-fault localization)

Vitals: voltage, current, power, temperature (V/I/P/T) per rail.
Status: UV/OV/OC/OT and latched fault codes (even short events matter).
Correlation: align telemetry timestamps to CRC spikes, link flaps, or reboot events.
Trend view: rising power/temperature plus shrinking voltage margin is a common precursor to field issues.

Sequencing / reset and brownout-proofing (principles + verification points)

Sequencing: ensure rails that gate stable I/O and high-speed blocks reach regulation before dependent domains are released.
PG and debounce: PG thresholds and deglitching prevent false resets while still catching real droops.
Brownout behavior: define whether the platform should degrade, alarm, or reset under short undervoltage events.
Margining (acceptance test): controlled, small voltage shifts should not immediately cause CRC/link instability if margin is healthy.

The scope stays inside the box: multi-rail VRMs and PMBus evidence. 48 V hot-swap/eFuse design and PoE power are intentionally out of scope.

Fault	Where it often occurs	PMBus evidence	Common symptom
UV	SerDes/retimer or core rails under burst load	Undervoltage flag + droop trend in V/I/P	CRC spikes, rising pre-FEC; in severe cases link flap or reboot
OC	VRM current limits during peak traffic/thermal events	Overcurrent flag + current plateau	Sudden instability when load peaks; errors that correlate with traffic bursts
OT	Hotspots near TCAM/memory/VRM zones	Temperature rise + OT flag or throttling indicators	Performance degradation, link instability in heat-soak conditions

Figure F8 — Power rails & PMBus map (faults → symptoms)

The diagram focuses on multi-rail VRMs, PMBus evidence, and symptom mapping. 48 V front-end hot-swap/eFuse and PoE power topics are intentionally excluded.

H2-9 · Thermal & Mechanical

Thermal & mechanical reality: airflow, hotspots, and derating the right things

In a core/aggregation router, temperature problems rarely look like “a simple overheat.” Hotspots can first appear as port instability, higher error rates, or performance drift under heat-soak. The practical job is to build a closed loop: airflow path → hotspot map → sensors → fan curve → derating triggers.

airflow path ASIC / TCAM retimer banks VRM zones sensor placement derating

Where hotspots come from (and why density matters)

ASIC: high power + adjacent SerDes blocks; heat reduces link margin.
TCAM: activity and temperature can compound power; sustained load creates hot plates.
Retimers: often clustered near dense ports; local heating can make “one row of ports” more sensitive.
VRMs: conversion losses concentrate near large loads; temperature reduces efficiency and margin.

Port density increases thermal density. That is why thermal issues often show spatial correlation (specific cards/port banks).

Sensor placement (evidence, not averages)

Inlet: boundary condition for the chassis (what the cooling system actually gets).
Outlet: measures airflow effectiveness (ΔT across the box).
Near hotspots: ASIC, TCAM, retimer banks, VRM zones (where failures start).
Correlation rule: hotspot sensors must explain error spikes better than “average chassis temperature.”

Fan curves and derating (do the least harmful thing first)

Fan curve goal: keep hotspot sensors below thresholds while avoiding permanent high RPM (noise and wear).
Heat-soak vs transient: validate both steady-state and burst-driven heating (traffic + power spikes).
Derate the right thing: protect link stability first (reduce port rate/active ports), then protect throughput, then optimize acoustics and lifetime.
Closed-loop triggers: tie derating to hotspot sensors and clear recovery conditions (avoid oscillation).

Engineering action	What it improves	What it prevents
Improve conduction path (heatsink/contact/thermal interface)	Reduces hotspot peak temperature and gradient	Local margin collapse that shows up as errors or instability
Airflow shaping (baffles/flow guides/recirculation control)	Delivers air where it matters (ASIC/retimer/VRM zones)	“Cool inlet, hot hotspot” false confidence
Derating (ports/rate/power cap)	Preserves system stability under thermal stress	Error storms, link flaps, cascading faults triggered by heat

Figure F9 — Thermal map block diagram (airflow + hotspots + sensors + fan control loop)

The diagram stays inside the chassis. It does not cover site-level environmental monitoring (door/water/smoke), which belongs to the Site Env & Power Monitor page.

H2-10 · Reliability & HA

Reliability & high availability: redundancy, hitless upgrades, watchdogs, and fault domains

High availability is an engineering contract, not a slogan. It is built from redundancy domains, clear fault-domain boundaries, and verifiable criteria for failover and upgrades. A correct design proves that a single component failure does not cascade into chassis-wide outage.

PSU / fan redundancy dual SUP fabric redundancy fault domains ISSU / hitless watchdogs

Redundancy domains (concept + verification intent)

PSU: loss of one supply should not drop forwarding stability or trigger mass port errors.
Fans: loss of one fan should keep hotspots under control without immediate instability.
Dual SUP / route processor: failover should preserve forwarding with bounded control-plane impact.
Fabric redundancy: loss of a fabric element should not create a chassis-wide outage.
Line-card isolation: a failing card should be fenced off without dragging the whole box.

Hitless / ISSU requirements (criteria, not vendor features)

State continuity: the system must preserve the minimum state required for stable forwarding.
Table consistency: FIB/ACL/QoS programming must remain coherent across the transition.
Failover budget: there is a finite time window for switchover before traffic impact is visible.
Rollback safety: failed upgrade paths must return to a known-good state without widening the blast radius.

Fault domains and proof (telemetry + logs as evidence)

Fault domain goal: keep failures local (line card, fabric, SUP, or power/thermal domains).
Watchdogs & health monitors: detect abnormal behavior early and trigger domain-level containment actions.
Proof rule: telemetry and logs must show that only the affected domain degraded while other domains remained stable.

Acceptance check	What to observe	What it proves
PSU pull test	No chassis-wide error storm; telemetry shows clean takeover	Power redundancy without cascading instability
Fan failure test	Hotspot temperatures remain controlled; no sudden link flaps	Thermal domain resilience and correct fan curve triggers
SUP failover	Bounded transition; forwarding stability remains within budget	Control-plane redundancy and switchover robustness
Fabric element isolation	Traffic impact remains localized; alarms identify the correct domain	Fabric redundancy and fault-domain containment
Line-card isolation	Affected card fenced; rest of chassis stays stable	Blast-radius control and domain boundaries

Figure F10 — Fault-domain diagram (redundancy components and isolation boundaries)

The diagram highlights redundancy domains and explicit fault-domain boundaries. Security boot/TPM/HSM topics are intentionally excluded and belong to dedicated security/BMC pages.

H2-11 · Validation & Bring-up Checklist: what proves it’s done (and debugs fast)

This chapter turns the router’s core hardware risks—high-speed links, forwarding behavior under congestion, table scale/updates, power/thermal stability, and redundancy—into a repeatable validation script. Each step defines: what to runwhat to logpass/fail criterianext measurement when it fails.

Principle: verify power + telemetry health first, then links, then forwarding, then tables + updates, then redundancy, then soak. This prevents “electrical issues” from being misdiagnosed as “network issues”.

Figure F11 — Bring-up flow (evidence-driven)

A) Step-by-step bring-up script (inputs → outputs)

Goal convert system validation into a deterministic flow with unambiguous evidence artifacts.

Step 0 — Power-up: confirm sequencing, resets, and stable boot to a known state.
Step 1 — PMBus health: verify UV/OC/OT and fault logs are clean before any high-speed tests.
Step 2 — Port PRBS: PRBS + loopbacks; capture pre/post-FEC, CRC/FEC counters, lane skew, CDR lock.
Step 3 — Forwarding: line-rate throughput vs packet size; drop curves; queue depth/ECN observations.
Step 4 — Table stress: fill FIB/ACL/QoS tables; test update rate + hitless update behavior (criteria-only).
Step 5 — Redundancy drills: pull PSU/fan/SUP/line-card (where applicable); measure failover budget and fault containment.
Step 6 — Soak: long-run at thermal steady-state; verify error counters remain flat and logs stay clean.

B) Pass/Fail criteria (make “done” measurable)

Goal avoid “it looks stable” by using counters, curves, and logs as acceptance truth.

Power: no persistent PMBus faults; no brownout signatures; rails remain within margin test envelopes.
Links: post-FEC error-free over the agreed window; stable CDR lock; CRC/FEC counters do not drift.
Forwarding: expected throughput for target packet sizes; drop curve matches buffer/ECN intent.
Tables: target scale achieved; updates do not break hit counters, do not trigger unexpected drops/reboots.
HA: failover stays within the outage budget; single fault remains in its fault domain.
Thermal: steady-state temps within limits; no thermally-correlated error bursts (CRC/link flap).

C) Acceptance checklist (copy/paste friendly)

Area	What to run	What to log	Pass signal
PMBus / rails	Read telemetry, clear/verify fault logs, run margin test (safe envelope)	VIN/VOUT/IOUT/Temp/FaultCode	No UV/OC/OT; no recurring fault entries
Port PRBS	PRBS on target speeds; host/line loopback; lane mapping sanity	pre-FEC BER, post-FEC, CRC/FEC	post-FEC stays clean; counters flat
Forwarding perf	Line-rate tests (varied pkt sizes); microburst patterns	throughput, drop, queue depth, ECN	Expected curves; no unexplained tail spikes
Table scale	Fill FIB/ACL/QoS to targets; hit tests; update burst tests	hit/miss, update rate, error counters	Scale achieved; updates do not destabilize
Redundancy drills	PSU/fan/SUP/fabric/line-card fail (as supported)	failover time, service impact, logs	Within budget; fault contained
Soak	Long-run at nominal + elevated load (safe) with stable airflow	temp traces + link/error counters	No thermally-driven error bursts

D) Fast debug map (symptom → first measurements → next isolation)

Symptom	First measurements (must be logged)	Next isolation step
Random CRC bursts	CRC/FEC counters, CDR lock, refclk alarms, PMBus rail ripple events	Swap port ↔ port; check refclk distribution; correlate with rail margin/temperature; bypass retimer path (if possible)
Link flap on certain ports	lane skew, pre-FEC BER, training status, module/cable changes, thermal sensors	Control environment: fixed airflow; retimer/connector path check; validate insertion loss suspects by pattern
Good PRBS, bad traffic	queue depth, drop reason, ECN marks, scheduler stats	Reproduce with microburst; verify admission/VOQ behavior; check counter alignment across ingress/egress
Scale works, updates break	update rate, hit counters consistency, control-plane CPU health, error logs	Reduce update burst; verify atomicity expectations; locate the stage where inconsistency starts (lookup vs counters)
Thermal-triggered errors	hotspot temps, fan PWM/RPM, PMBus temps on VRMs, error counters vs time	Pinpoint hotspot rail/device; adjust fan curve; validate derating actions (rate/ports) and re-run soak

E) Delivery artifacts (what “done” must include)

Bring-up record: the flow in Figure F11 with timestamps and outcomes for each step.
Port/link report: PRBS configs, pre/post-FEC, CRC/FEC counters, lane mapping notes, key screenshots.
Forwarding curves: throughput vs packet size; drop curves; queue/ECN traces (CSV + plots).
Table scale & update: achieved scale, update rate tests, any exception logs, counter consistency checks.
Power/thermal evidence: PMBus fault logs (clean), rail telemetry snapshots, steady-state thermal traces.
Redundancy drills: failover timing, fault-domain containment proof, “single point of failure” notes.

F) Tools & Silicon References (Part Numbers) — for faster bring-up

The part numbers below are practical reference anchors for building observability and accelerating fault isolation (they are not endorsements, and availability varies by region/BOM strategy).

High-speed link conditioning (retimers / equalizers)

Marvell MV-CHA180C0C — 800G PAM4 DSP retimer (monitoring + PRBS/loopback capabilities are useful for isolation).
TI DS280DF810 — 28Gbps 8-channel retimer (board-level reach/robustness reference for mid-rate SerDes paths).
Semtech GN8112 — 112G PAM4 quad linear equalizer (useful as a “lossy copper reach” reference block).

Refclks / jitter control (when errors look “random”)

ADI AD9545 — dual-DPLL clock synchronizer/jitter cleaner class reference for PHY clocks.
Skyworks (SiLabs) Si5341 — ultra-low jitter clock generator family reference (multi-output clock trees).
ADI HMC7044 — jitter attenuator class reference (high-performance clock distribution use cases).

PMBus / rail telemetry (turn power into evidence)

ADI LTC2977 — 8-channel PMBus power system manager (sequencing + telemetry + fault logs).
Infineon XDPE192C4C-0000 — digital multiphase controller class (PMBus-managed VR rails).
TI INA238 — current/voltage/power monitor (alert-driven “rail anomaly” capture).

Thermal sensing & fan control (prevent heat-driven errors)

TI TMP117 — high-accuracy digital temperature sensor reference (hotspot mapping).
Microchip EMC2305 — SMBus fan controller (multi-fan curves + closed-loop options).

Lab bring-up instruments (models)

Anritsu MP1900A — modular high-performance BERT platform (PHY-layer validation workflows).
Keysight M8040A — high-performance BERT (PAM4/NRZ coverage for high-speed links).
Keysight IxNetwork — traffic generation + measurement suite (1G→800G class traffic scenarios).

What to record with these tools

Retimers/equalizers: PRBS settings, loopback mode, SNR/eye stats (if available), error counters.
Clock devices: ref selection events, holdover transitions, jitter alarms (if available), correlation to CRC bursts.
PMBus managers: fault logs (timestamped), margin test results, rail telemetry snapshots under load.
Thermal/fans: hotspot temps vs time, fan PWM/RPM vs time, error counters vs time.
BERT/traffic: test pattern, duration, link settings, counters/curves, and DUT configuration hash.

Request a Quote

Name

Company

Part Number(s) / BOM

Quantity & Target Lead Time

Alternates Allowed

Temperature Grade

Package / Footprint

Compliance

Budget Window

Lot Size / Qty

Message

Attachment

Accepted Formats

pdf, csv, xls, xlsx, zip

Attachment

Drag & drop files here or use the button below.

H2-12 · FAQs (Core / Aggregation Router)

Each answer stays within this page’s scope: forwarding pipeline, tables/TCAM, high-speed ports, buffering/ECN, in-box clocking, PMBus power telemetry, thermal reality, HA/ISSU principles, and acceptance evidence.

1) Core router vs data center switch — what is the practical boundary?

The boundary is usually proven by scale + services of the forwarding plane, not by port speed. Core/aggregation routers prioritize large L3/MPLS-class table sets, strict fault domains, serviceable chassis design (line cards + fabric), and predictable behavior during updates and failovers. A data center switch is often optimized for a different buffer model, different table mix, and different operational envelope.

H2-1/2table mixchassis HA

2) Why does a router need TCAM, and how should it be sized?

TCAM is commonly used where multi-field matches must be fast and deterministic (ACLs, classification, policy matches) inside the forwarding pipeline. Sizing starts from the rule families (ACL/QoS/policy), then applies width reality (fields expand entries), growth margin, and redundancy requirements. Validation is practical: fill to target scale and confirm hit behavior, counter stability, and power/thermal headroom.

H2-3/4TCAM width vs entriesvalidation

3) FIB scale vs update rate — what breaks first in real deployments?

Three things often fail before “raw lookup” fails: (1) the update path (control-to-ASIC programming) becomes bursty and creates transient inconsistency, (2) telemetry/counter coherence drifts so the box “forwards” but evidence becomes unreliable, or (3) hitless conditions are violated during redundancy events (failover/ISSU) because state and tables cannot remain aligned under update pressure.

H2-4/10/11update mechanicshitless criteria

4) When do 400G/800G designs require retimers or gearboxes?

Retimers become necessary when the channel (backplane + connectors + traces + cable) consumes too much margin: insertion loss, reflections, or temperature sensitivity causes unstable training, rising pre-FEC errors, or frequent CDR unlocks. Gearboxes typically appear when lane rates or electrical interfaces must be adapted while preserving link integrity. The decision should be evidence-based: PRBS results, margin trends, and repeatable failure location along the port chain.

H2-5margin evidenceport chain

5) Pre-FEC looks fine but post-FEC worsens — what does it indicate?

This often points to a measurement mismatch (different windows, different sampling points), or to bursty/clustered errors that stress correction more than average BER suggests. It can also indicate a hidden contributor that changes with time—refclk jitter events, rail noise, or thermal drift—creating short error storms. The fix is to time-align counters, correlate with clock/power/temperature telemetry, and reproduce under controlled load and airflow.

H2-5/7/11time alignmentcorrelation

6) “Microburst drops” with low average utilization — how can it be proven?

Average utilization hides microbursts because the queue can overflow in microseconds while the long-term average stays low. Proof requires high time-resolution evidence: queue depth samples (or watermark), drop counters with timestamps, and a bursty traffic pattern that reproduces the symptom. The clean demonstration shows multiple ingress flows converging to one egress queue, a queue spike, and drops/ECN marks occurring at the same moment.

H2-6/11queue depthburst reproduction

7) Deep buffers reduce loss but increase latency — how should ECN/thresholds be set?

The practical goal is to keep loss low without creating unacceptable tail latency. ECN marking thresholds should be placed below the hard-drop line so congestion feedback happens before queues saturate. Thresholds are tuned using measured curves: queue depth vs time, drop probability vs load, and latency percentiles under bursty traffic. Different service classes can use different thresholds, but the evidence must remain consistent and repeatable.

H2-6ECN mark linetail latency

8) Link flaps only on certain ports — how to isolate clocking vs power vs thermal?

Start with spatial correlation: do failures cluster by port bank, line card, or proximity to retimers/VRMs? Then isolate in a strict order: fix airflow and ambient, lock refclk source/distribution, and correlate error bursts with PMBus rail telemetry and hotspot temperatures. Swap ports/modules/cables to test locality, and use loopbacks to separate “inside-the-box” from “outside” contributors. The winning approach is a time-aligned evidence trail.

H2-7/8/9/11spatial correlationtime-aligned evidence

9) PMBus shows no fault, yet ASIC errors rise — what telemetry is missing?

“No PMBus fault” usually means no threshold was crossed, not that power is perfect. Many failures come from fast transients, localized droop, or rail noise that is too brief or too local to be captured by coarse telemetry. Missing pieces often include higher-rate rail sampling/event tagging, margin-test logs under load, hotspot temperature sensors near sensitive blocks, and tightly time-aligned port/ASIC error counters. Without these, correlation is slow and MTTR increases.

H2-8/11telemetry gapsmargin evidence

10) How to validate hitless failover/ISSU without service impact?

Hitless validation is a criteria test: forwarding continuity within a budget, stable counters/queues, and controlled fault domains. A safe method uses a staged run: freeze large table updates, run a representative traffic profile, then trigger planned failover/ISSU while recording loss/latency and key counters. The result must show bounded impact and clean recovery, with logs proving the event stayed inside its redundancy domain rather than cascading across the chassis.

H2-10/11budgeted impactdomain containment

11) Which counters/logs are “must-have” to reduce MTTR?

Must-have evidence spans three layers: (1) PHY/port (CRC, FEC, lane status, CDR/training), (2) pipeline/queues (drop reason, queue depth/watermarks, ECN marks, scheduler stats), and (3) system domains (PMBus fault logs, rail/temperature snapshots, reset causes, redundancy events). The key is time alignment across layers, so symptoms can be correlated to a specific domain (port bank, rail, thermal zone, or failover event).

H2-3/8/10/11three-layer observabilitytime alignment

12) What is a minimal acceptance test pack for a new chassis or line card?

A minimal pack should still be evidence-complete: (a) PMBus health + margin snapshots, (b) port PRBS/loopback results with pre/post-FEC and CRC/FEC counters, (c) forwarding throughput + drop/ECN curves under bursty patterns, (d) table scale + update consistency tests, (e) redundancy drills within budget, and (f) soak test with thermal steady-state. Each item must produce artifacts (CSV + screenshots + logs) so results are reproducible and auditable.

H2-11evidence packreproducible

Core / Aggregation Router Architecture (100G–800G)

Core / Aggregation Router Architecture (100G–800G)

What a Core/Aggregation Router Is (and What It Is Not)

Engineering boundary (the “why this product exists” test)

System Anatomy: Line Cards, Switch Fabric, Control Plane, Management Plane

Chassis building blocks (what matters in hardware terms)

Forwarding Pipeline: parse → lookup → ACL/QoS → queue/schedule → rewrite

What the pipeline must guarantee at scale

Why TCAM sits in the classification stage

Where SRAM/HBM/DDR typically matter

Tables & Memory: FIB scale, TCAM sizing, counters, and update mechanics

Table families (what grows in real deployments)

“Table = cost” (why it impacts BOM and thermals)

Update mechanics (hitless principle without vendor specifics)

100G/400G/800G PHY & SerDes: where retimers and gearboxes fit

Port chain decomposition (what to isolate first)

PAM4 vs NRZ (only the practical engineering delta)

When retimers/gearboxes become unavoidable (decision triggers)

Buffering & Congestion: deep buffers, microbursts, VOQ, and ECN boundaries

Why buffering is a first-order concern in core/aggregation

Microburst mechanism (why “low average” still drops)

Architectural tools (router-centric, vendor-neutral)

Clocking inside the box: SerDes refclks, jitter, and why it shows up as errors

Typical clock chain for SerDes (vendor-neutral)

How to think about jitter budget (engineering view)

Symptom mapping (what clock issues usually look like)

Power architecture & PMBus telemetry: multi-rail VRMs, sequencing, and brownout-proofing

Typical rails (and what tends to be most sensitive)

PMBus telemetry (what to capture for soft-fault localization)

Sequencing / reset and brownout-proofing (principles + verification points)

Thermal & mechanical reality: airflow, hotspots, and derating the right things

Where hotspots come from (and why density matters)

Sensor placement (evidence, not averages)

Fan curves and derating (do the least harmful thing first)

Reliability & high availability: redundancy, hitless upgrades, watchdogs, and fault domains

Redundancy domains (concept + verification intent)

Hitless / ISSU requirements (criteria, not vendor features)

Fault domains and proof (telemetry + logs as evidence)

H2-11 · Validation & Bring-up Checklist: what proves it’s done (and debugs fast)

A) Step-by-step bring-up script (inputs → outputs)

B) Pass/Fail criteria (make “done” measurable)

C) Acceptance checklist (copy/paste friendly)

D) Fast debug map (symptom → first measurements → next isolation)

E) Delivery artifacts (what “done” must include)

F) Tools & Silicon References (Part Numbers) — for faster bring-up

High-speed link conditioning (retimers / equalizers)

Refclks / jitter control (when errors look “random”)

PMBus / rail telemetry (turn power into evidence)

Thermal sensing & fan control (prevent heat-driven errors)

Lab bring-up instruments (models)

What to record with these tools

Recommended topics you might also need

Request a Quote

Accepted Formats

Attachment

H2-12 · FAQs (Core / Aggregation Router)

Explore

Categories

Get in Touch