BNG / BRAS: Subscriber Gateway Architecture, QoS, and Scaling

Q: PPPoE vs IPoE (DHCP): what are the biggest differences in scale, failure modes, and operations?

PPPoE adds a session protocol handshake and keepalive behavior, while IPoE relies on DHCP allocation and lease lifecycle. At scale, PPPoE issues often appear as handshake timeouts, session churn, and retry storms; IPoE issues often appear as address-pool depletion, lease thrash, and relay/option misbehavior. Operations should track setup success rate, tail auth latency, retry counters, and allocation failures for the chosen access mode.

Q: “Throughput is fine but user experience is bad” — is it usually deep queues/bufferbloat, and how to disprove it?

Deep queues can keep throughput high while inflating tail latency (bufferbloat), but it must be proven with evidence. Disprove by comparing per-class p95/p99 latency and queue depth time series under the same offered load: if tail latency stays bounded and queue depth does not build up, bufferbloat is not the primary cause. Also check drop reasons; if drops occur upstream or at a different stage, local queue tuning will not fix the symptom.

Q: What does HQoS really solve, and when does it become harder to tune or easier to misconfigure?

HQoS improves fairness and isolation at aggregation scale by scheduling and shaping across a hierarchy (port → service/VLAN → subscriber → flow/class). It becomes harder when classification/mapping is ambiguous, the hierarchy is too deep, or shaping/policing points are placed inconsistently, leading to priority inversion or starving classes. Misconfiguration is suspected when counters do not close (class totals vs subscriber totals) or tail latency becomes unstable under mixed traffic despite correct bandwidth sums.

Q: The physical link stays up but sessions flap — how to localize across PHY, MAC, and queue layers?

Start with PHY counters (PCS/FEC-related errors and quality indicators): silent error growth can trigger higher-layer timeouts without link-down. Then check MAC counters (drops, error frames, congestion hints) to determine whether framing or congestion is involved. Finally, inspect queue depth and drop reasons to detect microbursts or sustained congestion that inflates tail latency and trips control-plane timers. Time-correlate counter spikes with session churn to confirm causality.

Q: When AAA is unreachable, should a BNG fail-open or fail-close, and how to choose in practice?

Fail-open preserves continuity but increases unauthorized access risk; fail-close protects policy but risks widespread outages. A practical engineering policy is conditional: treat established sessions differently from new sessions during the outage window, use short-lived cached authorization where allowed, and always enforce retry suppression to prevent storms. The choice must be paired with observables (AAA timeouts, setup failures, churn) and a bounded recovery plan once AAA returns.

Q: HA switchover keeps traffic flowing, yet a redial storm happens — why, and how to suppress it?

No forwarding gap does not guarantee no control-plane churn. Redial storms often occur when timers and retry windows are not preserved across switchover, causing many sessions to falsely conclude failure and re-auth simultaneously. External dependencies (AAA/policy/address allocation) can also be stressed during switchover. Suppress with controlled backoff and rate-limits, staged recovery (stabilize existing sessions before opening new setups), and drills that measure churn and tail latency until KPIs return to baseline.

← Back to: Telecom & Networking Equipment

A BNG/BRAS is the subscriber gateway at the broadband edge: it terminates IPoE/PPPoE sessions and enforces per-subscriber policy/QoS while producing trustworthy accounting and telemetry. The real engineering challenge is keeping tail latency, drop reasons, and session setup stability under control at scale—especially during bursts, AAA degradation, and HA events.

What BNG/BRAS Is: Subscriber Termination, Policy/QoS, Accounting

A BNG/BRAS (Broadband Network Gateway / Broadband Remote Access Server) is the access aggregation gateway where subscriber sessions are terminated, per-subscriber policies and QoS are enforced, and accounting records/counters are produced for OSS/BSS.

Boundary in one line: BNG/BRAS is the “subscriber edge” between access aggregation (DSL/FTTH/Metro Ethernet) and the IP edge/core, focusing on sessions, service policy, and HQoS—not generic security or NAT.

This page stays inside the BNG/BRAS scope. CGNAT, firewall/UTM/DPI, optical transport, and access-node internals are treated only as boundaries.

Typical responsibilities (what it must do well):

Terminate subscriber access (PPPoE or IPoE/DHCP) and maintain a consistent subscriber/session state.
Authenticate/authorize subscribers via AAA (commonly RADIUS) and apply service profiles (bandwidth, VLAN/VRF, ACL, etc.).
Enforce per-subscriber and per-service QoS (HQoS): shaping, policing, scheduling, and queue management at scale.
Generate accounting: per-subscriber/per-service counters and records (usage, session events, errors), suitable for billing and audits.
Aggregate uplinks toward the IP edge/core (LAG/ECMP scenarios) while keeping session stability and predictable QoS behavior.
Expose operational telemetry (session setup rate, drops by reason, queue depth/latency signals) to close the “field feedback” loop.

What it is NOT (scope boundaries):

Not a CGNAT platform: NAT may exist elsewhere; BNG typically hands traffic to NAT/service edge when needed.
Not a firewall/UTM/DPI box: it may apply basic ACLs/policy rules, but deep security inspection belongs to security gateways.
Not an access node (OLT/DSLAM/ONT): it terminates subscriber sessions after access aggregation, not optical line termination.
Not an optical transport element (DWDM/ROADM/OTN): those handle photonics/optical switching, not subscriber policy/QoS.

Why this boundary matters in engineering:

BNG performance is dominated by “subscriber-scale fairness” (HQoS + deep queues), not only raw forwarding throughput.
Session lifecycle health (setup rate, AAA latency, churn storms) often fails before line-rate datapath limits are reached.
“Looks stable” can be misleading: averages may hide microburst drops and latency tail growth in deep queues.

Figure F1 — Where BNG/BRAS sits: access aggregation to edge/core, plus control/services

Where BNG/BRAS Sits: Reference Topology and Two Critical Paths

BNG/BRAS typically sits after access aggregation and before the IP edge/core. To keep designs readable and debuggable, treat it as a device with two distinct paths: session establishment (control interactions) and user data forwarding (fast path).

Access domains that commonly feed into BNG/BRAS:

DSL access aggregation (subscriber concentration before session termination).
FTTH/FTTx aggregation (subscriber traffic converges before policy/QoS enforcement).
Metro Ethernet (point-to-point or aggregated L2 delivery into the BNG).

Upstream handoff targets (only what matters for BNG behavior):

IP backbone / edge: determines congestion patterns that drive queue depth and latency tails.
MPLS/service edge: often provides service separation; BNG focuses on subscriber policy/QoS at ingress.
Peering/transit: impacts burstiness and microburst exposure at uplinks.

Engineering principle: many “subscriber complaints” originate from uplink congestion + queue behavior, even when the session state is healthy. Separating setup path vs data path prevents misdiagnosis.

Interface terms (keep them practical, keep them scoped):

VLAN / QinQ: used to map access domains and subscriber groups; mis-planning can break subscriber-to-policy binding.
LAG: uplink scale/HA; uneven hashing can create “single-link hot spots” that inflate latency even when total capacity looks fine.
ECMP: upstream multipath; asymmetry can complicate fault isolation and can amplify tail-latency when one path congests.

What to verify early (before deep tuning):

Session path health: setup rate, AAA latency, retries/timeouts, churn storms (session failures are not forwarding failures).
Forwarding path health: drops by reason (ingress vs queue vs egress), queue depth indicators, uplink counters.
Correlation: access events (link flap/aggregation changes) vs uplink congestion vs policy pushes.

Figure F2 — End-to-end view: session setup path vs data forwarding path

Subscriber Session Lifecycle: IPoE vs PPPoE, State Machine, and Failure Hotspots

BNG/BRAS reliability is often constrained by session lifecycle dynamics, not raw forwarding. Two access models dominate in practice: IPoE (DHCP-based) and PPPoE (session-based PPP). Both can “look fine” at average load but collapse under setup-rate spikes, AAA latency tails, or stale state.

Key diagnostic rule: treat session establishment as a multi-stage pipeline. When users report “drops,” first identify which stage failed (discover/authorize/maintain/cleanup), then confirm with stage counters + timeout distributions.

IPoE (DHCP) lifecycle (what matters at the BNG edge):

DORA: Discover → Offer → Request → Ack (address assignment and binding).
Relay/options: relay path and options influence how a subscriber maps to policy and services.
Lease maintenance: Renew/Rebind behavior determines how “silent failures” surface.

PPPoE lifecycle (what matters at the BNG edge):

Discovery: PADI → PADO → PADR → PADS (session creation).
PPP negotiation: LCP then IPCP (link/IP parameters), followed by policy attachment.
Keepalive: echo/keepalive timers decide how quickly a bad path is detected and how storms form.

Most common “blow-up” patterns (symptom → likely cause → where to look):

Symptom	Likely cause	What to watch (evidence)
Session flapping users drop/reconnect	Keepalive thresholds too aggressive; intermittent access issues; state cleanup lag; policy attachment delays.	`session_up/down` rate, keepalive fail counters, cleanup timer expirations, “duplicate session” rejects, stage-specific timeouts.
Reconnect storm setup-rate spikes	AAA latency tail; retry amplification; relay/option mismatch causing repeated failures; control-plane CPU saturation.	Setup attempts per second, AAA RTT histogram, retry counters, queue backlog for control tasks, top timeout stage (DHCP vs PPPoE discovery vs authZ).
Address pool exhaustion IPoE fails to bind	Pool depleted by churn; stale leases; mis-tuned lease times; cleanup lag after outages.	Pool free count trend, NAK/decline counters, lease renew failures, expired-but-not-released count, binding table occupancy.
Authorization timeout policy not applied	AAA/policy dependency slow or down; synchronous path blocks session completion; circuit not protected.	AuthZ timeout counters, pending-policy queue depth, fallback/deny decisions, rate-limit activations, dependency health alarms.
Stale session residue “ghost” occupancy	Session not cleaned on link flap; accounting stop not emitted; inconsistent state across HA pair.	Stale-state detector counters, long-lived sessions without traffic, cleanup retries, session-table watermark and “orphan” entries.

Engineering observability checklist (stage-level, not vague):

Stage counters: success/fail per stage (DHCP DORA stages; PPPoE discovery; LCP/IPCP; policy attach).
Timeout distribution: not only averages—track p95/p99 setup latency and AAA RTT tails.
Retry pressure: retry rate is the “storm multiplier” that converts a small slowdown into a widespread outage.
State watermarks: session-table occupancy, stale cleanup backlog, duplicate-session rejects.

Figure F3 — Two-lane session state machine with timeout and fault-injection points

Control Plane at Scale: AAA, Policy, Accounting, and Why CPU/Memory Fail First

In large deployments, session stability is commonly limited by control-plane capacity and dependency latency tails (AAA, policy stores, accounting sinks) long before datapath throughput is saturated. The practical goal is not “faster code,” but failure containment: prevent a slow dependency from triggering a global retry storm.

Scale insight: a small rise in AAA or policy latency can multiply into a large rise in setup rate (retries), which then consumes CPU/memory and further increases latency — a classic positive feedback loop.

AAA at BNG scale (three phases, three failure modes):

AuthN (authentication): identity verification; failure mode is timeout/retry amplification.
AuthZ (authorization): policy/profile retrieval; failure mode is inconsistent service mapping and long “pending policy” queues.
Acct (accounting): start/update/stop records; failure mode is backlog (I/O) and missing stop records that create stale state.

RADIUS is most common; Diameter may appear in some environments but is treated here only as a boundary label.

Policy delivery: two axes that drive control-plane load:

Per-subscriber (base service): bandwidth profile, queue tree selection, domain/VRF binding, baseline ACL rules.
Per-service (business intent): traffic class mapping, time-of-day rules, quota triggers, walled-garden transitions.
Dynamic triggers (load multipliers): usage thresholds, time windows, alarms/maintenance events — each can cause bursts of updates.

Where CPU/memory/I/O are consumed first (practical bottlenecks):

Bottleneck	Why it scales badly	Mitigation (engineering shape)
Session/state tables memory pressure	High churn increases concurrent “in-flight” sessions; stale state remains when stop records or cleanup are delayed.	Hard watermarks; stale-state detectors; fast cleanup paths; backoff on retries; limit setup concurrency.
Accounting/logging I/O backlog	Per-session events generate sustained writes; bursts during storms overflow queues and block critical paths.	Batch/aggregate updates; async emission; bounded queues with drop policies; replay-safe sinks; separate critical vs verbose logs.
AAA/policy dependencies latency tails	Tail latency (p99) drives retries; a “slow but alive” dependency is often worse than a hard fail.	Caching with explicit TTL/invalidation; circuit breakers; fail-open/fail-close policy per service tier; dependency health gating.
Control CPU / GC scheduler collapse	Serialization, timers, retry handling, and logging compete for CPU; storms create timer explosions and queue thrash.	Rate limiting; token buckets per stage; coarse timers; priority lanes (setup vs maintenance); load shedding under watermarks.

Control-plane protection toolbox (must be explicit, testable):

Layered caching: reduce dependency RTT sensitivity; define TTL and “safe fallback” behavior.
Batching: accounting/policy updates should be aggregated; specify batch size and max delay.
Async write + backpressure: keep the session pipeline non-blocking; define queue bounds and drop strategy.
Rate limit + circuit breaker: isolate slow dependencies; define trigger thresholds and recovery timers.

The correct “fail-open vs fail-close” choice is service-dependent and should be documented as an explicit policy, not an accident.

Figure F4 — Simplified control-plane sequence: BNG ↔ AAA ↔ Policy ↔ OSS/BSS (accounting)

Data Plane Pipeline: NPU/ASIC Fast Path, Lookups, and Counters at Line Rate

In a BNG/BRAS, the data plane must enforce per-subscriber policy and QoS while forwarding at high speed. This is why NPU/ASIC fast paths are used: at line rate, the system must parse, classify, look up subscriber/policy state, update counters, and enqueue packets without collapsing the CPU.

Performance reality: throughput is not only bps. The hard limit often shows up in pps (especially at minimum packet sizes) and under burst/microburst conditions. A design that “passes bps” can still fail in pps, counters, or queue mapping.

Typical fast-path pipeline (what happens to each packet):

Parser: extract L2/L3/L4 headers and key fields.
Classification: map packet to subscriber/service class (policy context).
Lookup: subscriber/session state + policy/QoS pointers; basic ACL matching (boundary).
Counters: update per-subscriber / per-class / per-interface usage and drops.
Queuing: place packet into the correct queue level (HQoS tree).
Shaping / Scheduling / Policing: enforce rates and fairness.
Egress: transmit to uplink, update egress stats, apply final shaping rules.

Table boundaries (what must be hardware-friendly):

Subscriber/session table: state + bindings + pointers to policy/QoS objects.
ACL / TCAM (boundary use): rule matching for policy enforcement and service separation.
Flow/service counters: usage counters and drop-reason counters that must not block forwarding.

Why NPU/ASIC is required (practical argument, not marketing):

Line-rate counters + HQoS at subscriber scale require many state updates per packet.
Minimum packets + bursts stress pps; per-packet CPU handling quickly becomes the bottleneck.
Deterministic queuing decisions must happen before buffers overflow under microbursts.

Measurement targets (use these to avoid false confidence):

Metric	Why it matters	What to verify
bps (throughput)	Large packets can hide weak classification/lookup behavior.	Stable throughput across mixed traffic classes and subscriber counts.
pps (packet rate)	Minimum packets expose per-packet work limits.	No unexpected drops or counter collapse at high pps.
Minimum packet	Worst-case pps; reveals parser and lookup capacity.	Forwarding + counters + queue mapping remain stable.
Burst / microburst	Short bursts overflow queues before averages rise.	Drop reasons are explainable; latency tail stays bounded for priority traffic.
Drop reasons	“Drops” without reasons cannot be fixed.	Per-stage and per-queue drop counters are consistent with symptoms.

Fast-path observability (keep it stage-based):

Lookup health: table occupancy, miss/fallback counts, update rate watermarks.
Queue health: per-queue depth, overflow counters, tail latency indicators by class.
Counter integrity: update backlog (if any), aggregation windows, counter reset events.

Figure F5 — NPU/ASIC pipeline: latency/resource/drop points per stage

Deep Queues & HQoS: Fairness Under Bursts, and How to Prove QoS

BNG/BRAS platforms often aggregate large numbers of subscribers, which makes bursts inevitable. Deep buffers can reduce immediate loss during burst aggregation and upstream congestion, but they can also create latency tail inflation (bufferbloat). The engineering goal is to deliver per-subscriber fairness and service-class guarantees without letting delay tails explode.

Deep-queue trade-off: deeper buffers reduce short-term drops, but can increase latency and jitter. HQoS structures and queue limits must be designed with service intent in mind.

Why deep queues exist in BNG (three practical drivers):

Aggregation bursts: many subscribers burst simultaneously, overwhelming uplink drain rate for short windows.
Upstream congestion: the “next hop” can congest; buffering absorbs transient mismatch.
Per-subscriber fairness: queues enable controlled scheduling rather than uncontrolled tail drops.

HQoS hierarchy (a typical shaping/scheduling structure):

Port level: total uplink capacity; global scheduling boundary.
Service/VLAN group: separation by domain/service bundle.
Subscriber level: fairness and rate caps per subscriber.
Service class / flow class: voice/video/best-effort class behavior within each subscriber.

Where shaping, policing, and scheduling belong (avoid the common mistake):

Shaping is often applied where fairness is required (subscriber and/or service-group levels).
Scheduling decides which queue drains first (service classes and priority).
Policing enforces hard limits and can drop aggressively; it should be used deliberately for specific tiers.

Bufferbloat vs loss (engineering judgment, not ideology):

If real-time traffic is critical, prioritize bounded delay via strict queue limits and priority handling.
If throughput-oriented traffic dominates, allow moderate buffering but monitor tail latency.
AQM (RED/WRED/CoDel) is considered when tail latency grows under mixed traffic and frequent bursts; it is not mandatory in every deployment.

QoS proof (make it measurable and auditable):

Service class	What to prove	How to test (outline)
Real-time voice/interactive	Bounded delay tail, bounded jitter, low loss under bursts.	Mixed-load run with burst injection; track p95/p99 latency by class and drop reasons.
Streaming video	Stable throughput with controlled loss; avoid repeated stall patterns.	Concurrent video-like flows + background BE; verify class bandwidth and overflow counters.
Best-effort bulk	Fairness across subscribers; predictable sharing under contention.	Many-subscriber test; compare per-subscriber throughput distribution and queue occupancy.

Figure F6 — HQoS tree + burst-to-queue-to-latency concept view

Crypto & ACL Boundary: What a BNG Enforces (and What It Does Not)

Security-related functions in a BNG/BRAS are best described as policy enforcement and isolation at subscriber scale—without turning the platform into a deep security inspection engine. This chapter defines practical boundaries for ACLs, crypto, subscriber isolation, and accounting/audit counters so performance and operability remain predictable.

Boundary rule: enforce access, isolation, and measurable accounting at line rate. Avoid designs that rely on per-packet logging or heavy inspection on the forwarding path.

ACLs: placement, scale, hit cost, and logging strategy

Insertion points: ACLs can be applied at ingress (early drop, save resources) and/or egress (service edge boundary). Placement should match the policy goal and the drop visibility needed.
Rule scale: per-subscriber rule explosion is avoided by using shared templates (service tiers / domains) with subscriber bindings, rather than unique rule sets for every subscriber.
Hit cost: long rule chains and frequent updates increase match pressure; keep high-hit rules simple and avoid unnecessary rule depth.
Logging: prefer counters + sampling + trigger logs (threshold-based). Avoid per-packet logs that can flood CPU and storage during attacks or churn events.

Crypto: what is typically inside BNG scope vs out of scope

In-scope: management-plane protection and limited tunnel/edge security functions when required by deployment—kept operationally stable under load.
Out-of-scope (boundary): large-scale, compute-heavy security processing is usually handled by dedicated security appliances or service edges rather than consuming BNG forwarding resources.

Subscriber isolation: separation that prevents cross-subscriber impact

Session separation: PPPoE or IPoE bindings establish the subscriber boundary used by policy and accounting.
VRF/domain separation: routing/forwarding contexts keep domains isolated (no cross-domain leakage by default).
Anti-spoofing: source-validation techniques can be used to reduce address spoofing; uRPF is relevant as a boundary mechanism (mention only).

Accounting & audit counters: a “trusted chain” without overload

Counter layers: subscriber counters + port/queue counters create an internal cross-check that supports troubleshooting and billing confidence.
Stability under stress: counter export and audit signals should not backpressure the forwarding path; use aggregation windows and rate controls.

Figure F7 — ACL/Crypto insertion points mapped onto the fast-path pipeline

Uplink & Ethernet PHY Considerations: What Breaks at 100G/400G and How to Localize Drops

High-speed uplinks (100G/400G and beyond) can fail in ways that look like “congestion” even when the root cause is physical-layer instability. The practical goal is to localize the problem layer before changing policies: start with PHY/FEC counters, then check MAC drops, then confirm queue drops.

Troubleshooting principle: always localize the layer first. Changing HQoS or ACL rules without confirming PHY/MAC health often hides the real root cause.

Uplink forms (impact only):

Multi-port uplinks: capacity and redundancy, but traffic distribution and hot spots must be observable.
LAG: resilience and aggregation; member link instability can translate into churn-like symptoms.
ECMP: path spreading; hash skew can produce single-path congestion even when aggregate capacity looks fine.

PHY/SerDes effects that matter to BNG operations:

Bit errors & FEC pressure: increased correction activity can impact effective throughput and tail latency.
Link flap: short up/down events cause visible service instability (timeouts, churn, accounting noise).
PCS/MAC symptoms: physical issues can propagate upward and appear as MAC drops or queue overflows.

How to read FEC/PCS statistics (engineering-only):

Trends beat snapshots: rising counters that correlate with temperature, optics changes, or time-of-day patterns are more actionable.
Compare both ends: align local counters with peer counters to distinguish local faults from upstream issues.
Correlate with symptoms: check whether counter changes align with MAC drops, queue drops, and tail latency shifts.

Localization order (recommended):

Layer to check	What indicates trouble	Next action
PHY / FEC	BER/FEC counters rising, link flap, unstable error trends.	Stabilize the link first; confirm peer-side behavior.
MAC	MAC drops/errors increase while PHY remains stable.	Confirm MAC-level congestion/drops before altering policy.
Queue / HQoS	Queue depth/overflow and tail latency rise with stable PHY/MAC.	Tune HQoS boundaries and queue limits with class intent.

Figure F8 — Counter-to-drop localization chain (PHY → MAC → Queue)

High Availability & Upgrade: HA/ISSU Where “No Session Drop” Is the Metric

In a BNG/BRAS, high availability is not a checkbox—it is measured by subscriber experience during failures and maintenance. The primary target is session continuity (or a strictly bounded disconnect rate), while secondary targets include switchover time, reconnect storm suppression, and accounting continuity.

Acceptance-first principle: define the pass/fail metrics before building HA or running ISSU. A “successful upgrade” is one where key service KPIs do not spike.

HA modes (impact only):

Active/Standby: one node forwards; the standby maintains enough state to take over with bounded churn.
Clustered designs: distribute sessions across nodes to reduce blast radius; failure should be localized to a subset of sessions.

Stateful switchover: what must be synchronized vs what can be rebuilt

Must sync: subscriber binding/session identity, policy “final state” pointers, and critical timers that prevent false retries.
Nice to sync: accounting aggregation windows and key counters used for cross-checking and troubleshooting.
Rebuild (common): deep queue internals and short-lived instantaneous measurements; design for controlled rebuild rather than full migration.

ISSU / hot upgrade boundaries:

Control vs data plane: the data plane should remain stable while control software changes, otherwise the event behaves like an outage.
Version compatibility: the upgrade window requires compatibility for key state fields (at minimum, the “must sync” set).
Rollback policy: define rollback triggers based on KPI deltas (setup failure rate, auth latency spikes, abnormal drop reasons).

Failure drills: scenarios that reveal real HA readiness

Link loss: uplink member loss, LAG events, or upstream cut.
Upstream congestion: microburst + sustained congestion patterns.
AAA unreachable: timeouts and partial reachability.
Address pool anomalies: depletion, slow allocation, or inconsistent state during churn.

Drill acceptance table (example template):

Drill	Targets	Key observables	Pass / Fail criteria
Switchover event	Bounded switchover time, bounded disconnect rate	session count, setup rate, teardown rate, churn indicators	Pass if KPIs spike briefly then return to baseline; fail if sustained churn or mass disconnect
AAA unreachable	Controlled setup degradation, no storm	auth latency, AAA errors/timeouts, setup success rate	Pass if retries are rate-limited and success recovers; fail if retry storms amplify outage
Congestion injection	Per-class QoS preserved, drops localized	queue depth, drop reason, per-class tail latency	Pass if priority classes remain bounded; fail if drops spread or tail latency explodes
Pool anomaly	Detect fast, recover cleanly	DHCP/alloc failures, setup rate, session failures by reason	Pass if failures are attributed and recovered; fail if sessions flap or accounting becomes inconsistent

Figure F9 — HA state synchronization scope: Must Sync vs Nice-to-Sync vs Rebuild

Telemetry & Field Evidence: Proving Stability with Metrics (Not Averages)

Telemetry should form a closed loop: collect the minimum metrics that explain user experience, trigger alerts based on thresholds and trends, localize the failing layer, apply controlled mitigations, then update baselines and drills. Stability is proven by tail behavior (p95/p99), drop reasons, and rate changes—not by averages.

Evidence rule: prefer counters + tail percentiles and keep logs minimal with sampling and rate limits. Unlimited logs can create a second outage.

Minimum metric set (actionable buckets):

Session health: session count, setup rate, teardown rate, churn indicators.
AAA/control plane: auth latency, AAA errors/timeouts, policy apply failures (as counters).
Forwarding truth: queue depth, drop reason, per-class tail latency (p95/p99).
System health: CPU/memory, control-plane protection triggers (rate-limit/circuit-breaker events).

Common traps (“looks stable” but is not):

Averages hide tail: mean latency remains low while p99 spikes cause real user complaints.
Microbursts: utilization looks fine but short bursts overflow queues and create instant drops.
Accounting delay: aggregation windows and export backlog can distort near-real-time billing views.

Alert strategy (simple but robust):

Thresholds: catch immediate failures (timeouts, hard drops).
Trends (rate-of-change): detect degradation before full failure (latency slope, churn slope).
Correlation: connect symptoms to likely causes (AAA errors ↑ → setup fails ↑ → churn ↑).

Minimal logging set (avoid self-inflicted outages):

Always log: major state transitions (switchover, rollback), AAA unreachable, pool anomalies.
Sample log: only during anomaly windows; increase sampling temporarily and revert automatically.
Never do: unlimited per-packet/per-flow logs on the fast path.

Figure F10 — Observability closed loop: Metrics → Alerts → Localization → Mitigation → Postmortem

Validation & Acceptance Checklist: What “Done” Looks Like for a BNG/BRAS

This chapter converts architecture goals into a repeatable acceptance workflow. “Pass” is defined by measurable evidence: throughput (bps/pps), tail latency (p95/p99), drop reasons, session scale, and controlled degradation during failures and maintenance.

Evidence pack (minimum): counters + p95/p99 percentiles + drop reason attribution + minimal event logs (rate-limited, sampled). Avoid unlimited per-packet logging on the fast path.

Acceptance scope (must cover all):

Performance: bps/pps, minimum-size packets, and per-class tail latency under multi-queue load.
Scale: max sessions, setup rate, churn/reconnect storms, and AAA degradation behavior.
QoS: SLA validation per class, bufferbloat checks, and “AQM effective” behavior validation.
Stability: 72-hour soak with staged load and controlled churn; no KPI drift or runaway backlogs.
Fault drills: AAA down, link flap, upstream congestion; recovery must be bounded and explainable.

Figure F11 — Acceptance “test stack” pyramid: Functional → Performance → Scale → Stability → Fault Drills

Copyable acceptance checklist (example table):

Category	Test case	Traffic model	Observables (evidence)	Pass criteria (bounded)
Functional	IPoE / PPPoE session bring-up with AAA	Low-rate, mixed packet sizes	setup success rate, auth latency, failure reason counters	No persistent setup failures; retries are controlled; failures are attributable
Performance	Line-rate forwarding + minimum-size packets	64B focus, mixed bps/pps	pps/bps, queue depth, drop reasons, p99 latency	Throughput meets target; tail latency bounded; drops localized and explainable
QoS	Per-class SLA validation under contention	Multi-queue load, bursts	per-class p95/p99, per-queue drops, queue depth time series	Priority classes remain bounded; bufferbloat is detected/mitigated as designed
Scale	Max sessions + sustained setup rate	Ramp sessions; steady hold	session count, setup rate, auth latency, CPU/mem, table watermarks	No uncontrolled churn; setup rate does not collapse; control-plane remains stable
Storm	Reconnect storm injection	Forced flap / timed churn	retry counters, setup fail reasons, rate-limit triggers, backlog	Storm is suppressed; system returns to baseline without cascading failures
Stability	72h soak with staged load phases	Low→mid→high, controlled bursts	KPI drift, backlog growth, tail latency trends, log volume	No drift/runaway backlogs; logs remain bounded; no hidden tail regressions
Fault drill	AAA down / partial reachability	Normal load + AAA outage window	AAA timeouts, setup success, churn, fallback counters	Graceful degradation; recovery is bounded; no retry storm amplification
Fault drill	Link flap / uplink member loss	Normal load + flap pattern	link events, queue drops, session churn, switchover timers	Bounded service impact; stable convergence; no mass session collapse
Fault drill	Upstream congestion / microburst stress	Short bursts + sustained congestion	queue depth peaks, drop reasons, per-class tail latency	Drops are attributed; priority services remain bounded; mitigation behaves as designed

Concrete test-stack BOM (example ordering part numbers):

Role	Part number	What it is	Why it is used in acceptance
Traffic generator / scale rig	`SPT-N4U-110` / `SPT-N4U-220`	Spirent TestCenter N4U compact chassis (110V/220V)	Repeatable bps/pps, minimum-packet stress, multi-queue contention, controlled churn for setup-rate and storm tests
Chassis accessory	`ACC-2017A` / `ACC-2018A`	Single-slot / dual-slot card carrier for N4U/N12U	Build a consistent hardware stack for identical test runs and evidence capture across labs
25GbE NIC (quad-port)	`E810-XXVDA4` (aka `E810XXVDA4`)	Intel Ethernet Network Adapter (4×25/10/1GbE)	Cost-effective high-port-density host for traffic sink/source, counter validation, and mixed packet-size performance tests
100GbE NIC (dual-port)	`MCX623106AC-CDAT`	NVIDIA ConnectX-6 Dx 2×100GbE QSFP56 (Crypto + Secure Boot options)	High-rate test host for line-rate forwarding, microburst stress, and queue/drop-reason attribution at 100GbE
SyncE/PTP-capable NIC	`MCX623106GC-CDAT`	NVIDIA ConnectX-6 Dx with Enhanced-SyncE & PTP GM support + GNSS/PPS	When acceptance requires time alignment evidence (timestamped loss/latency correlation) without moving into clock-servo design
Test fabric reference	`BCM56880` (Trident4 series)	Broadcom switch ASIC family reference for high-density fabrics	Keep the bottleneck out of the test fabric; maintain predictable L2/L3 switching capacity during BNG stress runs

Note: part numbers above are intended as concrete examples for a replicable lab stack. Equivalent vendor platforms can be substituted, as long as the traffic model, observables, and evidence pack remain identical.

Request a Quote

Name

Company

Part Number(s) / BOM

Quantity & Target Lead Time

Alternates Allowed

Temperature Grade

Package / Footprint

Compliance

Budget Window

Lot Size / Qty

Message

Attachment

Accepted Formats

pdf, csv, xls, xlsx, zip

Attachment

Drag & drop files here or use the button below.

FAQs: BNG / BRAS Engineering Questions

Each answer stays within this page’s scope: subscriber termination, AAA/policy/QoS, accounting, fast path, uplinks, HA, and telemetry.

1) Do BNG and BRAS still differ in modern products, and how to state the boundary cleanly?

In modern deployments, “BNG” and “BRAS” often refer to the same functional block: subscriber termination plus policy/QoS enforcement and accounting at the broadband edge. The least controversial boundary is a function-based definition: terminate IPoE/PPPoE sessions, apply per-subscriber policy and QoS, and export accounting records. Avoid historical role debates and describe what the box must do on the wire.

Proof points: session bring-up counters + policy apply counters + accounting export status.

2) PPPoE vs IPoE (DHCP): what are the biggest differences in scale, failure modes, and operations?

PPPoE introduces a session protocol handshake and keepalive behavior, while IPoE relies on DHCP allocation/lease lifecycle and relay behavior. At scale, PPPoE pain often shows up as handshake timeouts, session churn, and retry storms; IPoE pain often shows up as address-pool depletion, lease thrash, and relay/option misbehavior. Operations should track setup success rate, tail auth latency, retry counters, and allocation failures separately for the chosen access mode.

Proof points: setup rate vs success rate, retry counters, address pool/lease failure counters, tail auth latency.

3) Why do auth timeouts and disconnects appear when setup rate spikes, even if session count is not huge?

Setup rate stresses the control plane first: session table allocations, timer wheels, policy transactions, and AAA request concurrency. Even with moderate steady-state sessions, bursts can push AAA round-trip time into the tail, causing timeouts and retries that amplify load into a storm. Common triggers include CPU saturation, memory/GC pressure, synchronous logging/accounting paths, and AAA backends entering partial degradation. The fix path starts with tail latency and retry suppression, not with headline throughput.

Proof points: p95/p99 auth latency, timeout/retry counters, CPU/mem, backlog/export watermarks.

4) “Throughput is fine but user experience is bad” — is it usually deep queues/bufferbloat, and how to disprove it?

Deep queues can hide drops and keep throughput high while inflating tail latency (bufferbloat), but it must be proven with evidence. Disprove by comparing per-class p95/p99 latency and queue depth time series under the same offered load: if tail latency stays bounded and queue depth does not build up, bufferbloat is not the primary cause. Also check drop reasons: if drops occur upstream or at a different stage, local queue tuning will not fix the symptom.

Proof points: per-class p95/p99 latency, queue depth peaks, drop-reason attribution before/after AQM enable.

5) What does HQoS really solve, and when does it become harder to tune or easier to misconfigure?

HQoS solves fairness and isolation at aggregation scale by applying scheduling and shaping across a hierarchy (port → service/VLAN → subscriber → flow/class). It becomes harder when classification/mapping is ambiguous, the hierarchy is too deep, or shaping/policing points are placed inconsistently, causing priority inversion or starving classes. Misconfiguration is suspected when counters do not “close” (class usage vs subscriber totals) or when tail latency becomes unstable under mixed traffic despite correct bandwidth sums.

Proof points: per-class counters alignment, per-subscriber fairness checks, tail latency under contention, queue drop distributions.

6) After enabling ACLs, performance drops a lot — what are the three most common reasons?

The top three causes are: (1) rules push traffic onto a slower path due to table limits or rule compilation fallbacks; (2) overly broad matches create high hit-rate checks that add per-packet lookup cost; (3) logging/counters are configured too aggressively (e.g., logging on frequent hits), shifting load to the control plane. The fastest way to validate is to compare fast/slow-path counters and CPU load before and after enabling the ACL set.

Proof points: fast/slow-path counters, table utilization/compile status, CPU/mem deltas, pps drop vs latency rise.

7) Why do accounting/statistics often “not match” — lost counters, sampling, or export pipeline problems?

Mismatches usually come from three stages: (1) counter generation (reset on session events, wrap, or per-direction semantics); (2) aggregation windows (time alignment and batching make near-real-time views look inconsistent); (3) export pipeline (backlog, retries, or partial loss). A reliable approach is to verify closure: interface totals vs session totals vs exported totals, all aligned to the same time window and session churn markers. This turns “blame” into a measurable root cause.

Proof points: export backlog/watermarks, window timestamps, churn markers, closure checks across interface/session/export.

8) The physical link stays up but sessions flap — how to localize across PHY, MAC, and queue layers?

Start at PHY counters (symbol/PCS/FEC-related errors and quality indicators): silent error growth can trigger higher-layer timeouts without a hard link-down. Then check MAC counters (drops, pause/congestion hints, error frames) to determine whether the issue is framing/congestion-related. Finally, check queue depth and drop reasons to detect microbursts or sustained congestion that inflate tail latency and cause control-plane timeouts. Time-correlate counter spikes with session churn to confirm causality.

Proof points: PHY error counters → MAC drops/errors → queue depth/drops → session churn timeline correlation.

9) When AAA is unreachable, should a BNG fail-open or fail-close, and how to choose in practice?

The choice is a risk trade-off: fail-open preserves service continuity but can increase unauthorized access risk; fail-close protects policy but can create widespread outages. A practical engineering policy is conditional: treat established sessions differently from new sessions during an outage window, use short-lived cached authorization where allowed, and always enforce retry suppression to prevent storms. The decision must be paired with observables (AAA timeouts, setup failures, churn) and a bounded recovery plan once AAA returns.

Proof points: AAA timeout rate, setup failure reasons, churn slope, cache hit/miss counters, retry rate-limit triggers.

10) HA switchover keeps traffic flowing, yet a redial storm happens — why, and how to suppress it?

“No forwarding gap” does not guarantee “no control-plane churn.” Redial storms often happen when timers and retry windows are not preserved across switchover, causing many sessions to falsely conclude failure and re-auth simultaneously. The storm can also be triggered by external dependencies (AAA/policy/address allocation) being temporarily stressed during switchover. Suppression requires controlled backoff and rate-limit gates, staged recovery (stabilize existing sessions before opening new setups), and acceptance drills that measure churn and tail latency.

Proof points: churn and retry counters during switchover, auth tail latency spikes, rate-limit/circuit-breaker events, recovery-to-baseline time.

11) How to design stress tests that cover microbursts, minimum packets, and mixed QoS queues?

Use a three-part test suite: (1) minimum-packet pps test (e.g., 64B emphasis) to expose true packet-processing limits; (2) microburst injection on top of steady background traffic to reveal queue overflow and tail behavior; (3) mixed-class contention with defined SLAs to validate HQoS correctness. Evidence must include per-class p95/p99 latency, queue depth peaks, and drop reasons—not only average throughput. Repeatability matters: keep the traffic model and evidence pack identical across runs.

Proof points: bps+pps, per-class p95/p99, queue depth peaks, drop reasons, runbook + exported evidence bundle.

12) How to operationalize queue depth, drop reasons, and tail latency into actionable alerts and reports?

Operationalization needs three layers: metrics, alert logic, and localization hints. Track short-window queue depth peaks, categorized drop reasons, and per-class p95/p99 latency; then alert using thresholds plus rate-of-change (trend) to catch degradation before outages. Add correlation rules (AAA errors ↑ → setup fails ↑ → churn ↑, or queue depth peaks ↑ → tail latency ↑) to reduce false positives. Reporting should highlight tail metrics and drop-reason composition, since averages routinely hide customer-visible issues.

Proof points: threshold+trend alerts, correlation outcomes, weekly tail latency and drop-reason distribution reports, incident-to-mitigation closure loop.

BNG / BRAS: Subscriber Gateway Architecture, QoS, and Scaling

BNG / BRAS: Subscriber Gateway Architecture, QoS, and Scaling

What BNG/BRAS Is: Subscriber Termination, Policy/QoS, Accounting

Where BNG/BRAS Sits: Reference Topology and Two Critical Paths

Subscriber Session Lifecycle: IPoE vs PPPoE, State Machine, and Failure Hotspots

Control Plane at Scale: AAA, Policy, Accounting, and Why CPU/Memory Fail First

Data Plane Pipeline: NPU/ASIC Fast Path, Lookups, and Counters at Line Rate

Deep Queues & HQoS: Fairness Under Bursts, and How to Prove QoS

Crypto & ACL Boundary: What a BNG Enforces (and What It Does Not)

Uplink & Ethernet PHY Considerations: What Breaks at 100G/400G and How to Localize Drops

High Availability & Upgrade: HA/ISSU Where “No Session Drop” Is the Metric

Telemetry & Field Evidence: Proving Stability with Metrics (Not Averages)

Validation & Acceptance Checklist: What “Done” Looks Like for a BNG/BRAS

Request a Quote

Accepted Formats

Attachment

FAQs: BNG / BRAS Engineering Questions

Explore

Categories

Get in Touch

BNG / BRAS: Subscriber Gateway Architecture, QoS, and Scaling

BNG / BRAS: Subscriber Gateway Architecture, QoS, and Scaling

What BNG/BRAS Is: Subscriber Termination, Policy/QoS, Accounting

Where BNG/BRAS Sits: Reference Topology and Two Critical Paths

Subscriber Session Lifecycle: IPoE vs PPPoE, State Machine, and Failure Hotspots

Control Plane at Scale: AAA, Policy, Accounting, and Why CPU/Memory Fail First

Data Plane Pipeline: NPU/ASIC Fast Path, Lookups, and Counters at Line Rate

Deep Queues & HQoS: Fairness Under Bursts, and How to Prove QoS

Crypto & ACL Boundary: What a BNG Enforces (and What It Does Not)

Uplink & Ethernet PHY Considerations: What Breaks at 100G/400G and How to Localize Drops

High Availability & Upgrade: HA/ISSU Where “No Session Drop” Is the Metric

Telemetry & Field Evidence: Proving Stability with Metrics (Not Averages)

Validation & Acceptance Checklist: What “Done” Looks Like for a BNG/BRAS

Recommended topics you might also need

Request a Quote

Accepted Formats

Attachment

FAQs: BNG / BRAS Engineering Questions

Explore

Categories

Get in Touch