BNG / BRAS: Subscriber Gateway Architecture, QoS, and Scaling
← Back to: Telecom & Networking Equipment
A BNG/BRAS is the subscriber gateway at the broadband edge: it terminates IPoE/PPPoE sessions and enforces per-subscriber policy/QoS while producing trustworthy accounting and telemetry. The real engineering challenge is keeping tail latency, drop reasons, and session setup stability under control at scale—especially during bursts, AAA degradation, and HA events.
What BNG/BRAS Is: Subscriber Termination, Policy/QoS, Accounting
A BNG/BRAS (Broadband Network Gateway / Broadband Remote Access Server) is the access aggregation gateway where subscriber sessions are terminated, per-subscriber policies and QoS are enforced, and accounting records/counters are produced for OSS/BSS.
Boundary in one line: BNG/BRAS is the “subscriber edge” between access aggregation (DSL/FTTH/Metro Ethernet) and the IP edge/core, focusing on sessions, service policy, and HQoS—not generic security or NAT.
This page stays inside the BNG/BRAS scope. CGNAT, firewall/UTM/DPI, optical transport, and access-node internals are treated only as boundaries.
Typical responsibilities (what it must do well):
- Terminate subscriber access (PPPoE or IPoE/DHCP) and maintain a consistent subscriber/session state.
- Authenticate/authorize subscribers via AAA (commonly RADIUS) and apply service profiles (bandwidth, VLAN/VRF, ACL, etc.).
- Enforce per-subscriber and per-service QoS (HQoS): shaping, policing, scheduling, and queue management at scale.
- Generate accounting: per-subscriber/per-service counters and records (usage, session events, errors), suitable for billing and audits.
- Aggregate uplinks toward the IP edge/core (LAG/ECMP scenarios) while keeping session stability and predictable QoS behavior.
- Expose operational telemetry (session setup rate, drops by reason, queue depth/latency signals) to close the “field feedback” loop.
What it is NOT (scope boundaries):
- Not a CGNAT platform: NAT may exist elsewhere; BNG typically hands traffic to NAT/service edge when needed.
- Not a firewall/UTM/DPI box: it may apply basic ACLs/policy rules, but deep security inspection belongs to security gateways.
- Not an access node (OLT/DSLAM/ONT): it terminates subscriber sessions after access aggregation, not optical line termination.
- Not an optical transport element (DWDM/ROADM/OTN): those handle photonics/optical switching, not subscriber policy/QoS.
Why this boundary matters in engineering:
- BNG performance is dominated by “subscriber-scale fairness” (HQoS + deep queues), not only raw forwarding throughput.
- Session lifecycle health (setup rate, AAA latency, churn storms) often fails before line-rate datapath limits are reached.
- “Looks stable” can be misleading: averages may hide microburst drops and latency tail growth in deep queues.
Where BNG/BRAS Sits: Reference Topology and Two Critical Paths
BNG/BRAS typically sits after access aggregation and before the IP edge/core. To keep designs readable and debuggable, treat it as a device with two distinct paths: session establishment (control interactions) and user data forwarding (fast path).
Access domains that commonly feed into BNG/BRAS:
- DSL access aggregation (subscriber concentration before session termination).
- FTTH/FTTx aggregation (subscriber traffic converges before policy/QoS enforcement).
- Metro Ethernet (point-to-point or aggregated L2 delivery into the BNG).
Upstream handoff targets (only what matters for BNG behavior):
- IP backbone / edge: determines congestion patterns that drive queue depth and latency tails.
- MPLS/service edge: often provides service separation; BNG focuses on subscriber policy/QoS at ingress.
- Peering/transit: impacts burstiness and microburst exposure at uplinks.
Engineering principle: many “subscriber complaints” originate from uplink congestion + queue behavior, even when the session state is healthy. Separating setup path vs data path prevents misdiagnosis.
Interface terms (keep them practical, keep them scoped):
- VLAN / QinQ: used to map access domains and subscriber groups; mis-planning can break subscriber-to-policy binding.
- LAG: uplink scale/HA; uneven hashing can create “single-link hot spots” that inflate latency even when total capacity looks fine.
- ECMP: upstream multipath; asymmetry can complicate fault isolation and can amplify tail-latency when one path congests.
What to verify early (before deep tuning):
- Session path health: setup rate, AAA latency, retries/timeouts, churn storms (session failures are not forwarding failures).
- Forwarding path health: drops by reason (ingress vs queue vs egress), queue depth indicators, uplink counters.
- Correlation: access events (link flap/aggregation changes) vs uplink congestion vs policy pushes.
Subscriber Session Lifecycle: IPoE vs PPPoE, State Machine, and Failure Hotspots
BNG/BRAS reliability is often constrained by session lifecycle dynamics, not raw forwarding. Two access models dominate in practice: IPoE (DHCP-based) and PPPoE (session-based PPP). Both can “look fine” at average load but collapse under setup-rate spikes, AAA latency tails, or stale state.
Key diagnostic rule: treat session establishment as a multi-stage pipeline. When users report “drops,” first identify which stage failed (discover/authorize/maintain/cleanup), then confirm with stage counters + timeout distributions.
IPoE (DHCP) lifecycle (what matters at the BNG edge):
- DORA: Discover → Offer → Request → Ack (address assignment and binding).
- Relay/options: relay path and options influence how a subscriber maps to policy and services.
- Lease maintenance: Renew/Rebind behavior determines how “silent failures” surface.
PPPoE lifecycle (what matters at the BNG edge):
- Discovery: PADI → PADO → PADR → PADS (session creation).
- PPP negotiation: LCP then IPCP (link/IP parameters), followed by policy attachment.
- Keepalive: echo/keepalive timers decide how quickly a bad path is detected and how storms form.
Most common “blow-up” patterns (symptom → likely cause → where to look):
| Symptom | Likely cause | What to watch (evidence) |
|---|---|---|
| Session flapping users drop/reconnect |
Keepalive thresholds too aggressive; intermittent access issues; state cleanup lag; policy attachment delays. |
session_up/down rate, keepalive fail counters, cleanup timer expirations,
“duplicate session” rejects, stage-specific timeouts.
|
| Reconnect storm setup-rate spikes |
AAA latency tail; retry amplification; relay/option mismatch causing repeated failures; control-plane CPU saturation. | Setup attempts per second, AAA RTT histogram, retry counters, queue backlog for control tasks, top timeout stage (DHCP vs PPPoE discovery vs authZ). |
| Address pool exhaustion IPoE fails to bind |
Pool depleted by churn; stale leases; mis-tuned lease times; cleanup lag after outages. | Pool free count trend, NAK/decline counters, lease renew failures, expired-but-not-released count, binding table occupancy. |
| Authorization timeout policy not applied |
AAA/policy dependency slow or down; synchronous path blocks session completion; circuit not protected. | AuthZ timeout counters, pending-policy queue depth, fallback/deny decisions, rate-limit activations, dependency health alarms. |
| Stale session residue “ghost” occupancy |
Session not cleaned on link flap; accounting stop not emitted; inconsistent state across HA pair. | Stale-state detector counters, long-lived sessions without traffic, cleanup retries, session-table watermark and “orphan” entries. |
Engineering observability checklist (stage-level, not vague):
- Stage counters: success/fail per stage (DHCP DORA stages; PPPoE discovery; LCP/IPCP; policy attach).
- Timeout distribution: not only averages—track p95/p99 setup latency and AAA RTT tails.
- Retry pressure: retry rate is the “storm multiplier” that converts a small slowdown into a widespread outage.
- State watermarks: session-table occupancy, stale cleanup backlog, duplicate-session rejects.
Control Plane at Scale: AAA, Policy, Accounting, and Why CPU/Memory Fail First
In large deployments, session stability is commonly limited by control-plane capacity and dependency latency tails (AAA, policy stores, accounting sinks) long before datapath throughput is saturated. The practical goal is not “faster code,” but failure containment: prevent a slow dependency from triggering a global retry storm.
Scale insight: a small rise in AAA or policy latency can multiply into a large rise in setup rate (retries), which then consumes CPU/memory and further increases latency — a classic positive feedback loop.
AAA at BNG scale (three phases, three failure modes):
- AuthN (authentication): identity verification; failure mode is timeout/retry amplification.
- AuthZ (authorization): policy/profile retrieval; failure mode is inconsistent service mapping and long “pending policy” queues.
- Acct (accounting): start/update/stop records; failure mode is backlog (I/O) and missing stop records that create stale state.
RADIUS is most common; Diameter may appear in some environments but is treated here only as a boundary label.
Policy delivery: two axes that drive control-plane load:
- Per-subscriber (base service): bandwidth profile, queue tree selection, domain/VRF binding, baseline ACL rules.
- Per-service (business intent): traffic class mapping, time-of-day rules, quota triggers, walled-garden transitions.
- Dynamic triggers (load multipliers): usage thresholds, time windows, alarms/maintenance events — each can cause bursts of updates.
Where CPU/memory/I/O are consumed first (practical bottlenecks):
| Bottleneck | Why it scales badly | Mitigation (engineering shape) |
|---|---|---|
| Session/state tables memory pressure |
High churn increases concurrent “in-flight” sessions; stale state remains when stop records or cleanup are delayed. | Hard watermarks; stale-state detectors; fast cleanup paths; backoff on retries; limit setup concurrency. |
| Accounting/logging I/O backlog |
Per-session events generate sustained writes; bursts during storms overflow queues and block critical paths. | Batch/aggregate updates; async emission; bounded queues with drop policies; replay-safe sinks; separate critical vs verbose logs. |
| AAA/policy dependencies latency tails |
Tail latency (p99) drives retries; a “slow but alive” dependency is often worse than a hard fail. | Caching with explicit TTL/invalidation; circuit breakers; fail-open/fail-close policy per service tier; dependency health gating. |
| Control CPU / GC scheduler collapse |
Serialization, timers, retry handling, and logging compete for CPU; storms create timer explosions and queue thrash. | Rate limiting; token buckets per stage; coarse timers; priority lanes (setup vs maintenance); load shedding under watermarks. |
Control-plane protection toolbox (must be explicit, testable):
- Layered caching: reduce dependency RTT sensitivity; define TTL and “safe fallback” behavior.
- Batching: accounting/policy updates should be aggregated; specify batch size and max delay.
- Async write + backpressure: keep the session pipeline non-blocking; define queue bounds and drop strategy.
- Rate limit + circuit breaker: isolate slow dependencies; define trigger thresholds and recovery timers.
The correct “fail-open vs fail-close” choice is service-dependent and should be documented as an explicit policy, not an accident.
Data Plane Pipeline: NPU/ASIC Fast Path, Lookups, and Counters at Line Rate
In a BNG/BRAS, the data plane must enforce per-subscriber policy and QoS while forwarding at high speed. This is why NPU/ASIC fast paths are used: at line rate, the system must parse, classify, look up subscriber/policy state, update counters, and enqueue packets without collapsing the CPU.
Performance reality: throughput is not only bps. The hard limit often shows up in pps (especially at minimum packet sizes) and under burst/microburst conditions. A design that “passes bps” can still fail in pps, counters, or queue mapping.
Typical fast-path pipeline (what happens to each packet):
- Parser: extract L2/L3/L4 headers and key fields.
- Classification: map packet to subscriber/service class (policy context).
- Lookup: subscriber/session state + policy/QoS pointers; basic ACL matching (boundary).
- Counters: update per-subscriber / per-class / per-interface usage and drops.
- Queuing: place packet into the correct queue level (HQoS tree).
- Shaping / Scheduling / Policing: enforce rates and fairness.
- Egress: transmit to uplink, update egress stats, apply final shaping rules.
Table boundaries (what must be hardware-friendly):
- Subscriber/session table: state + bindings + pointers to policy/QoS objects.
- ACL / TCAM (boundary use): rule matching for policy enforcement and service separation.
- Flow/service counters: usage counters and drop-reason counters that must not block forwarding.
Why NPU/ASIC is required (practical argument, not marketing):
- Line-rate counters + HQoS at subscriber scale require many state updates per packet.
- Minimum packets + bursts stress pps; per-packet CPU handling quickly becomes the bottleneck.
- Deterministic queuing decisions must happen before buffers overflow under microbursts.
Measurement targets (use these to avoid false confidence):
| Metric | Why it matters | What to verify |
|---|---|---|
| bps (throughput) | Large packets can hide weak classification/lookup behavior. | Stable throughput across mixed traffic classes and subscriber counts. |
| pps (packet rate) | Minimum packets expose per-packet work limits. | No unexpected drops or counter collapse at high pps. |
| Minimum packet | Worst-case pps; reveals parser and lookup capacity. | Forwarding + counters + queue mapping remain stable. |
| Burst / microburst | Short bursts overflow queues before averages rise. | Drop reasons are explainable; latency tail stays bounded for priority traffic. |
| Drop reasons | “Drops” without reasons cannot be fixed. | Per-stage and per-queue drop counters are consistent with symptoms. |
Fast-path observability (keep it stage-based):
- Lookup health: table occupancy, miss/fallback counts, update rate watermarks.
- Queue health: per-queue depth, overflow counters, tail latency indicators by class.
- Counter integrity: update backlog (if any), aggregation windows, counter reset events.
Deep Queues & HQoS: Fairness Under Bursts, and How to Prove QoS
BNG/BRAS platforms often aggregate large numbers of subscribers, which makes bursts inevitable. Deep buffers can reduce immediate loss during burst aggregation and upstream congestion, but they can also create latency tail inflation (bufferbloat). The engineering goal is to deliver per-subscriber fairness and service-class guarantees without letting delay tails explode.
Deep-queue trade-off: deeper buffers reduce short-term drops, but can increase latency and jitter. HQoS structures and queue limits must be designed with service intent in mind.
Why deep queues exist in BNG (three practical drivers):
- Aggregation bursts: many subscribers burst simultaneously, overwhelming uplink drain rate for short windows.
- Upstream congestion: the “next hop” can congest; buffering absorbs transient mismatch.
- Per-subscriber fairness: queues enable controlled scheduling rather than uncontrolled tail drops.
HQoS hierarchy (a typical shaping/scheduling structure):
- Port level: total uplink capacity; global scheduling boundary.
- Service/VLAN group: separation by domain/service bundle.
- Subscriber level: fairness and rate caps per subscriber.
- Service class / flow class: voice/video/best-effort class behavior within each subscriber.
Where shaping, policing, and scheduling belong (avoid the common mistake):
- Shaping is often applied where fairness is required (subscriber and/or service-group levels).
- Scheduling decides which queue drains first (service classes and priority).
- Policing enforces hard limits and can drop aggressively; it should be used deliberately for specific tiers.
Bufferbloat vs loss (engineering judgment, not ideology):
- If real-time traffic is critical, prioritize bounded delay via strict queue limits and priority handling.
- If throughput-oriented traffic dominates, allow moderate buffering but monitor tail latency.
- AQM (RED/WRED/CoDel) is considered when tail latency grows under mixed traffic and frequent bursts; it is not mandatory in every deployment.
QoS proof (make it measurable and auditable):
| Service class | What to prove | How to test (outline) |
|---|---|---|
| Real-time voice/interactive |
Bounded delay tail, bounded jitter, low loss under bursts. | Mixed-load run with burst injection; track p95/p99 latency by class and drop reasons. |
| Streaming video |
Stable throughput with controlled loss; avoid repeated stall patterns. | Concurrent video-like flows + background BE; verify class bandwidth and overflow counters. |
| Best-effort bulk |
Fairness across subscribers; predictable sharing under contention. | Many-subscriber test; compare per-subscriber throughput distribution and queue occupancy. |
Crypto & ACL Boundary: What a BNG Enforces (and What It Does Not)
Security-related functions in a BNG/BRAS are best described as policy enforcement and isolation at subscriber scale—without turning the platform into a deep security inspection engine. This chapter defines practical boundaries for ACLs, crypto, subscriber isolation, and accounting/audit counters so performance and operability remain predictable.
Boundary rule: enforce access, isolation, and measurable accounting at line rate. Avoid designs that rely on per-packet logging or heavy inspection on the forwarding path.
ACLs: placement, scale, hit cost, and logging strategy
- Insertion points: ACLs can be applied at ingress (early drop, save resources) and/or egress (service edge boundary). Placement should match the policy goal and the drop visibility needed.
- Rule scale: per-subscriber rule explosion is avoided by using shared templates (service tiers / domains) with subscriber bindings, rather than unique rule sets for every subscriber.
- Hit cost: long rule chains and frequent updates increase match pressure; keep high-hit rules simple and avoid unnecessary rule depth.
- Logging: prefer counters + sampling + trigger logs (threshold-based). Avoid per-packet logs that can flood CPU and storage during attacks or churn events.
Crypto: what is typically inside BNG scope vs out of scope
- In-scope: management-plane protection and limited tunnel/edge security functions when required by deployment—kept operationally stable under load.
- Out-of-scope (boundary): large-scale, compute-heavy security processing is usually handled by dedicated security appliances or service edges rather than consuming BNG forwarding resources.
Subscriber isolation: separation that prevents cross-subscriber impact
- Session separation: PPPoE or IPoE bindings establish the subscriber boundary used by policy and accounting.
- VRF/domain separation: routing/forwarding contexts keep domains isolated (no cross-domain leakage by default).
- Anti-spoofing: source-validation techniques can be used to reduce address spoofing; uRPF is relevant as a boundary mechanism (mention only).
Accounting & audit counters: a “trusted chain” without overload
- Counter layers: subscriber counters + port/queue counters create an internal cross-check that supports troubleshooting and billing confidence.
- Stability under stress: counter export and audit signals should not backpressure the forwarding path; use aggregation windows and rate controls.
Uplink & Ethernet PHY Considerations: What Breaks at 100G/400G and How to Localize Drops
High-speed uplinks (100G/400G and beyond) can fail in ways that look like “congestion” even when the root cause is physical-layer instability. The practical goal is to localize the problem layer before changing policies: start with PHY/FEC counters, then check MAC drops, then confirm queue drops.
Troubleshooting principle: always localize the layer first. Changing HQoS or ACL rules without confirming PHY/MAC health often hides the real root cause.
Uplink forms (impact only):
- Multi-port uplinks: capacity and redundancy, but traffic distribution and hot spots must be observable.
- LAG: resilience and aggregation; member link instability can translate into churn-like symptoms.
- ECMP: path spreading; hash skew can produce single-path congestion even when aggregate capacity looks fine.
PHY/SerDes effects that matter to BNG operations:
- Bit errors & FEC pressure: increased correction activity can impact effective throughput and tail latency.
- Link flap: short up/down events cause visible service instability (timeouts, churn, accounting noise).
- PCS/MAC symptoms: physical issues can propagate upward and appear as MAC drops or queue overflows.
How to read FEC/PCS statistics (engineering-only):
- Trends beat snapshots: rising counters that correlate with temperature, optics changes, or time-of-day patterns are more actionable.
- Compare both ends: align local counters with peer counters to distinguish local faults from upstream issues.
- Correlate with symptoms: check whether counter changes align with MAC drops, queue drops, and tail latency shifts.
Localization order (recommended):
| Layer to check | What indicates trouble | Next action |
|---|---|---|
| PHY / FEC | BER/FEC counters rising, link flap, unstable error trends. | Stabilize the link first; confirm peer-side behavior. |
| MAC | MAC drops/errors increase while PHY remains stable. | Confirm MAC-level congestion/drops before altering policy. |
| Queue / HQoS | Queue depth/overflow and tail latency rise with stable PHY/MAC. | Tune HQoS boundaries and queue limits with class intent. |
High Availability & Upgrade: HA/ISSU Where “No Session Drop” Is the Metric
In a BNG/BRAS, high availability is not a checkbox—it is measured by subscriber experience during failures and maintenance. The primary target is session continuity (or a strictly bounded disconnect rate), while secondary targets include switchover time, reconnect storm suppression, and accounting continuity.
Acceptance-first principle: define the pass/fail metrics before building HA or running ISSU. A “successful upgrade” is one where key service KPIs do not spike.
HA modes (impact only):
- Active/Standby: one node forwards; the standby maintains enough state to take over with bounded churn.
- Clustered designs: distribute sessions across nodes to reduce blast radius; failure should be localized to a subset of sessions.
Stateful switchover: what must be synchronized vs what can be rebuilt
- Must sync: subscriber binding/session identity, policy “final state” pointers, and critical timers that prevent false retries.
- Nice to sync: accounting aggregation windows and key counters used for cross-checking and troubleshooting.
- Rebuild (common): deep queue internals and short-lived instantaneous measurements; design for controlled rebuild rather than full migration.
ISSU / hot upgrade boundaries:
- Control vs data plane: the data plane should remain stable while control software changes, otherwise the event behaves like an outage.
- Version compatibility: the upgrade window requires compatibility for key state fields (at minimum, the “must sync” set).
- Rollback policy: define rollback triggers based on KPI deltas (setup failure rate, auth latency spikes, abnormal drop reasons).
Failure drills: scenarios that reveal real HA readiness
- Link loss: uplink member loss, LAG events, or upstream cut.
- Upstream congestion: microburst + sustained congestion patterns.
- AAA unreachable: timeouts and partial reachability.
- Address pool anomalies: depletion, slow allocation, or inconsistent state during churn.
Drill acceptance table (example template):
| Drill | Targets | Key observables | Pass / Fail criteria |
|---|---|---|---|
| Switchover event | Bounded switchover time, bounded disconnect rate | session count, setup rate, teardown rate, churn indicators | Pass if KPIs spike briefly then return to baseline; fail if sustained churn or mass disconnect |
| AAA unreachable | Controlled setup degradation, no storm | auth latency, AAA errors/timeouts, setup success rate | Pass if retries are rate-limited and success recovers; fail if retry storms amplify outage |
| Congestion injection | Per-class QoS preserved, drops localized | queue depth, drop reason, per-class tail latency | Pass if priority classes remain bounded; fail if drops spread or tail latency explodes |
| Pool anomaly | Detect fast, recover cleanly | DHCP/alloc failures, setup rate, session failures by reason | Pass if failures are attributed and recovered; fail if sessions flap or accounting becomes inconsistent |
Telemetry & Field Evidence: Proving Stability with Metrics (Not Averages)
Telemetry should form a closed loop: collect the minimum metrics that explain user experience, trigger alerts based on thresholds and trends, localize the failing layer, apply controlled mitigations, then update baselines and drills. Stability is proven by tail behavior (p95/p99), drop reasons, and rate changes—not by averages.
Evidence rule: prefer counters + tail percentiles and keep logs minimal with sampling and rate limits. Unlimited logs can create a second outage.
Minimum metric set (actionable buckets):
- Session health: session count, setup rate, teardown rate, churn indicators.
- AAA/control plane: auth latency, AAA errors/timeouts, policy apply failures (as counters).
- Forwarding truth: queue depth, drop reason, per-class tail latency (p95/p99).
- System health: CPU/memory, control-plane protection triggers (rate-limit/circuit-breaker events).
Common traps (“looks stable” but is not):
- Averages hide tail: mean latency remains low while p99 spikes cause real user complaints.
- Microbursts: utilization looks fine but short bursts overflow queues and create instant drops.
- Accounting delay: aggregation windows and export backlog can distort near-real-time billing views.
Alert strategy (simple but robust):
- Thresholds: catch immediate failures (timeouts, hard drops).
- Trends (rate-of-change): detect degradation before full failure (latency slope, churn slope).
- Correlation: connect symptoms to likely causes (AAA errors ↑ → setup fails ↑ → churn ↑).
Minimal logging set (avoid self-inflicted outages):
- Always log: major state transitions (switchover, rollback), AAA unreachable, pool anomalies.
- Sample log: only during anomaly windows; increase sampling temporarily and revert automatically.
- Never do: unlimited per-packet/per-flow logs on the fast path.
Validation & Acceptance Checklist: What “Done” Looks Like for a BNG/BRAS
This chapter converts architecture goals into a repeatable acceptance workflow. “Pass” is defined by measurable evidence: throughput (bps/pps), tail latency (p95/p99), drop reasons, session scale, and controlled degradation during failures and maintenance.
Evidence pack (minimum): counters + p95/p99 percentiles + drop reason attribution + minimal event logs (rate-limited, sampled). Avoid unlimited per-packet logging on the fast path.
Acceptance scope (must cover all):
- Performance: bps/pps, minimum-size packets, and per-class tail latency under multi-queue load.
- Scale: max sessions, setup rate, churn/reconnect storms, and AAA degradation behavior.
- QoS: SLA validation per class, bufferbloat checks, and “AQM effective” behavior validation.
- Stability: 72-hour soak with staged load and controlled churn; no KPI drift or runaway backlogs.
- Fault drills: AAA down, link flap, upstream congestion; recovery must be bounded and explainable.
Copyable acceptance checklist (example table):
| Category | Test case | Traffic model | Observables (evidence) | Pass criteria (bounded) |
|---|---|---|---|---|
| Functional | IPoE / PPPoE session bring-up with AAA | Low-rate, mixed packet sizes | setup success rate, auth latency, failure reason counters | No persistent setup failures; retries are controlled; failures are attributable |
| Performance | Line-rate forwarding + minimum-size packets | 64B focus, mixed bps/pps | pps/bps, queue depth, drop reasons, p99 latency | Throughput meets target; tail latency bounded; drops localized and explainable |
| QoS | Per-class SLA validation under contention | Multi-queue load, bursts | per-class p95/p99, per-queue drops, queue depth time series | Priority classes remain bounded; bufferbloat is detected/mitigated as designed |
| Scale | Max sessions + sustained setup rate | Ramp sessions; steady hold | session count, setup rate, auth latency, CPU/mem, table watermarks | No uncontrolled churn; setup rate does not collapse; control-plane remains stable |
| Storm | Reconnect storm injection | Forced flap / timed churn | retry counters, setup fail reasons, rate-limit triggers, backlog | Storm is suppressed; system returns to baseline without cascading failures |
| Stability | 72h soak with staged load phases | Low→mid→high, controlled bursts | KPI drift, backlog growth, tail latency trends, log volume | No drift/runaway backlogs; logs remain bounded; no hidden tail regressions |
| Fault drill | AAA down / partial reachability | Normal load + AAA outage window | AAA timeouts, setup success, churn, fallback counters | Graceful degradation; recovery is bounded; no retry storm amplification |
| Fault drill | Link flap / uplink member loss | Normal load + flap pattern | link events, queue drops, session churn, switchover timers | Bounded service impact; stable convergence; no mass session collapse |
| Fault drill | Upstream congestion / microburst stress | Short bursts + sustained congestion | queue depth peaks, drop reasons, per-class tail latency | Drops are attributed; priority services remain bounded; mitigation behaves as designed |
Concrete test-stack BOM (example ordering part numbers):
| Role | Part number | What it is | Why it is used in acceptance |
|---|---|---|---|
| Traffic generator / scale rig | SPT-N4U-110 / SPT-N4U-220 |
Spirent TestCenter N4U compact chassis (110V/220V) | Repeatable bps/pps, minimum-packet stress, multi-queue contention, controlled churn for setup-rate and storm tests |
| Chassis accessory | ACC-2017A / ACC-2018A |
Single-slot / dual-slot card carrier for N4U/N12U | Build a consistent hardware stack for identical test runs and evidence capture across labs |
| 25GbE NIC (quad-port) | E810-XXVDA4 (aka E810XXVDA4) |
Intel Ethernet Network Adapter (4×25/10/1GbE) | Cost-effective high-port-density host for traffic sink/source, counter validation, and mixed packet-size performance tests |
| 100GbE NIC (dual-port) | MCX623106AC-CDAT |
NVIDIA ConnectX-6 Dx 2×100GbE QSFP56 (Crypto + Secure Boot options) | High-rate test host for line-rate forwarding, microburst stress, and queue/drop-reason attribution at 100GbE |
| SyncE/PTP-capable NIC | MCX623106GC-CDAT |
NVIDIA ConnectX-6 Dx with Enhanced-SyncE & PTP GM support + GNSS/PPS | When acceptance requires time alignment evidence (timestamped loss/latency correlation) without moving into clock-servo design |
| Test fabric reference | BCM56880 (Trident4 series) |
Broadcom switch ASIC family reference for high-density fabrics | Keep the bottleneck out of the test fabric; maintain predictable L2/L3 switching capacity during BNG stress runs |
Note: part numbers above are intended as concrete examples for a replicable lab stack. Equivalent vendor platforms can be substituted, as long as the traffic model, observables, and evidence pack remain identical.
FAQs: BNG / BRAS Engineering Questions
Each answer stays within this page’s scope: subscriber termination, AAA/policy/QoS, accounting, fast path, uplinks, HA, and telemetry.
1) Do BNG and BRAS still differ in modern products, and how to state the boundary cleanly?
In modern deployments, “BNG” and “BRAS” often refer to the same functional block: subscriber termination plus policy/QoS enforcement and accounting at the broadband edge. The least controversial boundary is a function-based definition: terminate IPoE/PPPoE sessions, apply per-subscriber policy and QoS, and export accounting records. Avoid historical role debates and describe what the box must do on the wire.
2) PPPoE vs IPoE (DHCP): what are the biggest differences in scale, failure modes, and operations?
PPPoE introduces a session protocol handshake and keepalive behavior, while IPoE relies on DHCP allocation/lease lifecycle and relay behavior. At scale, PPPoE pain often shows up as handshake timeouts, session churn, and retry storms; IPoE pain often shows up as address-pool depletion, lease thrash, and relay/option misbehavior. Operations should track setup success rate, tail auth latency, retry counters, and allocation failures separately for the chosen access mode.
3) Why do auth timeouts and disconnects appear when setup rate spikes, even if session count is not huge?
Setup rate stresses the control plane first: session table allocations, timer wheels, policy transactions, and AAA request concurrency. Even with moderate steady-state sessions, bursts can push AAA round-trip time into the tail, causing timeouts and retries that amplify load into a storm. Common triggers include CPU saturation, memory/GC pressure, synchronous logging/accounting paths, and AAA backends entering partial degradation. The fix path starts with tail latency and retry suppression, not with headline throughput.
4) “Throughput is fine but user experience is bad” — is it usually deep queues/bufferbloat, and how to disprove it?
Deep queues can hide drops and keep throughput high while inflating tail latency (bufferbloat), but it must be proven with evidence. Disprove by comparing per-class p95/p99 latency and queue depth time series under the same offered load: if tail latency stays bounded and queue depth does not build up, bufferbloat is not the primary cause. Also check drop reasons: if drops occur upstream or at a different stage, local queue tuning will not fix the symptom.
5) What does HQoS really solve, and when does it become harder to tune or easier to misconfigure?
HQoS solves fairness and isolation at aggregation scale by applying scheduling and shaping across a hierarchy (port → service/VLAN → subscriber → flow/class). It becomes harder when classification/mapping is ambiguous, the hierarchy is too deep, or shaping/policing points are placed inconsistently, causing priority inversion or starving classes. Misconfiguration is suspected when counters do not “close” (class usage vs subscriber totals) or when tail latency becomes unstable under mixed traffic despite correct bandwidth sums.
6) After enabling ACLs, performance drops a lot — what are the three most common reasons?
The top three causes are: (1) rules push traffic onto a slower path due to table limits or rule compilation fallbacks; (2) overly broad matches create high hit-rate checks that add per-packet lookup cost; (3) logging/counters are configured too aggressively (e.g., logging on frequent hits), shifting load to the control plane. The fastest way to validate is to compare fast/slow-path counters and CPU load before and after enabling the ACL set.
7) Why do accounting/statistics often “not match” — lost counters, sampling, or export pipeline problems?
Mismatches usually come from three stages: (1) counter generation (reset on session events, wrap, or per-direction semantics); (2) aggregation windows (time alignment and batching make near-real-time views look inconsistent); (3) export pipeline (backlog, retries, or partial loss). A reliable approach is to verify closure: interface totals vs session totals vs exported totals, all aligned to the same time window and session churn markers. This turns “blame” into a measurable root cause.
8) The physical link stays up but sessions flap — how to localize across PHY, MAC, and queue layers?
Start at PHY counters (symbol/PCS/FEC-related errors and quality indicators): silent error growth can trigger higher-layer timeouts without a hard link-down. Then check MAC counters (drops, pause/congestion hints, error frames) to determine whether the issue is framing/congestion-related. Finally, check queue depth and drop reasons to detect microbursts or sustained congestion that inflate tail latency and cause control-plane timeouts. Time-correlate counter spikes with session churn to confirm causality.
9) When AAA is unreachable, should a BNG fail-open or fail-close, and how to choose in practice?
The choice is a risk trade-off: fail-open preserves service continuity but can increase unauthorized access risk; fail-close protects policy but can create widespread outages. A practical engineering policy is conditional: treat established sessions differently from new sessions during an outage window, use short-lived cached authorization where allowed, and always enforce retry suppression to prevent storms. The decision must be paired with observables (AAA timeouts, setup failures, churn) and a bounded recovery plan once AAA returns.
10) HA switchover keeps traffic flowing, yet a redial storm happens — why, and how to suppress it?
“No forwarding gap” does not guarantee “no control-plane churn.” Redial storms often happen when timers and retry windows are not preserved across switchover, causing many sessions to falsely conclude failure and re-auth simultaneously. The storm can also be triggered by external dependencies (AAA/policy/address allocation) being temporarily stressed during switchover. Suppression requires controlled backoff and rate-limit gates, staged recovery (stabilize existing sessions before opening new setups), and acceptance drills that measure churn and tail latency.
11) How to design stress tests that cover microbursts, minimum packets, and mixed QoS queues?
Use a three-part test suite: (1) minimum-packet pps test (e.g., 64B emphasis) to expose true packet-processing limits; (2) microburst injection on top of steady background traffic to reveal queue overflow and tail behavior; (3) mixed-class contention with defined SLAs to validate HQoS correctness. Evidence must include per-class p95/p99 latency, queue depth peaks, and drop reasons—not only average throughput. Repeatability matters: keep the traffic model and evidence pack identical across runs.
12) How to operationalize queue depth, drop reasons, and tail latency into actionable alerts and reports?
Operationalization needs three layers: metrics, alert logic, and localization hints. Track short-window queue depth peaks, categorized drop reasons, and per-class p95/p99 latency; then alert using thresholds plus rate-of-change (trend) to catch degradation before outages. Add correlation rules (AAA errors ↑ → setup fails ↑ → churn ↑, or queue depth peaks ↑ → tail latency ↑) to reduce false positives. Reporting should highlight tail metrics and drop-reason composition, since averages routinely hide customer-visible issues.