xPON OLT: MAC/PHY, DBA Scheduling, Optics, Timing & Uplink
← Back to: Telecom & Networking Equipment
An xPON OLT is the access-network “traffic + timing coordinator” that aggregates many ONUs, schedules upstream TDMA (DBA/grants), and maps PON QoS into Ethernet uplinks with verifiable telemetry. Most real-world failures are not “mystery optics” but evidence-driven interactions between burst-mode reception, DBA stability, uplink microbursts, and clock/timestamp integrity.
H2-1 · What is an xPON OLT (and what it is not)
Core idea
An xPON OLT is the access-network headend that multiplexes many subscribers over a shared optical distribution network. It owns upstream TDMA scheduling (DBA), burst optical reception, PON MAC/PHY processing, and the observability needed to diagnose capacity, latency, and stability—while presenting aggregated Ethernet uplinks and time/sync to the rest of the network.
Responsibility boundary (OLT vs ONU/ONT vs upstream network)
This table is the scope contract: only interface obligations are described for ONU/ONT, not internal design.
| Domain | Owns (must get right) | Interfaces / observable evidence |
|---|---|---|
| OLT | DBA scheduler (T-CONT/grants), upstream burst-mode receive behavior, PON MAC/PHY pipeline (GEM/XGEM, FEC), Ethernet uplink buffering/QoS mapping, clock distribution inside the OLT, alarms/logs/PM counters. | Grant maps & slot utilization, burst-miss counters, FEC corrected/uncorrected counters, ranging outcomes, queue watermarks & drop stats, uplink error counters, reference lock/holdover state, event logs. |
| ONU/ONT | Responds to OLT grants (bursts in assigned slots), reports queue state per the protocol, stays within optical power and burst timing constraints, supports OAM per service profile. | ONU registration/ranging messages, queue report validity, burst alignment and preamble detectability, optical power & alarms as seen by OLT, service OAM/PM visibility at the interface. |
| Upstream network | Treats OLT as an Ethernet aggregation device; provides backhaul capacity and (optionally) time reference. IP services and subscriber policy engines are outside this page. | Uplink link state, congestion signals (drops/ECN), time reference status presented to OLT (if used), northbound telemetry. |
What problems an OLT must solve (engineering-first)
- Upstream fairness and latency control: convert per-service demand into grants (DBA) while avoiding slot fragmentation and micro-starvation.
- Burst-mode reception under wide dynamic range: reacquire baseline/threshold quickly enough that preamble and payload are both reliably detected.
- Pipeline integrity: keep MAC/PHY processing (encapsulation + FEC) and uplink aggregation from turning predictable traffic into unpredictable jitter.
- Field diagnosability: publish counters and logs that let operations prove where loss/jitter originates (PON optics, DBA, or uplink).
RangingFail↑ + SlotUtilization unstable often points to DBA timing/guard time or noisy reports from endpoints.
UplinkDrops↑ + PON counters clean often points to uplink buffering/microbursts and QoS mapping.
Typical OLT form factors (only what changes engineering)
- Chassis / line card: higher port density stresses power delivery, thermal zones, and serviceability. Telemetry and fault isolation become mandatory.
- Pizza-box / fixed system: tighter cost/power envelopes force integration trade-offs; debug hooks and counter coverage should not be sacrificed.
- Pluggable vs fixed optics: changes how DDM, alarms, and replacement workflows are implemented; optical health must remain observable at the OLT boundary.
Figure F1 — OLT system context (scope map)
The diagram anchors the entire page: DBA, optical AFE, timing/sync, OAM, and Ethernet uplink are OLT-owned blocks.
H2-2 · PON flavors & compatibility: GPON vs XG-PON vs XGS-PON vs 50G-PON
What stays the same (architecture invariants)
Regardless of generation, an OLT remains a shared-medium scheduler + burst optics endpoint + aggregation device with evidence-grade counters.
- Downstream broadcast, upstream TDMA: upstream stability still depends on grant timing and slot utilization.
- DBA remains the control loop: queue reports → grants → utilization/counters feedback.
- Optical AFE is still decisive: burst recovery and dynamic range define “hidden loss” behavior.
- Evidence matters: FEC/BER + burst-miss + ranging + queue stats remain the minimum telemetry set.
What changes (sorted by hardware pressure)
- Optics / burst receiver stress: higher rates and coexistence increase sensitivity to receiver recovery, thresholding, and power-step handling.
- PHY clocking and jitter budget: faster line rates tighten the tolerance to phase noise and distribution skew inside the OLT.
- FEC footprint and interpretation: stronger FEC can mask optics margin until counters and tail latency reveal the cost.
- Scheduling granularity pressure: more bursty traffic pushes DBA to avoid fragmentation and guard-time waste.
- Power/thermal headroom: denser optics + faster SerDes + bigger ASICs raise steady heat and transient power demands.
Coexistence & mixed-deployment checklist (OLT-side only)
- Start with evidence, not guesses: compare (a) FEC corrected/uncorrected, (b) burst-miss / preamble detect, (c) ranging outcomes, and (d) slot utilization stability.
- Differentiate optics vs scheduling: optics issues tend to raise burst-miss and FEC together; scheduling issues tend to destabilize utilization and create patterned latency.
- Guard against alarm storms: use debounced thresholds and “sustained” windows for optics drift and reference switching events.
- Plan uplink headroom: an apparently “clean PON” can still fail user experience if uplink microbursts exceed buffers and QoS mapping is misaligned.
Figure F2 — Mode matrix (engineering pressure map)
The matrix avoids fragile spec tables and instead highlights where OLT hardware is typically stressed: burst Rx, FEC, clocking, and uplink class.
H2-3 · MAC/PHY data path: downstream broadcast + upstream TDMA (where latency/jitter comes from)
Core idea
In xPON, latency and jitter are not caused by a single “slow block.” They emerge from the interaction of discrete grant cycles (TDMA), burst receiver recovery, FEC block processing, queueing/shaping, uplink congestion, and where hardware timestamps are taken inside the OLT.
Downstream pipeline (OLT → many endpoints): mostly deterministic, still buffer-shaped
Downstream is broadcast-like and continuous from the OLT perspective, so variability usually comes from classification, batching, and FEC granularity.
- Ingress classification & service mapping: frames are mapped into service containers and priority classes before encapsulation.
- GEM/XGEM “pipe” stage: encapsulation and (if enabled) link-layer encryption add predictable processing; variability comes from batching/aggregation.
- FEC as a block-stage: FEC commonly operates on fixed-size blocks; that block granularity can add a stable but non-zero latency floor.
- PCS/PHY clock domain: clock-domain boundaries and rate adaptation are typically stable, but can amplify jitter when upstream queues are bursty.
- Optical Tx emission: transmission is generally predictable; alarms and drift are the key operational hooks.
Upstream pipeline (many endpoints → OLT): TDMA + burst reception creates fragile variability
Upstream is the engineering “hard mode”: bursts arrive with power steps and timing constraints, and every missed preamble can become hidden loss.
- Timeslot execution: upstream capacity is delivered in discrete grants, so demand is translated into a slot plan, not a continuous stream.
- Burst-mode receiver recovery: each burst forces fast reacquisition (baseline/threshold/AGC behavior) so that preamble + payload are correctly detected.
- CDR / clock recovery window: recovery stability determines whether payload is sampled reliably; failures often manifest as burst-miss and elevated FEC activity.
- FEC interpretation: strong FEC can hide marginal optics until corrected/uncorrected counters and tail latency reveal the true cost.
- Ingress queues: once decoded, frames still enter queueing domains that can re-introduce jitter (especially under microbursts or QoS remapping).
Latency & jitter sources (typed by field signature)
| Signature | Typical OLT-side cause | What to check first (evidence) |
|---|---|---|
| Periodic / sawtooth | Grant cycle effects, DBA recomputation cadence, slot-plan reshaping under changing demand. | Slot utilization stability, grant-map statistics (fragmentation), per-class queue watermarks, tail-latency pattern. |
| Sudden spikes | Uplink microbursts exceeding buffers, shaping/QoS remap, transient head-of-line blocking. | Uplink drop/ECN counters, queue peaks aligned with spikes, PON counters remaining clean. |
| Slow drift | Optics margin drift (temperature/aging), reference clock switching/holdover transitions. | DDM trends, burst-miss trend, FEC corrected trend, reference lock/holdover log entries. |
| “Timestamp lies” | Timestamp taken before/after buffers, crossing clock domains, or taken on a congested path. | Timestamp insertion point, queueing in front of the timestamp, ref lock state, consistency across ingress/egress points. |
Failure fingerprints (fast triage using combined evidence)
These fingerprints help separate optics/burst recovery from scheduling from uplink congestion.
Figure F3 — Frame & timeslot pipeline (where delay is inserted)
Top lane: downstream pipeline. Bottom lane: upstream burst pipeline. Markers highlight typical jitter/latency insertion points.
H2-4 · DBA scheduler deep dive: T-CONTs, queues, grant calculation, and why it breaks in the field
Core idea
DBA is a closed-loop controller inside the OLT: it turns demand signals into a discrete slot plan (grants), then uses utilization and error evidence to keep fairness, latency, and stability under changing traffic, optics margin, and endpoint behavior.
DBA as a control loop (not a list of acronyms)
Understanding the loop is the fastest path to diagnosing “it worked in the lab, but collapses at peak hours.”
Inputs that shape the scheduler (grouped by trust and stability)
| Input group | What it represents | Why it can destabilize |
|---|---|---|
| Policy constraints | QoS class, SLA targets, minimum guarantees, priority rules, maximum burst constraints. | Conflicting objectives (fairness vs tail latency vs utilization) require explicit trade-offs. |
| Demand signals | Reported backlog / demand indicators and observed consumption over recent windows. | Noisy/late/invalid reports create oscillations and over/under-allocation patterns. |
| Physical overhead | Guard time, burst recovery window, and practical slot minimums for reliable reception. | Small-grant fragmentation magnifies overhead and drives congestion collapse. |
| Historical evidence | Utilization stability, corrected vs uncorrected FEC, burst-miss, ranging outcomes. | Misreading evidence can push the loop into the wrong corrective direction. |
Core mechanics: grant sizing, slot planning, and guard time reality
- Grant sizing: chooses how much upstream “work” to schedule per service container, balancing minimum guarantees and burst efficiency.
- Slot planning: packs grants into a time map; the map quality determines whether upstream looks stable or chaotic.
- Guard time overhead: a fixed per-burst cost; when grants become too small, the overhead dominates and effective capacity shrinks.
- Collision risk: tighter windows and unstable timing increase the probability that bursts land outside valid reception windows.
Why DBA breaks in the field (four common failure modes)
OLT-side protection policies (keep the loop stable)
Only OLT-side levers are listed here: detection, containment, and graceful fallback.
- Sanity checks: validate demand signal range and rate-of-change to reject implausible reports.
- Smoothing / hysteresis: avoid chasing instantaneous noise; stabilize decisions over a controlled window.
- Minimum-grant enforcement: prevent pathological fragmentation by enforcing a floor grant size per class.
- Containment: rate-limit or quarantine abnormal endpoints that destabilize the map.
- Fallback allocation: apply a conservative, stable plan under uncertainty (protecting latency-critical classes).
Minimum evidence set for DBA health (what must be measurable)
- Slot plan quality: fragmentation indicators, average grant size per class, guard time waste proxies.
- Utilization stability: not only mean utilization, but variance over time windows.
- Demand consistency: whether demand signals correlate with observed throughput and queue drain.
- Physical success evidence: burst-miss and FEC corrected/uncorrected trends to avoid “scheduling the impossible.”
- Service impact evidence: tail latency and drops per priority class.
Figure F4 — DBA control loop (inputs → plan → execution → evidence)
The diagram shows DBA as a closed-loop controller with explicit stability logic and evidence feedback.
H2-5 · Optical Tx/Rx AFEs: burst-mode receiver, laser driver, APC/ATC, DDM—engineering choices & pitfalls
Core idea
OLT optics is not just “power in / power out.” The field failures that look random (BER drift, burst-miss, intermittent deregistration) usually come from burst-mode recovery limits, dynamic-range stress, thermal/aging drift, reflections/contamination, and how evidence is read from DDM and counters.
OLT optical port boundary: Tx chain + Rx chain + sensors + control loops
This chapter stays at OLT-side blocks and their interfaces to the PON MAC/PHY. Endpoint internals are intentionally excluded.
Tx engineering choices: laser driver, power control, and thermal drift
- Laser driver & modulation behavior: launch stability is shaped by edge control, amplitude control, and timing cleanliness.
- APC (automatic power control): stabilizes optical launch power, but does not guarantee stable BER if noise/reflections rise downstream.
- ATC (thermal control): keeps the operating point stable; slow temperature drift is a common trigger for “intermittent but repeatable” issues.
- Practical pitfall: “Power looks fine” can coexist with rising corrected errors because APC holds power while margin is lost elsewhere.
Rx engineering choices: burst-mode recovery dominates “random” field behavior
Upstream reception is bursty and heterogeneous. The receiver must rapidly adapt across large amplitude steps and timing constraints.
- Dynamic range stress: mixed distances and loss create strong/weak bursts; front-end linearity and recovery determine whether weak bursts survive.
- Fast recovery window: the receiver must settle quickly enough to capture the preamble and valid payload sampling region.
- Threshold / baseline stability: drift or imperfect settling turns into preamble mis-detect, burst-miss, or elevated corrected errors.
- Limiter/AGC trade: aggressive AGC helps range but can create recovery delay; conservative AGC can cause saturation and false decisions.
DDM and alarms: trends create evidence, not verdicts
DDM is most valuable as a correlated time series: temperature, power, and bias trends should be read alongside burst-miss and FEC counters.
| Observed pattern | Likely interpretation (OLT-side) | What to check next |
|---|---|---|
| Temp↑ + corrected↑ | Margin shrinks with thermal drift; recovery window and threshold stability become critical. | Event timing vs temperature ramps, burst-miss spikes, holdover/clock changes excluded. |
| Tx power stable, BER drifts | APC is holding launch power while reflections/contamination/noise change the receive condition. | Connector events, cleaning cycles, reflection-sensitive behavior, corrected/uncorrected split. |
| Bias changes over weeks | Aging or control-loop compensation; margin can slowly erode before failures become visible. | Bias trend slope, corresponding corrected trend, alarm thresholds and hysteresis. |
| LOS/LOF edges | Edge-of-valid bursts or intermittent detect; can look like random drops higher up the stack. | LOS/LOF timestamps aligned with burst-miss and uncorrected error events. |
“Power looks normal but BER drifts”: five common paths (with field fingerprints)
Figure F5 — Optical AFE blocks (Tx/Rx, APC/ATC, DDM, and MAC/PHY boundary)
This block view shows where burst recovery lives, where evidence is measured, and where the MAC/PHY interface begins.
H2-6 · Timing & synchronization inside an OLT: PTP/SyncE/ToD distribution and why timestamping lies
Core idea
Timing inside an OLT is a clock tree with monitoring points. Most “timestamp errors” come from the stamp point and the path: queues, buffers, and clock-domain crossings add bias that looks like jitter—even when the timing input itself is healthy.
OLT timing scope: what is controlled inside the box
- Reference intake: SyncE / PTP / ToD enter the OLT as a reference for internal distribution.
- Cleaning and distribution: a jitter-cleaning stage produces a stable internal reference and fans it out to PHY blocks.
- Timestamp units: hardware timestamps must be placed close to the effective I/O boundary to avoid queue-induced bias.
- Monitoring and alarms: lock state, phase error proxies, and holdover transitions must be logged with stable thresholds.
Clock tree: reference in → jitter cleaning → fanout to uplink and PON PHY
The same reference can look “good” at the input and still produce poor timestamps if distribution points and stamp points are poorly chosen.
Why timestamps “lie”: stamp point and path bias
| Cause | What it looks like | Evidence to confirm |
|---|---|---|
| Stamp before a queue | Apparent delay increases with load; spikes align with queue peaks. | Queue watermark correlation; moving stamp point reduces variance. |
| Clock-domain crossing bias | Slow “wander” or inconsistent offset across ports. | Distribution monitor points; compare uplink vs PON domain behavior. |
| Stamp on a congested path | Sharp spikes during microbursts; time error follows traffic pattern. | Ingress/egress counters, drops, and time-error spikes co-occur. |
| Reference switch / holdover entry | Step changes or slow drift after ref loss; alarms may flap. | Holdover logs, lock status, alarm hysteresis events. |
Holdover and alarms inside an OLT: stable thresholds prevent alarm storms
- Detect: reference loss or quality drop triggers a state transition (lock → holdover).
- Degrade: holdover quality typically degrades over time; alarms should reflect duration and severity.
- Hysteresis: alarm thresholds should avoid flapping under short disturbances.
- Log evidence: holdover entry/exit timestamps + quality proxies make “why it drifted” provable.
Figure F6 — OLT clock tree (ref in → cleaner PLL → fanout → PHY/timestamps + monitor points)
A clock tree view that highlights monitor points and the two biggest timestamp pitfalls: stamp point and queue bias.
H2-7 · Ethernet uplinks & aggregation: buffering, shaping, QoS mapping, and microburst survival
Core idea
OLT uplinks are where bursty PON traffic meets statistical Ethernet multiplexing. Throughput can look fine while microbursts create queue spikes, tail-latency jumps, ECN marks, or drops. Survival depends on mapping, buffering, shaping, and evidence from counters.
PON QoS to Ethernet QoS: mapping as an engineering contract
Mapping is not a label conversion. It defines which flows share queues, which flows get shaping, and which counters represent “truth” under load.
- Input side: PON-side service classes and flow grouping (e.g., T-CONT/flows) provide intent: priority, bandwidth guarantees, and delay sensitivity.
- OLT translation: classification + mapping places traffic into a limited set of internal queues and schedulers.
- Output side: Ethernet priorities and uplink queues decide who gets served first when bursts collide.
- Failure mode: poor mapping can starve low priority, inflate latency for “interactive” flows, or hide congestion behind retries.
Where congestion forms: aggregation points and oversubscription
Why microbursts happen (and why PON makes them worse)
- Bursty arrivals: upstream scheduling and flow aggregation create clustered arrivals rather than smooth Poisson-like streams.
- Superposition peaks: multiple sources align in time, producing short high-rate spikes that exceed uplink drain rate.
- Queue spikes: when instantaneous ingress exceeds egress, queue depth jumps and tail latency follows.
- Operational fingerprint: average throughput stays stable while P95/P99 latency spikes, ECN marks rise, or shallow buffers drop bursts.
Shaping, early signals, and controlled loss: what to enable (without a data-center deep dive)
- Shaping: caps burstiness for selected classes, trading peak rate for predictable latency and fewer drops.
- WRED/ECN (policy view): turns late tail-drop into earlier signals (marks) to prevent buffer blow-ups—if counters are monitored.
- Common pitfall: overly aggressive thresholds reduce goodput; overly relaxed thresholds cause tail-drop bursts and retransmit storms.
- Practical selection: prioritize consistent behavior (stable P99, controlled marks/drops) over maximum short-run peak throughput.
How to read uplink evidence: throughput vs PPS vs latency vs loss vs retransmits
| Metric | What it really indicates | Misread risk |
|---|---|---|
| Throughput (Gbps) | Sustained transfer rate; can remain high even when burst-sensitive flows suffer. | Assuming “Gbps OK” means “no congestion.” |
| PPS / Mpps | Packet processing stress; small packets amplify queue management and per-packet overhead. | Blaming optics/DBA while the real bottleneck is packet-rate. |
| Latency (P50/P95/P99) | Queueing reality; tail (P99) reacts first to microbursts and shallow buffers. | Only looking at averages and missing spikes. |
| ECN marks / WRED drops | Early congestion or controlled loss; often rises before hard tail-drop. | Disabling marks/drops without understanding the buffer dynamics. |
| Retransmits | Symptom of loss or excessive jitter; points back to queue behavior. | Treating retransmits as root cause instead of evidence. |
Figure F7 — QoS mapping & buffers (PON flows → OLT queues/shapers → uplink queues → evidence counters)
This view shows where oversubscription forms and which counters prove microburst survival.
H2-8 · OAM, performance monitoring & alarms: what to measure so ops can actually fix problems
Core idea
Counters become operational signal only after thresholds, debounce, and event correlation. A workable OLT OAM design separates transient noise from persistent faults and produces forensic logs that align optics, burst reception, DBA, uplink, and system health.
Minimum observability set: what must be measured (so diagnosis converges)
The goal is domain isolation: optics vs burst vs FEC vs DBA vs uplink vs system.
- Optics domain: edge alarms and DDM trend slopes separate gradual margin loss from sudden link loss.
- Burst domain: burst-miss spikes explain “random” drops that do not show as continuous LOS.
- FEC domain: corrected vs uncorrected reveals whether errors are being rescued or escaping into loss.
- DBA domain: abnormal patterns identify upstream scheduling instability without relying on customer symptoms.
- Uplink domain: watermarks and marks/drops pinpoint microburst and oversubscription behavior.
- System domain: thermal/power and reset causes prevent misattribution to “network” problems.
Symptom-to-domain triage: turning complaints into a short evidence list
| Observed symptom | First domain to check | Evidence to look for |
|---|---|---|
| Intermittent drops | Burst + optics edge | Burst-miss spikes, LOS/LOF edges, DDM temperature correlation |
| BER drift | Optics + FEC | Corrected slope rising, DDM trend changes, uncorrected events |
| P99 latency spikes | Uplink queues | Watermark spikes, marks rising, short drop bursts |
| “Congestion” complaints | DBA + uplink | DBA anomalies, uplink drop/mark counters, queue depth telemetry |
| Sudden widespread outage | System + timing/power | Power faults, thermal limits, holdover events, reset cause logs |
Alarm grading and debounce: separating transient noise from persistent faults
- Grade levels: informational, warning, critical—based on severity and duration.
- Debounce: require persistence time or repeated events before escalation.
- Suppression: avoid alarm storms by rate-limiting repeated identical alarms and linking dependent alarms.
- Reset rules: define explicit recovery conditions, not just “counter went down once.”
Event logs: what must be recorded to enable forensic correlation
Logs must include timestamps and identifiers (port/ONU/module/queue) so counter spikes can be explained and reproduced.
Northbound integration (name only): exporting signal, not raw noise
- SNMP: expose stable counters and graded alarms; avoid flooding with raw transient spikes.
- NETCONF: export configuration and structured state; use for controlled retrieval of diagnostic context.
- Streaming telemetry: deliver time-series watermarks and marks/drops trends for rapid correlation.
Figure F8 — Telemetry & alarm pipeline (counters → thresholds → debounce → event log → northbound)
This pipeline shows exactly where “noise becomes signal” and where operations-ready evidence is produced.
H2-9 · Power, thermal & reliability: multi-rail sequencing, hot-swap, fan control, and “random resets”
Core idea
“Random resets” are usually measurable. The root is often a combination of input protection behavior, multi-rail dependency windows, thermal hotspots, and protection policies. Reliability improves when telemetry + reset-cause logs turn intermittent events into evidence.
Power entry: 48V/12V protection as the first instability amplifier
Entry protection is designed to save hardware during hot-plug, short events, or inrush. Under marginal conditions it can also create brief brownouts.
- Hot-swap / eFuse / fuse: limits inrush and trips on overcurrent; transient limiting can pull downstream rails toward UV thresholds.
- Key observable signals: fault flags, current sense, input voltage droop, “power-good” timing edges.
- Operational signature: resets cluster around plug events, load steps, or temperature-driven current increases.
Multi-rail power tree: what matters is dependency, not the number of rails
Sequencing & resets: the four parameters that decide whether bring-up is repeatable
- Order: which rails must be valid before others are enabled (core → I/O → memory is common).
- Delay: minimum settle time before deasserting reset or starting DDR/SerDes training.
- Threshold: PG comparators and UV limits must match real rail dynamics, not ideal targets.
- Debounce: filtering prevents noisy PG edges from generating spurious resets.
Thermal hotspots and control policies: fan curves, throttling, and protective resets
Thermal is not only about absolute temperature. It is about gradients, hotspots, and policy transitions (normal → throttle → protect).
- Typical hotspots: switch/NP ASIC, SerDes banks, optics cages, and local DC/DC stages.
- Control strategy: fan curve + sensor placement; throttle thresholds that avoid oscillation and link retraining loops.
- Protection behavior: overtemp or VRM limiting can cause sudden link loss, re-training, or a protective reset.
Random resets: the minimum forensic set that turns “intermittent” into evidence
| Signature | Most likely domain | First evidence to check |
|---|---|---|
| Resets on load steps | Entry limiting / rail transient | Input droop, UV flags, PG timing edges |
| Resets after warm-up | Thermal / VRM current limiting | Hotspot temp slope, fan state, rail current rise |
| Occasional lock-ups | Sequencing margin / training | Reset deassert timing vs clocks/PG, retraining counters |
| WDT resets | System health / software stall | WDT reason + preceding thermal/power anomalies timeline |
Figure F9 — Power tree + sensors (entry protection → rails → loads → sensors → controller/alarms)
This diagram shows where to instrument and how reset evidence is produced.
H2-10 · Bring-up & debug playbook: from “no link/no ranging” to “high FEC/packet loss”
Core idea
Bring-up succeeds when each layer has a clear “done” signal and a small evidence set. Debug should narrow domains in order: physical → link → burst/ranging → scheduling (DBA) → uplink queues. This playbook focuses on OLT-side observations and counters.
Bring-up order: the shortest path from “power on” to “services stable”
- Power: stable rails + clean PG edges + no recurring faults.
- Clocks: PLL lock and no frequent reference switching events.
- Uplink: link up + stable error counters + queue watermarks reasonable under load.
- PON PHY/optics: no persistent LOS/LOF, DDM values in range.
- Ranging/registration: ONU registration stable; burst-miss/collision counters do not grow abnormally.
- Service flow: FEC corrected stable, low uncorrected, acceptable tail latency.
Symptom map: what “no link / no ranging / high FEC / packet loss” usually means
| Symptom | Likely domain | First evidence |
|---|---|---|
| No light / LOS | Physical optics / module state | LOS/LOF edges, DDM readings, module fault flags |
| No ranging / unstable registration | Burst reception / timing windows | Burst-miss, collision/guard indicators, registration retries |
| High FEC corrected | Margin degradation (optics/thermal) | Corrected slope, DDM trends, temperature correlation |
| Packet loss / latency spikes | Scheduling or uplink queuing | DBA anomalies, queue watermarks, marks/drops, P99 latency |
Minimum observation points: the evidence set that prevents guessing
High FEC but “still works”: treat corrected errors as a leading indicator
- Corrected rising: the system is spending margin to keep service alive; treat as an early warning.
- Uncorrected events: indicate service is already escaping into loss; escalate severity.
- Most productive correlation: corrected slope vs DDM trends vs thermal sensors vs time-of-day load steps.
- Field survival action: alarm thresholds should track trends (slope + persistence), not only absolute values.
Uplink seems “OK” but experience is poor: restrict the search to DBA + mapping + microbursts
- DBA domain: check for abnormal grant/report patterns and burst-miss growth under load.
- Mapping domain: confirm service classes land in intended queues/shapers (no accidental sharing of tail latency).
- Microburst domain: use watermarks, marks/drops, and P99 latency to prove burst absorption failure.
Figure F10 — Debug decision tree (physical → link → scheduling → uplink)
A practical decision tree that narrows the domain using a small evidence set at each branch.
Validation & troubleshooting: proving “done” and enabling fast field triage
Three-layer validation plan (each test must produce evidence)
| Layer | Stimulus / method | Evidence to capture (counters/logs) | Pass criteria (engineering intent) |
|---|---|---|---|
| A) Performance & stability | Soak run (steady traffic + storage load), temperature step (heat-up/cool-down), link-margin disturbance (short/long cable paths), NVMe sustained read/write. |
NIC: FEC/CRC/PCS trends, retrain/downshift count; PCIe: AER rate + link up/down; NVMe: SMART (temp/throttle/errors), timeouts, tail (p99/p999 if available); Power/Thermal: rail telemetry, sensor points, throttling flags. |
No retrain storms or drive drop; errors do not drift upward into instability; tail latency spikes remain explainable and repeatable (thermal/GC/link evidence aligned). |
| B) Reliability | PCIe AER fault injection (or controlled stimulus that triggers AER), bay hot-unplug/hot-plug drills, firmware rollback readiness (principle-level). | AER class (Correctable/Non-fatal/Fatal), port/bay attribution, reset-domain behavior (what restarts), NVMe “unsafe shutdown” deltas, firmware event logs (upgrade/rollback markers). | Blast radius stays inside the intended reset domain; one bay/port failure remains isolatable; rollback path exists and is testable without introducing new instability signatures. |
| C) Power disaster drills | Brownout/AC drop repeats (including “bounce”), recovery-time measurement, post-event consistency checks. | Power-fail detect → flush → safe-state sequence timestamps, unsafe shutdown count, PLP/hold-up related events, rail telemetry dips, thermal flags and fan state around the event. | Repeatable recovery distribution; no state-machine oscillation under power bounce; unsafe shutdown behavior matches expectation and remains explainable via hold-up/detect evidence. |
Evidence bundle template (field-ready)
- Time base: one consistent timestamp source for all logs/counters; record start/end boundaries of each drill.
- Identity mapping: port ID (NIC cage/port), bay/slot ID (NVMe), PCIe downstream port mapping, reset-domain label.
- Network snapshot: FEC corrected/uncorrected, CRC/PCS errors, retrain/downshift events.
- PCIe snapshot: AER counts by class, link down/up events, surprise down markers (if present).
- NVMe snapshot: SMART (temperature, throttle, media errors), timeout counters, unsafe shutdown count delta.
- Power+thermal snapshot: rail telemetry minima during events, thermal sensor maxima, throttling flags, fan PWM/health.
Troubleshooting map (symptom → evidence → first action)
| Symptom | Evidence (check in order) | Likely domain | First action |
|---|---|---|---|
| Throughput OK but p99 spikes |
1) NVMe tail + SMART throttle/temperature 2) PCIe correctable AER rate aligned with spikes 3) NIC FEC/CRC trend aligned with spikes |
NVMe thermal/GC or PCIe margin, then network margin | Isolate hot bay, then isolate PCIe port/cable path |
| Frequent drive drop / timeout |
1) PCIe link retrain / AER bursts 2) Bay power/connectors (slot-level evidence) 3) SSD firmware events + SMART media errors |
PCIe reset-domain/margin, then bay hardware, then SSD | Pin to bay/port; avoid node-wide resets |
| Link flap / downshift |
1) FEC/CRC/PCS error ramp vs temperature 2) Retimer/NIC telemetry (if available) 3) Power/clock stability markers |
Link margin (retimer/clock/power/thermal) | Swap cable/module path; verify thermal + rail margin |
| Post-powerloss cache anomaly |
1) Unsafe shutdown + PLP/hold-up logs 2) Hold-up budget vs load; “bounce” behavior 3) Fail-detect trigger stability (no oscillation) |
Power detect/hold-up window and power bounce handling | Increase hold-up margin; stabilize fail-detect behavior |
Concrete material numbers (examples for validation, triage, and replacements)
Use platform-approved FRU lists for final procurement; the items below are common, field-proven references for the four fault domains.
A) Network (NIC / Ethernet)
- Intel Ethernet Adapter: E810-CQDA2 (100GbE class), E810-XXVDA4 (25GbE class)
- NVIDIA / Mellanox: ConnectX-6 Dx NIC family; ConnectX-7 NIC family (select speed/port count per node design)
B) PCIe fabric (switch / retimer)
- Broadcom / PLX PCIe switches: PEX88096 (Gen4 class), PEX89144 (Gen5 class)
- Astera Labs PCIe retimers: Aries product family (used for margin recovery on long/complex paths)
C) NVMe SSD (data center class)
- Samsung: PM9A3 (DC NVMe family)
- Solidigm: D7-P5520 / D7-P5620 (DC NVMe families)
- Micron: 7450 series (DC NVMe family)
D) Power / telemetry / thermal (IC-level part numbers)
- Hot-swap (ADI / LT): LTC4282, LTC4286
- eFuse (TI): TPS25982, TPS25947
- Power/Current monitor (TI): INA228, INA229
- Temperature sensor (TI): TMP117
- Fan controller (Microchip): EMC2305