123 Main Street, New York, NY 10001

ROADM Control: WSS/VOA Loops, Optical Power AFE & Drivers

← Back to: Telecom & Networking Equipment

A ROADM primarily controls per-wavelength routing and optical power by closing a loop between WSS/VOA actuators and tap-PD measurement AFEs. This page shows how to design stable control (ramp/blanking), manage calibration/temperature drift, and use telemetry to prove long-term power stability in the field.

H2-1 · Scope & Control Loop

What a ROADM Really Controls

This chapter locks the boundary: a ROADM is controlled as an optical routing + power equalization system. The focus is the setpoint → actuator → sensor loop that makes per-path or per-channel optical power reproducible over temperature, aging, and reconfiguration.

Controls
Optical path state (which wavelength goes to which degree/add/drop) and optical power/attenuation (dBm or relative dB targets).
Actuators
WSS (wavelength-selective switching/attenuation shaping) + VOA (continuous attenuation for equalization & transients).
Sensors
Tap coupler → photodiode → power AFE (TIA/ADC/Vref) to produce a trustworthy measured-power value.
Not covered
Coherent DSP / transceiver CDR/retiming / PAM4 modules / OTN switching & mapping / router & switch ASICs.

1) Setpoints: define “power” precisely before closing a loop

  • Absolute target (dBm): aims at a calibrated optical power at a defined measurement point (ingress/egress/per-channel).
  • Relative target (dB flattening): minimizes channel-to-channel deviation (ΔdB) around a reference, often more robust to absolute sensor offset.
  • Transient policy: ramp-rate limits, step-size limits, and blanking windows during re-route prevent overshoot-triggered alarms.
  • Granularity pledge: a per-channel promise requires per-channel measurement or a validated proxy; otherwise the loop can only guarantee coarse equalization.

2) Actuator reality: repeatability beats theoretical resolution

  • WSS control knobs (abstracted): route/port map, per-channel attenuation, and (if available) passband shaping parameters.
  • VOA job: fine, continuous attenuation to flatten power and to soften step changes during switching events.
  • Non-idealities: hysteresis, mechanical backlash (if present), thermal lag, and drift mean “same command” ≠ “same dB” without calibration.
  • Engineering rule: keep a defined safe state (park/through/block) and a deterministic recovery path back to closed-loop control.

3) Measurement truth: power AFE sets the floor for control quality

  • Dynamic range must cover expected tap power across all operating states without ADC clipping or quantization-dominated noise.
  • Offset & drift: PD responsivity tempco, TIA offset, and reference drift must be budgeted (absolute vs relative errors).
  • Noise vs loop stability: aggressive filtering reduces noise but increases latency; loop gain and sampling cadence must match.
  • Calibration hooks: provide a path for offset/gain trim and temperature compensation (LUT/coefficients) to preserve long-term consistency.

4) The minimal closed-loop contract (what “control” must guarantee)

  • Convergence: reaches setpoint within a defined time without oscillation under typical re-route/step conditions.
  • Consistency: repeated route/attenuation commands return similar measured power after compensation.
  • Bounded behavior: no unsafe overshoot during ramp; alarms use debounce/blanking to avoid false trips.
  • Observability: log setpoint, measured power, actuator command, temperature, and fault codes to prove stability later.
Writing logic for the rest of the page: every later chapter ties back to this loop. Power AFE explains measurement truth, drivers explain actuator repeatability, and calibration explains how to map codes/steps → dB/dBm across temperature and aging.
Figure F1 — ROADM control loop: setpoint → actuator → sensor (with calibration/temp compensation)
ROADM Control Loop (WSS/VOA + Optical Power AFE) Setpoint & Policy Target: dBm or dB Policy: Ramp / Limits Controller Loop Logic (PI + Limits) ADC Filter Actuators WSS Control VOA Driver Stepper Thermal Optical Power Measurement Tap PD Power AFE (TIA + ADC) Measured Power dBm estimate + noise / drift handling Calibration LUT + Temp Comp Offset / Gain Cmd → dB Map Optical path Feedback

Reading guide: the loop’s quality is bounded by measurement truth (tap+PD+AFE drift/noise) and actuator repeatability (hysteresis/backlash/thermal lag). Calibration ties “codes/steps” to meaningful dB/dBm outcomes.

H2-2 · Node Architecture

Node-Level Architecture: Degrees, Add/Drop, and Where WSS/VOA Sit

This chapter places WSS/VOA back into a ROADM node so measurement points and control points are unambiguous. The main design lever is control granularity: what can be guaranteed depends on where power is measured and how the loop is closed.

1) Degrees & paths: where control happens

  • Degrees are the directional ports (to adjacent spans). The node must steer selected wavelengths among degrees and add/drop paths.
  • WSS typically sits on paths where wavelength selection/port mapping is needed (express routing + selective add/drop handling).
  • VOA sits where continuous attenuation is needed: equalization, power limiting during switching, or per-path trimming after WSS actions.
  • Safe state must be explicit: define “through/blocked/parked” behavior for actuator failure or controller reset.

2) Measurement points map: what each sensor location enables

  • Ingress tap: supports input normalization and coarse equalization; cannot guarantee per-channel output without a proxy model.
  • Egress tap: validates delivered power at a node boundary; good for per-degree flattening and alarm thresholds.
  • Group/segment taps: enable sectional balancing (e.g., add/drop group), reducing sensor count vs full per-channel instrumentation.
  • Per-channel taps: enables true per-channel equalization, but multiplies AFE channels, calibration burden, and telemetry volume.

3) Granularity choices: the “no free lunch” table (hardware meaning)

  • Per-degree equalization: few sensors; simplest calibration; robust. Typical promise: stable output power envelope per degree.
  • Group equalization: moderate sensor count; calibration scales with groups; operationally useful when channels are treated as bands.
  • Per-channel setpoint: requires measurement granularity + actuator mapping accuracy; drift handling and logs become mandatory.
  • Engineering guardrail: do not promise finer control than the measurement topology can observe and verify.

4) Bypass & recovery: what “robust” looks like in the field

  • Bypass path (if present): isolates failed actuator segments and keeps the node in a predictable degrade mode.
  • Reconfiguration windows: apply ramp/blanking to avoid false alarms when paths are switching.
  • State persistence: store last stable setpoint + actuator command + temperature snapshot to recover deterministically after reset.
  • Bring-up order: sensor readiness (Vref/ADC stable) → calibration load → closed-loop enable → normal operation.
Practical planning tip: this chapter should link forward to (a) Power AFE design (error budget), (b) Driver/actuator repeatability, and (c) Calibration LUTs. The node drawing below becomes the “map” used by troubleshooting and telemetry chapters later.
Figure F2 — Simplified multi-degree ROADM node (optical paths + control/monitor layers)
ROADM Node Map (Degrees + Add/Drop) — Control & Monitor Placement Degree A to adjacent span Degree B to adjacent span Add / Drop client-side paths Optical Switching Region WSS (route + wavelength-selective attenuation) VOA (equalize) VOA (limit) Tap Tap PD + Power AFE (TIA + ADC) Measured power → alarms / equalization feedback Controller State Machine Loop Drivers Stepper Thermal Safe State through / block / park

Use this map to prevent scope creep: the diagram only includes optical paths + WSS/VOA control + power monitoring + drivers. Granularity is determined by where taps exist and what the controller can observe.

H2-3 · WSS Control Fundamentals

WSS Control Fundamentals: Ports, Passbands, and Repeatability Knobs

A WSS is best treated as a multi-dimensional control surface, not a single knob. Practical control focuses on route mapping, per-channel attenuation, and (when supported) passband shaping, while repeatability is bounded by actuator non-idealities and calibration drift.

1) Control dimensions dictionary inputs → outputs → calibration

  • Port / Route map: chooses the destination degree/add-drop for each wavelength group. Output is a verifiable path state (through / drop / blocked).
  • Per-channel attenuation: sets a channel’s loss (dB) to hit a target power (dBm) at a defined measurement point. Requires a code→dB mapping (LUT).
  • Passband shape / tilt: selects presets (width/tilt/edge) that influence “control after-effects” (adjacent anomalies, stability margin, consistency). Avoid physics details; focus on what the firmware can select and validate.
  • Control granularity rule: per-channel promises require per-channel observability (tap topology or validated proxy); otherwise only per-degree/group targets can be guaranteed.

2) KPIs defined from a control perspective measurable + actionable

  • Insertion loss variation: how much delivered power changes for the same configuration across temperature, time, and reboot cycles.
  • Return-to-setpoint repeatability: distribution of measured power after repeated route/attenuation toggles (captures hysteresis/backlash effects).
  • After-effect indicators: control-side proxies for “too aggressive shaping” (e.g., rising adjacent alarm counts or channel-to-channel imbalance events).
  • Settling time: time to re-enter a defined error band after switching or a setpoint step; directly determines blanking windows and ramp policies.

3) Layered control strategy coarse → fine

  • Coarse layer: route selection, initial attenuation preset, and temperature/position preconditions that move the system into a controllable region.
  • Fine layer: small-step power trimming using the measured-power loop (handles tap/PD/AFE drift and small actuator nonlinearity).
  • Guardrails: rate-limit command changes, clamp integrators, and define deterministic “safe states” if convergence fails or sensors are invalid.
  • Practical outcome: coarse errors manifest as integrator saturation and oscillation; fine loop should never be tasked with “fixing topology mistakes.”

4) Repeatability breakdown symptom → evidence → fix

  • Resolution / quantization: power responds in steps; fix via smaller command increments or iterative micro-trim with feedback.
  • Hysteresis / backlash: up/down approaches require different commands; fix via bidirectional LUTs or fixed-direction approach strategies.
  • Temperature drift: systematic offset vs temperature; fix via temperature-binned LUTs and controlled update cadence (slow correction loop).
  • Aging drift: command must creep over weeks/months; fix via re-calibration triggers and telemetry-based health scoring (trend + thresholds).
Implementation hint: keep WSS “control dimensions” explicit in telemetry (route preset, attenuation code, shape preset) so field evidence can separate measurement errors from actuator non-repeatability.
Figure F3 — WSS control dimensions: route, attenuation, and passband shaping (plus coarse/fine layering)
WSS Control Surface (Inputs → Interface → Verifiable Outputs) Control Inputs Route Map Port select / degree Attenuation Target dB / code Passband Shape Preset width / tilt WSS Control Interface Config + Limits LUT (code → dB) Coarse → Fine route/shape → power trim Verifiable Outputs Path State through / drop / block Channel Power measured dBm + error After-Effect KPIs repeatability / IL drift Repeatability Knobs (what must be tracked & compensated) Resolution Hysteresis Temp Drift Aging Drift control quality

The diagram stays “control-level”: it lists what can be set, what can be verified, and which knobs determine repeatability. Hardware physics is intentionally abstracted into measurable behaviors and calibration requirements.

H2-4 · VOA Roles & Stability

VOA Roles: Power Equalization, Transient Handling, and Stability

A VOA is not “just attenuation.” In ROADM control it is a stabilizing actuator that (1) flattens steady-state power error and (2) constrains switching transients so alarms do not fire on expected reconfiguration events.

1) Two jobs, two operating modes steady-state vs transient

  • Steady-state flattening: reduce power error (dBm) or channel imbalance (ΔdB) to a defined band at the chosen measurement point.
  • Transient handling: during route changes and restarts, apply ramp/limits so measured power stays within safe envelopes and does not cause alarm storms.
  • Engineering contract: steady-state is judged by final error and drift; transient mode is judged by overshoot, settle time, and safe-state compliance.

2) Setpoint strategy dBm vs dB + granularity

  • Absolute (dBm) target: requires calibrated measurement truth; best when the node must deliver a known boundary power.
  • Relative (dB) target: robust to sensor offset; best for flattening and consistency when absolute accuracy is not guaranteed.
  • Per-degree vs per-channel: finer granularity increases AFE channel count, LUT size, temperature compensation complexity, and telemetry volume.
  • Scope guard: never promise per-channel control without a measurement topology that can observe and validate per-channel behavior.

3) Stability pitfalls (and how they show up) practical mechanisms

  • Sampling too slow: feedback arrives late, producing chase/oscillation. Symptom: command changes lag measured swings.
  • Over-filtering: reduces noise but adds delay; symptom: slow convergence followed by overshoot when the filter “catches up.”
  • Integrator windup: after switching windows or sensor invalid intervals, the integrator saturates; symptom: large overshoot right after validity returns.
  • Command step too large: actuator increments are coarse; symptom: repeated over/under-correction (sawtooth behavior).

4) Protection logic ramp · limit · blanking

  • Ramp (setpoint slew): limit dB/s or dBm/s so the measurement chain and actuator can follow without overshoot.
  • Command limit: clamp per-update actuator changes (code/step per cycle) to prevent instability under noisy measurements.
  • Blanking/debounce: suppress alarms during a defined re-route window; log the window so evidence remains auditable.
  • Safe fallback: if convergence fails, enter a deterministic state (hold/park/through) and expose a clear fault code + recovery sequence.
Tuning workflow: validate the measurement chain first (noise + drift), then tune ramp/limits for switching safety, and finally tune the steady-state loop for convergence without windup. This order prevents “fixing stability” by hiding sensor problems.
Figure F4 — VOA closed-loop time behavior: setpoint, measured power, and actuator command (ramp/limit/blanking)
VOA Loop Timing (Stability + Protection) Re-route window Blanking time → level Setpoint (ramp) Measured power Actuator command (limit) slew limit Overshoot kept bounded by ramp + limits alarm threshold Setpoint Measured Command Re-route + Blanking Design goal: converge without oscillation, avoid false alarms during switching, and prevent integrator windup via clamps.

The plot is intentionally “control-level”: ramp limits how fast targets change, command limits bound actuator steps, and blanking prevents alarms from triggering on expected reconfiguration transients.

H2-5 · Optical-Power AFE

Optical-Power AFE Design: PD Front-End, Dynamic Range, and Error Budget

Power control quality is bounded by measurement truth. This section breaks the tap-to-digital chain into range, offset, consistency, and noise, so “inaccurate power” can be traced to a specific stage: tap coupler, photodiode, TIA/AFE, ADC/reference, or digital filtering and calibration.

1) Measurement chain map tap → PD → AFE → ADC → digital

  • Tap coupler: defines how much optical power is sampled. Primary risks are ratio tolerance and slow drift (contamination / connector state).
  • Photodiode (PD): converts optical power to current. Primary risks are responsivity drift (temperature) and rising dark current at high temperature.
  • TIA / power AFE: converts current to voltage. Primary risks are gain error, offset, and front-end noise that appears as power jitter.
  • ADC + Vref: digitizes the signal. Primary risks are reference drift (dominates long-term offset) and nonlinearity that breaks LUT assumptions.
  • Digital filter & calibration: improves stability but adds delay. Poor settings can cause slow convergence or “chasing” in closed-loop control.

2) Dynamic-range decomposition avoid saturation + avoid noise floor

  • Start from the controlled point: expected min/max optical power (dBm) at the tap location sets the required measurement span.
  • Tap ratio tolerance: worst-case ratio shifts the AFE input range; budget headroom so the “high-power corner” does not clip.
  • PD current range: low-power corner must stay above dark-current and noise-dominated regions; high-power corner must not overload the TIA/ADC.
  • TIA gain choice: higher gain improves low-power resolution but reduces headroom. Range decisions should be made before chasing higher ADC bits.
  • Filtering as a trade: stronger filtering reduces jitter but increases latency; choose a bandwidth that matches the loop’s stability margin (links to ramp/limits).

3) Error budget template absolute · relative · short-term

  • Absolute offset (dBm): tap ratio calibration, PD responsivity, TIA gain, and Vref drift accumulate into a steady power shift.
  • Channel-to-channel consistency (ΔdB): multi-channel AFE matching and thermal gradients determine how flat the node can equalize.
  • Short-term noise (jitter): TIA/ADC noise and EMI coupling set the reading jitter that can drive control dithering and false alarms.
  • Engineering rule: use absolute budgets for “delivered boundary power” claims, and relative budgets for flattening and repeatability.

4) Environmental drift discrimination trend → root cause

  • Temperature: systematic drift that correlates with temperature points to PD responsivity or AFE gain/offset tempco; handle via temperature-binned LUTs.
  • Humidity / contamination: slow monotonic drift on a single path often points to optical coupling changes rather than Vref drift.
  • Self-check hooks: periodic “dark/zero” sampling windows or reference checkpoints (if present) separate AFE drift from optical-path drift.
  • Evidence discipline: log temperature, Vref status, and measured-power trends so the system can prove whether the drift is common-mode or path-specific.
Design mindset: range decisions come first (prevent clipping), then match and stabilize references (reduce long-term offset), then tune filtering for jitter without destabilizing closed-loop power control.
Figure F5 — Power-measurement chain error map (stacked sources along tap → PD → AFE → ADC/Vref → digital)
Optical Power AFE — Error Sources Along the Chain Tap coupler PD photodiode TIA / AFE gain + offset ADC + Vref digitize Digital filter ratio tol path drift tempco dark current gain error offset + noise INL / range Vref drift delay LUT Error Budget Outputs (what control can trust) Absolute Offset dBm shift Consistency ΔdB flatness Short-term Noise reading jitter propagates

Use this map to localize “wrong power”: common-mode drift points to Vref/AFE, while path-specific slow drift often points to tap/optical coupling. Short-term jitter usually originates from TIA/ADC noise or EMI coupling.

H2-6 · Actuator Drivers

Actuator Drivers: Stepper/Microstepping and Thermal/TEC Control

In optics, “driver quality” is measured by repeatable positioning and stable temperature without introducing vibration, audible noise, or measurement interference. This section focuses on what matters for stepper and thermal channels: current regulation, microstepping linearity, hold strategies, loop limiting, and fail-safe recovery.

1) Stepper driver essentials position repeatability

  • Current regulation accuracy: torque margin and repeatability depend on accurate phase current; poor regulation increases missed-step risk.
  • Microstepping linearity: nonlinearity creates small-step “wobble” that appears as optical power jitter after settling.
  • Decay mode / ripple: choices affect audible noise, EMI, and heating; optics-sensitive assemblies benefit from low-ripple behavior.
  • Step-loss detection (if available): converts “silent misalignment” into a logged fault, enabling deterministic recovery procedures.

2) Mechanics-to-optics translation backlash · vibration · hold

  • Backlash / friction: the same target reached from different directions can land differently; mitigate via bidirectional compensation or fixed-direction approaches.
  • Vibration sensitivity: overly aggressive stepping near the final target can excite resonance; use settle windows and smaller terminal steps.
  • Hold strategy: holding current improves stiffness but increases heating and drift; reduce-hold or sleep modes reduce drift but require disturbance tolerance.
  • State machine pattern: move → settle → hold/relax → monitor, with clear transitions on alarms and re-route events.

3) Thermal / TEC / heater control slow + stable

  • Power stage choice: PWM offers efficiency but can inject ripple; linear modes are cleaner but dissipate more heat—select based on noise sensitivity.
  • Loop behavior: thermal plants have lag; limiting and anti-windup prevent overshoot and oscillation during step changes.
  • Sensor placement: distance and thermal coupling create delay; account for lag in control cadence and limit settings.
  • Practical tuning: validate sensor integrity, then limit output slew, then tune the steady-state controller for bounded settling.

4) Fail-safe & recovery park · protect · restore

  • Protection: over-current, over-temperature, open/short detection should force deterministic output states and raise unambiguous fault codes.
  • Safe state: define a “park/disable/hold” behavior for each actuator to prevent uncontrolled optical drift during faults.
  • Recovery flow: sensor self-check → driver enable → homing/park verify → reload LUTs → re-enter closed-loop control.
  • Evidence: log last command, current/temperature snapshots, and whether step-loss or thermal saturation occurred.
Control coupling: stepper and thermal subsystems are not independent—heating changes mechanics and alignment, while vibration can corrupt power readings. Driver strategy must be coordinated with the power-measurement chain and loop limits.
Figure F6 — Actuator driver channels: command path and feedback path (stepper + thermal/TEC)
Actuator Drivers (Stepper + Thermal/TEC) — Control + Feedback Controller DAC / PWM outputs State machine Limits + Safe State slew / clamp park / disable Stepper Driver current regulation microstepping Motor / Mechanism backlash / friction hold strategy Thermal Driver PWM / linear PID + limits Heater / TEC thermal lag sensor placement Feedback Sensing current sense temp sense ADC feedback

The diagram emphasizes what optics needs: stable actuation with bounded noise and clear recovery states. Sensing (current + temperature) closes the loop and makes faults diagnosable instead of silent.

H2-7 · Calibration & Compensation

Calibration & Compensation: LUTs, Temperature Drift, Hysteresis, and Aging

“Set dB” becomes real only after two contracts are enforced: (1) measurement calibration (raw ADC to power estimate), and (2) actuation calibration (target attenuation to stable driver commands). This section shows how LUTs, temperature compensation, hysteresis modeling, and drift monitoring keep ROADM power control accurate over months and years.

1) What is calibrated measurement + actuation

  • Measurement calibration: raw ADC → calibrated reading. Covers offset/gain, channel matching, and reference-driven long-term drift.
  • Actuation calibration: target dB (or dBm) → command. Captures nonlinear attenuation curves and maps targets into stable actuator inputs.
  • Verification lens: measurement calibration is judged by offset, consistency (ΔdB), and noise/jitter; actuation calibration is judged by repeatability and settling behavior.

2) LUT design for attenuation LUT(T, dir)

  • Nonlinearity first: LUTs convert “desired attenuation” into “effective commands” that match real optical behavior.
  • Temperature indexed: store LUT slices per temperature bin or apply compact correction terms; keep updates in a slow loop to avoid control jitter.
  • Direction aware: hysteresis requires up/down LUTs or direction-conditioned correction so repeated setpoints land consistently.

3) Temperature compensation strategy priority + cadence

  • Priority: first stabilize the power estimate (PD responsivity, TIA gain/offset, Vref), then compensate actuator mapping (command→dB).
  • Cadence separation: apply lightweight temp correction per sample, but update model parameters slowly (seconds/minutes) to prevent oscillation.
  • Sensor lag: account for thermal delay and gradients; compensation should not assume temperature readings are instantaneous truth.

4) Hysteresis & backlash modeling repeatability knobs

  • Bidirectional curves: maintain separate up/down LUT paths or apply direction-conditioned offsets near sensitive regions.
  • Pre-bias approach: intentionally approach a final setpoint from a consistent direction to reduce landing variance.
  • Control policy: define “move → settle → hold/relax” windows and log direction so repeatability can be proven and tuned.

5) Aging & drift monitoring trend → trigger

  • Drift indicators: increasing offset trend, worsening ΔdB flatness, or rising command cost (more command for same outcome).
  • Trigger types: threshold (absolute error), slope (rate of change), and consistency (channel-to-channel statistics).
  • Recal order: validate measurement chain first, then local recal (per channel/module), then full recal only if necessary.

6) Practical evidence for field robustness logs that matter

  • Store: temperature, Vref health, calibrated power, command, direction, settle time, and residual error after settle.
  • Correlate: common-mode drift hints at AFE/Vref; path-specific monotonic drift hints at coupling contamination or module-specific aging.
  • Prove: drift detection should be explainable via logs, not only via end-user alarms.
Key concept: keep compensation in a slow loop and keep the closed-loop power controller stable. Recalibration must be triggered by trends, not by instantaneous noise.
Figure F7 — LUT + temperature compensation structure (measurement and actuation contracts + re-cal triggers)
Calibration + Compensation Structure (Measurement + Actuation) Measurement Contract Raw ADC Calibration offset + gain Temp Comp PD/TIA/Vref Temp sensor Power Estimate dBm / ΔdB / jitter Actuation Contract Target dB / dBm LUT(T, dir) nonlinear map Driver Command step / code / PWM Actuator Outcome attenuation realized Drift Monitor + Recalibration Trigger Trend Metrics offset / ΔdB / cost Trigger Rules threshold / slope Recal Routine local → full update

The diagram separates fast control (power loop) from slow correctness (temperature compensation and drift-based re-cal), preventing noise-driven “self-relearning” while keeping long-term accuracy.

H2-8 · Control Firmware Architecture

Control Firmware Architecture: State Machines, Safety Interlocks, and Sequencing

ROADM incidents often happen during transitions: boot, add/drop changes, re-route events, or fault recovery. A robust firmware architecture enforces a strict order: sensing validity → calibrated estimates → closed-loop enable → bounded actuation. This section defines state machines, interlocks, and rate separation so power control remains safe and deterministic.

1) State machine definition entry/exit rules

  • Boot: initialize clocks, power rails, and interfaces; no actuator motion permitted.
  • Self-test: validate ADC/Vref, temperature sensors, and driver presence; produce explicit pass/fail codes.
  • Cal load: load calibration versions and LUT slices; reject stale or incompatible calibration sets.
  • Closed-loop enable: start with small-step limits and blanking windows; require stability for N cycles.
  • Normal: steady regulation, alarms armed; drift monitoring continues in the slow loop.
  • Fault/Degrade: deterministic degrade actions (hold/park/block) with defined recovery conditions.

2) Sequencing & interlocks no valid sensing → no loop

  • Measure first: require Vref stable and ADC ready before declaring “power estimate valid.”
  • Enable second: closed-loop may only start after calibration is loaded and sensor health is confirmed.
  • Move last: actuator moves must be bounded by ramp, slew limits, and safe-state rules.
  • Transition guard: during re-route, temporarily adjust thresholds and rate limits to prevent false alarms and integral windup.

3) Safety degrade policies hold · park · block

  • Hold last: short sensor interruptions where actuation is trusted; resume only after validity persists for N cycles.
  • Park/disable: actuator or driver faults, step-loss detection, or over-temperature events where continued motion risks drift.
  • Block/clamp: power limit violations where safety requires immediate suppression regardless of control objectives.
  • Recovery rules: explicit conditions such as “fault clear,” “re-home OK,” and “power stable” prevent oscillatory recovery loops.

4) Rate separation fast loop vs slow loop

  • Fast loop: filtering, small-step correction, bounded actuator changes; handles short-term noise without destabilizing the system.
  • Slow loop: temperature compensation updates, LUT slice management, and drift monitoring; prevents noise-driven parameter churn.
  • Separation rule: slow-loop decisions must pass through fast-loop limiters, never direct large actuator steps.

5) Telemetry that prevents mystery faults prove behavior

  • Log: state transitions, reason codes, sensor validity windows, limit activations, and last stable power bands.
  • Snapshot: power estimate, temperature, driver currents, and command values at every fault entry.
  • Explain: alarms should map to a clear state and a clear interlock, not to ambiguous “out-of-range” messages.
Engineering rule: never allow integral buildup or large actuator motion while sensing is invalid. A deterministic state machine plus interlocks converts transients into controlled transitions.
Figure F8 — ROADM control state machine (safe enable, degrade paths, and recovery conditions)
Firmware State Machine (Enable → Normal → Degrade → Recovery) Boot init only Self-test ADC/Vref Cal Load LUT ok Loop Enable limits + N Normal steady Fault / Degrade hold / park / block Sensor Invalid Vref/ADC/temp Actuator Fault OC/OT/step-loss Power Limit clamp / block Recovery valid N cycles Recovery re-home OK recover → enable

The state machine enforces a strict order: sensing validity and calibration loaded come before loop enable, while faults route into deterministic degrade actions with explicit recovery gates.

H2-9 · Telemetry, Alarms & Field Evidence

Telemetry, Alarms, and Field Evidence: Proving Stability Over Months

Long-term stability is proven, not assumed. The minimum viable telemetry set must explain both slow drift and intermittent faults by preserving control context, measured outcomes, actuation intent, and environmental conditions. This section defines what to record, how to compute trend metrics, and how to reduce false alarms without hiding real failures.

1) Minimum viable telemetry fields that explain

  • Control context: channel ID, state, mode, setpoint (dB/dBm), timestamp.
  • Observation: measured power, filtered estimate, estimate-valid flag.
  • Actuation: actuator command (step/code/PWM), direction, ramp/limit active.
  • Environment: temperature, driver current, Vref/rail health, supply status.
  • Alarms: alarm code, latched/not, debounce counters, blanking active.

2) Trend metrics for months of proof ΔdB · cmd/day · RMS

  • Channel consistency (ΔdB): quantify flatness using max-min or percentile spread across channels.
  • Command drift rate: slope of command vs time under constant setpoint (early aging/contamination indicator).
  • Closed-loop error RMS: statistical error under steady windows (separates noise from real control bias).
  • Alarm quality: ratio of debounced alarms vs raw threshold hits (tracks false-alarm pressure).

3) False-alarm reduction threshold · debounce · blanking

  • Threshold: static limits for steady state; widened limits during transition windows if needed.
  • Debounce: require N consecutive samples or time T beyond threshold; store counters for evidence.
  • Blanking: on re-route/enable/move, gate alarms until power and command are stable for N cycles.
  • Three classes: real over-limit, transition artifact, and noise spike must be distinguishable in logs.

4) Field evidence replay ring buffer snapshot

  • Pre-window: store short high-rate history (ring buffer) to capture the lead-up to a fault.
  • Fault instant: record state, interlock reason, alarm code, sensor validity, and command values.
  • Post-window: record recovery behavior (did it settle, oscillate, or re-enter fault?).
  • Bundle rule: one “fault package” must reconstruct setpoint/measured/command/temp/current over time.
Practical rule: long-term proof needs both rollups (slow, long retention) and snapshots (fast, short retention). A small, well-chosen field set is more valuable than noisy bulk logging.
Figure F9 — Telemetry dataflow (collect → aggregate → thresholds → ring buffer → export) with minimum required fields
Telemetry Dataflow (Minimum Fields + Trends + Evidence) Sources setpoint / estimate command / temp / current Collector timestamped frames valid flags Minimum Fields time · state · setpoint measured · command · temp current · alarm Fast Evidence Path Ring Buffer pre / post window Event Snapshot fault package Slow Proof Path Aggregator rollups Trend Metrics ΔdB · cmd/day Alarms & Export Threshold limits by mode Debounce N samples / T Blanking Gate transition window API CLI

The pipeline separates fast evidence (ring-buffer snapshots) from slow proof (rollups and trends), while alarm gates (debounce + blanking) reduce false positives without losing forensic detail.

H2-10 · Validation & Production Checklist

Validation & Production Checklist: How to Know It’s Done

A ROADM control design is “done” only when it is verifiable at three levels: R&D validation (stability and robustness), production (repeatable calibration and actuator health), and field acceptance (self-test, re-calibration, and safe rollback). This section provides measurable test outputs and a stage-by-stage matrix for sign-off.

1) R&D validation dynamics · temp · repeat

  • Step response: overshoot, settle time, steady error, limit activations under controlled setpoint changes.
  • Temperature sweep: scan temperature bins and record offset, ΔdB spread, command drift, and stability margins.
  • Repeatability: approach setpoints from different initial conditions and directions; quantify landing variance.
  • Fault drills: sensor invalid, actuator fault, and power limit scenarios must enter deterministic degrade states.

2) Production checklist fast · traceable

  • Zero/offset: dark/zero points captured; baseline noise recorded and checked against limits.
  • Reference points: gain/offset calibration written with versioning; CRC checked after programming.
  • LUT integrity: LUT(T) slices or coefficients validated; incompatible versions rejected.
  • Actuator health: limits, current/temperature sensing, and basic homing/parking verified.
  • Traceability: serial number binds to calibration version, date, and test result summary.

3) Field acceptance self-test · re-cal · rollback

  • Self-test: verify sensor validity, driver health, and trend anomalies without disrupting normal operation.
  • Re-cal strategy: local updates first (channel/module) with minimal downtime; reject changes that worsen error.
  • Rollback: keep active + previous parameter sets; revert on post-cal failure, unstable loops, or alarm storms.
  • Fail-safe: failures must land in hold/park/block deterministically, with explicit recovery gates.

4) Required test output headers fields only

  • Dynamics: step ID, setpoint, measured, overshoot, settle time, steady error, temp, limit flags.
  • Temp sweep: temp bin, offset, ΔdB spread, cmd drift, Vref status, pass/fail tag.
  • Repeatability: start state, direction, final error, settle time, repeats, hysteresis tag.
  • Production: serial, cal version, CRC, noise baseline, actuator current, sensor checks.
Sign-off rule: every stage must output machine-checkable artifacts (logs, CRC, matrix results). If a test cannot produce evidence, it cannot close risk.
Figure F10 — Validation matrix (rows = tests, columns = R&D / Production / Field) with checkboxes and evidence artifacts
Validation Matrix (R&D · Production · Field) + Evidence Test Item R&D Production Field Evidence Step response (overshoot / settle) log/report Temperature sweep (offset / ΔdB / drift) scan log Repeatability (direction / hysteresis) stats Calibration programming (version / CRC) CRC record Actuator health (limits / current / temp) self-test Field re-calibration + rollback safety snapshot Checkboxes indicate where each test is required; the Evidence column lists the artifact to retain.

The matrix converts “done” into stage-specific evidence: dynamics and robustness in R&D, programming integrity in production, and safe re-cal/rollback behavior in the field.

H2-11 · Failure Modes & Troubleshooting Playbook

Failure Modes & Troubleshooting Playbook (Symptom → Cause → Evidence → Fix)

This playbook turns common ROADM control issues into repeatable diagnostics. Each case uses a fixed four-line template: Symptom, Likely causes, What to check, and Corrective action, with example suspect parts to speed isolation. Keep evidence aligned to the minimum telemetry set: time, state, setpoint, measured, command, temperature, current, and alarm.

1) Power reading is jumpy / noisy

Subsystem: Measurement Confidence: High
  • Symptom: Measured power shows fast jitter; alarm counters may “flicker” even with stable setpoints.

  • Likely causes: PD bias instability, TIA noise pickup, reference drift/noise, ADC configuration, grounding/return coupling, overly aggressive digital filtering.

  • What to check: Freeze actuator command and observe measured RMS; correlate jitter to temp/current; verify estimate-valid flag; compare raw vs filtered samples.

  • Corrective action: Stabilize PD bias and reference routing; improve decoupling/return paths; adjust ADC sampling and digital filters; add transition blanking if jitter is switch-induced.

    Suspect parts (examples): TI DDC112/DDC232, ADI AD7124-8, TI ADS124S08, ADI ADR4525/ADR4550, TI REF5025/REF5050

2) Power reading offset shifts (all channels move together)

Subsystem: Measurement Confidence: High
  • Symptom: Multiple channels show a similar dB/dBm offset change after reboot, temperature change, or long uptime.

  • Likely causes: Reference (Vref) drift, calibration version mismatch, incorrect zero/dark-current handling, partial initialization after brownout, shared rail noise coupling.

  • What to check: Compare current calibration version/CRC to expected; inspect reset cause and rail-valid flags; run a dark/zero check point if available; verify Vref health telemetry.

  • Corrective action: Roll back to previous parameter set; enforce sequencing (Vref/ADC ready before closed-loop); add CRC + dual-image persistence; tighten Vref filtering and layout.

    Suspect parts (examples): ADI ADR4525/ADR4550, TI REF5025/REF5050, TI TPS3823/TPS3839, Maxim MAX706

3) Closed-loop oscillation

Subsystem: Control Confidence: High
  • Symptom: measured and command show periodic swings around the setpoint; alarms may latch after repeated crossings.

  • Likely causes: Loop gain too high, sampling/actuation latency, insufficient filtering, integral windup, actuator deadband + aggressive integrator, missing ramp limits on transitions.

  • What to check: Overlay setpoint/measured/command over time; detect command saturation/limit flags; verify state transitions (blanking gate) during re-route/enable; compare fast vs slow loop rates.

  • Corrective action: Reduce integral gain and add anti-windup; separate fast small-step loop from slow temperature compensation; add ramp/limiters and transition blanking; increase measurement filtering (without hiding real drift).

    Suspect parts (examples): (Control-dominant) verify ADC sampling path and actuator driver current evidence with TI INA240; validate supervisor behavior with TI TPS3823

4) Slow convergence / never reaches target

Subsystem: Control Confidence: Medium
  • Symptom: Error decays very slowly or stalls at a non-zero level; command creeps without producing expected optical change.

  • Likely causes: Overly conservative step limits, actuator static friction, deadband in command→attenuation mapping, measurement averaging too long, integrator clamped by safety limits.

  • What to check: Compare command increments vs measured response (gain); check limiter/anti-windup flags; test a controlled “breakaway” step to cross static friction; verify direction-dependent behavior.

  • Corrective action: Use two-stage moves (breakaway then fine trim); tune limits by mode; refine LUT slopes near deadband; shorten averaging during acquisition, then increase smoothing in steady state.

    Suspect parts (examples): Trinamic TMC2209/TMC5160, TI DRV8711 (actuator response), TI INA240 (current evidence)

5) Channel-to-channel mismatch (ΔdB too large)

Subsystem: Calibration Confidence: High
  • Symptom: With the same strategy, some channels remain consistently high/low; ΔdB spread grows with temperature or direction changes.

  • Likely causes: Tap ratio tolerance, LUT mismatch, missing temperature compensation, hysteresis not modeled (up vs down curves), inconsistent sensor placement or thermal gradients.

  • What to check: Plot ΔdB vs temperature; test approach from both directions to expose hysteresis; verify LUT(T) bin selection and CRC; compare sensor locations and thermal lag.

  • Corrective action: Add bi-directional LUTs; increase calibration points for consistency; prioritize temperature compensation for PD/Vref first, then actuator; enforce per-channel offsets with version control.

    Suspect parts (examples): TMP117 / ADT7420 (temperature sensing), I²C EEPROM 24LCxx (parameter storage), ADR4525/REF5025 (shared reference)

6) Actuator positioning is inaccurate (repeatability poor)

Subsystem: Actuation Confidence: High
  • Symptom: Same command leads to different attenuation; large direction dependence; occasional “missed” moves with no matching optical response.

  • Likely causes: Missed steps, insufficient holding current, backlash/hysteresis, microstepping nonlinearity at low current, inadequate homing/parking logic.

  • What to check: Correlate command vs current and optical change; verify limit/homing counters; compare results from consistent approach direction; inspect thermal dependence (friction changes).

  • Corrective action: Tune motor current and microstepping mode; add homing/parking and approach-direction rules; apply post-move “settle and trim” with small steps; improve holding strategy after reaching target.

    Suspect parts (examples): Trinamic TMC5160/TMC2209, TI DRV8711, TI INA240

7) Actuator buzzing / heat is abnormal

Subsystem: Actuation Confidence: Medium
  • Symptom: Audible noise, excess heating, or current ripple increases during hold or microstepping; optical output may jitter mechanically.

  • Likely causes: Wrong decay mode, overly high hold current, PWM frequency interacting with mechanics, unstable current regulation, insufficient thermal path.

  • What to check: Trend driver current and temperature; compare behavior across microstep/decay settings; check whether noise appears only in certain states (hold vs move).

  • Corrective action: Reduce hold current where possible; select quieter current-control modes; move PWM out of sensitive bands; enforce thermal limits and degrade strategies for prolonged hold.

    Suspect parts (examples): Trinamic TMC2209 (quiet modes), TI DRV8711 (current control flexibility), TMP117 (thermal evidence)

8) Thermal loop runs away or oscillates

Subsystem: Thermal Confidence: High
  • Symptom: Temperature overshoots and hunts; TEC/heater command saturates; optical performance drifts with unstable temperature control.

  • Likely causes: Sensor placement lag, missing output limits, integral windup, incorrect polarity/sign, inadequate current limiting, poor thermal coupling.

  • What to check: Plot temperature vs command; confirm limit flags and saturation time; validate sensor location vs controlled element; verify polarity and fail-safe triggers.

  • Corrective action: Add explicit output limits + anti-windup; slow the loop to match thermal time constants; reposition sensors; enforce safe “park” on thermal faults with deterministic recovery gates.

    Suspect parts (examples): ADI ADN8834, Maxim MAX1968 (TEC control), TI TMP117 / ADI ADT7420 (temperature sensors)

9) False alarms during switching / re-route

Subsystem: Alarms Confidence: High
  • Symptom: Alarms spike only during transitions; steady-state is clean; operators see “brief red” events.

  • Likely causes: Missing/short blanking window, debounce too aggressive or inconsistent, thresholds not mode-aware, estimate-valid not used to gate alarms.

  • What to check: Confirm blanking active and debounce counters around the event; check state transitions and timing; verify whether alarms correlate with estimate-valid drops.

  • Corrective action: Implement mode-aware thresholds; debounce with explicit counters; apply blanking gates for known transition windows and re-enable only after power/command stability criteria are met.

    Suspect parts (examples): (Logic-dominant) validate supervisors/reset behavior: TI TPS3823/TPS3839; confirm timestamp integrity via system timebase

10) Long-term drift exceeds spec

Subsystem: Calibration Confidence: Medium
  • Symptom: Over weeks/months, command drift rate increases; consistency degrades; re-cal events become frequent.

  • Likely causes: Optical contamination, mechanical wear, PD responsivity drift, reference aging, missing trend-based triggers, overly rare re-calibration.

  • What to check: Trend cmd/day, ΔdB spread, and error RMS; compare drift to temperature exposure; verify re-cal trigger thresholds and parameter version history.

  • Corrective action: Add trend-based re-cal triggers (slope/threshold); refresh LUT(T) bins; verify reference stability; apply maintenance actions for contamination and enforce safe rollback to last good calibration.

    Suspect parts (examples): ADR4525/REF5025 (reference aging), DDC112/AD7124-8 (measurement chain evidence), 24LCxx (parameter persistence)

11) Intermittent reset causes state loss

Subsystem: Firmware / Power Confidence: High
  • Symptom: After a reset, power settings do not restore; closed-loop re-enables with wrong parameters or wrong sequence.

  • Likely causes: Brownout or watchdog resets, non-atomic parameter writes, missing CRC checks, state machine enabling closed-loop before sensors are valid.

  • What to check: Inspect reset-cause logs; verify parameter CRC/version; confirm boot sequence: rails/Vref/ADC ready → calibration load → enable loop; check last-known-good snapshot availability.

  • Corrective action: Implement dual-image parameter storage with CRC; add explicit boot gating; restore only after sensor validity; log a complete “boot evidence package” for forensic review.

    Suspect parts (examples): TI TPS3823/TPS3839, Maxim MAX706 (supervisor/watchdog), I²C EEPROM 24LCxx (persistence)

Use the template consistently: when troubleshooting, always capture time, state, setpoint, measured, command, temperature, current, alarm. If any of these is missing, diagnosis becomes guesswork.
Figure F11 — Symptom-to-check decision tree (reading → sensor chain / positioning → driver & mechanics / thermal → loop)
Decision Tree: Symptom → What to Check Start from symptom use time/state/setpoint/measured Reading issues noise / offset / flicker Positioning issues repeatability / missed move Thermal issues runaway / oscillation Check (Measurement) ADC / Vref PD bias / TIA Filter / averaging Ground / return Check (Actuation) Microstepping Hold current Backlash / hysteresis Homing / limits Check (Thermal) Sensor lag / placement PID saturation Current limits Fail-safe gates Tip: Always capture time/state/setpoint/measured/command/temp/current/alarm before changing parameters.

The decision tree routes symptoms into measurement, actuation, or thermal checks. Keep the first pass minimal and evidence-driven, then apply fixes in safe steps (limits, blanking, rollback-ready parameter updates).

Request a Quote

Accepted Formats

pdf, csv, xls, xlsx, zip

Attachment

Drag & drop files here or use the button below.
H2-12 · FAQs ×12

FAQs (ROADM WSS/VOA Control, Power AFE, Calibration, Firmware)

These FAQs focus on what a ROADM node actually controls: wavelength routing/attenuation state, optical-power measurement accuracy, stable closed-loop behavior, and safe field operation. Answers are intentionally evidence-driven and map to the sections above.

1What is the practical boundary between a ROADM WSS and a VOA?

In control terms, a WSS primarily sets wavelength-to-port routing and per-channel passband behavior (including discrete or calibrated attenuation), while a VOA is the continuous attenuation element used for power equalization and transient limiting. Both can affect power, but WSS defines routing/granularity, and VOA stabilizes level and dynamics when setpoints or paths change.

Mapping: H2-1 (scope), H2-2 (node architecture), H2-4 (VOA loop roles)

2Should the power setpoint be in dBm or dB—and what are the common traps?

Use dBm when an absolute output level matters (alarm thresholds, launch power), but it requires a trustworthy absolute calibration chain (tap ratio, PD/TIA, ADC, Vref). Use dB when the goal is relative equalization (flattening across channels/degrees) because it can be more robust to fixed offsets. The trap is mixing them: relative control can hide absolute drift, and absolute control can amplify calibration errors.

Mapping: H2-4 (VOA strategy), H2-5 (AFE error budget)

3Why can a “stable” power reading still be very inaccurate?

Stability often means low short-term noise, not correctness. Large errors can come from systematic terms such as tap ratio tolerance, PD responsivity drift, TIA gain/offset, ADC nonlinearity, or reference (Vref) drift. If those terms shift slowly, the trace looks stable while the absolute value is wrong. A good rule is to separate noise (RMS jitter) from offset/gain in the measurement error budget.

Mapping: H2-5 (optical-power AFE & error budget)

4How does tap coupler tolerance get amplified—or canceled—in a ROADM?

Tap ratio tolerance becomes a direct dB/dBm error if the control loop relies on absolute measured power without per-unit calibration. It is amplified further when absolute limits are tight or when multiple stages compare values from different taps. It can be largely canceled by per-channel calibration (offset/gain) and by using relative equalization (dB targets) that references channels against each other. The key is making the tolerance a modeled term in the LUT and verification tests.

Mapping: H2-5 (tap/PD/TIA terms), H2-7 (LUT + compensation)

5What are the most common sources of closed-loop oscillation, and how to localize them fast?

The top causes are too much loop gain, latency (sampling + filtering + actuator response), and integral windup when commands hit limits. Localize quickly by plotting setpoint / measured / command together: oscillation with command saturation points to windup, oscillation with strong phase lag points to latency, and stick-slip patterns suggest actuator deadband/hysteresis. A safe first move is reducing integral gain and adding anti-windup plus ramp limits.

Mapping: H2-4 (VOA loop), H2-8 (firmware sequencing), H2-11 (playbook)

6For actuator backlash/hysteresis, is software LUT enough or is a mechanical strategy required?

A bi-directional LUT and approach-direction rules often fix repeatability without hardware changes, especially when hysteresis is consistent and measurable. However, if behavior depends strongly on load, temperature, or wear, mechanical strategies (preload, limiting slack, improved homing/parking) reduce the root cause. A practical pattern is breakaway + fine trim: a larger move to cross static friction, then small steps to settle, with direction-aware LUT selection.

Mapping: H2-6 (drivers & mechanics), H2-7 (hysteresis modeling)

7What should temperature compensation cover first: PD, TIA, reference, or the actuator?

Start with the measurement chain—PD responsivity tempco, TIA gain/offset drift, and Vref drift—because any measurement bias pushes the closed loop to the wrong solution. Then compensate the actuator command→attenuation curve (LUT(T)) to keep control sensitivity consistent across temperature. Apply compensation as a slow loop: keep the fast loop stable and update temperature terms at bounded intervals with clear validity gating.

Mapping: H2-7 (temp compensation priorities)

8Why are ramp and blanking needed during power-up or reconfiguration, and how do you pick values?

Ramp limits control dP/dt so transient overshoot does not trip alarms or stress optics, and blanking prevents the system from interpreting known transition behavior as a fault. Choose ramp and blanking based on the slowest element in the loop: actuator settling time, measurement filter group delay, and any temperature loop lag, then add margin. Re-enable alarms only after sensor-valid is true and measured power stays within a stability band for N samples.

Mapping: H2-4 (transients), H2-8 (sequencing), H2-11 (false alarm case)

9How can re-calibration triggers be designed without frequently disturbing live traffic?

Use trend triggers rather than time-only schedules: monitor command drift rate (cmd/day), inter-channel spread (ΔdB), and closed-loop error RMS. First apply “soft corrections” (small offset updates or bounded LUT adjustments) when drift is mild. Trigger full recalibration only when slopes exceed thresholds or consistency breaks across channels, and execute it within a controlled window with rollback to the last-known-good parameter set.

Mapping: H2-7 (aging + triggers), H2-9 (telemetry), H2-10 (field checklist)

10How can production calibration be repeatable without a complex optical lab setup?

Production should focus on a minimal, repeatable calibration set: dark/zero handling, a small number of reference points for gain/offset, LUT programming with CRC verification, and actuator self-tests (limits, current, temperature). The goal is not perfect absolute metrology but traceability and consistency—every unit records the same calibration fields and passes the same stability/repeatability checks under controlled conditions.

Mapping: H2-10 (production checklist)

11Which telemetry fields are the minimum set to localize a field issue in one pass?

At minimum log time, state, setpoint, measured power, actuator command, plus temperature and driver current and a structured alarm code. Add a sensor-valid flag, calibration version/CRC, and key limit/saturation flags to prevent misdiagnosis. With those fields, most issues can be split into measurement (readings), actuation (response), thermal (drift), or firmware sequencing (state/validity) within a single review.

Mapping: H2-9 (telemetry & evidence)

12When an actuator or sensor fails, what fail-safe states should the system enter?

A good fail-safe is deterministic and recoverable. If sensing is invalid, freeze control (open-loop hold) and gate alarms until validity returns. If an actuator faults, transition to a defined state such as park / block / bypass / hold-last depending on the node’s safe optical policy, and prevent repeated retries that increase drift. Thermal faults should enforce output limits and move to a safe state with explicit re-entry conditions in the state machine.

Mapping: H2-8 (interlocks & sequencing), H2-11 (playbook)