123 Main Street, New York, NY 10001

Liquid Cooling Manifold & Pump Control

← Back to: Data Center & Servers

A rack/server liquid-cooling manifold & pump controller keeps coolant circulation stable by combining a BLDC drive, clean Flow/ΔP/Temp sensing, and a fault-aware state machine. It prevents dry-run/leak events and maintains continuity through redundant power-domain switchover, while recording the minimum telemetry and fault logs needed for fast root-cause analysis.

H2-1 | Page Boundary & System Role: What Manifold & Pump Control Owns

A liquid-cooling manifold and pump control module sits inside the server/rack coolant loop and owns the execution layer: driving the BLDC pump, conditioning flow/ΔP/temperature inputs, enforcing safety interlocks (including leak response), managing redundant power domains for continuity, and emitting actionable telemetry plus time-ordered fault events for diagnosis.

Execution-layer control Flow / ΔP / Temp sensing Leak & interlock safety Power domain A/B continuity Telemetry & event logs

What this page covers

  • Pump drive + protection: BLDC/PMSM drive chain, startup/priming ownership, derating and safe shutdown triggers.
  • Signal front-end contract: where flow, differential pressure (ΔP), and coolant temperatures enter, and what “control-grade” means.
  • Safety enforcement: leak and interlock latching, graded response (warn → derate → shutdown) and anti-oscillation rules.
  • Redundant power domains: A/B feed, OR-ing behavior, continuity of control rail vs power rail, and switchover evidence in logs.
  • Diagnosability: minimal telemetry fields and event codes that make field failures reproducible.

Explicit non-goals (to avoid cross-page overlap)

  • No facility CDU / chiller plant design: only the local loop and its execution controls are in scope.
  • No BMC protocol deep dive: transport (IPMI/Redfish details) is out-of-scope; only “what must be observable” is defined.
  • No PSU topology (PFC/LLC) and no rack PDU metering: this page stays at the pump module domain.
  • No general thermal policy: system-level fan curves and workload scheduling are handled elsewhere; this page provides reliable actuation + evidence.
System contract in one sentence: The module must deliver stable coolant transport, enforce deterministic safety actions, and preserve root-cause visibility with consistent telemetry and event logs—without owning facility cooling strategy or management-plane protocol stacks.
Figure L1 — Where the manifold & pump controller sits, and what it interfaces
Liquid Cooling Loop — Execution Layer Ownership Interfaces only (no management protocol stack) Rack / Server Loop Manifold Supply / Return Pump Control Module Drive + AFE + Safety + Logs Cold Plates Return to loop Flow ΔP Temp Leak Power A / B Interlock Telemetry & Event Logs Focus: execution, safety, continuity, diagnosability (not facility cooling or protocol stacks).
ALT: Block diagram of a server liquid-cooling loop showing the manifold and pump control module interfaces (Flow, ΔP, Temp, Leak, Power A/B, Interlock, Telemetry/Logs).

H2-2 | What Must Be Measured: Flow, ΔP, Temperature, Bubbles & Leaks

Reliable pump control is not “one-sensor control.” The execution layer needs a minimal observability set that survives real coolant loops: flow (delivery), differential pressure ΔP (loop impedance / blockage proxy), and temperature gradient (heat transport result). A leak signal is safety-critical and must be treated with graded certainty to prevent false shutdowns.

The engineering meaning of each signal

  • Flow (delivery capacity): confirms coolant is actually reaching the load. Most useful to confirm “flow established” after priming and to detect intermittent dropouts.
  • ΔP (impedance proxy): reacts strongly to blockage, filter loading, kinked lines, or cold-plate restriction changes; often more control-stable than raw flow in noisy regimes.
  • Temperature (Tin/Tout, ΔT): validates heat transport and provides hard safety limits; it can detect “pump spinning but not transporting heat” when paired with flow/ΔP.
  • Leak (safety latch): must support warning/derate/shutdown tiers; condensation and service events demand anti-false-positive logic.

Common “false reality” traps (and how to avoid being fooled)

  • Bubbles / poor priming: flow readings can oscillate; ΔP can look inconsistent; current ripple often increases. Use cross-checks rather than a single threshold.
  • Low-flow quantization: many flow sensors become jumpy near minimum measurable flow. Treat flow as diagnostic in that region, not as a tight control target.
  • ΔP zero drift: temperature and mounting stress shift the baseline. A control loop must include drift tolerance (deadband/hysteresis) and plausibility checks.
  • Slow temperature response: Tin/Tout can lag fast transients. Use temperature primarily for protection and slow validation, not for fast torque decisions.
  • Condensation vs leak: moisture sensors may trigger during cold starts. Grade the leak signal (suspect vs confirmed) and require persistence or multi-sensor corroboration.
Control-grade vs diagnostic-grade rule: When a signal cannot be trusted as “control-grade” (noise, quantization, drift), the controller should degrade to a safer mode (e.g., speed/ΔP control with limits) while using the unreliable signal only for cross-check and alarms.

A practical cross-check triangle (minimal but powerful)

  • Low flow + high ΔP → likely restriction/blockage (filter/cold plate) rather than a drive failure.
  • Low flow + normal/low ΔP → likely bubbles, dry-run, or loss of prime (not enough head built).
  • Flow looks “OK” but ΔT stays high → sensor placement error, bypass/short-circuit path, or insufficient contact at the load.
Figure L2 — Sensor placement on a simplified manifold (supply/return)
Manifold Sensing — Supply / Return Reference Points Minimal set: Flow + ΔP + Tin/Tout + Leak (graded) Manifold Supply Return Load Load Load Load Flow Tin Tout ΔP Leak (graded) Control: ΔP / Speed fallback Diagnose: Flow cross-check, Temp validation
ALT: Simplified manifold diagram showing supply/return rails with flow, ΔP, Tin/Tout temperature points, and graded leak detection placement.

H2-3 | BLDC Pump Drive Architecture: 6-Step vs FOC (Why Servers Often Prefer FOC)

Data center liquid-cooling pumps commonly use 3-phase BLDC/PMSM motors driven by a three-phase half-bridge (integrated or discrete MOSFET stages). The practical decision between 6-step trapezoidal commutation and field-oriented control (FOC) should be made using execution-layer constraints: low-speed torque for priming, acoustic/vibration limits, and diagnostic visibility under bubbles or partial dry-run conditions.

Power path vs control path Low-speed torque control Noise & vibration Current-sense observability Abnormal signature detection
6-step trapezoidal commutation (what it buys, what it costs)
  • Strength: simpler implementation and fewer tuning dependencies; robust for fixed-speed or moderate dynamic requirements.
  • Trade-off: commutation torque ripple can amplify mechanical resonance (pump + tubing), raising acoustic noise in some operating points.
  • Low-speed caveat: sensorless operation becomes less stable when back-EMF is weak; open-loop alignment/forced commutation can reduce priming success rate under bubbles.
  • Diagnosability: fewer internal observables; abnormal conditions are often detected later (e.g., “no-flow” rather than root-cause signatures).
FOC (what it buys, what it costs)
  • Strength: smoother torque (lower ripple) and better low-speed authority; improves startup reliability and reduces vibration-sensitive behavior.
  • Strength: current-loop control makes “torque intent vs outcome” measurable, enabling stronger abnormal detection (bubbles, partial dry-run, restriction).
  • Trade-off: requires reliable current measurement and parameter/tuning management; control complexity increases validation effort.
  • Failure containment: when sensors degrade (noisy flow, drifting ΔP), the controller can degrade to safer modes while preserving evidence in logs.
Practical server reason for FOC: Priming success and acoustic/vibration constraints are often tighter than raw peak flow targets. FOC’s low-speed torque control and current-based observability typically improve “start → establish flow → stay quiet → remain diagnosable” behavior.
Execution-layer decision checklist
  • Priming sensitivity: frequent cold starts, trapped air risk, or low NPSH conditions favor smoother low-speed torque control.
  • Noise budget: if torque ripple excites tubing/manifold resonance, prioritize torque smoothness over simplest commutation.
  • Fault evidence: if root-cause isolation matters (bubble vs restriction vs dry-run), prioritize current-sense observability and event logs.
  • Complexity risk: if calibration/tuning cannot be controlled across units, keep control architecture conservative and enforce strict degradation rules.
Figure L3 — Pump drive chain: power path vs sensing/control path
BLDC/PMSM Pump Drive — Architecture Overview Power path (top) vs sensing/control (bottom) Power Path Sensing / Control Logs / Evidence DC In Protection OR-ing DC-Link Bulk C 3-Phase Inverter Pump MCU 6-step / FOC Gate Driver Current Sense Bus / Phase Loop Sensors Flow / ΔP / Temp Gate signals Feedback Event Log
ALT: Block diagram of a BLDC/PMSM pump drive showing the power path (DC in, protection/OR-ing, DC-link, inverter, pump) and the sensing/control path (MCU, gate driver, current sense, flow/ΔP/temp) with event logging.

H2-4 | Startup & Priming Are the Hard Part: Low-Speed Torque, Bubbles, Dry-Run, Cavitation

A pump “spinning” does not guarantee coolant transport. Priming is the execution-layer reliability bottleneck because trapped air can prevent head buildup, and abnormal regimes (bubbles, partial dry-run, incipient cavitation) can look acceptable if only one signal is trusted. A robust implementation treats priming as a bounded state machine with graded evidence, power limits, and explicit fault codes.

Priming success criteria Soft-start limits Bubble / dry-run signatures ΔP & flow cross-check State machine + logs
Why “RPM ≠ flow established” (execution-layer view)
  • Air lock: the impeller moves air rather than liquid; speed may rise while ΔP/flow stay low or unstable.
  • Partial dry-run: rotation exists but coolant contact is insufficient; current/estimator stability may change before flow confirms failure.
  • Incipient cavitation: pressure fluctuations reduce effective flow and can cause ΔP oscillation; prolonged operation risks damage and noise spikes.
Priming design rules (written as enforceable constraints)
  • Soft-start: limit acceleration (dRPM/dt) and/or electrical power (Pmax) during START/PRIME to prevent supply dips and mechanical shock.
  • Low-speed torque margin: during PRIME, allow controlled torque boost within a strict time window; avoid infinite retries that become destructive dry-run.
  • Evidence-based detection: treat “no-flow” as a conclusion from multi-signal mismatch (flow vs ΔP trend vs electrical signatures), not a single threshold.
  • Anti-oscillation: enforce minimum dwell time and hysteresis between states to prevent rapid start/stop cycling.
Minimal priming evidence set: Flow rising above a minimum threshold, ΔP building with a credible slope, and stable electrical behavior (no extreme ripple/instability) within a bounded prime window. If any element is unreliable, degrade to a conservative mode with explicit flags and event records.
What must be logged (to make field failures reproducible)
  • Prime outcome: success / timeout / unstable signals / safety trigger.
  • Key snapshots: RPM target, current proxy (bus/phase), flow, ΔP, and a brief window statistic (mean + ripple).
  • Counts & timing: prime duration, retry count, time since last successful prime.
Figure L4 — Startup state machine with entry/exit rules, limits, and log points
Startup & Priming — Bounded State Machine Each state enforces a limit and emits a distinct event code START Pmax_START EVT_START PRIME Tprime window Torque boost (bounded) EVT_PRIME RUN ΔP / Speed control EVT_RUN DEGRADED Limit + fallback Flag unreliable signals EVT_DEGRADED FAULT Latched stop Requires clear EVT_FAULT Enable Flow OK Timeout / unstable Signals stable Leak confirmed / OC / OT Escalate Key rule: bounded PRIME window + evidence cross-check + anti-oscillation dwell times
ALT: State machine diagram for pump startup and priming showing START, PRIME, RUN, DEGRADED, and FAULT states with flow/ΔP evidence rules, power limits, and event log points.

H2-5 | Sensor AFE & Sampling: Turning “Dirty Signals” into Controllable Metrics

Pump control quality is limited by signal quality. Flow, differential pressure (ΔP), and coolant temperature are “dirty” in the field: drift, noise coupling, quantization at low flow, and transient artifacts during priming can destabilize control and inflate false alarms. A production-grade implementation separates control-grade signals (stable, low bandwidth) from diagnostic features (ripple, slope, plausibility), and records the evidence that explains state transitions.

ΔP drift & offset Low-flow quantization Sampling & filters Feature extraction Consistency checks
ΔP (differential pressure): drift, offset, and supply coupling
  • Dominant failure mode: slow offset/temperature drift that masks real restriction changes or fabricates apparent ΔP.
  • Execution-layer rule: allow zero/offset re-baselining only in an explicitly safe window (stable low activity), never as a continuous hidden correction.
  • Noise coupling: reference and supply disturbances can appear as ΔP movement; treat ΔP credibility as a scored signal, not a single number.
Flow sensing: low-flow quantization, jitter, and air artifacts
  • Low-flow limit: quantization and jitter dominate near the minimum measurable range; avoid using raw flow as a fast control variable in that region.
  • Air/bubbles: transient spikes or dropouts are common during priming; classify flow as “unstable” when ripple and persistence criteria fail.
  • Engineering output: publish both Flow and a Flow stability flag so the state machine can degrade rather than oscillate.
Coolant temperature: pump-control use only
  • Safety: enforce over-temperature and abnormal temperature rise rate (dT/dt) limits.
  • Transport check: use inlet/outlet trends as evidence that coolant transport is effective, especially when flow sensing is unstable.
  • Placement awareness: interpret temperature in context (inlet vs outlet) to avoid false conclusions during transients.
Sampling & filtering (engineering usage, not theory)
  • Window averaging: best for slow, stable control metrics (temperature trends, long-term ΔP).
  • IIR filters: common choice for control loops—stable output with adjustable responsiveness via time constant.
  • Kalman-style estimators: use for fusion and credibility (e.g., combining noisy flow with electrical/ΔP evidence), not as a default filter everywhere.
Consistency checks prevent false alarms: Compare flow vs ΔP trend, electrical effort vs hydraulic response, and temperature rise vs transport evidence. When signals disagree, downgrade credibility, prefer conservative limits, and record the mismatch as an explicit event reason.
Figure L5 — Signal chain: sensor → AFE/ADC → filters → features → control/diagnostics + plausibility
Sensor Signal Chain — From Raw Readings to Metrics Control-grade outputs (low BW) and diagnostic features (ripple/slope) are separated Sensors ΔP Offset / Drift Flow Jitter / Bubbles Temp dT/dt Noise Drift Signal Processing Pipeline AFE / ADC Gain / Ref Filters IIR / Window Features Mean / Ripple / Slope Control-grade Low BW / Stable Diagnostics Events / Evidence Plausibility / Consistency Flow ↔ ΔP ↔ Power ↔ Temp Event Log Reason codes
ALT: Block diagram of the sensor signal chain for pump control showing sensors (ΔP, flow, temperature), AFE/ADC, filtering, feature extraction (mean/ripple/slope), split outputs for control-grade signals and diagnostics, plus plausibility/consistency checks and event logging.

H2-6 | Leak Detection: Conductivity, Humidity, Optical, Inference — and False-Alarm Control

Leak handling must be evidence-based. The difficult part is not “detecting something,” but preventing condensation and service artifacts from causing disruptive shutdowns. A robust pump-control implementation uses graded evidence and tiered actions: suspect → confirmed → emergency, with explicit debounce/persistence rules and a clear lock-and-log policy.

Conductivity rope / point Condensation false alarms Optical / level Inference as auxiliary Tiered response
Detection methods (execution-layer applicability)
  • Conductivity (rope/point): fast local detection near manifold fittings and pump area; requires fluid-compatibility and aging-aware thresholds.
  • Humidity / moisture: useful as a suspect indicator; high condensation risk demands strong debounce and environmental gating.
  • Optical / local level: provides direct “fluid present” evidence when mechanically feasible; strong candidate for confirmation.
  • Inference: flow drop + ΔP anomaly + temperature pattern is auxiliary only; do not confirm a leak from inference alone.
False-alarm control (the three-layer rule set)
  • Time persistence: require a minimum duration above threshold (debounce/persistence) before changing severity state.
  • Context gating: during priming, maintenance windows, or known condensation risk, prevent moisture-only inputs from jumping directly to confirmation.
  • Cross-check: confirm using at least one “strong” sensor class (conductivity/optical), plus an optional inference score for confidence.
Tiered actions stay within pump-control scope: Suspect triggers derating and intensified sampling; Confirmed triggers stop-or-switch and latching; Emergency triggers immediate shutdown and locked evidence capture. Actions are limited to pump enable/interlock, redundancy domain switching (if available), and event logging.
What must be logged (so leaks are diagnosable)
  • Severity transitions: Suspect → Confirmed → Emergency, including the exact trigger class (conductivity / humidity / optical / combined).
  • Snapshot: flow, ΔP, temperature, electrical effort proxy, and current pump state (START/PRIME/RUN/DEGRADED) at confirmation time.
  • Latch policy: whether the condition is self-clearing or requires explicit clearing after inspection.
Figure L6 — Leak handling ladder: suspect → confirmed → emergency with actions and logging
Leak Handling — Evidence Levels and Actions Evidence inputs are debounced and gated; actions escalate with severity Evidence Inputs Conductivity Rope / Point Humidity Condensation risk Optical / Level Local presence Inference Aux only Evidence Engine Debounce / Persistence Context gating Severity Ladder Leak_Suspect Derate / Sample ↑ / Log Leak_Confirmed Stop or Switch Latch + Snapshot Emergency_Shutdown Immediate stop Event Log Reason codes Context snapshot
ALT: Diagram of leak detection inputs (conductivity, humidity, optical/level, inference) feeding an evidence engine with debounce and context gating, producing severity states (Leak_Suspect, Leak_Confirmed, Emergency_Shutdown) with derate/stop actions and event logging.

H2-7 | Redundant Power Domains & Fault Tolerance: A/B Rails, OR-ing, Switchover Without Speed Drop

“Redundancy” in a pump control board is not a single backup wire—it is a power-domain architecture that keeps the control domain alive while the high-current motor domain can switch sources. The practical goal is to prevent a brownout reset, avoid reverse-current propagation, limit DC-link sag, and record a traceable event with timestamp and reason code.

Domain A / Domain B Ideal-diode OR-ing DC-link sag control Always-on control rail Timestamped power-fail log
What “redundant domains” mean in pump control
  • Power-stage domain: the high-current path feeding the inverter DC-link (A/B → OR-ing → Vbus).
  • Control always-on domain: MCU, gate-driver logic, and critical sensor rails remain stable through switchover.
  • Sense integrity: ΔP/flow/temp credibility is preserved by keeping references and ADC rails out of brownout.
OR-ing and isolation (motor power domain only)
  • Reverse-current control: prevent backfeed from DC-link into a failing input rail so the fault does not propagate.
  • Low drop: ideal-diode OR-ing minimizes voltage loss and heat compared with diode drops in high-current paths.
  • Switchover transients: treat Vbus sag and dV/dt as first-class signals; avoid trip-chains caused by brief dips.
Switchover strategy: keep control, limit sag, keep torque
  • Control continuity: keep state machine and sampling stable; do not allow a reset to re-enter startup logic.
  • Vbus protection: enforce sag limits and UVLO margin by managing DC-link energy and short transient handling.
  • Soft transition: temporarily cap acceleration/power during switchover to prevent overcurrent trips and oscillation.
Degraded operation on a single surviving domain
  • Power cap: limit maximum electrical effort when only one domain is healthy.
  • Target reduction: reduce flow/ΔP targets to stay away from critical rail margins.
  • Anti-flap rule: avoid repeated A↔B bouncing; lock preference until rails are stable for a defined window.
What must be logged: record (1) trigger reason (UV/OV/missing/OR-ing fault), (2) snapshot (A/B rails, Vbus, pump state, power proxy), and (3) action taken (derate, switch, latch). Logs are part of the control design, not an afterthought.
Figure L7 — Redundant domains: A/B inputs, ideal-diode OR-ing, DC-link, always-on control rail, and logging points
Redundant Power Domains — Switchover Without Speed Drop Separate high-current power stage from always-on control; monitor rails and log events Domain A A_in Hot-swap Rail sense UV/OV Domain B B_in Hot-swap Rail sense UV/OV Ideal-diode OR-ing Reverse current blocked DC-link (Vbus) Hold-up Vbus 3-Phase Inverter Motor domain Pump Always-on control rail BOR State Log A_OK / B_OK Vbus sag
ALT: Block diagram of redundant power domains for a pump control board showing Domain A and Domain B inputs with hot-swap and UV/OV sensing, ideal-diode OR-ing into a DC-link (Vbus) with hold-up and Vbus sensing, a separate always-on control rail (BOR, state machine, logging), and the inverter/motor power stage.

H2-8 | Protections & Fault Modes: Stall, Overcurrent, Overtemp, Dry-Run, Sensor Distortion

Effective protection is evidence-driven and stable. The goal is to map field symptoms (flow, ΔP, current/power proxy, temperature) to fault classes, apply tiered actions (warn → derate → shutdown), and avoid oscillation caused by repeated start-stop cycles. Sensor faults must be handled as credibility problems first, not as immediate shutdown triggers.

Symptom → fault mapping Tiered actions Retry & latch Anti-oscillation Fault logs
Symptom channels used for fault inference
  • Electrical effort (current/power proxy): the drive is “pushing” vs limited.
  • Hydraulic result (flow + ΔP): whether pressure head and transport are established.
  • Thermal outcome (temperature + dT/dt): whether heat is being removed as expected.
Typical fault signatures (execution-layer inference)
  • Stall / blockage: electrical effort rises while flow remains near zero and ΔP does not build as expected.
  • Overcurrent: current exceeds limit (transient or sustained); Vbus sag can amplify trips if not handled with margin.
  • Overtemperature: device/winding temperature rises; derating is preferred before escalation to shutdown.
  • Dry-run: flow fails to establish while temperature rise pattern becomes abnormal; electrical effort may not be extreme.
  • Sensor distortion: stuck-at flow, drifting ΔP, open/short temperature—treat as credibility loss and degrade conservatively.
Tiered actions (avoid unnecessary shutdowns)
  • Warn: log evidence, tighten sampling, and maintain safe limits.
  • Derate / Degraded: cap power/acceleration, reduce targets, and continue operation when safe.
  • Shutdown / Latched: stop and latch only when safety thresholds or repeated failures demand it.
Anti-oscillation rules: apply hysteresis (clear threshold < trip threshold), enforce minimum run/stop windows, use backoff on retries, and latch after repeated failures. This prevents protect-restart loops that create instability.
Figure L8 — Symptom-to-fault matrix with action hints (graphical, not text-heavy)
Fault Signatures — Symptoms vs Fault Types Use arrows (= / ↑ / ↓ / ~) to triage quickly; apply tiered actions Flow ΔP Current Temp Stall OC Over-I OT Over-T Dry run Sensor fault = = ~ ~ ~ ~ ~ ~ ~ = = ~ = ~ ~ = Action Derate Limit Derate Retry Degrade Anti-oscillation: Hys Min-time Backoff Latch + log snapshot
ALT: Graphical matrix mapping pump fault types (stall, overcurrent, overtemperature, dry-run, sensor fault) to symptom channels (flow, ΔP, current, temperature) using trend symbols, plus action hints (derate/limit/retry/degrade) and anti-oscillation rules (hysteresis, minimum time, backoff, latch).

H2-9 | Control Strategy: From Speed Control to ΔP / Flow Control (Practical Boundaries)

Pump control targets are chosen by signal credibility and operating phase, not by algorithm complexity. Speed control is the most robust baseline, while ΔP control tracks loop impedance changes more directly. Flow control can be effective only when flow sensing is stable and trustworthy; otherwise, flow is better used for diagnostics and cross-checks.

Speed-control baseline ΔP-control with clamps Flow credibility Mode switching Hysteresis + dwell
Three closed-loop targets and what they guarantee
  • Speed control: simplest and most stable; does not guarantee flow or ΔP under changing impedance.
  • Flow control: targets transport directly; depends heavily on flow sensor quality and bubble sensitivity.
  • ΔP control: targets pressure head / impedance behavior; requires a stable ΔP signal (offset + drift managed).
Selection rules (engineering-first)
  • Reliable ΔP available: prefer ΔP-control and apply speed / power clamps to protect margins during transients.
  • Flow is noisy or bubble-prone: keep flow as diagnostic-grade (credibility scoring, cross-check, alarms), not the primary loop target.
  • Low temperature / degas / bubble period: fall back to a conservative mode (typically speed control) until signals stabilize.
Credibility gating: control-grade vs diagnostic-grade signals
  • Control-grade: bounded noise, stable offset, reasonable rate-of-change, and consistent with physics.
  • Diagnostic-grade: helpful for triage but unsafe to close the loop directly when it can chase noise.
  • Cross-check: compare trends across {ΔP, flow, electrical effort, temperature} before enabling aggressive control modes.
Mode switching must be stable: use hysteresis (enter/exit thresholds differ), enforce a minimum dwell time in each mode, and log each mode transition with the reason code and a snapshot. This prevents “mode flapping” that can mimic a fault.
Figure L9 — Control mode switching inside RUN: ΔP-control ↔ Speed-control with hysteresis and dwell
RUN Mode — Control Target Selection and Switching ΔP-control when ΔP is control-grade; fall back to speed-control when signals are unstable RUN ΔP-control ΔP target Speed clamp Power clamp Cross-check OK Speed-control RPM target Limits I / P / T Safe baseline ΔP unstable Bubble/Low-T ΔP stable Hysteresis + Minimum dwell + Log every switch Log
ALT: Block diagram showing RUN-mode control target selection for a liquid cooling pump: ΔP-control with speed/power clamps and cross-check gating, switching to speed-control when ΔP is unstable or during bubble/low-temperature phases, with hysteresis, minimum dwell time, and logging of each mode change.

H2-10 | Telemetry & Fault Logs: What to Record to Diagnose Problems

This section focuses on what the pump control board / manifold controller should record locally. Continuous telemetry provides trends, while event logs provide evidence. The diagnostic minimum is an event code plus a snapshot of key rails, hydraulic signals, and active limits—captured with a consistent relative timestamp.

Telemetry vs event log Snapshot evidence MVP field set Nice-to-have Relative timestamp
Telemetry and logs serve different purposes
  • Telemetry: periodic values (rails, RPM, flow, ΔP, temperatures) used for trending and cross-check.
  • Event log: discrete records (start, prime failure, dry-run, leak, domain switch, derate, sensor fault).
  • Snapshot: a “freeze frame” of key telemetry taken at the event moment; the snapshot is what makes logs actionable.

MVP field set (minimum to localize root cause)

  • Time: uptime_ms (relative time), monotonic counter
  • Power rails: Vin_A / Vin_B, Vbus, active_domain, brownout_flag
  • Motor effort: Idc (or power proxy), RPM, speed_cmd, limit_active
  • Hydraulic: Flow, ΔP, Tin/Tout (or the closest available equivalents)
  • State & validity: state_machine_state, sensor_valid_flags, fault_code, action_taken

Advanced field set (faster diagnosis, fewer reproductions)

  • Trends: Vbus sag rate, ΔP slope, flow ripple (feature stats)
  • Counters: retry_counter, backoff_level, domain_switch_count
  • Thermal detail: driver_temp, winding_temp (if available)
  • History: last-N event ring buffer for correlation
Event template (recommended): EventCode + Reason + Snapshot + Action. A log without the active limits/state and a consistent timestamp is rarely diagnosable.
Figure L10 — Example event timeline: codes + snapshots to reconstruct the fault chain
Event Logs — Timeline With Codes and Snapshots Relative time is sufficient; log each event code with a consistent snapshot field set t0 t + uptime_ms EV_START EV_FLOW_OK EV_DP_ANOM EV_DERATE EV_DOMAIN_SWITCH EV_RECOVER Snapshot fields (attach to each event) Vbus Vin A/B Idc RPM Flow ΔP Tin/Tout State Limits active Sensor valid Action
ALT: Timeline diagram illustrating pump-controller event logging using relative time and event codes (start, flow established, ΔP anomaly, derate, domain switch, recovery), plus a snapshot field set (rails, current, RPM, flow, ΔP, temperatures, state, limits, sensor validity, action).

H2-11 | Validation & Production Test: Proving Reliability, No False Trips, and No-Drop Switchover

This section defines a practical evidence plan for a rack/server liquid-cooling manifold & pump controller: development validation (design correctness), production test (unit-to-unit consistency), and in-field self-test (diagnosability). Scope is limited to pump-control behaviors (priming, dry-run, leak discrimination, redundant power-domain switchover, and sensor fault handling).

Detection time targets (T_detect) False-positive control (FP_rate) Continuity metric (ΔRPM / ΔFlow) Event code + snapshot evidence
Engineering principle: every protection decision must leave an audit trail (event code + a small data snapshot) so that field issues can be reproduced and traced without relying on platform/BMC protocol details.

1) Three-layer evidence model (DVT/EVT → Production → In-field)

Development validation (DVT/EVT)
  • Goal: prove control/protection/state-machine logic is correct under worst-case boundary conditions.
  • Method: repeatable fault injection + scripted runs + time-aligned evidence (event + snapshot + trend).
  • Coverage: bubbles/unprimed start, dry-run windows, condensation vs leak, power-domain switchover profiles, sensor drift & disconnections.
Production test
  • Goal: guarantee unit-level consistency in minutes, with low fixture cost.
  • Method: testable “proxies” (open/short injection, threshold checks, controlled brownout pulses, logging integrity).
  • Output: pass/fail + traceable calibration constants + key counters (e.g., switchover count).
In-field self-test
  • Goal: detect degradation early without disrupting service (prefer derate/alert over unnecessary shutdown).
  • Method: lightweight plausibility checks (sensor sanity, domain health, log space, anomaly counters).
  • Output: diagnostic flags + a minimal snapshot for remote triage.
Non-negotiable rule: if a control-grade sensor becomes invalid, the loop must fall back to a conservative mode (e.g., speed-control with clamps) instead of using stale/false measurements.

2) Definition of Done (DoD): measurable acceptance criteria

Protection correctness (safety + detection):

  • Dry-run detection time: T_detect_dryrun ≤ X s, followed by safe stop or safe derate within a defined window.
  • Leak confirmation time: T_confirm_leak ≤ Y s with graded actions (Warn → Derate → Emergency shutdown).
  • Sensor fault detection: open/short and stuck readings detected within T_detect_sensor ≤ Z s, entering a defined fallback mode.
  • No “protection oscillation”: bounded retry count and cooldown windows to prevent repeated start/stop instability.

Service continuity (no-drop switchover):

  • Power-domain switchover continuity: ΔRPM_max ≤ A% and/or ΔFlow_min ≥ B% during A→B transfer profiles.
  • Control-rail survival: no controller reset during micro-interruptions within the specified tolerance.
  • Mode switching stability: hysteresis and rate limits prevent “mode flapping” in marginal sensor conditions.
Values X/Y/Z/A/B are design targets to be filled per pump type and rack loop constraints. The structure above ensures the page remains a reusable engineering spec template.

3) Must-run validation use-cases (fault injection → expected action → evidence)

Each use-case below should be written and executed with a consistent template: purposeinjection methodobservablesexpected actionslogging evidencepass/fail criteria.

  • Bubbles/priming variability: multiple orientations + different initial fill levels; verify RUN entry requires credible Flow/ΔP establishment.
  • Dry-run window: run without liquid for a bounded time; verify detection time and safe stop/derate without false leak triggers.
  • Leak vs condensation discrimination: compare true leak injection vs controlled humidity/condensation; quantify false-positive rate.
  • Power-domain switchover: A hard drop, B micro-interruption, and brownout ramps; verify continuity metrics and event stamping.
  • Sensor disconnect/short injection: Flow/ΔP/Temp open/short/stuck; verify fallback mode and stable limits (no uncontrolled acceleration).
Evidence requirement (minimum): each fault must generate an event code and capture a snapshot (key signals around the trigger) so that root-cause can be determined from logs alone.

4) Production test: fast proxies that still catch real failures

100% test (typical)
  • Power rails: always-on rail, gate-driver rail, ADC references within limits.
  • Sensor integrity: open/short detection paths + stuck-at plausibility checks.
  • Switchover detection: controlled brownout pulse and verify domain status + event logging.
  • Logging integrity: event buffer write/read, monotonic counters, CRC if applicable.
Sample / type test (typical)
  • Full priming & bubble matrix across orientations.
  • Extended dry-run robustness with controlled thermal rise.
  • Condensation chamber correlation vs leak sensor thresholds.
Production philosophy: complex loop behaviors are verified in DVT/EVT; production focuses on fast-to-measure proxies that strongly correlate with field reliability and false-trip risk.

5) Example material part numbers (MPNs) that support testability

The following are example MPNs commonly used as building blocks for pump-control boards. Selection must be verified against electrical ratings, thermal design, and wetted-material requirements (for sensors).

  • 3-phase smart gate driver (BLDC/PMSM): DRV8323RS / DRV8323RH (SPI or HW interface, diagnostics).
  • Current-sense amplifier (PWM rejection): INA240A3 / INA240A4 (shunt sensing).
  • Hot-swap / inrush & power limiting (domain input): LM5069 (9–80V class hot-swap controller).
  • Ideal-diode OR-ing (A/B domain combine): LTC4359 (ideal diode controller with reverse protection).
  • ΔP sensor (digital differential pressure example): SSCDRRN002ND2A3 (differential pressure sensor family example).
  • Liquid flow sensor (I²C example): SLF3S-1300F (liquid flow sensor example).
  • Coolant/board temperature sensor (digital): TMP117 (0.1°C class digital temperature sensor).
  • Humidity sensor (condensation context): HDC3022 (RH sensor example for condensation correlation).
  • Capacitive sensing for leak / liquid presence (front-end): FDC1004 (capacitance-to-digital converter).
  • EEPROM (calibration constants / identifiers): 24AA02 (2Kb I²C serial EEPROM).
  • Supervisor / watchdog (anti-hang & reset discipline): TPS3828 (voltage supervisor with watchdog variants).
Why MPNs appear in a validation chapter: the listed parts provide typical built-in hooks (fault pins, digital registers, stable references, and predictable failure modes) that make production proxies and field diagnosability practical.

6) Figure V11 — Fault-injection validation matrix (visual checklist)

Rows: injected faults / boundary conditions. Columns: expected actions, continuity constraints, and logging evidence. Text in the figure is kept minimal (mobile-readable).

Figure V11 — Test matrix for “reliable, no false trips, no-drop switchover”
Validation Matrix (Fault Injection → Expected Outcomes) Legend: MUST / OPT / N/A · Evidence: EVT (event code) + SNAP (snapshot) Fault injection / boundary Action Continuity Logging Bubbles / Unprimed start orientation + low fill level variants MUST: PRIME Clamp ΔRPM EVT + SNAP Dry-run (no liquid) bounded time window, safe stop/derate MUST: STOP No surge EVT + SNAP Leak vs condensation quantify FP_rate and grading behavior MUST: GRADE Derate ok EVT + SNAP Domain A hard drop A→B transfer, verify no-drop target MUST: SW ΔRPM ≤ A% EVT: SW + SNAP Brownout ramp / micro-cut threshold + hysteresis + no reset OPT: HOLD No reset EVT + counter DoD keywords: T_detect FP_rate ΔRPM / ΔFlow EVT + SNAP
Practical acceptance matrix: define injection profiles, expected actions, continuity constraints, and required evidence for each row.

Request a Quote

Accepted Formats

pdf, csv, xls, xlsx, zip

Attachment

Drag & drop files here or use the button below.

H2-12 | FAQs (Liquid Cooling Manifold & Pump Control)

Each FAQ stays within the pump-controller scope: BLDC drive behavior, Flow/ΔP/Temp signal conditioning, priming/dry-run/leak logic, redundant power-domain switchover, and local telemetry & fault logs (no CDU/plant, no BMC protocol stack, no PSU topology).

Why can the pump spin but flow never establishes?
→ H2-4 / H2-2

Separate “motor rotation” from “hydraulic circulation.” If RPM rises but Flow≈0 and ΔP≈0, priming/air pockets or dry-run is likely. If ΔP rises while flow stays low, loop restriction is more likely. Use a PRIME state with limited acceleration, time windows, and retries with cooldown. Log PRIME failures with snapshots (RPM/Flow/ΔP/Idc/Tin/Tout) for repeatable diagnosis.

Related: priming state machine, sensor placement and trustworthiness.
If Flow readings jump, is it bubbles or sensor noise—and how to tell?
→ H2-5 / H2-4

Use cross-checks instead of Flow alone. Bubble-driven jitter often aligns with ΔP oscillations and priming phases, while pure electrical noise may not correlate with ΔP, RPM, or motor power. Apply a control-grade filter (window/IIR) plus a “confidence flag.” When confidence drops, treat Flow as diagnostic-only and fall back to conservative control (speed or ΔP) until stability returns.

Related: filtering + plausibility, PRIME/RUN transitions.
Speed control or ΔP control—which is more stable, and what are the common traps?
→ H2-9

Speed control is the safest baseline but does not guarantee flow under changing loop impedance. ΔP control can track impedance changes better, but only if ΔP is control-grade (low drift, low noise, correct placement). Common traps: speed control under-delivers flow after restrictions change; ΔP control “chases noise” when bubbles or drift corrupt ΔP. Use mode switching with hysteresis and minimum dwell time to prevent flapping.

Related: mode switching logic, hysteresis + dwell.
How does ΔP sensor zero drift mislead control, and what online compensation works?
→ H2-5 / H2-9

A positive drift makes the controller believe pressure is “already high,” reducing pump output and starving flow; a negative drift can over-drive the pump, raising noise and cavitation risk. Practical online compensation: update a slow offset only in known low-energy windows (e.g., pump stopped or very low speed with stable conditions), clamp offset rate-of-change, and invalidate ΔP control when plausibility checks fail (Flow/Idc/Tin–Tout patterns disagree).

Related: drift handling, fallback to safe mode.
Why does dry-run detection false-trip, and when are delay + cross-check required?
→ H2-4 / H2-8 / H2-6

False trips often happen during priming (air, bubbles), at low temperature (viscosity changes), or when a single sensor glitches. Use a delay so PRIME transients are not treated as dry-run, then require cross-checks: persistent low Flow plus low/unstable ΔP plus abnormal power/temperature slope (at least two independent cues). Apply graded actions (Warn → Derate → Stop) and log the exact rule that triggered.

Related: PRIME vs DRY-RUN separation, graded protection.
How to distinguish stall vs cavitation using current, ΔP, and flow signatures?
→ H2-8 / H2-4

Stall tends to show high motor current/power with poor RPM and no flow establishment; driver temperature may rise quickly. Cavitation more often shows oscillatory ΔP and Flow, intermittent flow collapse, and unstable operating points; current may not spike as hard as a stall. Response differs: stall favors shutdown and bounded retries; cavitation favors derate, softer targets, and extended priming until stability returns.

Related: fault symptoms matrix, restart strategy.
Why can protections cause repeated start/stop oscillation, and how to fix it?
→ H2-8 / H2-4

Oscillation usually comes from tight thresholds without hysteresis, no minimum dwell time, and aggressive retries that re-enter PRIME before the loop stabilizes. Fix with a clear state machine (START → PRIME → RUN → DEGRADED → FAULT), hysteresis on key conditions, bounded retry budget, and cooldown windows. Severe faults should latch until a safe reset condition. Always log “reason code” and the last two snapshots to reveal the loop.

Related: state machine discipline, anti-flapping guardrails.
How to prevent leak-probe false alarms in condensation-heavy environments?
→ H2-6

Treat leak sensing as a graded decision, not a single threshold. Use debounce time and require spatial/temporal consistency. Condensation can be managed by correlating the leak signal with humidity/temperature context (dew-risk) and hydraulic plausibility (Flow/ΔP/temperature trends). For Leak_Suspect, derate and keep monitoring; for Leak_Confirmed, escalate actions. Example building blocks include humidity sensors (e.g., HDC3022) and liquid-presence front-ends (e.g., FDC1004) depending on the probe method.

Related: false-positive governance, graded response ladder.
During redundant power-domain switchover, how to minimize speed/flow disturbance?
→ H2-7

Keep the control rail alive while switching only the power stage domain. Use OR-ing/ideal-diode paths and controlled hot-swap behavior, plus enough DC-link energy to bridge brief sags. During switchover, temporarily clamp torque/acceleration and hold integrators to avoid over-correction. Detect sag early (domain status + Vbus trend), then log the switchover event with a before/after snapshot. Example parts: ideal-diode controllers (e.g., LTC4359) and hot-swap controllers (e.g., LM5069).

Related: A/B domains, no-drop strategy + evidence logging.
What minimum log fields are needed to replay an intermittent flow drop?
→ H2-10

Minimum fields must explain “what the controller decided” and “what the loop did.” Log: state/mode, target (speed/ΔP), clamps (torque/power), RPM, Flow, ΔP, Tin/Tout, driver temperature, Vbus and domain A/B status, DC current (Idc), fault flags, and a relative timestamp. For intermittent issues, store a short snapshot window around the trigger (pre/post) plus counters (retry count, switchover count). Keep calibration IDs in EEPROM for traceability (e.g., 24AA02).

Related: MVP vs advanced telemetry, event + snapshot pattern.
How can production quickly verify sensors are not swapped or drifting?
→ H2-11 / H2-5

Use fast proxies that validate sign, scaling, and plausibility. For ΔP: apply a known differential (or a controlled electrical injection for the AFE path) and verify polarity and range. For Flow: verify direction/response consistency (expected pulse/I²C change) and confirm plausibility against a simple pump step. For temperature: apply a short thermal step and verify response time and slope. Store calibration constants and a pass/fail digest into the unit record for traceability.

Related: production proxies, calibration + consistency checks.
Trapezoidal or FOC drive—when can trapezoidal be more reliable?
→ H2-3 / H2-11

Trapezoidal commutation can be more reliable when sensing resources are limited, EMI is harsh, or deterministic protection behavior is prioritized over acoustic performance. FOC typically needs accurate current sensing, stable parameters, and robust tuning across temperature and aging; if validation coverage is incomplete, field behavior can be unpredictable. In contrast, a simpler trapezoidal drive with strong stall/priming logic and bounded retries can be easier to test and certify. Example gate-driver building blocks include DRV8323-family devices.

Related: drive architecture trade-offs, verification completeness.
Figure F12 — Quick triage map (signals → actions → logs)
Quick Triage Map Use Flow + ΔP + RPM/Power to choose PRIME, MODE SWITCH, DERATE, or SWITCHOOVER — then log EVT+SNAP. Signals (inputs) Flow ΔP RPM + Power Tin / Tout Decisions Trust grade? PRIME vs RUN Stall vs Cav? Leak grading Actions + Evidence PRIME (bounded) MODE SWITCH DERATE / STOP SWITCHOVER A↔B LOG: EVT + SNAP Rule of thumb: Never trust a single sensor; use cross-check + persistence + graded actions. Every decisive protection must leave evidence: event code + short snapshot window.
A compact map that aligns with H2-2 to H2-11: signals → decisions → actions → evidence.