123 Main Street, New York, NY 10001

Fan & Thermal Management for Server Racks

← Back to: Data Center & Servers

Fan & Thermal Management in servers is a fan-domain control and observability problem: multi-fan PWM drive + tach/vibration/temperature inputs are fused into stable fan curves, redundancy actions, and derating states. A good design turns raw signals into deterministic decisions (pass/fail checks, clean alarms, and actionable logs) so thermals stay safe without RPM hunting, noise spikes, or false stall events.

H2-1 · Definition & Boundary

Intent

Define the fan-domain thermal control loop and its deliverables, so all later chapters remain inside a clear engineering boundary (no cross-topic drift).

Fan curve spec Derating policy Fault policy Validation checklist

What this page covers

  • Inputs (measured): inlet/outlet/hotspot temperatures, tach/FG RPM feedback, fan present/fault, (optional) vibration indicators, and fan-channel electrical health.
  • Controller (computed): multi-zone demand fusion, fan curve (LUT/PID/hybrid), hysteresis, ramp limiting, acoustic caps, derating state machine, and redundancy compensation.
  • Actuators (controlled): PWM duty or voltage drive (2/3/4-wire fans), per-group control (A/B trays), and controlled spin-up/soft-start to avoid step transients.
The focus is not “thermal theory,” but how to build a stable, observable, fault-tolerant fan subsystem that behaves predictably on real hardware.

Typical I/O map (fan domain)

  • PWM outputs: per-channel duty commands; frequency/logic level must match the fan type (especially 4-wire PWM fans).
  • Tach/FG inputs: RPM truth source; requires windowing/filtering to avoid low-speed jitter and false stall flags.
  • Presence & fault pins: fan tray hot-swap detection, fault escalation, and safe fallback behavior.
  • Temperature inputs: inlet/outlet/hotspot points for zoning and safe-state logic (sensor placement + filtering matters).
  • SMBus/I²C/PMBus telemetry: minimal set (PWM target, RPM actual, status/fault, temperature inputs), plus optional vibration health metrics.

What this page does NOT cover (link-only topics)

  • Liquid cooling manifolds & pump control: flow/ΔP loops, pump BLDC drives, leak detection (separate page).
  • BMC / Redfish / IPMI internals: management-plane architecture and protocol details (separate page). This page only defines the interface points.
  • Rack PSU / PDU / facility HVAC: power conversion, metering, or room-level thermal (separate pages).

Only interface expectations are stated here (setpoints, readbacks, and alarms), not the full stack implementation.

Figure F0 — Fan-domain control loop boundary (inputs → control → actuators)
Measured Inputs Temperatures Inlet / Outlet / Hotspot Tach / FG RPM feedback + status Presence / Fault Hot-swap + alarms Vibration (optional) Health indicators Fan Controller Control Policy Curve · Hysteresis · Ramp Derating & Safe States Normal → Boost → Degraded Fault Handling Stall detect · Retry · Escalate Telemetry & Logs PWM/RPM/status timestamps Actuators Fan Group A Fan1 · Fan2 Fan Group B Fan3 · Fan4 PWM Drive Per-channel duty Spin-up Control Soft-start / stagger tach feedback SMBus / I²C / PMBus Telemetry (fan domain) Minimal set: PWM target · RPM actual · status/fault · timestamps

Diagram uses fan-domain signals only; management-plane and facility-level details are intentionally excluded.

H2-2 · System Architecture (Multi-zone + Multi-fan data flow)

Intent

Decompose the subsystem into implementable blocks: thermal zones (what to sense), control placement (where logic runs), closed-loop behavior (how stability is maintained), and redundancy (how failures are absorbed).

Thermal zones: three sensors, three purposes

  • Inlet temperature: baseline for ambient changes and intake restrictions (filters/doors). Typically stable, slower to reflect workload changes.
  • Outlet temperature: coarse indicator of total system heat; good for long-term control but can lag fast hotspots.
  • Hotspot temperature: risk-driven input with the fastest response. Requires filtering + hysteresis to avoid fan “hunting.”
Multi-zone fusion is an architectural choice: max-of-zones (safest, louder), weighted (balanced), or priority override (hotspot takes over on excursions).

Control placement: fan controller vs embedded controller (boundary-safe view)

  • Board-level fan controller: near fan connectors; strong real-time behavior, built-in tach capture and fault state machines.
  • Embedded MCU/EC: flexible policies and sensor fusion; must still guarantee deterministic fault response and safe fallback behavior.

Only the interface expectations are defined here (setpoints, readbacks, alarms). Protocol stacks belong to the dedicated management page.

Closed-loop control and real disturbances

  • Main loop: Temperature → demand (PWM/RPM target) → fan airflow → temperature.
  • Disturbances: clogged filter / backpressure reduces airflow at the same RPM; stall removes airflow entirely; tray removal changes both airflow and sensor context.
  • Stability knobs: hysteresis, minimum duty, ramp limiting, and zone fusion rules to avoid oscillation and acoustic spikes.

Redundancy: N+1 with grouping and guardrails

  • Grouping: A/B trays, front/rear partitions, or zone-bound fans depending on airflow topology.
  • Compensation: one fan fails → remaining fans boost (within acoustic/derating limits) using a pre-defined “degraded curve.”
  • Guardrails: staggered spin-up + ramp limits prevent simultaneous step increases that can cause noise bursts and unstable temperature readings.
Figure F1 — Multi-zone fusion + multi-fan control data flow
Thermal Zones Inlet Temp Ambient baseline Intake restriction hint Outlet Temp Total heat trend Slower response Hotspot Temp Risk-driven input Needs filtering + hysteresis Zone Fusion max / weighted / priority Output: demand RPM target or PWM target Control Policy Curve + hysteresis Ramp limit + acoustic cap Fan Actuation Group A Fan1 · Fan2 PWM + tach Group B Fan3 · Fan4 PWM + tach Redundancy N+1 compensation Degraded curve PWM target tach feedback Disturbances (what breaks the loop) Filter clog / backpressure · stall / tray removal · sensor noise / hotspot spikes Mitigation: hysteresis · ramp limit · redundancy compensation

This chapter defines a practical architecture: zones → fusion → policy → multi-fan actuation with feedback and redundancy.

H2-3 · Fan Drive Fundamentals (2/3/4-wire fans and drive implementation)

Intent

Select a fan interface and drive topology that delivers stable low-speed behavior, predictable control response, and diagnosable faults in multi-fan server trays.

2-wire DC 3-wire + tach 4-wire PWM Multi-channel drive Fan-domain protection

2-wire DC: voltage control vs chopping (and what breaks first)

  • Voltage control: reduces the supply delivered to the fan. Simple, but low-speed stability is limited by the fan’s internal commutation and minimum operating point.
  • Chopping / switching: modulates the average voltage with a switch. More efficient, but higher EMI and acoustic risk if frequency/edges land in sensitive bands.
  • Low-speed dead zone: below a minimum voltage/duty, a fan may fail to start or oscillate. A practical drive profile uses spin-up boost followed by a controlled ramp-down to the target.
2-wire control is workable for simple trays, but multi-zone server platforms typically prefer 4-wire PWM for cleaner control and consistent telemetry.

3-wire fans: tach is only useful if measurement is engineered

  • Added value: tach/FG provides RPM truth and early health hints (intermittent drops, jitter, “almost stalled” behavior).
  • Key implication: RPM quality depends on the measurement method (windowing, debouncing, low-speed handling). Poor measurement creates false stalls and unstable fan curves.
  • Design rule: treat tach as a feedback signal that needs a quality gate (valid/invalid) rather than a single raw number.

4-wire PWM fans: control-supply decoupling (server default)

  • Decoupled power vs control: the fan keeps a stable supply while PWM acts as a speed request. This improves low-speed behavior and reduces control-side power artifacts.
  • PWM electrical expectations: many 4-wire inputs are designed for an open-drain style drive with a pull-up; mismatched push-pull signaling can increase noise or create contention.
  • Minimum controllable region: below a minimum duty/RPM, the response becomes non-linear. Fan curves should respect a minimum duty and use controlled ramps.
  • Spin-up profile: short boost on cold start improves start success, then a ramp back to target prevents acoustic spikes.

Multi-channel drive: isolation, staggering, and failure containment

  • Integrated multi-channel controllers: consistent policy and unified telemetry; requires careful thermal design and single-point failure analysis.
  • Discrete high/low-side per channel: strong isolation and flexibility; larger BOM and more tuning across channels.
  • Staggered spin-up: avoid simultaneous current steps when many fans start or accelerate at once. Use channel sequencing + ramp limits within the fan domain.
  • Failure containment: a short or stalled fan should trip and isolate locally, not collapse the entire tray’s fan rail.

Fan-domain protection (actions and what to log)

  • Overcurrent / short: current limit → latch-off or timed retry. Log fault count and last trip timestamp.
  • Reverse connection risk (service / modular trays): block reverse polarity and raise a clear fault flag.
  • Hot-swap transients: soft-start and controlled ramp reduce connector stress and avoid tach false events during insertion.

Output — Quick selection matrix

Fan type Control method Observability Low-speed stability EMI risk Cost & complexity
2-wire DC Supply voltage or chopped drive Limited (no native tach) Higher dead-zone risk; needs spin-up profile Medium–High (for chopping) Low
3-wire Voltage/chop + tach feedback RPM available; quality depends on measurement Better than 2-wire if tach is engineered Medium–High (if chopping) Low–Medium
4-wire PWM PWM request + stable supply Strong (PWM target + tach actual) Best overall; respect minimum duty and ramp Lower on supply side; PWM signal needs good routing Medium

The selection criteria prioritize predictable control and serviceability for multi-fan server trays.

Figure F3 — Fan interfaces and drive blocks (2/3/4-wire)
Fan Drive Fundamentals 2-wire · 3-wire (tach) · 4-wire PWM (server default) 2-wire DC 3-wire + tach 4-wire PWM Drive (V or chop) controls supply Spin-up boost start then ramp down Protection OC · short · transient Fan 2 wires only Drive (V or chop) controls supply Tach capture window + debounce Quality gate valid vs invalid Fan power + tach Stable supply power held constant PWM input open-drain preferred Min duty + ramp avoid hunting Fan power + PWM + tach fan rail fan rail PWM tach

The diagram highlights interface differences and the minimum drive blocks needed for predictable control and serviceability.

H2-4 · Tachometry & Stall Detection (measurement, filtering, and fault decisions)

Intent

Eliminate unstable RPM readings and false stall flags by engineering tach measurement as a quality-controlled signal, then applying a low-false-positive stall decision flow with bounded retries.

PPR period vs count adaptive window debounce confirm time kick retry fail-safe

Tach signal model: pulses, PPR, and two measurement modes

  • Pulses-per-revolution (PPR): determines how pulse frequency maps to RPM. Wrong PPR assumptions directly distort RPM.
  • Period capture: measures time between pulses. Better at low speed, but sensitive to glitches and missing edges.
  • Count-in-window: counts pulses in a time window. Responsive at high speed, but quantized and unstable at low speed if the window is too short.
Practical rule: use adaptive windows or a hybrid (period at low RPM, count at high RPM) so the control loop stays stable across the full speed range.

Low-speed stability: windowing + debouncing + confidence gating

  • Adaptive window length: extend the measurement window at low RPM to reduce quantization and avoid “0 count” swings.
  • Debounce / minimum pulse width: reject short glitches that would create phantom RPM and false “alive” status.
  • Quality gate: output both RPM and a tach_valid indicator. When tach_valid is low, stall logic should enter a suspect state rather than trip immediately.

Failure patterns (classify before acting)

  • Tach missing: no pulses for a timeout. Causes include connector issues, fan failure, or overly aggressive filtering.
  • “Alive but wrong” pulses: jittery or inconsistent periods, glitches, or noise coupling. Requires quality checks before using RPM.
  • Intermittent drops: RPM dips and recovers. Often indicates marginal mechanics, backpressure changes, or unstable low-speed command regions.

Stall / locked-rotor decision flow (bounded, serviceable)

  • Confirm time: require the tach fault to persist long enough to reject transient disturbances.
  • Kick retry: apply a short spin-up boost to recover from near-stall, limited by retry count and interval.
  • Fail-safe: isolate the channel (if supported), raise a clear fault, and allow redundancy compensation to take over.

The goal is predictable behavior: avoid oscillating between normal and fault states, and avoid endless retry loops that create acoustic and electrical disturbances.

Output — Stall detection parameter checklist (fan domain)

  • PPR (pulses per revolution)
  • Measurement mode (period / count / hybrid)
  • Window length policy (fixed vs adaptive)
  • Debounce / minimum pulse width
  • Confirm time (fault persistence threshold)
  • Kick PWM and kick duration
  • Retry count and retry interval
  • Escalation level (suspect → confirm → fail-safe)
  • Logging fields (fault count, last fault time, last valid RPM)
Figure F2 — PWM & tach timing, measurement windows, and stall decision states
Tachometry & Stall Detection PWM command · tach pulses (PPR) · windows · confirm time · kick retry PWM duty Tach pulses Measurement windows Decision timeline low high mid high missing pulses count window adaptive window period capture Normal Suspect confirm timer Kick retry Fail-safe alarm + isolate confirm time

The diagram emphasizes measurement quality gates and bounded retries to reduce false positives.

H2-5 · Vibration Sensing (from “measurable” to “actionable”)

Intent

Make vibration sensing operational: choose sensor placement, pick robust indicators, align spectra to RPM to reduce false positives, and convert signals into tiered maintenance actions that work with fan redundancy.

accelerometer / IMU RMS peak 1× / 2× RPM-aligned spectrum watch / plan / replace

Sensors: what matters for fan-tray health

  • Accelerometer vs IMU: for fan imbalance and bearing wear, an accelerometer is typically sufficient; an IMU can help when multi-axis context is needed, but adds cost and integration overhead.
  • Bandwidth and noise floor: the useful band should cover RPM-related components (1×/2×) and nearby structural resonance bands; noise floor determines how early degradation can be detected.
  • Mounting and axis orientation: rigid mounting near the fan tray structure is critical; a loose or compliant mount turns “installation” into the dominant signal source.
Vibration becomes actionable only after separating fan-related components from ambient chassis vibration and random shocks.

Indicators: RMS, peak, and RPM-aligned spectral peaks

  • RMS (band energy): best for trend monitoring and tiered thresholds; stable across windows when configured consistently.
  • Peak: sensitive to knocks, looseness, and rubbing events; requires confirmation time to avoid one-off false alarms.
  • Spectrum peaks (1×/2×): strongest for attribution. 1× often maps to imbalance; 2× can indicate misalignment or structural effects, depending on mechanics.

A single metric is rarely reliable; a minimal robust combination is RMS (trend) + 1× alignment (attribution) + persistence over time.

RPM alignment: using tach to reduce false positives

  • Step 1 — validate tach: use a tach quality gate (valid/invalid) so RPM is not derived from glitches.
  • Step 2 — compute expected frequencies: f1 = RPM/60 and f2 = 2×RPM/60.
  • Step 3 — align the spectrum: search energy within narrow windows around f1 and f2. Peaks that do not track RPM are more likely resonance or ambient vibration.
  • Step 4 — require persistence: promote events only when alignment persists across multiple windows (prevents “one-frame” misclassification).

Maintenance tiers: convert signals into clear actions

  • Watch: mild RMS rise, occasional 1× alignment. Action: log trend, increase observation frequency, no immediate fan speed escalation.
  • Plan: persistent 1× alignment + upward RMS trend. Action: schedule replacement in a maintenance window; keep N+1 margin.
  • Replace now: strong peaks, repeated shocks, or vibration correlated with intermittent RPM drops. Action: fail-safe escalation and redundancy compensation.

Tiering avoids “alarm storms” and unnecessary full-speed operation while still protecting thermals.

Output — Symptom → probable cause → recommended action

Symptom RPM-aligned? Probable cause Recommended action
Strong 1× peak + RMS slowly rising Yes (tracks RPM) Imbalance / debris on blades Plan replacement; verify tray cleanliness; keep redundancy margin
Broadband RMS rise + weak/variable peaks Sometimes Bearing wear (early) / friction Watch → Plan based on trend slope and persistence
Fixed-frequency peak not moving with RPM No Chassis resonance / external vibration Check mounting points; evaluate structural damping; avoid misclassification as fan fault
High peak shocks + directional spikes Mixed Loose installation / tray rattle / rubbing Inspect fastening; if recurring, Replace now to prevent secondary damage

RPM alignment is the primary discriminator between fan-originated issues and ambient vibration/resonance.

Figure F4 — RPM-aligned spectrum and tiered thresholds
Vibration: Spectrum + Action Tiers Align peaks to RPM (1×/2×) to reduce false positives Simplified spectrum bars represent energy; vertical markers show RPM-aligned windows energy frequency RPM aligned fixed peak (resonance) Tiered thresholds Watch Plan Replace RMS ↑ (trend) 1× aligned + persistent high peak / RPM drops

Use RPM alignment to separate fan-originated peaks from ambient vibration and fixed resonances.

H2-6 · Thermal Inputs (sensor placement and trustworthiness)

Intent

Make temperature inputs reliable for fan control: place sensors with clear physical meaning (inlet/outlet/hotspot), manage time constants and filtering to avoid fan hunting, and explain why a cold inlet can still coexist with a hot hotspot.

inlet / outlet / hotspot time constant filter step detect zone fusion airflow hints

Sensor types (high-level, fan-domain view)

  • Digital temperature sensors: consistent calibration and easy integration; placement and thermal coupling dominate accuracy in airflow paths.
  • NTC thermistors: fast and low cost; requires linearization and careful mounting to avoid reading “air” instead of structure.
  • Remote diode (hotspot source): can represent risk points, but this page treats it as a generic hotspot input without board-level deep dive.

Placement strategy: what each point is meant to answer

  • Inlet (baseline): tracks ambient intake changes and gross intake restrictions; typically slow to reflect workload spikes.
  • Outlet (system trend): correlates with total heat and long-term cooling sufficiency; may lag local risk.
  • Hotspot (risk): fastest indicator of local over-temperature; must be filtered and gated to prevent control oscillation.
“Inlet is cold but hotspot is hot” often indicates local airflow short-circuit, backpressure increase, or a fast hotspot rise that the inlet sensor cannot reflect in time.

Sampling & filtering: stabilize the control loop

  • Time-constant matching: sensors have different thermal lag; mixing raw values can cause unstable fan demand.
  • Filtering: moving average or first-order filters reduce noise but add delay; hotspot paths require careful balance.
  • Step detection: detect rapid rises and apply temporary override (boost) without permanently raising noise levels.
  • Validity checks: detect stuck-at, open/short, or implausible jumps; degrade gracefully instead of oscillating.

Airflow hints (fan-domain only)

  • RPM is not airflow truth: backpressure and filter clog can reduce airflow even when RPM is high.
  • Consistency checks: if RPM is high but outlet/hotspot continues rising, suspect restriction or recirculation in the airflow path.
  • Action tie-in: use these hints to select safe fan curves and avoid misattributing a thermal issue to tach or vibration alone.

Output — Temperature placement checklist

  • Location: inlet near true intake, outlet near exhaust path, hotspot at the risk zone (not in a dead-air pocket).
  • Thermal coupling: ensure consistent contact to the intended medium (structure vs air); avoid floating sensors.
  • Shielding from direct blast: prevent inlet sensors from being artificially cooled by local jets that do not represent bulk intake.
  • Calibration consistency: account for sensor tolerances and mounting variance; verify cross-sensor plausibility.
  • Sampling & filter settings: choose update rate and filter strength to avoid hunting; use separate filters per zone if needed.
  • Fault handling: define behavior for open/short, stuck values, and sudden jumps (invalidate → substitute → alert).
Figure F5 — Inlet/outlet/hotspot placement and the “trust chain” into zone fusion
Thermal Inputs: Placement + Trust Chain inlet · outlet · hotspot → filter · step detect → zone fusion Airflow path (simplified) server tray airflow Inlet baseline Hotspot risk zone fast changes Outlet system trend filter clog cold inlet ≠ safe hotspot Trust chain into control Sensors digital / NTC hotspot source Filter time constant noise reduction Step detect rapid rise override boost Zone fusion max / weighted priority

Separate “placement meaning” from “signal trust.” Stable fan control depends on both.

H2-7 · Control Logic (fan curves, zone control, and the stability–noise–lifetime triangle)

Intent

Turn a fan curve into an executable, verifiable policy: stable temperature regulation without audible hunting, while protecting fan mechanics and minimizing unnecessary high-speed operation.

LUT / PID / Hybrid hysteresis deadband min PWM ramp limits zone fusion

Curve strategies: LUT vs PID vs hybrid

  • Lookup table (piecewise linear): widely used in servers because it is auditable and predictable. Requires hysteresis + ramp limits to avoid threshold hunting.
  • PID (closed-loop): useful when the controlled temperature is well-defined and measurements are trustworthy. Needs output limiting and anti-windup to prevent oscillation under delays.
  • Hybrid (feedforward + small closed-loop trim): uses a LUT for the main command and applies a limited trim for disturbances. This keeps behavior explainable while improving stability.
A “great curve” still hunts if thermal inputs are noisy or delayed. Stabilization mechanisms must be part of the policy, not an afterthought.

Anti-hunt toolkit: four mechanisms that make curves stable

  • Hysteresis band: separate “enter” and “exit” thresholds so small temperature wandering does not cause frequent curve segment switches.
  • Deadband: ignore tiny changes (ΔT) to avoid micro-updates that become audible PWM/RPM modulation.
  • Minimum PWM / minimum RPM: prevent low-speed stall and tach instability; define a stable operating floor for the fan family.
  • Ramp (slew-rate) limits: shape PWM changes over time to reduce acoustic spikes and mechanical stress. Use asymmetric ramps: faster up, slower down.

Ramp limits should cooperate with step detection: allow a time-bounded override for rapid hotspot rises, then return to smooth behavior.

Zone fusion: turning multi-sensor inputs into one command

  • Zone max: safest for hotspot protection but can be dominated by one noisy sensor; requires validity checks and confirmation time.
  • Weighted fusion: quieter and smoother, but can under-react to a true hotspot unless a priority override exists.
  • Priority (hotspot override): run weighted fusion in normal conditions, then promote hotspot to priority control beyond a defined threshold.
A practical pattern is “weighted in normal mode + hotspot priority in boost mode,” with a recovery hysteresis to avoid bouncing.

Noise constraints (policy-level)

  • Night mode: cap maximum PWM or tighten downward ramps to minimize audible transitions.
  • Acoustic cap: flatten selected regions of the curve so normal workload variation stays within a predictable noise envelope.
  • Emergency exception: allow temporary cap bypass only in high-risk states (e.g., Boost/Critical), and log the duration.

Output — Curve design parameters (config-ready table)

Parameter What it controls Why it matters Typical pitfalls
T_low / T_high Curve breakpoints and segment transitions Defines how aggressively fan speed rises with temperature Too steep → noise spikes; too flat → thermal margin loss
min PWM Stable operating floor Avoids stall, low-RPM tach noise, and “start/stop” wear Too low → stalls; too high → idle noise increases
max PWM Top-end cap (normal or night mode) Noise ceiling management Over-capping hides thermal insufficiency
hysteresis (ΔT) Entry/exit band around thresholds Suppresses curve hunting near breakpoints Too small → oscillation; too large → slow recovery
deadband (ΔT) Ignore small changes Removes audible micro-modulation Too large → sluggish response
ramp_up / ramp_down Slew-rate of PWM changes Noise and lifetime control; prevents mechanical stress Too strict can delay hotspot protection without override
hotspot override threshold When hotspot takes priority Prevents weighted fusion from missing local risk Too low causes frequent “boost” activation
override timeout How long an emergency relaxation can last Stops prolonged noisy behavior after the transient passes Too short causes repeated toggling
Figure F6 — Fan curve with hysteresis band and ramp limiting
Control Logic: Curve + Stabilizers hysteresis · deadband · min PWM · ramp limits Piecewise fan curve (LUT) PWM Temperature min PWM hysteresis band T↓ T↑ deadband: ignore tiny ΔT Ramp limiting (slew-rate) PWM time raw step ramp-limited ramp up: faster ramp down: slower

Stability is created by constraints: hysteresis and ramp limits are as important as the curve itself.

H2-8 · Derating & Safe States (controlled degradation instead of collapse)

Intent

When cooling margin disappears, the system should move through explicit safe states—boost, degraded, critical, shutdown—using confirmation timers and recovery hysteresis to prevent thrashing. The goal is controlled behavior with clear logs and predictable outcomes.

Normal → Boost Degraded Critical Shutdown confirm time recovery hold

Triggers (fan-domain view with confidence)

  • Thermal: hotspot over threshold, outlet trend rising, or rapid temperature slope detected (highest priority).
  • Fan health: tach missing, stall detected, fan absent/present mismatch, or repeated restart failures.
  • Vibration tier: persistent RPM-aligned peaks or severe shocks; treat as predictive risk that tightens policies rather than immediate shutdown.
Predictive signals should bias the system toward safer operating points and earlier maintenance tiers, but must not trigger shutdown without corroborating thermal or fan-health evidence.

Actions by state (what changes, and why)

  • Normal: follow H2-7 control policy (zone fusion + stabilizers).
  • Boost: temporarily raise fan demand; allow a time-bounded ramp override for fast hotspot recovery; enable hotspot priority control.
  • Degraded: keep service running with reduced margin; increase base airflow and issue an abstract throttle request upstream (no CPU mechanism details).
  • Critical: enforce stronger protection (more aggressive boost + throttle request); require stricter recovery conditions.
  • Shutdown (last resort): request an orderly stop when hardware risk is imminent and recovery is not possible within safe limits.

Anti-thrash safeguards (the difference between safe and noisy)

  • Confirm time: require triggers to persist (sensor-dependent) before state entry; prevents one-sample spikes from escalating.
  • Combined conditions: use “AND with guardrails” (e.g., hotspot high + outlet rising) to promote severe states.
  • Recovery hysteresis + hold time: exit only after temperature is below a recovery threshold for a minimum duration; avoids bouncing.

A safe-state machine must be harder to exit than to enter; otherwise it oscillates under noise and delays.

Output — Derating strategy card (Trigger → Action → Recovery)

State Trigger (with confirmation) Action Recovery (hysteresis + hold)
Boost Hotspot > T_boost for t_confirm OR temperature slope exceeds limit Raise target PWM; enable hotspot priority; allow bounded ramp override Hotspot < T_boost_recover for t_hold; no active slope alarms
Degraded Boost active too long OR fan redundancy lost + thermal margin shrinking Increase base PWM; limit min airflow; issue abstract throttle request Redundancy restored AND temps stable below recover thresholds for t_hold
Critical Hotspot > T_crit for t_confirm AND outlet trending up Maximize safe airflow; stronger throttle request; elevate alarms Hotspot < T_crit_recover for longer t_hold; trend normalized
Shutdown Critical persists OR temperature exceeds hard limit for t_confirm Orderly shutdown request (last resort) + capture black-box logs Manual intervention / maintenance required

Keep trigger logic explicit and logs consistent so operational teams can reproduce and audit decisions.

Figure F7 — Safe-state machine and the derating policy card
Derating & Safe States explicit states · confirm time · recovery hysteresis State machine (fan-domain) Normal policy control Boost fast recovery Degraded reduced margin Critical protect HW Shutdown last resort enter if: hotspot > threshold + confirm time promote if: redundancy lost OR outlet trend rising recovery: hysteresis + hold time Policy card template Trigger condition + confirm Action fans + request Recovery hysteresis + hold

Keep state transitions explicit and harder to exit than to enter to avoid oscillation under noise and delays.

H2-9 · Redundancy & Hot-Swap (N+1, fan trays, and fault coupling)

Intent

Cover the core availability requirement in rack servers: keep thermals stable through a fan failure or hot-swap by combining grouped N+1 compensation, zone recomputation, and explicit fault-to-action mapping.

N+1 fan groups curve uplift zone recompute present debounce tach_valid gating

N+1 and grouping: compensate without “everything to max”

  • Group-level uplift: when one fan in a group fails, uplift the remaining fans in that group first. This preserves acoustics better than a global uplift.
  • Zone recomputation: re-evaluate zone fusion (max/weighted/priority) under reduced airflow margin; bias toward hotspot protection only when needed.
  • State gating: if redundancy is lost and temperature trends worsen, escalate to a safer state (Boost/Degraded) rather than oscillating the curve.
Compensation is a three-layer action: fan-level uplift + zone-level recompute + state-level escalation.

Hot-swap flow (fan domain only)

  • Present detection: debounce present transitions to avoid contact chatter creating false remove/insert events.
  • Swap transient window: during insertion/removal, treat tach as invalid and suppress stall classification until stability is proven.
  • Post-insert self-test: run a short spin-up phase, then require tach_valid across multiple windows before joining normal policy control.
  • Return to policy: ramp back to the computed target (avoid abrupt drops that create audible steps).

Key rule: do not enter closed-loop decisions on a newly inserted fan until tach validity is established.

Fault coupling: consistent alarm level and deterministic actions

  • Tach missing / invalid: classify only after confirm time; correlate with present transitions and low-speed conditions to avoid false stalls.
  • Stall / lock: escalate after retries; apply a bounded re-kick (spin-kick) policy and log retry counters.
  • Current anomaly (fan domain): separate short startup surge from persistent overcurrent; persistent cases should promote severity quickly.
  • Vibration anomaly: treat as predictive risk; tighten policy and maintenance tiering, but do not jump to shutdown without thermal/health corroboration.

Output — Fault tolerance matrix (type × severity × action × log fields)

Fault type Severity Immediate action Persistence rule Recovery rule Record fields (minimum)
Fan absent / present change Warning → Critical (if redundancy lost) Recompute groups/zones; uplift affected group; gate tach-based faults during swap window present debounce + confirm window present stable for hold time + tach_valid established ts, fan_id, group_id, present_state, debounce_cnt, state
Tach missing / invalid Warning Hold current target; run validity gating; if persists, promote to stall handling tach_invalid for N windows tach_valid for M windows + RPM within expected band ts, fan_id, pwm_cmd, rpm_meas, tach_valid, window_cnt
Stall / lock Critical Spin-kick retry (bounded); if fails, mark fan failed and apply N+1 compensation; possible state escalation stall detected + confirm time; retry up to K times rpm stable + no stall flags for hold time ts, fan_id, stall_cnt, retry_cnt, retry_interval, state
Current anomaly (fan domain) Warning (surge) / Critical (persistent) Filter startup surge; persistent overcurrent triggers fan isolation (if supported) + compensation I_anom duration > t_confirm current normal for hold time + rpm stable ts, fan_id, i_flag, i_duration, pwm_cmd, state
Vibration severe (RPM-aligned) Info → Warning Tighten curve / raise base airflow; promote maintenance tier; keep redundancy margin aligned tier for N windows tier drops below threshold for hold time ts, fan_id, vib_tier, align_flag, rpm_meas, state

A good matrix separates transient behavior (swap/surge/noise) from persistent faults (stall/persistent overcurrent).

Figure F8 — N+1 group compensation and the hot-swap gating sequence
Redundancy & Hot-Swap N+1 groups · compensation · present debounce · tach_valid gating N+1 grouped compensation Group A Fan Fan Fan FAIL Group B Fan Fan Fan Compensation group uplift zone recompute state gate Hot-swap gating sequence Remove Present debounce Insert Spin-up self-test Tach valid join policy gate: tach_invalid during swap

Debounce and tach_valid gating are essential to avoid false stall classification during hot-swap.

H2-10 · Interfaces & Telemetry (fields, alarms, and logs that actually debug issues)

Intent

Make observability practical: connect the fan-control fast path (PWM/tach/sideband) with the slow management path (I²C/SMBus/PMBus semantics), define minimal telemetry that debugs most failures, and prevent log storms with dedupe and throttling.

PWM / tach present / fault I²C / SMBus PMBus dedupe throttle

Interface layers (what each path is responsible for)

  • Fast path: PWM command, tach/FG feedback, and sideband (present/fault). This is where control stability and validity gating happen.
  • Slow path: I²C/SMBus/PMBus-style registers for configuration, status, and telemetry snapshots. Use it for auditability and forensics, not for high-rate control loops.
  • Event path: timestamped alarms and state transitions with dedupe and rate limiting to keep logs readable under fault conditions.
A debug-friendly design keeps fast-path decisions simple and records the context on the slow/event path for later reconstruction.

Recommended fields (organized as a diagnostic loop)

Minimum viable telemetry set

  • pwm_cmd — target command (what the policy asked for)
  • rpm_meas — measured speed (what the fan delivered)
  • tach_valid — validity gate (quality of rpm_meas)
  • temp_snapshot — inlet/outlet/hotspot (or fused value)
  • state — Normal/Boost/Degraded/Critical
  • action_flags — ramp_override / hotspot_priority / cap_active

Enhanced telemetry set (when available)

  • stall_cnt, stall_reason
  • retry_cnt, retry_interval
  • present_state, present_flap_cnt
  • i_anom_flag, i_anom_duration (fan domain)
  • vib_tier, vib_align_flag (if vibration sensing exists)
  • state_entry_cnt (helps detect policy thrashing)

Alarm & event hygiene (avoid log storms)

  • Dedupe: merge repeated identical alarms for the same fan and state; keep a counter and first/last timestamps.
  • Rate limiting: cap events per time window; if exceeded, emit a summary record (count + window).
  • Correlation snapshot: attach pwm_cmd, rpm_meas, tach_valid, temps, state to each promotion/demotion event.

Debug order (fast isolation path)

  1. Start with pwm_cmd / rpm_meas / temp_snapshot: confirm the loop is coherent.
  2. Then check tach_valid, stall_cnt, retry_cnt: determine whether the system is fighting a mechanical/electrical fault or a validity gate.
  3. Finally evaluate vib_tier and alignment: only RPM-aligned persistent vibration should tighten policy or promote maintenance tiering.

This order isolates “command mismatch” before digging into deeper causes.

Figure F9 — Interfaces and the minimal telemetry loop
Interfaces & Telemetry fast path · slow path · event path · minimal debug loop Interface stack (fan domain) Controller policy + gating Fan module motor + driver PWM tach/FG present / fault Slow path I²C / SMBus / PMBus Event path timestamp + dedupe rate limit Minimal telemetry loop pwm_cmd rpm_meas tach_valid temps state action_flags debug in this order

Keep logs readable: dedupe and rate-limit repetitive alarms, but preserve correlated snapshots on state transitions.

H2-11 · Validation & Debug Playbook (thermal, acoustics, EMI, and field triage)

Intent

Provide an execution-ready workflow for engineering bring-up and field support. The playbook is organized as a bring-up ladder, a set of fan-domain validation cases (thermal/acoustic/EMI), and a 3-step triage method for the most common site issues.

bring-up ladder fault injection pass/fail acoustic stability PWM EMI field triage

Reference BOM (example part numbers for fan-domain builds)

These are representative, commonly-used devices for multi-fan control, temperature inputs, and vibration sensing. Exact choices depend on channel count, bus constraints, and telemetry needs.

Function Example part numbers (material numbers) Why used in this subsystem
Multi-fan I²C controller MAX31790 (Analog Devices / Maxim)
EMC2305, EMC2303 (Microchip)
Multi-channel PWM outputs with tach inputs; good for grouped control and consistent tachometry.
Temp monitor (local + remote diode) TMP468 (Texas Instruments)
ADT7461A (Analog Devices)
Multi-point thermal inputs for inlet/outlet/hotspot; remote diode supports “hotspot-like” sensing without deep CPU internals.
Fan controller + temp monitor combo ADT7470, ADT7473 (Analog Devices) Consolidates basic fan control and temperature monitoring for smaller designs or test fixtures.
3-axis accelerometer (vibration) ADXL356 (Analog Devices)
LIS2DW12 (STMicroelectronics)
Fan tray vibration / imbalance indicators; supports “RPM-aligned” checks to reduce false positives.
PWM buffering / open-drain driver 74LVC1G07 (open-drain buffer family) Clean PWM edges and correct logic style when the controller GPIO cannot directly meet PWM electrical expectations.
Fan-domain power switch (optional, keep fan-domain only) TPS25947 (Texas Instruments eFuse, example) Helps controlled insertion/short protection for a fan tray rail; mention here only as a validation/bring-up aid (no PSU/PDU deep dive).
EMI beads (examples) BLM21 series (Murata), MPZ2012 series (TDK) Fan-domain noise shaping on supply or signal lines when PWM edge energy couples into tach/sensors.

Part numbers are provided as examples to make the playbook actionable. Validate footprint, channel count, and bus voltage for the target platform.

Bring-up ladder (single fan → groups → zones → redundancy injection)

  • Stage A — Single-fan sanity: open-loop PWM steps → verify rpm_meas monotonicity and stable tach_valid. Then enable a minimal LUT curve with min_pwm and ramp.
  • Stage B — Multi-fan grouping: drive a group with identical PWM; quantify RPM dispersion and confirm group uplift does not create audible hunting. Keep the loop stable before adding more zones.
  • Stage C — Multi-zone fusion: introduce inlet/outlet/hotspot inputs; compare fusion rules (weighted vs max vs priority) under controlled disturbances. Confirm hotspot protection triggers only when required.
  • Stage D — Redundancy & fault injection: remove a fan, force tach_missing, and inject stall/retry scenarios. Verify the system executes: group upliftzone recomputestate escalation with clean, deduped logs.
Always gate progress with a minimal telemetry snapshot: pwm_cmd, rpm_meas, tach_valid, temp_snapshot, state, action_flags.

Thermal validation (dynamic response, not only steady-state)

  • Step load: apply a temperature-driving disturbance (real load or controlled heater). Pass if hotspot remains within guard bands and the system avoids repeated state flips.
  • Airflow restriction: partial blockage / backpressure simulation. Pass if reduced airflow is handled as a trend (temps + RPM response) rather than being misclassified as stall.
  • Fan removal: remove one fan (or one group member). Pass if N+1 compensation triggers within the expected time and logs capture the correlation snapshot once (no storm).
  • Filter aging simulation: gradually reduce airflow (slow degradation). Pass if the policy tightens gracefully (no PWM thrash) and maintenance tiering can be raised without unsafe actions.

Keep this fan-domain: do not attribute failures to facility HVAC; classify them by local sensors and response coherence.

Acoustic stability validation (ramp & hysteresis prove-out)

  • Threshold dither test: hold temperature near a curve breakpoint and add small ±ΔT disturbances. Pass if PWM/RPM do not “ping-pong” across breakpoints.
  • Ramp comparison: compare “no ramp” vs “ramp-limited” profiles. Pass if ramp removes audible modulation while keeping thermal settling time acceptable.
  • Night cap behavior: if an acoustic cap exists, pass only if Boost/Degraded overrides are deterministic (with hold time and clean recovery), not oscillatory.
Acoustic issues are usually control-jitter problems: insufficient hysteresis, missing hold time, or aggressive fusion switching.

EMI validation (fan-domain PWM, edge control, routing, local filtering)

  • PWM frequency sweep: evaluate a small set of candidate PWM frequencies and check for tach corruption, sensor noise, or false stall triggers.
  • Edge-rate control A/B: adjust drive strength / series resistor / buffering. Pass if EMI improvements do not degrade tach validity or thermal response.
  • Routing & return path checks: keep high-current fan return loops separated from tach/sensor references; verify that moving the return reference changes false-alarm rates (a strong indicator of coupling).
  • Local filtering: apply fan-domain beads/RC only where coupling is observed. Pass if alarms stabilize and telemetry remains consistent.

This section stays at the “knobs you can turn” level; it does not expand into full-system EMC compliance strategy.

Field debug playbook (common symptoms → deterministic triage)

Use the same three questions every time to avoid chasing secondary artifacts.

3-step triage: (1) Temperature truth → (2) Fan response → (3) Tach/Vibration validity
  • RPM jitter / “hunting”: first confirm temperature inputs are stable and filtered; then confirm ramp + hysteresis + hold time are applied; finally confirm tach_valid is not flapping.
  • False stall alarms: check present transitions and swap windows; confirm tach validity gates are suppressing classification during insertion/removal; then tune stall confirm time and retry spacing.
  • Cold inlet but hotspot hot: validate hotspot sensor placement/credibility; verify zone priority and fusion rules; look for airflow restriction trends (not just instantaneous RPM).
  • Vibration alarm but thermals OK: require RPM-aligned persistence before escalation; otherwise treat as environment/resonance and keep actions maintenance-grade.
  • Log storm: validate dedupe and rate limiting; if state toggles frequently, return to input trust and hysteresis/hold design.

Output — Validation checklist (test → observe → pass/fail)

Test case Stimulus Observe (minimum) Pass / Fail criteria
Single-fan PWM step PWM: low → mid → high (open-loop) pwm_cmd, rpm_meas, tach_valid RPM monotonic, tach_valid stable across windows, no spurious stall.
Group dispersion Same PWM to a fan group rpm_meas per fan, tach_valid Dispersion within target band; no outliers that trigger false policies.
Zone fusion stability Small ±ΔT near a breakpoint temp_snapshot, pwm_cmd, state No repeated state flips; PWM changes are ramp-limited and stable.
Fan removal (N+1) Remove one fan (or one group member) state, action_flags, event snapshot Compensation triggers on time; logs capture one correlated record (deduped).
Airflow restriction Partial blockage / filter simulation temp_snapshot, rpm_meas, pwm_cmd Classified as thermal margin loss (trend), not as stall unless evidence exists.
PWM EMI sweep Try candidate PWM frequencies tach_valid, sensor noise indicators, false-alarm count Chosen PWM setting maintains tach integrity and reduces false alarms.

A checklist is only useful when each line item specifies what to observe and how to decide pass/fail.

Figure F10 — Bring-up ladder + 3-step field triage (fan-domain)
Validation & Debug Playbook bring-up ladder · thermal/acoustic/EMI checks · 3-step field triage Bring-up ladder (gate each stage) Stage A Single fan PWM steps tach_valid Stage B Multi-fan grouping dispersion Stage C Multi-zone fusion stability Stage D Injection N+1 / stall clean logs PASS PASS PASS Snapshot fields: pwm_cmd rpm_meas tach_valid temps state action_flags 3-step field triage (always the same questions) 1) Temperature truth inlet/outlet/hotspot 2) Fan response pwm_cmd → rpm_meas 3) Sensor validity tach_valid / vib_align

Figure F10 is designed for mobile readability: minimal words, many blocks, and text sizes ≥ 18px.

Request a Quote

Accepted Formats

pdf, csv, xls, xlsx, zip

Attachment

Drag & drop files here or use the button below.

H2-12 · FAQs (Fan & Thermal Management)

Each answer is fan-domain focused and debug-oriented: check fields → identify the likely class → apply a bounded mitigation → validate with a repeatable test.

1 Why is PWM duty high but RPM does not increase—what should be checked first?

Start by confirming the command path is real: pwm_cmd at the pin (and any caps/limits) matches the requested setpoint. Then validate measurement integrity: rpm_meas must be accompanied by a stable tach_valid. If both are true, suspect airflow/backpressure, mechanical load, or fan-domain power limiting; compare current draw and temperature trend under a controlled PWM step.

Related: H2-4 (Tachometry), H2-10 (Telemetry). Example parts: MAX31790 / EMC2305 (multi-fan controllers).
2 Tach readings jump badly at low speed—how to set windowing/filtering without false stalls?

At low RPM, prefer long-window period measurement or adaptive windowing so quantization noise does not dominate. Add deglitching, debounce, and a clear tach_valid rule (e.g., N consecutive windows consistent within a band). Stall logic should require both “tach invalid/missing” and a confirm time; avoid classifying stalls inside spin-up, hot-swap, or very-low-RPM regions.

Related: H2-4 (Tachometry & Stall Detection).
3 A fan sometimes “spins briefly then stops”—is it power protection or stall-retry policy?

Use the timing signature. Power protection issues often look like a clean cutoff after a short surge, with a repeatable restart pattern tied to the rail (fan-domain). Retry-policy issues show repeated “kick” attempts with retry_cnt/stall_cnt climbing while tach_valid never stabilizes. Separate them by logging one correlated snapshot per attempt and comparing behavior with a gentler spin-up ramp and longer confirm windows.

Related: H2-3 (Drive), H2-4 (Stall), H2-9 (Hot-swap/Redundancy). Example part: TPS25947 (fan-domain eFuse, optional).
4 Multiple fans accelerating together cause power transients/noise—how to do stagger or ramp?

Use a coordinated acceleration policy: apply a global ramp limit (dPWM/dt) and add staggered start offsets per fan or per group so inrush peaks do not stack. Keep thermal safety by allowing emergency override (Boost/Degraded) but still rate-limit transitions to avoid oscillation. Validate with A/B tests: “all-step” versus “stagger+ramp,” and compare event counts, tach validity, and temperature settling time.

Related: H2-7 (Control Logic), H2-11 (Validation).
5 How to design hysteresis and slope limiting to stop “tug-of-war” that hurts noise and lifetime?

Combine four stabilizers: hysteresis around breakpoints, a deadband for small input changes, a minimum hold time before switching modes, and a slope limit on PWM/RPM targets. If zones fuse by max/priority, lock the fusion decision for a hold interval to avoid rapid re-ranking. Prove it with a threshold-dither test (±ΔT) and confirm PWM does not ping-pong while hotspot remains protected.

Related: H2-7 (Control Logic).
6 Inlet is cold but the hotspot still overheats—sensor placement or airflow organization?

First validate sensor truth: inlet/outlet/hotspot should follow a physically consistent relationship and time constant; a “hotspot” sensor that reacts too slowly or too fast is often poorly coupled. If sensors are credible, treat it as airflow/pressure distribution: RPM may rise but effective airflow at the hotspot may not. Use trend-based diagnosis: correlate PWM/RPM changes to hotspot temperature slope rather than single-point readings.

Related: H2-2 (Architecture), H2-6 (Thermal Inputs). Example parts: TMP468 / ADT7461A (multi-channel temperature monitors).
7 In N+1, after removing one fan, how should curves be recomputed to stay stable?

Recompute in a fixed order: (1) uplift remaining fans in the affected group, (2) recompute zone fusion with reduced airflow margin, then (3) escalate state (Boost/Degraded) only if temperature trends demand it. Keep existing hysteresis/hold/ramp rules so the system does not thrash. Validate by fan-removal injection and require a single deduped event record capturing PWM, RPM, temps, and state.

Related: H2-9 (Redundancy), H2-7 (Control), H2-11 (Validation).
8 After hot-swap, RPM detection fails—how to debug present/tach/self-test sequencing?

Verify present is debounced and that tach-based faults are gated during the swap window. A newly inserted fan should run a short spin-up self-test, then only join closed-loop policy after tach_valid holds for multiple windows and RPM enters an expected band. If RPM remains invalid, separate “tach integrity” (signal/wiring/format) from “fan response” (mechanical/power) using open-loop PWM steps.

Related: H2-9 (Hot-swap), H2-4 (Tach validity).
9 How to avoid misclassifying chassis resonance or shipping shock as bearing degradation (vibration)?

Require persistence and RPM alignment. Transient shocks are short and broadband; treat them with a time gate and do not escalate actions. For bearing-related issues, the vibration signature should persist and show components aligned to RPM (1×/2×) or stable bands across operating points. Cross-check multiple sensors or trays: chassis resonance tends to appear coherently across locations, while a bad fan is localized.

Related: H2-5 (Vibration Sensing). Example parts: ADXL356 / LIS2DW12 (accelerometers).
10 For early bearing wear, does RMS rise matter more, or spectral peaks?

Use both as a tiered indicator. RMS captures broadband energy changes and is good for trend monitoring, but it is sensitive to environment and mounting. Narrow spectral peaks (especially RPM-aligned 1×/2× behavior) are more diagnostic when they persist across windows. Early detection should be trend-based at fixed or comparable RPM, with thresholds that promote “observe → schedule replacement → urgent” rather than immediate shutdown.

Related: H2-5 (Metrics and RPM alignment).
11 When should derating be triggered instead of “fans to max,” and how to avoid false triggers?

Derating is appropriate when airflow is no longer the limiting factor: PWM is near max, RPM is responsive, yet hotspot temperature keeps rising or exceeds guard bands. Avoid false triggers by combining conditions (threshold + duration + sensor credibility) and defining clear recovery rules (exit threshold + hold time). Keep actions staged: Boost → Degraded (request throttling) → Critical → Shutdown as last resort.

Related: H2-8 (Derating & Safe States).
12 With only RPM, temperature, and alarms available, what is the fastest debug order?

Follow a fixed sequence: (1) verify temperature truth (relationships and trends across inlet/outlet/hotspot), (2) verify fan response (does RPM move coherently with alarm/state changes), then (3) verify measurement validity (tach flapping, swap windows, low-speed behavior). If the loop is coherent, focus on airflow margin loss; if not, isolate whether the issue is command limiting, sensing validity, or fault gating.

Related: H2-10 (Telemetry), H2-11 (Validation workflow).