Rack Server Mainboard Power, Clock, and Telemetry Design

Q: Why can rails look “normal” yet the board still intermittently refuses to power on?

Mainboard checks: capture the first-fail edge (PG vector, reset cause, and a time-aligned rail snapshot for AON/AUX/clock rails) before reboots overwrite evidence. Common causes: PG chatter that telemetry averages hide, enable-owner collisions, or missing first-fault tokens. Fast proof: latch first-fault + snapshot on a supervisor interrupt and compare cold boot vs warm reset. Escalate only if PG/reset exits cleanly and rails are stable during the failed boot window. Example parts: LTC2937, TPS3899, FM25V10.

Q: PG is deglitched, yet a reset storm persists—what are the three most common dependency/logic mistakes?

Mainboard checks: review the reset-domain map and PG combine logic (hard-gate vs soft-monitor). Most common mistakes: (1) wrong AND/OR composition that turns a non-critical PG into a global reset gate; (2) circular dependencies between EN and PG; (3) thresholds/blanking placed inside a droop band, so PG still toggles under load. Fast proof: freeze the PG vector at the first low pulse and identify the first PG to drop. Escalate to CPU VRM only after board-level logic/thresholds are proven correct. Example parts: TPS3899, LTC2937, SN74LVC1G32.

Q: Remote sense is “placed correctly,” but readings still drift—what is the most common mainboard-level reason?

Mainboard checks: verify that the sense pair and its reference return do not share high di/dt ground segments, and correlate drift with temperature and bus activity. Common causes: ground reference movement, connector/copper temperature effects, or telemetry-chain aliasing (ADC reference/filtering/sample timing) that looks like slow drift. Fast proof: compare direct DMM at the load with telemetry during a load step and fan-speed change. Escalate to CPU VRM only if regulation error is strongly transient-dependent with stable references. Example parts: INA238, INA229, PCA9517A.

Q: PMBus polling often times out—how to tell topology issues from electrical noise first?

Mainboard checks: treat PMBus as an electrical network—segmentation, pull-ups, branch length, address plan, and multi-master ownership. Topology signature: errors persist at idle and follow specific branches/addresses. Noise signature: timeouts cluster with load steps, fan PWM edges, or reset/PG activity. Fast proof: isolate with a mux one segment at a time, add a bus-recovery step, then repeat under a controlled load step. Escalate only if segmentation proves the bus healthy and a single VR domain still fails. Example parts: TCA9548A, PCA9517A, PCA9615.

Q: Ref-clk frequency measures “correct,” but links are unstable—what should be checked at the mainboard level first?

Mainboard checks: prove clock intent (enable/gating), then verify clock-buffer/jitter-cleaner power rails, thermal gradients, and sideband chatter (PERST#/CLKREQ#). Common causes: frequency is right but jitter is not due to supply noise coupling, duty-cycle/SSC assumptions differ across zones, or marginal enable chains gate clocks under load/temperature. Fast proof: correlate flaps with clock-rail telemetry and sideband vector at the first flap edge. Escalate to PCIe Switch/Retimer only when intent and rails are clean during flaps. Example parts: 9DBV0641, Si5341, LMK03328.

Q: Is “more clock fanout” always better? When should zoning replace a single global fanout?

Mainboard checks: identify zones with different noise/thermal environments and decide whether independent enable/isolation is required. Zoning wins when a noisy load zone couples rail/return noise into the clock rail, long cross-zone routing breaks reference-plane continuity, or fault containment is needed so one zone can mute without collapsing the whole tree. Fast proof: compare clock-rail noise and flap rate with a zoned-enable experiment. Escalate to endpoint-specific tolerance analysis on the PCIe Retimer/Switch page if needed. Example parts: LMK1C1104, 9DBV0641, Si5341.

Q: Why are power-loss logs often untrustworthy—and how to tell “no time to write” vs “written but read wrong”?

Mainboard checks: verify power-fail detect edge, interrupt path to the owner, and a measurable last-gasp window marker under worst-case load. No time to write: detect is too late or the window collapses too fast. Written but read wrong: VALID/CRC/sequence mismatch or an outdated pointer selects the wrong record after reboot. Fast proof: inject repeatable brownouts and confirm the first-fault token and pointer advance every time. Escalate if detect/window are proven correct yet persistence still fails. Example parts: TPS3899, FM25V10, LTC3350.

Q: After a brief power interruption reboot, which logs/snapshots should be read first?

Mainboard checks: read in a cause-preserving order. Best-first order: (1) first-fault token or last-shutdown summary, (2) reset cause and brownout marker, (3) PG vector at first-fail edge, (4) minimal telemetry snapshot (AON/AUX/clock/PCIe rails plus hotspot temperatures), then (5) long-form ring logs. Fast proof: repeat injected brownouts and confirm the summary stays consistent. Escalate to the implicated subsystem page only after the board-level summary is stable. Example parts: FM25V10, PCF85063A, W25Q128JV.

Q: Many sensors exist, but hotspots remain invisible—how should thermal zones and sampling points be changed to matter?

Mainboard checks: map sensors to the heat-flow chain—air inlet/outlet, VRM islands, DIMM banks, PCIe slot zones, and connector choke points. Common causes: sensors are placed on thermally cold copper, sampling is too slow for excursions, or thresholds do not match throttle/derate behavior. Fast proof: run a controlled workload ramp and correlate inlet/outlet delta, VRM temps, slot-zone temps, and event timing. Escalate to fan/pump policy only when placement and timing are correct. Example parts: TMP468, TMP451, NCT7802Y.

Q: For “minimal production test coverage,” which signal classes must be measurable to avoid future field gaps?

Mainboard checks: prove four classes—power domains, reset exit, ref-clk distribution to consumers, and survivable failure summaries. Must-measure classes: key rails presence (AON/AUX/clock/PCIe), PG/EN/RESET vector, ref-clk presence at consumer-side test points, and a readable first-fault token plus snapshot pointer after an injected interruption. Fast proof: enforce fail-stop gates that stop the line when any class is missing. Escalate to subsystem validation only after gates pass. Example parts: ADS1115, TCA9548A, PCA9555.

← Back to: Data Center & Servers

Central idea

A rack server mainboard is the system integration layer that turns power rails, reset domains, reference clocks, telemetry, and logs into a single bootable and debuggable platform. It succeeds when the board provides a trustworthy evidence chain—first-fault capture, time-aligned snapshots, and power-fail survivability—so issues can be isolated quickly without guesswork.

H2-1 · What a Rack Server Mainboard Owns

One-page boundary: the mainboard is the integration layer that turns power, clocks, and evidence (telemetry + logs) into a bootable, verifiable, debuggable system.

Key takeaways (fast orientation)

Power Tree names rails and domains, defines sequencing + PG/reset dependencies, and places measurement points so readings match reality.
Clock Tree distributes ref-clocks/PLLs with board-level jitter hygiene (power/return/route isolation) before blaming link silicon.
Evidence Tree makes failures explainable: consistent telemetry snapshots, event ordering, power-fail handling, and durable post-mortem logs.
Stop line: when the question becomes “chip-internal control/SerDes/protocol,” switch to link-only sibling pages.

A “mainboard-level problem” usually looks normal in isolation (a rail voltage is in range, a status bit is set, a clock frequency exists) yet the system still fails (cold-boot miss, reset storm, link flaps, missing evidence after power loss). The correct first step is to map the symptom to one (or an intersection) of the three trees, then decide where deeper component-level analysis is justified.

Topic	Here (mainboard owns)	Not here (link-only)	Go deeper
CPU VRM & multi-rail PoL	Rail taxonomy, distribution/return constraints, sequencing dependencies, measurement tiers, PMBus/SMBus visibility	Loop compensation, multiphase control, DrMOS selection, controller-internal tuning	CPU VRM
DDR power & DIMM logic	Domain boundaries, boot dependencies, presence/management interface roles in the system view	DDR5 PMIC / RCD / DB / SPD internals and register-level behavior	DDR5 PMIC
PCIe stability & ref-clock	Ref-clock/PLL fanout topology, board-level jitter hygiene, power/return coupling risks that cause link flaps	Retimer/switch SerDes equalization, training details, device-internal diagnostics	PCIe Switch/Retimer
OOB management	BMC as a consumer of sensors/events/logs: what must be observable and trustworthy on the board	IPMI/Redfish command sets, firmware architecture, update and policy internals	BMC
Telemetry platform	Board-side sensor placement, bus partitioning, sampling consistency, timestamps, deglitch and alarm storm control	System-level analytics pipelines, anomaly detection engines, fleet policy logic	In-band Telemetry

Symptom index (symptom → first tree → decision point)

Intermittent power-up / cold-boot miss → start with Power Tree (sequencing + PG/reset graph + measurement tiers) → then decide if VRM deep-dive is needed.
Reset storm / watchdog-like reboot loop → start with Power Tree + Evidence Tree (PG deglitch, event ordering, last-fault preservation).
PCIe link flaps while “freq is correct” → start with Clock Tree (jitter hygiene, coupling from power/return) → then decide whether to suspect retimers.
No usable evidence after power loss → start with Evidence Tree (power-fail detect → hold-up intent → log flush) → only link to PSU/hot-swap pages if needed.

Scope guard (stop line)

Chip-internal control theory (VRM compensation), SerDes equalization, and BMC protocol/firmware details are link-only topics. This page stays at board-level topology, observability, and evidence integrity.

Figure A — Mainboard ownership map (Power Tree · Clock Tree · Evidence Tree)

Block-diagram navigation: minimal text (≥18px), many visual cues. Use it to decide “system-first” vs “component deep-dive.”

Link-only siblings: CPU VRM · PCIe Switch/Retimer · BMC · In-band Telemetry · DDR pages.

H2-2 · Power Tree on the Mainboard: Rail Taxonomy & Distribution

This chapter defines a practical rail language (groups, domains, naming, and measurement tiers) so logs and telemetry map cleanly to real boot behavior.

Key takeaways (system-level, not VRM control theory)

Rail groups must carry meaning (boot criticality, noise sensitivity, transient stress, isolation need), not just voltage labels.
Distribution must be designed as a loop (forward path + return path). Return mistakes commonly surface as “clock/link instability.”
Measurement must be tiered: regulator output vs distribution node vs load point—each answers a different question.
Stop line: multiphase tuning/compensation details belong to the CPU VRM page (link-only).

A useful taxonomy separates rail (a regulated voltage line) from power domain (a functional boundary that may require multiple rails). When taxonomy is consistent, power-fail snapshots and boot-time logs become searchable evidence rather than isolated numbers.

Practical naming rule (for searchable logs)

Use a consistent pattern that encodes intent: DOMAIN + STATE + optional LEVEL. Example pattern: V_DOMAIN_STATE (e.g., V_IO_AON, V_AUX_STBY). Avoid pure numeric names that cannot be reasoned about from logs.

Rail group	What it powers (system view)	Board-level risk	Primary observability	Typical failure signature
CORE	Compute core rails (boot-critical)	Fast transients; local hotspots; PG sensitivity	V/I near load + event flags	Cold-boot miss, reset storm after load step
SoC	Uncore / fabric / management islands	Dependency ordering; partial-boot traps	PG graph + boot-time snapshots	Stalls during early init; “looks on” but not usable
IO	PCIe slots, PHY-adjacent support rails	Noise coupling into clocks/links via returns	V + thermal + clock health markers	Link flaps despite “clock present”
AUX	Board services, sideband logic, helpers	Brownouts trigger false faults	V + brownout logs	Spurious faults, intermittent boot peripherals
AON	Always-on logic, persistent monitors	Silent degradation; long-tail reliability	Low-rate telemetry + persistent counters	Evidence gaps, inconsistent last-known state
STBY	Standby pre-boot context	Sequencing corner cases	PG + time ordering	Boot loops, “works warm but fails cold”

Distribution rules that prevent “measures fine, fails anyway”

Design the return path deliberately: large di/dt loops inject noise into sensitive references through shared returns; treat return continuity as a first-class constraint.
Partition only when it improves fault containment: excessive split planes create return discontinuities that hurt clocks/links more than they help isolation.
Keep observability honest: place measurement points where they answer the intended question—regulator health vs distribution loss vs load reality.

Measurement tiers (what each tells)

Regulator output = power source health · Distribution node = copper/connector loss & hotspots · Load point = true supply at the consumer. Mixing tiers in logs without labeling produces false conclusions.

Scope guard (stop line)

This chapter stops at board-level taxonomy, distribution/returns, and measurement strategy. Multiphase control/compensation and controller tuning belong to the CPU VRM page (link-only).

Figure B — Rail groups and distribution (taxonomy-first)

Static power tree view: group rails by intent (CORE/SoC/IO/AUX/AON/STBY), then map to load zones and observability points.

Next chapter (H2-3) will turn this taxonomy into a sequencing/PG dependency graph without diving into VRM control theory.

H2-3 · Sequencing & Reset Domains: EN/PG Dependencies That Actually Boot

Goal: convert “it boots” into a reusable dependency model—reset domains, enable paths, PG conditioning, and the failure patterns that create cold-boot misses and reset storms.

Key takeaways

Reset domains are the unit of truth (CPU / PCIe / BMC / AUX / AON), not individual rails.
PG is analog → digital: deglitch/blanking, thresholds, and OR/AND composition decide whether the system is stable or trapped in a storm.
Cold vs warm: always-on rails preserve context; some domains must fully drop before re-entry to avoid “partial-boot” deadlocks.
Stop line: chip-internal mechanisms are link-only; this chapter stays at board-level graphs and evidence integrity.

A reusable dependency model

A boot sequence succeeds when each reset domain transitions exactly once into a valid state window: EN asserted → rails settle → PG becomes stable → reset released. Failures happen when any of those signals is unstable, mis-grouped, or evaluated at the wrong time scale.

Dependency patterns (when each is appropriate)

Chain is simple but fragile (a single PG glitch cascades). Tree improves resilience by isolating branches. Domain-based gating is the most robust: each consumer domain has explicit prerequisites and its own deglitch window.

PG conditioning: how reset storms are created (and prevented)

Deglitch / blanking: filters transient dips during ramp and load steps; too short creates chatter, too long hides real faults.
Threshold & hysteresis: a PG threshold too close to ripple turns normal noise into repeated resets.
OR/AND composition: composition errors cause either false release (unsafe boot) or false hold (never boots).

Reset storm signature

PG toggles → reset toggles → rails re-enter transient → PG toggles again. This loop often looks like “random reboots” unless events are timestamped and PG inputs are labeled by domain.

Cold boot vs warm reboot: what must drop, what may stay (AON)

Warm reboots often appear healthier because AON preserves monitoring context and avoids re-training of unrelated domains. However, some domains require a full drop to clear latched states, stale resets, or partial-initialization traps.

Reset domain	EN source	PG prerequisites	Reset outputs	Must-capture evidence
CPU domain	System power controller / sequencer	CORE + SoC stable PG (with deglitch), temperature-safe window	CPU_RST#, power-good to CPU	PG timeline, V/I snapshot at load point, last reset reason timestamp
PCIe domain	Sequencer + slot enable policy	IO rails stable PG + ref-clock valid marker	PERST#, slot enable gates	Ref-clock health marker, IO rail snapshot, link flap counter timestamp
BMC domain	AON/Standby path	AON + STBY stable PG	BMC_RST#, sensor fabric reset	Event ordering, sensor bus health, durable fault log write status
AUX domain	Sequencer (policy-driven)	AUX PG conditioned, brownout counters stable	Sideband logic resets	Brownout counters, AUX rail snapshot, glitch counters
AON domain	Always-on supply	AON PG (tight deglitch, long-term stability)	Persistent monitor resets	Persistent counters, last-known-good snapshot, power-fail detect timestamp

PG false-trigger Top 10 (diagnostic format)

Blanking too short → PG drops during ramp/load step → align PG edge vs ramp/step timestamp → extend blanking and capture pre/post snapshots.
Threshold too close to ripple → PG chatters at steady state → compare ripple margin to threshold → add hysteresis or move measurement tier.
Wrong domain composition (OR/AND mix-up) → either never releases reset or releases unsafely → audit PG-to-domain map → split domains and gate locally.
PG sourced from the wrong tier (regulator output only) → “reads OK” but load point sags → add load-point sense marker or distribution-node monitor.
Shared return coupling → PG comparator sees ground bounce as droop → correlate PG drops with high di/dt rails → repair return continuity / segregation.
Debounce not aligned to time constants → filters the wrong thing → measure settling time vs debounce window → tune windows by domain (CPU vs PCIe vs AUX).
Reset release too early → partial init, later collapse → enforce “PG stable for N ms” rule → add stable-state timer per domain.
Latch behavior misunderstood → one-shot faults persist across warm reboot → require full drop of specific domains → encode cold/warm rules explicitly.
Telemetry lag masks causality → logs show rails after the event → take event-triggered snapshots → store “edge-time + snapshot bundle.”
Power-fail detect too late → evidence lost on dropout → prioritize early detect + minimal write set → confirm flush completion marker.

Link-only deep dives

Chip-level VRM control details → CPU VRM · PCIe training/equalization → PCIe Switch/Retimer · OOB firmware/protocol → BMC

Figure C — Sequencing dependency graph (rails → PG conditioning → reset domains)

Block diagram: many elements, minimal text (≥18px). Focus on dependencies and gating, not chip internals.

Practical next step: implement per-domain prerequisites + stable timers, then validate with event snapshots (pre/post) rather than single-point readings.

H2-4 · VRM Islands on a Mainboard: Placement, Sensing, and Telemetry Hooks

Goal: treat each VRM as a managed “power island” with clear physical boundaries (heat/current/noise), trustworthy sensing, and consistent telemetry access.

Key takeaways

Placement is system engineering: heat density, current loops, and magnetic keepouts must be resolved together.
Remote sense is credibility: Kelvin routing + return integrity decide whether telemetry reflects load reality.
Telemetry hooks: V/I/T taps and event-triggered snapshots prevent “looks normal” misdiagnoses.
PMBus/SMBus topology: segmentation and address planning prevent bus-level failures from becoming false power faults.

VRM island physical constraints (board view)

Thermal: define hotspots and airflow boundaries; temperature gradients change losses and can bias sensing and protection thresholds.
Current loops: keep high di/dt loops compact; long distribution loops convert load steps into droop, ground bounce, and false PG behavior.
Magnetic keepout: inductor fields and switch-node regions must avoid ref-clock traces, sensitive sideband lines, and sense pairs.

Keepout rules (practical)

Reserve a “quiet corridor” for ref-clock + sensitive sideband. Route sense pairs away from inductor clusters and from shared high-current returns. Prefer explicit return continuity over excessive plane splits.

Sense / remote sense: measurement tiers that stay truthful

Remote sense is not a “long wire to the load.” It is a Kelvin measurement system that requires a matched return reference. Without return integrity, measurements become load-correlated noise rather than voltage truth.

Common mistakes → typical field symptoms

Sense taken at regulator output only → load droop not visible; throttling without clear cause. Sense return shares high-current return → readings jump with load; false UV/OCP signatures.

Telemetry hooks: what must be visible (system-level)

V/I/T minimum set: load-point voltage marker, distribution-node loss indicator, island temperature near hotspots.
Event snapshots: capture a small bundle at each PG/reset edge (pre/post) with timestamps to preserve causality.
Consistency: label measurement tier (output vs node vs load) so logs remain comparable across revisions.

PMBus/SMBus access (topology only)

Reliable telemetry depends on bus topology that survives multi-card capacitance, hot environments, and multiple masters (host + BMC). Address planning and segmentation prevent “bus issues” from masquerading as power instability.

Topology item	Board-level intent	Failure signature if ignored
Address plan	Stable map across revisions; predictable discovery; avoid collisions across VRM islands	“Missing VRM” telemetry, intermittent reads, false power-fault conclusions
Segmentation	Partition long/loaded segments; isolate noisy zones; reduce fault propagation	Random NACKs, bus lockups during boot, telemetry lag that breaks causality
Multi-master boundaries	Define ownership windows; avoid contention; protect snapshot collection during critical edges	Arbitration-like collisions, timeouts, “telemetry disappears during faults”
Level integrity	Ensure consistent logic levels across islands (topology placement of level domains)	Works in lab, fails in rack; intermittent reads tied to temperature/cabling

Scope guard (stop line)

Multiphase tuning, loop compensation, and DrMOS selection are link-only topics → CPU VRM.

Figure D — VRM islands, keepouts, sense routing, and telemetry taps (board view)

Block diagram with many visual elements and minimal text (≥18px). No chip-internal loops or protocol details.

Link-only deep dive for VRM internals: CPU VRM (VR13/VR12+).

H2-5 · Clock/PLL & PCIe Ref-Clk Distribution: Board-Level Jitter Hygiene

Goal: distribute ref-clk across CPU, PCIe slots, and accelerator zones while keeping jitter and crosstalk within a usable board-level envelope.

Key takeaways

Topology first Zoning fanout reduces cross-domain coupling and limits blast radius when something degrades.
Jitter hygiene Power coupling, return integrity, thermal drift, and crosstalk are the dominant board-level risk paths.
Routing is a system rule Differential-pair continuity and reference-plane consistency matter more than “pretty length matching.”
Budget in 4 blocks Source + Fanout + Power + Routing—enough to decide where to look without deep PLL theory.

Ref-clk distribution topologies (board view)

Single-source → global fanout: simplest, but vulnerable to cross-domain coupling and single-point failure effects.
Zoned fanout: separate branches for CPU / PCIe slots / accelerator zone; improves isolation and troubleshooting locality.
Redundant / dual-source (concept): used for high availability; board-level concerns are switching boundary, isolation, and validation markers.

Dominant jitter risk paths (board-level)

Power noise coupling into clock buffers/cleaners, ground/return bounce near high di/dt loops, return crossing zones that defeats isolation, and thermal drift that creates temperature-correlated instability.

Routing & isolation rules that prevent “mystery link flaps”

Reference continuity: keep the same reference plane through transitions; avoid layer changes across split returns.
Pair integrity: maintain consistent impedance and coupling; treat stubs and uncontrolled “Y splits” as high-risk.
Keepouts: avoid running ref-clk parallel to VRM islands and large current loops; prioritize return integrity over aggressive plane cuts.
Zone boundaries: keep crossings explicit and minimal; label zone edges in layout and in validation evidence.

Simplified jitter budget (enough to diagnose)

Budget block	Primary sources	Typical symptom	First evidence to capture
Source	Ref source health, clock module stability marker	Wide-area instability across zones	Source “valid” marker + temperature correlation
Fanout	Fanout buffer additive jitter, zone boundary integrity	One zone degrades while others remain stable	Zone A/B/C comparison, branch-by-branch failure map
Power	Supply ripple/ground bounce coupling into clock path	Errors correlate with load steps or rail events	Ref-clk errors aligned to V/I snapshots and PG edges
Routing	Crosstalk, reference discontinuity, stubs, bad crossings	Slot-specific flaps, revision-specific failures	Slot/location correlation, “same card different slot” test

Scope boundary

SerDes training/equalization and retimer internals are link-only → PCIe Switch / Retimer.

Figure E — Ref-clk clock tree + simplified jitter budget (board view)

Many block elements, minimal text (≥18px). Emphasize zoning, coupling paths, and budget blocks.

Deeper link-only topic: PCIe Switch / Retimer.

H2-6 · Telemetry Fabric: Thermal / Current / Voltage Signals into a Coherent Story

Goal: turn “many sensors” into a single evidence timeline—where placement, bus architecture, and time consistency explain real rack behavior.

Key takeaways

Placement by physics Sensors must cover hotspots, airflow transitions, and current branches where failures originate.
Domain buses Segment and isolate I²C/SMBus/PMBus/I³C so bus faults do not masquerade as power faults.
One time axis Sampling windows, timestamps, and edge-trigger snapshots preserve causality.
Storm control Debounce + threshold classes + rate limiting prevents alarm avalanches without hiding true faults.

Sensor placement logic (what to cover)

Thermal: VRM islands, DIMM field, PCIe slots/accelerator zone, airflow inlet and outlet for gradient context.
Current: main rails plus critical branches (slot power branches, aux domains) to expose abnormal load steps and brownout precursors.
Voltage: key rails with labeled measurement tier (output / distribution node / load marker) to avoid “looks OK” misreads.

Collection buses (architecture only)

Telemetry reliability depends on board-level bus zoning: segment long/loaded runs, isolate noisy regions, and define multi-master access boundaries. The goal is not protocol depth, but evidence continuity during boot edges and fault edges.

Bus zoning principles

Segment by physical zone (VRM / DIMM / PCIe), isolate when a fault would otherwise propagate, and reserve a protected window for edge-trigger snapshots.

Data consistency: sampling, timestamps, thresholds, and storm suppression

Sampling cadence: temperature is slow, current can be fast, and PG/reset are edge events—capture them on a common timeline.
Timestamps: edge events must carry timestamps; snapshots should store “pre/post” bundles.
Threshold classes: map signals into Info / Warning / Critical classes so actions remain consistent.
Debounce & rate limiting: suppress chatter and avalanche alarms while preserving first-cause evidence.

Evidence integrity warning

Telemetry that arrives after the event breaks causality. Edge-trigger snapshots (with timestamps) often diagnose issues faster than high-rate polling.

Telemetry Map (signal → location → purpose → threshold class)

The table below is a reusable template: label the physical zone, measurement tier, and evidence tag so logs remain comparable across board revisions and slot configurations.

Signal	Physical zone	Tier	Primary purpose	Class	Action	Debounce / window	Evidence tag
VRM island temp	VRM zone	Hotspot	Explain throttling/derating and thermal drift correlation	Warning/Critical	Derate / Log	Slow window	ZONE:VRM
CPU load V marker	CPU zone	Load	Expose true droop vs “regulator looks OK”	Critical	Protect / Log	Edge snapshot	DOMAIN:CPU
Slot branch current	PCIe zone	Branch	Detect abnormal load steps and correlate to link flaps	Warning	Log / Limit	Fast window	SLOT:BRANCH
DIMM field temp	DIMM zone	Array	Identify airflow imbalance and thermal hotspots near memory	Warning	Fan policy / Log	Slow window	ZONE:DIMM
Inlet temp	Airflow inlet	Ambient	Normalize hotspot readings and explain rack-level excursions	Info/Warning	Log	Slow window	AIR:IN
Outlet temp	Airflow outlet	Ambient	Compute gradient and detect cooling degradation	Warning	Alert / Log	Slow window	AIR:OUT
AON voltage	AON/STBY	Domain	Preserve evidence chain; detect early power-fail risk	Critical	Snapshot / Log	Edge snapshot	DOMAIN:AON
PG edge marker	System edge	Event	Anchor causality for boot/fault sequences	Critical	Snapshot	Immediate	EVENT:PG
Reset reason	System edge	Event	Differentiate storm vs single-fault; guide next inspection	Critical	Log	Immediate	EVENT:RST
Bus health counter	Telemetry buses	Fabric	Separate bus failures from “power failures”	Warning	Alert / Log	Medium	FABRIC:BUS

Link-only deep dives

Protocol and OOB details → BMC · In-band logging and analytics → In-band Telemetry & Power Log

Figure F — Telemetry fabric ring: zones, bus segments, and edge-trigger snapshots

Block diagram with many visual elements, minimal labels (≥18px). Single-column friendly.

Link-only deep dive: BMC · In-band Telemetry & Power Log.

H2-7 · Power-fail Detection & Hold-up Intent: Making Logs Survive Reality

Goal: explain why power-loss logs often become untrustworthy, and how a mainboard makes last-gasp evidence deterministic and verifiable.

Key takeaways

Earliest detect “First to know” must trigger last-gasp actions before the voltage slope becomes unrecoverable.
Hold-up intent The target is finishing a minimal, verifiable write set—not extending runtime indefinitely.
Deterministic commit Header → payload → CRC → VALID prevents “looks written” logs that are actually corrupted.
One timebase Timestamp source must be explicit; mismatched clocks break causality across BMC/host evidence.

Power-fail detect chain (who knows first, who finishes the job)

Early warning sources: power-good loss, undervoltage trend, hot-swap fault, standby/AON droop markers.
Last-gasp executor: the component responsible for finishing “must-write” actions (snapshot + minimal log + commit markers).
Broadcast boundary: propagate a single “power-fail state” so domains stop starting new transactions mid-collapse.

Hold-up intent (board-level)

Hold-up is a completion window for a minimal set: freeze event order, capture a snapshot, and commit a durable summary with integrity markers. The win condition is verifiable completion, not long endurance.

Why power-loss logs fail (and what the board must do about it)

Failure mode	What happens in reality	Mainboard countermeasure (no register-level detail)
Voltage drops too fast	Power collapses before “nice shutdown” paths finish; partial writes look like valid logs.	Trigger earlier + reduce must-write set + commit markers that clearly distinguish VALID vs INVALID.
Write time is variable	Flash/FRU write latency varies; last-gasp budget becomes nondeterministic.	Write a compact summary first; defer large data to next boot; use “commit_state” to indicate partial vs complete.
Bus congestion / contention	I²C/SMBus transactions stall or collide; snapshot reads become incomplete during collapse.	Reserve a last-gasp window; freeze nonessential bus traffic; snapshot from pre-latched values when possible.
Timestamp mismatch	BMC vs host vs RTC disagree; events cannot be aligned into one causal timeline.	Declare a primary timebase + record timestamp_source; other clocks are secondary tags, not authoritative ordering.

Last-gasp action checklist (minimal, deterministic)

Freeze order: latch event_id and lock “first-cause” so later noise cannot overwrite root cause.
Snapshot: capture a minimal bundle (V/I/T + PG/Reset bitmap) tied to the same event_id.
Commit protocol: write header → payload → CRC → VALID (VALID written last).
Bus guard: rate-limit or block nonessential bus activity; prevent late transactions from delaying commits.
Self-evidence: if completion fails, the log must explicitly mark INVALID (never “looks good”).

Scope boundary

Energy sizing and PSU/hold-up math are link-only → CRPS / Server PSU · hot-swap deep details are link-only → 48 V / 12 V Bus & Hot-Swap.

Figure C — Power-fail evidence chain (detect → last-gasp → commit → trust mark)

Block diagram, many elements, minimal labels (≥18px). No <defs>/<style>.

Link-only: CRPS / Server PSU · 48 V / 12 V Bus & Hot-Swap.

H2-8 · Event Logs & Evidence Tree: From Raw Flags to a Post-mortem Narrative

Goal: organize fields into an evidence tree so post-mortems preserve first-cause, keep order, and produce a reliable “last shutdown” summary.

Key takeaways

3-layer logs Flags (fast) + snapshots (context) + durable records (survive power loss) form a complete narrative.
First-cause latch Root cause must be latched; later effects append, not overwrite.
Stable IDs event_id + snapshot_ptr make evidence comparable across boots, boards, and revisions.
Boot-first reading Read a compact summary early; validate integrity before trusting details.

Log layers (what each layer contributes)

Instant events (fault flags): fastest indicators; prone to chatter—must be deglitched and protected from overwrite.
State snapshots (telemetry snapshot): V/I/T + PG/Reset context at the moment of change; anchors causality.
Durable records (FRU/flash): survives power loss; must be compact and integrity-checked (CRC + VALID marker).

Overwrite rule

Preserve the first cause. Later events append “effects chain” fields; they must not replace the root-cause code or its evidence pointers.

Event numbering and causality protection (board-level policy)

event_id: monotonic counter (preferably persistent across boots) to keep ordering stable.
first_cause_code: latched once; subsequent entries only add effects and secondary tags.
dedup window: merge chatter into one event within a short window; avoid “storm logs” that erase meaning.
commit_state: written/partial/invalid so post-mortems never treat partial writes as true evidence.

Boot-time read strategy (get the “last shutdown” story early)

Step 1 — Read summary first: a one-line “Last Shutdown Summary” plus (event_id, snapshot_ptr).
Step 2 — Validate integrity: verify CRC/sequence/VALID marker before trusting any detail.
Step 3 — Expand selectively: fetch only the referenced snapshot fields needed to confirm causality.
Step 4 — Classify outcome: clean shutdown vs power-fail vs reset storm, based on first-cause and effects chain.

MVP log schema (field-name level, register-agnostic)

Field	Meaning	Why it matters
event_id	Monotonic event sequence number	Stable ordering across storms and reboots
severity_class	Info / Warning / Critical	Consistent actions and summaries
domain	CPU / PCIe / VRM / DIMM / AON / FABRIC	Localizes root cause and evidence ownership
first_cause_code	Latched root-cause category	Prevents “last error wins” failure mode
effects_bitmap	Reset/throttle/bus_fault/link_flap indicators	Captures the consequences chain without overwriting cause
pg_reset_bitmap	Key PG/Reset states at the event boundary	Explains boot loops and power domain dependencies
snapshot_ptr	Pointer/index to minimal telemetry snapshot	Links events to V/I/T context on the same timeline
timestamp_value + timestamp_source	Time and declared timebase (primary source)	Enables causality alignment across components
commit_state	written / partial / invalid	Makes reliability explicit (no fake “good” logs)
crc	Integrity marker for durable record	Detects corruption and partial writes
boot_count / reset_count_window	Boot and reset-storm counters	Identifies storm patterns and suppresses noisy repetition

Scope boundary

External presentation and protocol mapping are link-only → BMC · analytics and deeper log mining are link-only → In-band Telemetry & Power Log.

Figure H — Evidence tree: preserve first-cause and build a post-mortem narrative

Evidence tree blocks + minimal labels (≥18px). No <defs>/<style>. Single-column friendly.

Link-only: BMC · In-band Telemetry & Power Log.

H2-9 · Board Interfaces that Matter: Sideband, Headers, and Debug Affordances

Goal: define what the mainboard must expose so bring-up, factory test, and field service can observe state, confirm intent, and localize faults—without fragile hacks.

Key takeaways

Sideband Classify by intent: reset, presence/ID, power-good/health, and clock request/control.
Observability Critical state must be probe-able: test points, debug headers, LEDs/7-seg, and latched faults.
Inject-ability Some states must be safely inject-able for validation (feeds H2-10).
Scope Board-level semantics only (no PCIe protocol training, no BMC software feature list).

Sideband taxonomy (board-level semantics, not protocol deep-dives)

Reset family

Platform reset, device reset, domain reset fanout; deglitch and domain boundaries prevent reset storms.

Presence / ID family

Slot/device presence and identity; stable defaults + chatter control avoid phantom insert/remove behavior.

Power-good / health family

PG, FAULT, ALERT, standby-good; OR/AND policy and latching preserve first-cause.

Clock request / control family

CLKREQ/control gating and mux intent; isolation from high-current loops protects jitter hygiene.

Debug affordances (make failures localizable, repeatable, and low-damage)

Test points strategy: group by domain (AON/AUX/IO/CORE), ensure ground reference quality, and keep probe access predictable.
Straps / jumpers: controlled mode switches (safe disable/force paths) that allow deterministic diagnosis and rollback.
Diagnostic LED / 7-seg: encode boot stage, domain readiness, and a minimal fault code (field-friendly).
Fault latching: preserve first-cause across resets so storms do not erase meaning.

Factory measurability: what must be probe-able vs inject-able

Signal/affordance type	Must be probe-able (observe)	Should be inject-able (validate) — safely
Reset	Reset assertion/deassertion visibility per domain; fanout integrity.	Controlled reset injection to prove recovery paths and storm suppression.
Presence / ID	Presence line state; stable defaults when absent.	Simulated presence toggle for production screening (no damage, reversible).
PG / health	Key PG/FAULT visibility; latched first-cause access.	Fault injection at the policy boundary to prove fail-fast gating.
Clock control intent	Clock enable/intent states and gating visibility (board level).	Controlled gating injection to verify safe fallback/disable behavior.

Scope boundary

No PCIe training/protocol deep-dive (link-only) → PCIe Switch / Retimer. No BMC software feature list (link-only) → BMC.

Figure I — Sideband & debug affordance map (bring-up → factory → field)

Block diagram, many elements, minimal labels (≥18px). No <defs>/<style>. Single-column friendly.

Link-only: PCIe Switch / Retimer · BMC.

H2-10 · Bring-up & Validation Checklist: Proving the Board is “Done”

Goal: provide a staged, fail-fast validation path from pre-power checks to delivery readiness—focused on observable points, evidence, and stop criteria.

Key takeaways

Stage gates Pre-power → first power → steady/load → deliver, with explicit stop criteria at each gate.
Observe, don’t guess Each step defines what to observe and what it proves (no numeric limits required).
Evidence pack Keep minimal artifacts: waveforms, summaries, event_id pointers, and integrity markers.
Storm-safe Verify reset/power-fail logging remains trustworthy under brief interruptions.

Stage-gated validation flow (with fail-fast stop criteria)

Stage	What to observe	What it proves	Fail-fast stop criteria
Pre-power	Rail shorts/impedance sanity; default strap/presence states; clock-source enable intent; PG dependency map consistency.	Board is safe to energize; defaults are predictable; dependencies match design intent.	Any critical rail shows abnormal behavior or defaults are inconsistent/unpredictable.
First power	Domain-by-domain enable/PG transitions; reset release order; initial refclk enable stability; early fault latches.	Power/reset sequencing is stable and repeatable; domains reach ready state without storms.	PG chatter, repeated resets, unexplained current behavior, or unstable refclk intent.
Steady / load	Thermal rise trends in hotspots; telemetry self-consistency; event logs under activity; brief interruption / power-fail behavior.	Thermal/telemetry/logs form a coherent story; evidence survives realistic disturbances.	Telemetry contradicts itself; logs lack integrity markers; intermittent faults cannot be localized.
Deliver	Checklist sign-off; minimal evidence pack; last-shutdown summary path; debug affordances confirmed usable in factory.	Board is serviceable and reproducible in production; post-mortems are actionable.	Evidence pack incomplete, or critical debug hooks are not accessible/repeatable.

Evidence pack (minimal artifacts that make bring-up repeatable)

Power/reset snapshots: captures around domain enable/PG transitions and reset release.
Clock intent proof: enable/control visibility and stable refclk presence at the board boundary.
Telemetry consistency: a small set of hotspot trends (V/I/T) tied to the same observation window.
Log integrity: last-shutdown summary with event_id, snapshot_ptr, and CRC/VALID indicators.

Fail-fast philosophy

If a step cannot be proven with observable evidence, stop the flow. Moving forward hides root causes behind later-domain activity.

Scope boundary

VRM control-loop details are link-only → CPU VRM (VR13/VR12+). PCIe training details are link-only → PCIe Switch / Retimer. EMC deep spec is link-only → Safety & EMC Subsystem.

Figure J — Bring-up → validate → deliver (stage gates + stop criteria + evidence)

Gate flow diagram with many elements and short labels (≥18px). No <defs>/<style>. Single-column friendly.

Link-only: CPU VRM · PCIe Switch / Retimer · Safety & EMC Subsystem.

H2-11 · Field Debug Playbook: fastest isolation using power/clock/logs

Field triage is fastest when evidence is reduced to three synchronized pillars: rail state (PG/EN/reset), reference-clock hygiene, and a survivable log snapshot. The following playbooks stay at the mainboard integration level and stop at the point where a handoff to a sibling page becomes justified.

Common capture discipline (applies to all three trees)

1 Capture a pre-event baseline 2 Latch a fault token at first failure 3 Freeze a telemetry snapshot (time-stamped) 4 Readout in boot stage order

What to freeze	Why it accelerates isolation
PG/RESET state vector	Distinguishes “never reached a domain” vs “oscillating in reset storm” vs “booted then crashed”.
Key rails (AON, AUX, PCIe, clock rails)	Separates undervoltage droop, sequencing dependency violations, and localized hot spots.
Ref-clk presence + enable chain	Turns “link flaps” into a deterministic board-level gating/jitter check before suspecting SerDes training.
Power-fail detect + last-gasp window markers	Explains “no logs” as either missing trigger, missing compute interrupt, or insufficient write window.

Boundary rule: if the evidence shows stable ref-clk, stable rails, and correct resets—but link training still fails—handoff to PCIe Switch / Retimer becomes justified; do not chase SerDes internals inside the mainboard page.

Decision Tree A — Intermittent cold-boot failure (fastest path)

A 6–8 steps to isolate “won’t boot” without guessing

Confirm AON domain integrity first. Verify that always-on rails are stable and that the reset source/latch is readable (brownout vs watchdog vs external reset).
Check for a reset storm signature. Look for repeated reset assertion with PG toggling; treat this as a dependency / deglitch problem until proven otherwise.
Inspect the PG vector at the first failure edge. A single missing PG identifies the first broken dependency; multiple PG drops usually indicate a shared upstream rail or ground return disturbance.
Validate PG deglitch assumptions. If a PG signal is “mostly high” but shows narrow low pulses, the system can still reset (blanking too small, threshold too close, or OR/AND logic too aggressive).
Separate cold-start vs warm-reset dependencies. If warm reset works but cold boot fails, focus on rails/domains that require discharge, re-initialization, or are temperature sensitive.
Audit enable chain ownership. Ensure each enable signal has a single authority (no “wired-OR ambiguity” between BMC, CPLD, and straps).
Time-align telemetry around the boot window. AON, AUX, PCIe/clock rails, and VRM telemetry must share a consistent timestamp domain for reliable causality.
Escalate only when the board-level state is proven stable. If all domains are stable and reset exits cleanly, hand off to CPU/DIMM/retimer-specific pages.

Focus signals	Practical meaning
`PG_AON`, `RESET_CAUSE`	Distinguishes power integrity vs functional reset.
`PG_CHAIN` vector	Finds the first broken dependency in a tree or chain.
`EN_CPU`, `EN_PCIe`, `EN_AUX`	Checks “authority collisions” and sequencing correctness.

Typical board-level fix patterns: increase PG deglitch/blanking, move thresholds away from droop zones, add “first-fault” latching, and avoid multi-master enable contention.

TPS3899 LTC2937 MAX16054 TCA9548A PCA9555 FM25V10 PCF85063A

Figure A11 — Cold-boot isolation flow (rail/PG/reset first)

Decision Tree B — “PCIe link flaps” (ref-clk first, then handoff)

B 6–8 steps to separate clock hygiene from SerDes internals

Prove ref-clk is physically present at the consumer. Measure at the slot/device side (not only at the source) to catch mux/enable gating errors.
Validate the ref-clk power rails and enables. A ref-clk buffer can be “configured correctly” but still degrade under rail noise or thermal drift.
Check sideband stability around the flap. Ensure PERST#, CLKREQ#, and presence signals do not chatter due to marginal pull-ups or strap contention.
Correlate flaps with temperature gradients. A consistent thermal trigger points to clock-domain drift or a localized PI issue near the clock tree.
Validate jitter-cleaning assumptions at board level. Confirm the intended reference source, holdover state, and fanout topology (single point vs zoned fanout).
Only then suspect retimer/switch behavior. If ref-clk amplitude/presence and power rails are clean during flaps, handoff to the retimer/switch page is justified.
Freeze the “flap packet” evidence. Log: ref-clk enable state, clock-rail telemetry, and sideband vector at the first flap edge.

9DBV0641 LMK03328 8T49N241 Si5341 LMK04828 TCA9548A

Figure B11 — Ref-clk & sideband gating checks before suspecting retimers

Decision Tree C — “Power loss happened but no logs survived”

C 6–8 steps to make last-gasp evidence reliable

Prove the power-fail detect edge exists. Confirm the comparator/supervisor sees the correct rail and asserts power-fail early enough.
Prove the interrupt reaches the owner. Verify that the last-gasp signal lands on the intended BMC/MCU/CPLD input without level-translation ambiguity.
Measure the write window, not the intention. Determine the real time between power-fail assertion and rail collapse under worst-case load.
Eliminate bus congestion as a silent killer. If I²C/SMBus is busy or wedged, last-gasp writes can fail even with “enough” hold-up energy.
Use a survivable target for the first record. Write the first-fault token to a medium that tolerates abrupt loss and frequent writes (then mirror to slower storage later).
Guarantee timestamp consistency. A stable RTC or monotonic counter prevents “logs exist but cannot be ordered”.
Verify on injected brownouts. Validate the entire chain by repeatedly injecting short power interruptions and checking that the summary is preserved.

TPS3899 LTC2937 FM25V10 W25Q128JV 24AA02E48 PCF85063A LTC3350 LTC4041 TPS61094

Figure C11 — Power-fail evidence chain (detect → interrupt → write → survive)

Concrete reference parts (MPN examples) used by the playbooks

The following part numbers are examples commonly used to implement “detect/sequence/log/clock” building blocks on complex boards. Selection must match rail voltages, interfaces, validation rules, and platform constraints.

Function	Example MPNs
Voltage supervisor / power-fail detect	`TPS3899`, `MAX16054`
Multi-rail sequencing + fault logging	`LTC2937`
I²C/SMBus segmentation	`TCA9548A`
GPIO expansion for straps/LEDs/interrupts	`PCA9555`
“First-fault token” survivable NVM	`FM25V10` (FRAM), `W25Q128JV` (SPI NOR)
FRU / identity EEPROM (example)	`24AA02E48`
RTC for consistent timestamps (example)	`PCF85063A`
PCIe ref-clk fanout buffer (example)	`9DBV0641`
Clock generation / conditioning (examples)	`LMK03328`, `8T49N241`, `Si5341`, `LMK04828`
Last-gasp / supercap backup controller (examples)	`LTC3350`, `LTC4041`, `TPS61094`

Practical rule: write the first record to a survivable medium (FRAM or a protected flash region), then mirror/serialize into the long-form event log after reboot. This prevents “the most important failure” from being overwritten by late-stage noise.

CPU VRM (VR13/VR12+) — deep multiphase control & protections
PCIe Switch / Retimer — equalization & training deep dive
Baseboard Management Controller (BMC) — management plane ownership
In-band Telemetry & Power Log — aggregation/analytics layer

Request a Quote

Name

Company

Part Number(s) / BOM

Quantity & Target Lead Time

Alternates Allowed

Temperature Grade

Package / Footprint

Compliance

Budget Window

Lot Size / Qty

Message

Attachment

Accepted Formats

pdf, csv, xls, xlsx, zip

Attachment

Drag & drop files here or use the button below.

H2-12 · FAQs (Mainboard-only) ×12

Core idea

These answers stay at the mainboard integration boundary: rail/PG/reset evidence, clock distribution hygiene, telemetry topology, and power-fail survivability. When deeper component internals are needed, the answer ends with a clear handoff point to a sibling page (CPU VRM / PCIe Switch-Retimer / BMC).

1 Mainboard checks 2 Common board-level causes 3 Fast proof steps 4 Escalate rule

Example MPNs listed below are typical mainboard building blocks (supervisors, muxes, clock buffers, FRAM/RTC) and are not a mandate for any platform.

Q1 Why can rails look “normal” yet the board still intermittently refuses to power on?

Mainboard checks: capture the first-fail edge: PG vector, reset cause, and a time-aligned rail snapshot (AON/AUX/clock rails) before any reboot storm overwrites evidence.

Common causes: PG chatter that never appears in averaged telemetry, an enable-owner collision (two masters driving EN), or a missing “first-fault token” that turns every failure into the same symptom.

Fast proof: latch first-fault + snapshot on a supervisor interrupt; compare cold-boot vs warm-reset dependency behavior.

Escalate: if PG/reset exits cleanly and all rails are stable during the failed boot window, hand off to CPU VRM / DIMM bring-up pages.

LTC2937TPS3899FM25V10

Q2 PG is deglitched, yet a reset storm persists—what are the three most common dependency/logic mistakes?

Mainboard checks: review the reset-domain map and the exact PG combine logic (what is hard-gate vs soft-monitor).

Most common mistakes: (1) wrong AND/OR composition that promotes a non-critical PG into a global reset gate; (2) a circular dependency (EN_B depends on PG_A while PG_A depends on EN_B); (3) thresholds/blanking placed inside a droop band, so “clean” PG still toggles under load steps.

Fast proof: freeze the PG vector at the first low pulse; confirm whether the same PG is always first to drop.

Escalate: only after logic/thresholds are proven correct should rail dynamics be analyzed on the CPU VRM page.

TPS3899LTC2937SN74LVC1G32

Q3 Remote sense is “placed correctly,” but readings still drift—what is the most common mainboard-level reason?

Mainboard checks: verify that the sense pair and its reference return do not share high di/dt ground segments; correlate drift with temperature gradients and bus activity.

Common causes: Kelvin sense is correct but the ground reference moves (return path contamination), connector/copper temperature coefficient changes the effective drop, or the telemetry chain (ADC reference/filtering/sample timing) aliases noise into a “slow drift.”

Fast proof: compare a direct DMM at the load with telemetry during a controlled load step and a fan-speed change.

Escalate: if regulation error is strongly load-transient dependent with stable references, hand off to CPU VRM control-loop analysis.

INA238INA229PCA9517A

Q4 PMBus polling often times out—how to tell topology issues from electrical noise first?

Mainboard checks: treat PMBus as an electrical network: segmentation, pull-ups, branch length, address plan, and multi-master ownership.

Topology signature: errors persist at idle and correlate with specific branches/addresses. Noise signature: timeouts cluster with load steps, fan PWM edges, or reset/PG activity.

Fast proof: isolate with a mux (one segment at a time); add a forced-bus-recovery step; then repeat under a controlled load step to test noise correlation.

Escalate: if only one VR domain misbehaves after segmentation proves the bus healthy, hand off to the CPU VRM page for device-side behavior.

TCA9548APCA9517APCA9615

Q5 Ref-clk frequency measures “correct,” but links are unstable—what should be checked at the mainboard level first?

Mainboard checks: prove the clock intent (enable/gating), then verify clock-buffer/jitter-cleaner power rails, thermal gradients, and sideband chatter (PERST#/CLKREQ#).

Common causes: frequency is right but jitter is not (supply noise coupling), duty-cycle/SSC assumptions differ across zones, or a marginal enable chain intermittently gates the clock under load/temperature.

Fast proof: correlate link flaps with clock-rail telemetry and sideband vector at the first flap edge.

Escalate: if ref-clk intent + rails are clean during flaps, hand off to PCIe Switch/Retimer internals.

9DBV0641Si5341LMK03328

Q6 Is “more clock fanout” always better? When should zoning replace a single global fanout?

Mainboard checks: identify zones with very different noise/thermal environments (CPU socket vs PCIe/GPU slots) and evaluate whether independent enable/isolation is required.

When zoning wins: a noisy load zone couples supply/return noise into the clock rail; long cross-zone routing breaks reference-plane continuity; or fault containment is needed (one zone can be muted without collapsing the entire clock tree).

Fast proof: compare clock-rail noise and flap rate with a “zoned enable” experiment.

Escalate: if a specific endpoint’s tolerance is the question, hand off to the relevant PCIe Retimer/Switch page.

LMK1C11049DBV0641Si5341

Q7 Why are power-loss logs often untrustworthy—and how to tell “no time to write” vs “written but read wrong”?

Mainboard checks: verify the power-fail detect edge, the owner interrupt path, and a measurable “last-gasp window” marker (start/end) under worst-case load.

No time to write: detect is too late or the window collapses too fast. Written but read wrong: VALID/CRC/sequence mismatches or an outdated pointer selects the wrong record after reboot.

Fast proof: inject repeatable brownouts; confirm first-fault token and pointer advance every time.

Escalate: if detect/window are proven correct yet persistence still fails, hand off to the storage/firmware log-carrier page.

TPS3899FM25V10LTC3350

Q8 After a brief power interruption reboot, which logs/snapshots should be read first?

Mainboard checks: read in “cause-preserving” order to avoid chasing overwritten noise.

Best-first order: (1) first-fault token / last-shutdown summary, (2) reset cause + brownout marker, (3) PG vector at first-fail edge, (4) minimal telemetry snapshot (AON/AUX/clock/PCIe rails + hotspot temps), then (5) long-form ring logs.

Fast proof: validate ordering by repeating injected brownouts and confirming that the summary stays consistent.

Escalate: if summaries consistently implicate a single subsystem, hand off to that subsystem’s sibling page.

FM25V10PCF85063AW25Q128JV

Q9 Many sensors exist, but hotspots remain invisible—how should thermal zones and sampling points be changed to matter?

Mainboard checks: map sensors to the heat-flow chain: inlet/outlet air, VRM islands, DIMM banks, PCIe slot zones, and connector choke points.

Common causes: sensors sit on mechanically convenient but thermally “cold” copper, sampling is too slow to capture excursions, or alert thresholds are not aligned with the platform’s throttle/derate behavior.

Fast proof: perform a controlled workload ramp and compare inlet/outlet delta, VRM island temps, and slot-zone temps with event timing.

Escalate: if the question becomes fan/pump control policy, hand off to Fan & Thermal Management / Liquid Cooling pages.

TMP468TMP451NCT7802Y

Q10 For “minimal production test coverage,” which signal classes must be measurable to avoid future field gaps?

Mainboard checks: define a minimal set that proves the board can (1) power domains correctly, (2) exit reset deterministically, (3) distribute ref-clk to consumers, and (4) preserve a readable failure summary.

Must-measure classes: key rails presence (AON/AUX/clock/PCIe), PG/EN/RESET vector, ref-clk presence at consumer-side test points, and a readable first-fault token + snapshot pointer after an injected interruption.

Fast proof: adopt “fail-stop” gates: any missing class stops the line before deeper functional tests.

Escalate: if a failing gate points to one subsystem, hand off to that subsystem’s page for deep validation.

ADS1115TCA9548APCA9555

Q11 With only logs and limited telemetry in the field, how to quickly separate power issues, clock issues, and management-plane false alarms?

Mainboard checks: classify by correlation and first-cause integrity.

Power signature: PG drops and rail anomalies time-align with the event. Clock signature: ref-clk intent/enable or clock-rail noise correlates with link flaps while rails/PG remain stable. Management false-alarm signature: reset cause and hardware vectors stay clean while software-origin markers dominate the summary.

Fast proof: require a single timestamp domain for snapshots; reject conclusions from unordered logs.

Escalate: clean rails + clean clock intent + persistent platform errors → hand off to BMC / PCIe Retimer / CPU VRM pages as indicated by the first-fault token.

PCF85063AFM25V10Si5341

Q12 When should debug stop at the mainboard page and dive into sibling pages (CPU VRM / PCIe Retimer / BMC)?

Mainboard checks: stop only after the board-level evidence is “closed”: rails stable during the failure window, PG/RESET dependencies proven correct, ref-clk intent stable at the consumer, and logs/snapshots are survivable and time-ordered.

Dive criteria: evidence points to a single subsystem (one VR telemetry domain, one slot zone, or management-plane markers) while board-level vectors remain clean.

Fast proof: repeatability under injected conditions (brownout, thermal ramp, load step) while board-level state stays stable.

Escalate: CPU VRM for regulation/control details; PCIe Switch/Retimer for SerDes training/equalization; BMC for management-plane behavior.

LTC29379DBV0641FM25V10

Figure F12 — FAQ coverage map (Q1–Q12 → mainboard evidence domains)

Q12 is the boundary guard: once rails/PG/resets, ref-clk intent, and survivable logs are proven stable, deeper analysis belongs to sibling pages.

CPU VRM (VR13/VR12+) — regulation control & protections
PCIe Switch / Retimer — SerDes equalization & training internals
Baseboard Management Controller (BMC) — management-plane ownership
In-band Telemetry & Power Log — aggregation/anomaly analysis

Rack Server Mainboard Power, Clock, and Telemetry Design

Rack Server Mainboard Power, Clock, and Telemetry Design

H2-1 · What a Rack Server Mainboard Owns

H2-2 · Power Tree on the Mainboard: Rail Taxonomy & Distribution

H2-3 · Sequencing & Reset Domains: EN/PG Dependencies That Actually Boot

A reusable dependency model

PG conditioning: how reset storms are created (and prevented)

Cold boot vs warm reboot: what must drop, what may stay (AON)

PG false-trigger Top 10 (diagnostic format)

H2-4 · VRM Islands on a Mainboard: Placement, Sensing, and Telemetry Hooks

VRM island physical constraints (board view)

Sense / remote sense: measurement tiers that stay truthful

Telemetry hooks: what must be visible (system-level)

PMBus/SMBus access (topology only)

H2-5 · Clock/PLL & PCIe Ref-Clk Distribution: Board-Level Jitter Hygiene

Ref-clk distribution topologies (board view)

Routing & isolation rules that prevent “mystery link flaps”

Simplified jitter budget (enough to diagnose)

H2-6 · Telemetry Fabric: Thermal / Current / Voltage Signals into a Coherent Story

Sensor placement logic (what to cover)

Collection buses (architecture only)

Data consistency: sampling, timestamps, thresholds, and storm suppression

Telemetry Map (signal → location → purpose → threshold class)

H2-7 · Power-fail Detection & Hold-up Intent: Making Logs Survive Reality

Power-fail detect chain (who knows first, who finishes the job)

Why power-loss logs fail (and what the board must do about it)

Last-gasp action checklist (minimal, deterministic)

H2-8 · Event Logs & Evidence Tree: From Raw Flags to a Post-mortem Narrative

Log layers (what each layer contributes)

Event numbering and causality protection (board-level policy)

Boot-time read strategy (get the “last shutdown” story early)

MVP log schema (field-name level, register-agnostic)

H2-9 · Board Interfaces that Matter: Sideband, Headers, and Debug Affordances

Sideband taxonomy (board-level semantics, not protocol deep-dives)

Debug affordances (make failures localizable, repeatable, and low-damage)

Factory measurability: what must be probe-able vs inject-able

H2-10 · Bring-up & Validation Checklist: Proving the Board is “Done”

Stage-gated validation flow (with fail-fast stop criteria)

Evidence pack (minimal artifacts that make bring-up repeatable)

H2-11 · Field Debug Playbook: fastest isolation using power/clock/logs

Common capture discipline (applies to all three trees)

Decision Tree A — Intermittent cold-boot failure (fastest path)

Decision Tree B — “PCIe link flaps” (ref-clk first, then handoff)

Decision Tree C — “Power loss happened but no logs survived”

Concrete reference parts (MPN examples) used by the playbooks

Recommended topics you might also need

Request a Quote

Accepted Formats

Attachment

H2-12 · FAQs (Mainboard-only) ×12

Core idea

Explore

Categories

Get in Touch