123 Main Street, New York, NY 10001

Rack Server Mainboard Power, Clock, and Telemetry Design

← Back to: Data Center & Servers

Central idea

A rack server mainboard is the system integration layer that turns power rails, reset domains, reference clocks, telemetry, and logs into a single bootable and debuggable platform. It succeeds when the board provides a trustworthy evidence chain—first-fault capture, time-aligned snapshots, and power-fail survivability—so issues can be isolated quickly without guesswork.

H2-1 · What a Rack Server Mainboard Owns

One-page boundary: the mainboard is the integration layer that turns power, clocks, and evidence (telemetry + logs) into a bootable, verifiable, debuggable system.

Key takeaways (fast orientation)

  • Power Tree names rails and domains, defines sequencing + PG/reset dependencies, and places measurement points so readings match reality.
  • Clock Tree distributes ref-clocks/PLLs with board-level jitter hygiene (power/return/route isolation) before blaming link silicon.
  • Evidence Tree makes failures explainable: consistent telemetry snapshots, event ordering, power-fail handling, and durable post-mortem logs.
  • Stop line: when the question becomes “chip-internal control/SerDes/protocol,” switch to link-only sibling pages.

A “mainboard-level problem” usually looks normal in isolation (a rail voltage is in range, a status bit is set, a clock frequency exists) yet the system still fails (cold-boot miss, reset storm, link flaps, missing evidence after power loss). The correct first step is to map the symptom to one (or an intersection) of the three trees, then decide where deeper component-level analysis is justified.

Topic Here (mainboard owns) Not here (link-only) Go deeper
CPU VRM & multi-rail PoL Rail taxonomy, distribution/return constraints, sequencing dependencies, measurement tiers, PMBus/SMBus visibility Loop compensation, multiphase control, DrMOS selection, controller-internal tuning
DDR power & DIMM logic Domain boundaries, boot dependencies, presence/management interface roles in the system view DDR5 PMIC / RCD / DB / SPD internals and register-level behavior
PCIe stability & ref-clock Ref-clock/PLL fanout topology, board-level jitter hygiene, power/return coupling risks that cause link flaps Retimer/switch SerDes equalization, training details, device-internal diagnostics
OOB management BMC as a consumer of sensors/events/logs: what must be observable and trustworthy on the board IPMI/Redfish command sets, firmware architecture, update and policy internals
Telemetry platform Board-side sensor placement, bus partitioning, sampling consistency, timestamps, deglitch and alarm storm control System-level analytics pipelines, anomaly detection engines, fleet policy logic

Symptom index (symptom → first tree → decision point)

  • Intermittent power-up / cold-boot miss → start with Power Tree (sequencing + PG/reset graph + measurement tiers) → then decide if VRM deep-dive is needed.
  • Reset storm / watchdog-like reboot loop → start with Power Tree + Evidence Tree (PG deglitch, event ordering, last-fault preservation).
  • PCIe link flaps while “freq is correct” → start with Clock Tree (jitter hygiene, coupling from power/return) → then decide whether to suspect retimers.
  • No usable evidence after power loss → start with Evidence Tree (power-fail detect → hold-up intent → log flush) → only link to PSU/hot-swap pages if needed.

Scope guard (stop line)

Chip-internal control theory (VRM compensation), SerDes equalization, and BMC protocol/firmware details are link-only topics. This page stays at board-level topology, observability, and evidence integrity.

Figure A — Mainboard ownership map (Power Tree · Clock Tree · Evidence Tree)

Block-diagram navigation: minimal text (≥18px), many visual cues. Use it to decide “system-first” vs “component deep-dive.”

RACK SERVER MAINBOARD · OWNERSHIP MAP Mainboard integration layer POWER TREE CLOCK TREE EVIDENCE TREE CPU / SoC VRM VRM PLL / Ref-Clk DIMM zone PCIe slots Telemetry Logs

Link-only siblings: CPU VRM · PCIe Switch/Retimer · BMC · In-band Telemetry · DDR pages.

H2-2 · Power Tree on the Mainboard: Rail Taxonomy & Distribution

This chapter defines a practical rail language (groups, domains, naming, and measurement tiers) so logs and telemetry map cleanly to real boot behavior.

Key takeaways (system-level, not VRM control theory)

  • Rail groups must carry meaning (boot criticality, noise sensitivity, transient stress, isolation need), not just voltage labels.
  • Distribution must be designed as a loop (forward path + return path). Return mistakes commonly surface as “clock/link instability.”
  • Measurement must be tiered: regulator output vs distribution node vs load point—each answers a different question.
  • Stop line: multiphase tuning/compensation details belong to the CPU VRM page (link-only).

A useful taxonomy separates rail (a regulated voltage line) from power domain (a functional boundary that may require multiple rails). When taxonomy is consistent, power-fail snapshots and boot-time logs become searchable evidence rather than isolated numbers.

Practical naming rule (for searchable logs)

Use a consistent pattern that encodes intent: DOMAIN + STATE + optional LEVEL. Example pattern: V_DOMAIN_STATE (e.g., V_IO_AON, V_AUX_STBY). Avoid pure numeric names that cannot be reasoned about from logs.

Rail group What it powers (system view) Board-level risk Primary observability Typical failure signature
CORE Compute core rails (boot-critical) Fast transients; local hotspots; PG sensitivity V/I near load + event flags Cold-boot miss, reset storm after load step
SoC Uncore / fabric / management islands Dependency ordering; partial-boot traps PG graph + boot-time snapshots Stalls during early init; “looks on” but not usable
IO PCIe slots, PHY-adjacent support rails Noise coupling into clocks/links via returns V + thermal + clock health markers Link flaps despite “clock present”
AUX Board services, sideband logic, helpers Brownouts trigger false faults V + brownout logs Spurious faults, intermittent boot peripherals
AON Always-on logic, persistent monitors Silent degradation; long-tail reliability Low-rate telemetry + persistent counters Evidence gaps, inconsistent last-known state
STBY Standby pre-boot context Sequencing corner cases PG + time ordering Boot loops, “works warm but fails cold”

Distribution rules that prevent “measures fine, fails anyway”

  • Design the return path deliberately: large di/dt loops inject noise into sensitive references through shared returns; treat return continuity as a first-class constraint.
  • Partition only when it improves fault containment: excessive split planes create return discontinuities that hurt clocks/links more than they help isolation.
  • Keep observability honest: place measurement points where they answer the intended question—regulator health vs distribution loss vs load reality.

Measurement tiers (what each tells)

Regulator output = power source health · Distribution node = copper/connector loss & hotspots · Load point = true supply at the consumer. Mixing tiers in logs without labeling produces false conclusions.

Scope guard (stop line)

This chapter stops at board-level taxonomy, distribution/returns, and measurement strategy. Multiphase control/compensation and controller tuning belong to the CPU VRM page (link-only).

Figure B — Rail groups and distribution (taxonomy-first)

Static power tree view: group rails by intent (CORE/SoC/IO/AUX/AON/STBY), then map to load zones and observability points.

POWER TREE · RAIL GROUPS & DISTRIBUTION Board Input Distribution node CORE SoC IO AUX AON STBY CPU zone DIMM zone PCIe zone V/I T Return path continuity matters

Next chapter (H2-3) will turn this taxonomy into a sequencing/PG dependency graph without diving into VRM control theory.

H2-3 · Sequencing & Reset Domains: EN/PG Dependencies That Actually Boot

Goal: convert “it boots” into a reusable dependency model—reset domains, enable paths, PG conditioning, and the failure patterns that create cold-boot misses and reset storms.

Key takeaways

  • Reset domains are the unit of truth (CPU / PCIe / BMC / AUX / AON), not individual rails.
  • PG is analog → digital: deglitch/blanking, thresholds, and OR/AND composition decide whether the system is stable or trapped in a storm.
  • Cold vs warm: always-on rails preserve context; some domains must fully drop before re-entry to avoid “partial-boot” deadlocks.
  • Stop line: chip-internal mechanisms are link-only; this chapter stays at board-level graphs and evidence integrity.

A reusable dependency model

A boot sequence succeeds when each reset domain transitions exactly once into a valid state window: EN assertedrails settlePG becomes stablereset released. Failures happen when any of those signals is unstable, mis-grouped, or evaluated at the wrong time scale.

Dependency patterns (when each is appropriate)

Chain is simple but fragile (a single PG glitch cascades). Tree improves resilience by isolating branches. Domain-based gating is the most robust: each consumer domain has explicit prerequisites and its own deglitch window.

PG conditioning: how reset storms are created (and prevented)

  • Deglitch / blanking: filters transient dips during ramp and load steps; too short creates chatter, too long hides real faults.
  • Threshold & hysteresis: a PG threshold too close to ripple turns normal noise into repeated resets.
  • OR/AND composition: composition errors cause either false release (unsafe boot) or false hold (never boots).

Reset storm signature

PG toggles → reset toggles → rails re-enter transient → PG toggles again. This loop often looks like “random reboots” unless events are timestamped and PG inputs are labeled by domain.

Cold boot vs warm reboot: what must drop, what may stay (AON)

Warm reboots often appear healthier because AON preserves monitoring context and avoids re-training of unrelated domains. However, some domains require a full drop to clear latched states, stale resets, or partial-initialization traps.

Reset domain EN source PG prerequisites Reset outputs Must-capture evidence
CPU domain System power controller / sequencer CORE + SoC stable PG (with deglitch), temperature-safe window CPU_RST#, power-good to CPU PG timeline, V/I snapshot at load point, last reset reason timestamp
PCIe domain Sequencer + slot enable policy IO rails stable PG + ref-clock valid marker PERST#, slot enable gates Ref-clock health marker, IO rail snapshot, link flap counter timestamp
BMC domain AON/Standby path AON + STBY stable PG BMC_RST#, sensor fabric reset Event ordering, sensor bus health, durable fault log write status
AUX domain Sequencer (policy-driven) AUX PG conditioned, brownout counters stable Sideband logic resets Brownout counters, AUX rail snapshot, glitch counters
AON domain Always-on supply AON PG (tight deglitch, long-term stability) Persistent monitor resets Persistent counters, last-known-good snapshot, power-fail detect timestamp

PG false-trigger Top 10 (diagnostic format)

  1. Blanking too short → PG drops during ramp/load step → align PG edge vs ramp/step timestamp → extend blanking and capture pre/post snapshots.
  2. Threshold too close to ripple → PG chatters at steady state → compare ripple margin to threshold → add hysteresis or move measurement tier.
  3. Wrong domain composition (OR/AND mix-up) → either never releases reset or releases unsafely → audit PG-to-domain map → split domains and gate locally.
  4. PG sourced from the wrong tier (regulator output only) → “reads OK” but load point sags → add load-point sense marker or distribution-node monitor.
  5. Shared return coupling → PG comparator sees ground bounce as droop → correlate PG drops with high di/dt rails → repair return continuity / segregation.
  6. Debounce not aligned to time constants → filters the wrong thing → measure settling time vs debounce window → tune windows by domain (CPU vs PCIe vs AUX).
  7. Reset release too early → partial init, later collapse → enforce “PG stable for N ms” rule → add stable-state timer per domain.
  8. Latch behavior misunderstood → one-shot faults persist across warm reboot → require full drop of specific domains → encode cold/warm rules explicitly.
  9. Telemetry lag masks causality → logs show rails after the event → take event-triggered snapshots → store “edge-time + snapshot bundle.”
  10. Power-fail detect too late → evidence lost on dropout → prioritize early detect + minimal write set → confirm flush completion marker.

Link-only deep dives

Figure C — Sequencing dependency graph (rails → PG conditioning → reset domains)

Block diagram: many elements, minimal text (≥18px). Focus on dependencies and gating, not chip internals.

SEQUENCING · RESET DOMAINS · PG CONDITIONING Rail groups CORE SoC IO AUX AON / STBY PG conditioning DEGLITCH / BLANKING THRESHOLD + HYST DOMAIN COMPOSITION AND / OR gates per domain Reset domains CPU RESET DOMAIN PCIe RESET DOMAIN BMC / AUX DOMAIN EVENT SNAPSHOT + TIMESTAMP Stable boot = domain gating + PG conditioning + timestamped evidence (not “rails look OK”)

Practical next step: implement per-domain prerequisites + stable timers, then validate with event snapshots (pre/post) rather than single-point readings.

H2-4 · VRM Islands on a Mainboard: Placement, Sensing, and Telemetry Hooks

Goal: treat each VRM as a managed “power island” with clear physical boundaries (heat/current/noise), trustworthy sensing, and consistent telemetry access.

Key takeaways

  • Placement is system engineering: heat density, current loops, and magnetic keepouts must be resolved together.
  • Remote sense is credibility: Kelvin routing + return integrity decide whether telemetry reflects load reality.
  • Telemetry hooks: V/I/T taps and event-triggered snapshots prevent “looks normal” misdiagnoses.
  • PMBus/SMBus topology: segmentation and address planning prevent bus-level failures from becoming false power faults.

VRM island physical constraints (board view)

  • Thermal: define hotspots and airflow boundaries; temperature gradients change losses and can bias sensing and protection thresholds.
  • Current loops: keep high di/dt loops compact; long distribution loops convert load steps into droop, ground bounce, and false PG behavior.
  • Magnetic keepout: inductor fields and switch-node regions must avoid ref-clock traces, sensitive sideband lines, and sense pairs.

Keepout rules (practical)

Reserve a “quiet corridor” for ref-clock + sensitive sideband. Route sense pairs away from inductor clusters and from shared high-current returns. Prefer explicit return continuity over excessive plane splits.

Sense / remote sense: measurement tiers that stay truthful

Remote sense is not a “long wire to the load.” It is a Kelvin measurement system that requires a matched return reference. Without return integrity, measurements become load-correlated noise rather than voltage truth.

Common mistakes → typical field symptoms

Sense taken at regulator output only → load droop not visible; throttling without clear cause. Sense return shares high-current return → readings jump with load; false UV/OCP signatures.

Telemetry hooks: what must be visible (system-level)

  • V/I/T minimum set: load-point voltage marker, distribution-node loss indicator, island temperature near hotspots.
  • Event snapshots: capture a small bundle at each PG/reset edge (pre/post) with timestamps to preserve causality.
  • Consistency: label measurement tier (output vs node vs load) so logs remain comparable across revisions.

PMBus/SMBus access (topology only)

Reliable telemetry depends on bus topology that survives multi-card capacitance, hot environments, and multiple masters (host + BMC). Address planning and segmentation prevent “bus issues” from masquerading as power instability.

Topology item Board-level intent Failure signature if ignored
Address plan Stable map across revisions; predictable discovery; avoid collisions across VRM islands “Missing VRM” telemetry, intermittent reads, false power-fault conclusions
Segmentation Partition long/loaded segments; isolate noisy zones; reduce fault propagation Random NACKs, bus lockups during boot, telemetry lag that breaks causality
Multi-master boundaries Define ownership windows; avoid contention; protect snapshot collection during critical edges Arbitration-like collisions, timeouts, “telemetry disappears during faults”
Level integrity Ensure consistent logic levels across islands (topology placement of level domains) Works in lab, fails in rack; intermittent reads tied to temperature/cabling

Scope guard (stop line)

Figure D — VRM islands, keepouts, sense routing, and telemetry taps (board view)

Block diagram with many visual elements and minimal text (≥18px). No chip-internal loops or protocol details.

VRM ISLANDS · SENSE · TELEMETRY HOOKS Mainboard (board view) CPU ZONE PCIe SLOTS REF-CLK QUIET CORRIDOR Keepout: inductors / SW nodes VRM INDUCTORS V / I / T VRM ISLAND B SENSE (Kelvin) to load point WRONG: sense shares return BUS SEG isolate / split BMC / HOST SNAPSHOT VRM island success = placement + truthful sense + segmented telemetry access + event snapshots

Link-only deep dive for VRM internals: CPU VRM (VR13/VR12+).

H2-5 · Clock/PLL & PCIe Ref-Clk Distribution: Board-Level Jitter Hygiene

Goal: distribute ref-clk across CPU, PCIe slots, and accelerator zones while keeping jitter and crosstalk within a usable board-level envelope.

Key takeaways

  • Topology first Zoning fanout reduces cross-domain coupling and limits blast radius when something degrades.
  • Jitter hygiene Power coupling, return integrity, thermal drift, and crosstalk are the dominant board-level risk paths.
  • Routing is a system rule Differential-pair continuity and reference-plane consistency matter more than “pretty length matching.”
  • Budget in 4 blocks Source + Fanout + Power + Routing—enough to decide where to look without deep PLL theory.

Ref-clk distribution topologies (board view)

  • Single-source → global fanout: simplest, but vulnerable to cross-domain coupling and single-point failure effects.
  • Zoned fanout: separate branches for CPU / PCIe slots / accelerator zone; improves isolation and troubleshooting locality.
  • Redundant / dual-source (concept): used for high availability; board-level concerns are switching boundary, isolation, and validation markers.

Dominant jitter risk paths (board-level)

Power noise coupling into clock buffers/cleaners, ground/return bounce near high di/dt loops, return crossing zones that defeats isolation, and thermal drift that creates temperature-correlated instability.

Routing & isolation rules that prevent “mystery link flaps”

  • Reference continuity: keep the same reference plane through transitions; avoid layer changes across split returns.
  • Pair integrity: maintain consistent impedance and coupling; treat stubs and uncontrolled “Y splits” as high-risk.
  • Keepouts: avoid running ref-clk parallel to VRM islands and large current loops; prioritize return integrity over aggressive plane cuts.
  • Zone boundaries: keep crossings explicit and minimal; label zone edges in layout and in validation evidence.

Simplified jitter budget (enough to diagnose)

Budget block Primary sources Typical symptom First evidence to capture
Source Ref source health, clock module stability marker Wide-area instability across zones Source “valid” marker + temperature correlation
Fanout Fanout buffer additive jitter, zone boundary integrity One zone degrades while others remain stable Zone A/B/C comparison, branch-by-branch failure map
Power Supply ripple/ground bounce coupling into clock path Errors correlate with load steps or rail events Ref-clk errors aligned to V/I snapshots and PG edges
Routing Crosstalk, reference discontinuity, stubs, bad crossings Slot-specific flaps, revision-specific failures Slot/location correlation, “same card different slot” test

Scope boundary

Figure E — Ref-clk clock tree + simplified jitter budget (board view)

Many block elements, minimal text (≥18px). Emphasize zoning, coupling paths, and budget blocks.

REF-CLK DISTRIBUTION · JITTER HYGIENE Clock path REF SOURCE VALID MUX / CLEANER zone boundary + isolation Coupling paths POWER NOISE GROUND BOUNCE CROSSTALK THERMAL DRIFT Zoned fanout FANOUT BUFFER ZONE A · CPU ZONE B · PCIe SLOTS ZONE C · ACCEL Loads CPU PCIe SLOTS NIC / DPU / GPU ACCEL ZONE JITTER BUDGET: SOURCE + FANOUT + POWER + ROUTING Board success = zoned fanout + clean power/return + low-crosstalk routing + comparable evidence across zones

H2-6 · Telemetry Fabric: Thermal / Current / Voltage Signals into a Coherent Story

Goal: turn “many sensors” into a single evidence timeline—where placement, bus architecture, and time consistency explain real rack behavior.

Key takeaways

  • Placement by physics Sensors must cover hotspots, airflow transitions, and current branches where failures originate.
  • Domain buses Segment and isolate I²C/SMBus/PMBus/I³C so bus faults do not masquerade as power faults.
  • One time axis Sampling windows, timestamps, and edge-trigger snapshots preserve causality.
  • Storm control Debounce + threshold classes + rate limiting prevents alarm avalanches without hiding true faults.

Sensor placement logic (what to cover)

  • Thermal: VRM islands, DIMM field, PCIe slots/accelerator zone, airflow inlet and outlet for gradient context.
  • Current: main rails plus critical branches (slot power branches, aux domains) to expose abnormal load steps and brownout precursors.
  • Voltage: key rails with labeled measurement tier (output / distribution node / load marker) to avoid “looks OK” misreads.

Collection buses (architecture only)

Telemetry reliability depends on board-level bus zoning: segment long/loaded runs, isolate noisy regions, and define multi-master access boundaries. The goal is not protocol depth, but evidence continuity during boot edges and fault edges.

Bus zoning principles

Segment by physical zone (VRM / DIMM / PCIe), isolate when a fault would otherwise propagate, and reserve a protected window for edge-trigger snapshots.

Data consistency: sampling, timestamps, thresholds, and storm suppression

  • Sampling cadence: temperature is slow, current can be fast, and PG/reset are edge events—capture them on a common timeline.
  • Timestamps: edge events must carry timestamps; snapshots should store “pre/post” bundles.
  • Threshold classes: map signals into Info / Warning / Critical classes so actions remain consistent.
  • Debounce & rate limiting: suppress chatter and avalanche alarms while preserving first-cause evidence.

Evidence integrity warning

Telemetry that arrives after the event breaks causality. Edge-trigger snapshots (with timestamps) often diagnose issues faster than high-rate polling.

Telemetry Map (signal → location → purpose → threshold class)

The table below is a reusable template: label the physical zone, measurement tier, and evidence tag so logs remain comparable across board revisions and slot configurations.

Signal Physical zone Tier Primary purpose Class Action Debounce / window Evidence tag
VRM island temp VRM zone Hotspot Explain throttling/derating and thermal drift correlation Warning/Critical Derate / Log Slow window ZONE:VRM
CPU load V marker CPU zone Load Expose true droop vs “regulator looks OK” Critical Protect / Log Edge snapshot DOMAIN:CPU
Slot branch current PCIe zone Branch Detect abnormal load steps and correlate to link flaps Warning Log / Limit Fast window SLOT:BRANCH
DIMM field temp DIMM zone Array Identify airflow imbalance and thermal hotspots near memory Warning Fan policy / Log Slow window ZONE:DIMM
Inlet temp Airflow inlet Ambient Normalize hotspot readings and explain rack-level excursions Info/Warning Log Slow window AIR:IN
Outlet temp Airflow outlet Ambient Compute gradient and detect cooling degradation Warning Alert / Log Slow window AIR:OUT
AON voltage AON/STBY Domain Preserve evidence chain; detect early power-fail risk Critical Snapshot / Log Edge snapshot DOMAIN:AON
PG edge marker System edge Event Anchor causality for boot/fault sequences Critical Snapshot Immediate EVENT:PG
Reset reason System edge Event Differentiate storm vs single-fault; guide next inspection Critical Log Immediate EVENT:RST
Bus health counter Telemetry buses Fabric Separate bus failures from “power failures” Warning Alert / Log Medium FABRIC:BUS

Link-only deep dives

Figure F — Telemetry fabric ring: zones, bus segments, and edge-trigger snapshots

Block diagram with many visual elements, minimal labels (≥18px). Single-column friendly.

TELEMETRY FABRIC · ONE TIMELINE AGGREGATOR timestamp + snapshot store EDGE SNAPSHOT THRESHOLD CLASS STORM CONTROL debounce · rate limit VRM ZONE TEMP CURR VOLT TAG DIMM ZONE TEMP ARRAY AIRFLOW GRAD PCIe ZONE CURR SLOT TAG EDGE SNAPSHOT AIRFLOW IN / OUT INLET OUTLET Coherent story = zoned sensors + segmented buses + one timestamped evidence timeline

H2-7 · Power-fail Detection & Hold-up Intent: Making Logs Survive Reality

Goal: explain why power-loss logs often become untrustworthy, and how a mainboard makes last-gasp evidence deterministic and verifiable.

Key takeaways

  • Earliest detect “First to know” must trigger last-gasp actions before the voltage slope becomes unrecoverable.
  • Hold-up intent The target is finishing a minimal, verifiable write set—not extending runtime indefinitely.
  • Deterministic commit Header → payload → CRC → VALID prevents “looks written” logs that are actually corrupted.
  • One timebase Timestamp source must be explicit; mismatched clocks break causality across BMC/host evidence.

Power-fail detect chain (who knows first, who finishes the job)

  • Early warning sources: power-good loss, undervoltage trend, hot-swap fault, standby/AON droop markers.
  • Last-gasp executor: the component responsible for finishing “must-write” actions (snapshot + minimal log + commit markers).
  • Broadcast boundary: propagate a single “power-fail state” so domains stop starting new transactions mid-collapse.

Hold-up intent (board-level)

Hold-up is a completion window for a minimal set: freeze event order, capture a snapshot, and commit a durable summary with integrity markers. The win condition is verifiable completion, not long endurance.

Why power-loss logs fail (and what the board must do about it)

Failure mode What happens in reality Mainboard countermeasure (no register-level detail)
Voltage drops too fast Power collapses before “nice shutdown” paths finish; partial writes look like valid logs. Trigger earlier + reduce must-write set + commit markers that clearly distinguish VALID vs INVALID.
Write time is variable Flash/FRU write latency varies; last-gasp budget becomes nondeterministic. Write a compact summary first; defer large data to next boot; use “commit_state” to indicate partial vs complete.
Bus congestion / contention I²C/SMBus transactions stall or collide; snapshot reads become incomplete during collapse. Reserve a last-gasp window; freeze nonessential bus traffic; snapshot from pre-latched values when possible.
Timestamp mismatch BMC vs host vs RTC disagree; events cannot be aligned into one causal timeline. Declare a primary timebase + record timestamp_source; other clocks are secondary tags, not authoritative ordering.

Last-gasp action checklist (minimal, deterministic)

  • Freeze order: latch event_id and lock “first-cause” so later noise cannot overwrite root cause.
  • Snapshot: capture a minimal bundle (V/I/T + PG/Reset bitmap) tied to the same event_id.
  • Commit protocol: write headerpayloadCRCVALID (VALID written last).
  • Bus guard: rate-limit or block nonessential bus activity; prevent late transactions from delaying commits.
  • Self-evidence: if completion fails, the log must explicitly mark INVALID (never “looks good”).

Scope boundary

Figure C — Power-fail evidence chain (detect → last-gasp → commit → trust mark)

Block diagram, many elements, minimal labels (≥18px). No <defs>/<style>.

POWER-FAIL · EVIDENCE CHAIN Detect sources PWR_OK LOSS UV TREND HOT-SWAP FAULT AON DROOP LAST-GASP ORCHESTRATOR freeze · snapshot · commit EVENT_ID + FIRST_CAUSE LATCH BUS GUARD WINDOW Must-finish set SNAPSHOT V / I / T + PG/RST MVP LOG summary + pointers COMMIT HEADER → PAYLOAD CRC → VALID (LAST) RISK PATHS BUS BUSY / STALL TIMEBASE MISMATCH TRUST MARK VALID / INVALID Deterministic last-gasp: minimal write set + explicit integrity markers + declared timebase

H2-8 · Event Logs & Evidence Tree: From Raw Flags to a Post-mortem Narrative

Goal: organize fields into an evidence tree so post-mortems preserve first-cause, keep order, and produce a reliable “last shutdown” summary.

Key takeaways

  • 3-layer logs Flags (fast) + snapshots (context) + durable records (survive power loss) form a complete narrative.
  • First-cause latch Root cause must be latched; later effects append, not overwrite.
  • Stable IDs event_id + snapshot_ptr make evidence comparable across boots, boards, and revisions.
  • Boot-first reading Read a compact summary early; validate integrity before trusting details.

Log layers (what each layer contributes)

  • Instant events (fault flags): fastest indicators; prone to chatter—must be deglitched and protected from overwrite.
  • State snapshots (telemetry snapshot): V/I/T + PG/Reset context at the moment of change; anchors causality.
  • Durable records (FRU/flash): survives power loss; must be compact and integrity-checked (CRC + VALID marker).

Overwrite rule

Preserve the first cause. Later events append “effects chain” fields; they must not replace the root-cause code or its evidence pointers.

Event numbering and causality protection (board-level policy)

  • event_id: monotonic counter (preferably persistent across boots) to keep ordering stable.
  • first_cause_code: latched once; subsequent entries only add effects and secondary tags.
  • dedup window: merge chatter into one event within a short window; avoid “storm logs” that erase meaning.
  • commit_state: written/partial/invalid so post-mortems never treat partial writes as true evidence.

Boot-time read strategy (get the “last shutdown” story early)

  • Step 1 — Read summary first: a one-line “Last Shutdown Summary” plus (event_id, snapshot_ptr).
  • Step 2 — Validate integrity: verify CRC/sequence/VALID marker before trusting any detail.
  • Step 3 — Expand selectively: fetch only the referenced snapshot fields needed to confirm causality.
  • Step 4 — Classify outcome: clean shutdown vs power-fail vs reset storm, based on first-cause and effects chain.

MVP log schema (field-name level, register-agnostic)

Field Meaning Why it matters
event_id Monotonic event sequence number Stable ordering across storms and reboots
severity_class Info / Warning / Critical Consistent actions and summaries
domain CPU / PCIe / VRM / DIMM / AON / FABRIC Localizes root cause and evidence ownership
first_cause_code Latched root-cause category Prevents “last error wins” failure mode
effects_bitmap Reset/throttle/bus_fault/link_flap indicators Captures the consequences chain without overwriting cause
pg_reset_bitmap Key PG/Reset states at the event boundary Explains boot loops and power domain dependencies
snapshot_ptr Pointer/index to minimal telemetry snapshot Links events to V/I/T context on the same timeline
timestamp_value + timestamp_source Time and declared timebase (primary source) Enables causality alignment across components
commit_state written / partial / invalid Makes reliability explicit (no fake “good” logs)
crc Integrity marker for durable record Detects corruption and partial writes
boot_count / reset_count_window Boot and reset-storm counters Identifies storm patterns and suppresses noisy repetition

Scope boundary

Figure H — Evidence tree: preserve first-cause and build a post-mortem narrative

Evidence tree blocks + minimal labels (≥18px). No <defs>/<style>. Single-column friendly.

EVENT LOGS · EVIDENCE TREE FIRST CAUSE event_id + first_cause_code (LATCHED) OVERWRITE RULE append effects never replace cause FAULT FLAGS PG / RST / BUS DEDUP WINDOW EFFECTS BITMAP STATE SNAPSHOT V / I / T snapshot_ptr timestamp_source DURABLE RECORD SUMMARY CRC VALID / INVALID POST-MORTEM SUMMARY one-line cause + references (event_id, snapshot_ptr) Evidence tree = first-cause latch + stable IDs + integrity markers + boot-first summary

H2-9 · Board Interfaces that Matter: Sideband, Headers, and Debug Affordances

Goal: define what the mainboard must expose so bring-up, factory test, and field service can observe state, confirm intent, and localize faults—without fragile hacks.

Key takeaways

  • Sideband Classify by intent: reset, presence/ID, power-good/health, and clock request/control.
  • Observability Critical state must be probe-able: test points, debug headers, LEDs/7-seg, and latched faults.
  • Inject-ability Some states must be safely inject-able for validation (feeds H2-10).
  • Scope Board-level semantics only (no PCIe protocol training, no BMC software feature list).

Sideband taxonomy (board-level semantics, not protocol deep-dives)

Reset family

Platform reset, device reset, domain reset fanout; deglitch and domain boundaries prevent reset storms.

Presence / ID family

Slot/device presence and identity; stable defaults + chatter control avoid phantom insert/remove behavior.

Power-good / health family

PG, FAULT, ALERT, standby-good; OR/AND policy and latching preserve first-cause.

Clock request / control family

CLKREQ/control gating and mux intent; isolation from high-current loops protects jitter hygiene.

Debug affordances (make failures localizable, repeatable, and low-damage)

  • Test points strategy: group by domain (AON/AUX/IO/CORE), ensure ground reference quality, and keep probe access predictable.
  • Straps / jumpers: controlled mode switches (safe disable/force paths) that allow deterministic diagnosis and rollback.
  • Diagnostic LED / 7-seg: encode boot stage, domain readiness, and a minimal fault code (field-friendly).
  • Fault latching: preserve first-cause across resets so storms do not erase meaning.

Factory measurability: what must be probe-able vs inject-able

Signal/affordance type Must be probe-able (observe) Should be inject-able (validate) — safely
Reset Reset assertion/deassertion visibility per domain; fanout integrity. Controlled reset injection to prove recovery paths and storm suppression.
Presence / ID Presence line state; stable defaults when absent. Simulated presence toggle for production screening (no damage, reversible).
PG / health Key PG/FAULT visibility; latched first-cause access. Fault injection at the policy boundary to prove fail-fast gating.
Clock control intent Clock enable/intent states and gating visibility (board level). Controlled gating injection to verify safe fallback/disable behavior.

Scope boundary

Figure I — Sideband & debug affordance map (bring-up → factory → field)

Block diagram, many elements, minimal labels (≥18px). No <defs>/<style>. Single-column friendly.

SIDEband + DEBUG · AFFORDANCE MAP RACK SERVER MAINBOARD Expose state · Confirm intent · Localize faults AON / AUX IO / FABRIC POWER TREE CLOCK TREE FAULT LATCH Sideband categories RESET PRESENCE / ID PG / FAULT CLKREQ / CTRL Debug access TEST POINTS STRAPS LED / 7-SEG DEBUG HEADER LIFECYCLE BRING-UP FACTORY FIELD

H2-10 · Bring-up & Validation Checklist: Proving the Board is “Done”

Goal: provide a staged, fail-fast validation path from pre-power checks to delivery readiness—focused on observable points, evidence, and stop criteria.

Key takeaways

  • Stage gates Pre-power → first power → steady/load → deliver, with explicit stop criteria at each gate.
  • Observe, don’t guess Each step defines what to observe and what it proves (no numeric limits required).
  • Evidence pack Keep minimal artifacts: waveforms, summaries, event_id pointers, and integrity markers.
  • Storm-safe Verify reset/power-fail logging remains trustworthy under brief interruptions.

Stage-gated validation flow (with fail-fast stop criteria)

Stage What to observe What it proves Fail-fast stop criteria
Pre-power Rail shorts/impedance sanity; default strap/presence states; clock-source enable intent; PG dependency map consistency. Board is safe to energize; defaults are predictable; dependencies match design intent. Any critical rail shows abnormal behavior or defaults are inconsistent/unpredictable.
First power Domain-by-domain enable/PG transitions; reset release order; initial refclk enable stability; early fault latches. Power/reset sequencing is stable and repeatable; domains reach ready state without storms. PG chatter, repeated resets, unexplained current behavior, or unstable refclk intent.
Steady / load Thermal rise trends in hotspots; telemetry self-consistency; event logs under activity; brief interruption / power-fail behavior. Thermal/telemetry/logs form a coherent story; evidence survives realistic disturbances. Telemetry contradicts itself; logs lack integrity markers; intermittent faults cannot be localized.
Deliver Checklist sign-off; minimal evidence pack; last-shutdown summary path; debug affordances confirmed usable in factory. Board is serviceable and reproducible in production; post-mortems are actionable. Evidence pack incomplete, or critical debug hooks are not accessible/repeatable.

Evidence pack (minimal artifacts that make bring-up repeatable)

  • Power/reset snapshots: captures around domain enable/PG transitions and reset release.
  • Clock intent proof: enable/control visibility and stable refclk presence at the board boundary.
  • Telemetry consistency: a small set of hotspot trends (V/I/T) tied to the same observation window.
  • Log integrity: last-shutdown summary with event_id, snapshot_ptr, and CRC/VALID indicators.

Fail-fast philosophy

If a step cannot be proven with observable evidence, stop the flow. Moving forward hides root causes behind later-domain activity.

Scope boundary

Figure J — Bring-up → validate → deliver (stage gates + stop criteria + evidence)

Gate flow diagram with many elements and short labels (≥18px). No <defs>/<style>. Single-column friendly.

BRING-UP · VALIDATION · DELIVERY PRE-POWER FIRST POWER STEADY / LOAD DELIVER OBSERVE rails · defaults PROVE safe to energize STOP abnormal rails OBSERVE EN/PG · RST PROVE repeatable boot STOP PG chatter OBSERVE V/I/T · logs PROVE coherent story STOP untrusted logs OBSERVE PROVE STOP evidence ready missing pack GATE 0 GATE 1 GATE 2 EVIDENCE PACK (MINIMAL) WAVEFORMS SUMMARY event_id INTEGRITY CRC / VALID Stage gates + observable proofs + explicit stop criteria make bring-up repeatable and deliverable

H2-11 · Field Debug Playbook: fastest isolation using power/clock/logs

Field triage is fastest when evidence is reduced to three synchronized pillars: rail state (PG/EN/reset), reference-clock hygiene, and a survivable log snapshot. The following playbooks stay at the mainboard integration level and stop at the point where a handoff to a sibling page becomes justified.

Common capture discipline (applies to all three trees)

1 Capture a pre-event baseline 2 Latch a fault token at first failure 3 Freeze a telemetry snapshot (time-stamped) 4 Readout in boot stage order
What to freeze Why it accelerates isolation
PG/RESET state vector Distinguishes “never reached a domain” vs “oscillating in reset storm” vs “booted then crashed”.
Key rails (AON, AUX, PCIe, clock rails) Separates undervoltage droop, sequencing dependency violations, and localized hot spots.
Ref-clk presence + enable chain Turns “link flaps” into a deterministic board-level gating/jitter check before suspecting SerDes training.
Power-fail detect + last-gasp window markers Explains “no logs” as either missing trigger, missing compute interrupt, or insufficient write window.
Boundary rule: if the evidence shows stable ref-clk, stable rails, and correct resets—but link training still fails—handoff to PCIe Switch / Retimer becomes justified; do not chase SerDes internals inside the mainboard page.

Decision Tree A — Intermittent cold-boot failure (fastest path)

A 6–8 steps to isolate “won’t boot” without guessing
  1. Confirm AON domain integrity first. Verify that always-on rails are stable and that the reset source/latch is readable (brownout vs watchdog vs external reset).
  2. Check for a reset storm signature. Look for repeated reset assertion with PG toggling; treat this as a dependency / deglitch problem until proven otherwise.
  3. Inspect the PG vector at the first failure edge. A single missing PG identifies the first broken dependency; multiple PG drops usually indicate a shared upstream rail or ground return disturbance.
  4. Validate PG deglitch assumptions. If a PG signal is “mostly high” but shows narrow low pulses, the system can still reset (blanking too small, threshold too close, or OR/AND logic too aggressive).
  5. Separate cold-start vs warm-reset dependencies. If warm reset works but cold boot fails, focus on rails/domains that require discharge, re-initialization, or are temperature sensitive.
  6. Audit enable chain ownership. Ensure each enable signal has a single authority (no “wired-OR ambiguity” between BMC, CPLD, and straps).
  7. Time-align telemetry around the boot window. AON, AUX, PCIe/clock rails, and VRM telemetry must share a consistent timestamp domain for reliable causality.
  8. Escalate only when the board-level state is proven stable. If all domains are stable and reset exits cleanly, hand off to CPU/DIMM/retimer-specific pages.
Focus signalsPractical meaning
PG_AON, RESET_CAUSEDistinguishes power integrity vs functional reset.
PG_CHAIN vectorFinds the first broken dependency in a tree or chain.
EN_CPU, EN_PCIe, EN_AUXChecks “authority collisions” and sequencing correctness.
Typical board-level fix patterns: increase PG deglitch/blanking, move thresholds away from droop zones, add “first-fault” latching, and avoid multi-master enable contention.
TPS3899 LTC2937 MAX16054 TCA9548A PCA9555 FM25V10 PCF85063A
Figure A11 — Cold-boot isolation flow (rail/PG/reset first)
Cold Boot Failure — Board-Level Decision Tree Order: AON → Reset cause → PG vector → Enable ownership → Time-aligned snapshot Start: capture first-fail edge AON rails stable? RESET storm? Read RESET_CAUSE (BOR/WDT/EXT) Freeze PG vector identify first missing PG Check PG deglitch blanking / thresholds Audit EN owners single authority Time-align telemetry snapshot AON/AUX/clock/PCIe rails Escalate only after board is stable handoff to CPU VRM / DIMM / Retimer pages YES ? READ

Decision Tree B — “PCIe link flaps” (ref-clk first, then handoff)

B 6–8 steps to separate clock hygiene from SerDes internals
  1. Prove ref-clk is physically present at the consumer. Measure at the slot/device side (not only at the source) to catch mux/enable gating errors.
  2. Validate the ref-clk power rails and enables. A ref-clk buffer can be “configured correctly” but still degrade under rail noise or thermal drift.
  3. Check sideband stability around the flap. Ensure PERST#, CLKREQ#, and presence signals do not chatter due to marginal pull-ups or strap contention.
  4. Correlate flaps with temperature gradients. A consistent thermal trigger points to clock-domain drift or a localized PI issue near the clock tree.
  5. Validate jitter-cleaning assumptions at board level. Confirm the intended reference source, holdover state, and fanout topology (single point vs zoned fanout).
  6. Only then suspect retimer/switch behavior. If ref-clk amplitude/presence and power rails are clean during flaps, handoff to the retimer/switch page is justified.
  7. Freeze the “flap packet” evidence. Log: ref-clk enable state, clock-rail telemetry, and sideband vector at the first flap edge.
9DBV0641 LMK03328 8T49N241 Si5341 LMK04828 TCA9548A
Figure B11 — Ref-clk & sideband gating checks before suspecting retimers
PCIe Link Flaps — Mainboard Triage Goal: prove clock/sideband integrity before blaming SerDes training Start: capture first flap edge freeze clock enable + clock-rail telemetry Ref-clk at consumer? Check ref-clk buffer rails + enables rail noise / droop / thermal drift correlation Validate sideband stability PERST# / CLKREQ# / presence Verify jitter-cleaning topology source select / holdover / fanout zoning If clocks & rails are clean during flaps → handoff to PCIe Switch/Retimer page do not chase SerDes equalization or training inside the mainboard page NO → gating/mux/enable YES → power & jitter

Decision Tree C — “Power loss happened but no logs survived”

C 6–8 steps to make last-gasp evidence reliable
  1. Prove the power-fail detect edge exists. Confirm the comparator/supervisor sees the correct rail and asserts power-fail early enough.
  2. Prove the interrupt reaches the owner. Verify that the last-gasp signal lands on the intended BMC/MCU/CPLD input without level-translation ambiguity.
  3. Measure the write window, not the intention. Determine the real time between power-fail assertion and rail collapse under worst-case load.
  4. Eliminate bus congestion as a silent killer. If I²C/SMBus is busy or wedged, last-gasp writes can fail even with “enough” hold-up energy.
  5. Use a survivable target for the first record. Write the first-fault token to a medium that tolerates abrupt loss and frequent writes (then mirror to slower storage later).
  6. Guarantee timestamp consistency. A stable RTC or monotonic counter prevents “logs exist but cannot be ordered”.
  7. Verify on injected brownouts. Validate the entire chain by repeatedly injecting short power interruptions and checking that the summary is preserved.
TPS3899 LTC2937 FM25V10 W25Q128JV 24AA02E48 PCF85063A LTC3350 LTC4041 TPS61094
Figure C11 — Power-fail evidence chain (detect → interrupt → write → survive)
No Logs After Power Loss — Evidence Chain Debug Detect early → reach the owner → write inside window → store in a survivable medium Start: injected brownout / real outage 1) Detect edge supervisor / comparator 2) Interrupt owner BMC/MCU/CPLD input 3) Write window measured worst-case 4) Survive first record FRAM / protected flash region 5) Timestamp coherence RTC / monotonic counter + boot read order Validate by repetition: 100+ outage injections → consistent “last cause” summary

Concrete reference parts (MPN examples) used by the playbooks

The following part numbers are examples commonly used to implement “detect/sequence/log/clock” building blocks on complex boards. Selection must match rail voltages, interfaces, validation rules, and platform constraints.

Function Example MPNs
Voltage supervisor / power-fail detect TPS3899, MAX16054
Multi-rail sequencing + fault logging LTC2937
I²C/SMBus segmentation TCA9548A
GPIO expansion for straps/LEDs/interrupts PCA9555
“First-fault token” survivable NVM FM25V10 (FRAM), W25Q128JV (SPI NOR)
FRU / identity EEPROM (example) 24AA02E48
RTC for consistent timestamps (example) PCF85063A
PCIe ref-clk fanout buffer (example) 9DBV0641
Clock generation / conditioning (examples) LMK03328, 8T49N241, Si5341, LMK04828
Last-gasp / supercap backup controller (examples) LTC3350, LTC4041, TPS61094
Practical rule: write the first record to a survivable medium (FRAM or a protected flash region), then mirror/serialize into the long-form event log after reboot. This prevents “the most important failure” from being overwritten by late-stage noise.

Request a Quote

Accepted Formats

pdf, csv, xls, xlsx, zip

Attachment

Drag & drop files here or use the button below.

H2-12 · FAQs (Mainboard-only) ×12

Core idea

These answers stay at the mainboard integration boundary: rail/PG/reset evidence, clock distribution hygiene, telemetry topology, and power-fail survivability. When deeper component internals are needed, the answer ends with a clear handoff point to a sibling page (CPU VRM / PCIe Switch-Retimer / BMC).

1 Mainboard checks 2 Common board-level causes 3 Fast proof steps 4 Escalate rule
Example MPNs listed below are typical mainboard building blocks (supervisors, muxes, clock buffers, FRAM/RTC) and are not a mandate for any platform.
Q1 Why can rails look “normal” yet the board still intermittently refuses to power on?

Mainboard checks: capture the first-fail edge: PG vector, reset cause, and a time-aligned rail snapshot (AON/AUX/clock rails) before any reboot storm overwrites evidence.

Common causes: PG chatter that never appears in averaged telemetry, an enable-owner collision (two masters driving EN), or a missing “first-fault token” that turns every failure into the same symptom.

Fast proof: latch first-fault + snapshot on a supervisor interrupt; compare cold-boot vs warm-reset dependency behavior.

Escalate: if PG/reset exits cleanly and all rails are stable during the failed boot window, hand off to CPU VRM / DIMM bring-up pages.

LTC2937TPS3899FM25V10
Q2 PG is deglitched, yet a reset storm persists—what are the three most common dependency/logic mistakes?

Mainboard checks: review the reset-domain map and the exact PG combine logic (what is hard-gate vs soft-monitor).

Most common mistakes: (1) wrong AND/OR composition that promotes a non-critical PG into a global reset gate; (2) a circular dependency (EN_B depends on PG_A while PG_A depends on EN_B); (3) thresholds/blanking placed inside a droop band, so “clean” PG still toggles under load steps.

Fast proof: freeze the PG vector at the first low pulse; confirm whether the same PG is always first to drop.

Escalate: only after logic/thresholds are proven correct should rail dynamics be analyzed on the CPU VRM page.

TPS3899LTC2937SN74LVC1G32
Q3 Remote sense is “placed correctly,” but readings still drift—what is the most common mainboard-level reason?

Mainboard checks: verify that the sense pair and its reference return do not share high di/dt ground segments; correlate drift with temperature gradients and bus activity.

Common causes: Kelvin sense is correct but the ground reference moves (return path contamination), connector/copper temperature coefficient changes the effective drop, or the telemetry chain (ADC reference/filtering/sample timing) aliases noise into a “slow drift.”

Fast proof: compare a direct DMM at the load with telemetry during a controlled load step and a fan-speed change.

Escalate: if regulation error is strongly load-transient dependent with stable references, hand off to CPU VRM control-loop analysis.

INA238INA229PCA9517A
Q4 PMBus polling often times out—how to tell topology issues from electrical noise first?

Mainboard checks: treat PMBus as an electrical network: segmentation, pull-ups, branch length, address plan, and multi-master ownership.

Topology signature: errors persist at idle and correlate with specific branches/addresses. Noise signature: timeouts cluster with load steps, fan PWM edges, or reset/PG activity.

Fast proof: isolate with a mux (one segment at a time); add a forced-bus-recovery step; then repeat under a controlled load step to test noise correlation.

Escalate: if only one VR domain misbehaves after segmentation proves the bus healthy, hand off to the CPU VRM page for device-side behavior.

TCA9548APCA9517APCA9615
Q5 Ref-clk frequency measures “correct,” but links are unstable—what should be checked at the mainboard level first?

Mainboard checks: prove the clock intent (enable/gating), then verify clock-buffer/jitter-cleaner power rails, thermal gradients, and sideband chatter (PERST#/CLKREQ#).

Common causes: frequency is right but jitter is not (supply noise coupling), duty-cycle/SSC assumptions differ across zones, or a marginal enable chain intermittently gates the clock under load/temperature.

Fast proof: correlate link flaps with clock-rail telemetry and sideband vector at the first flap edge.

Escalate: if ref-clk intent + rails are clean during flaps, hand off to PCIe Switch/Retimer internals.

9DBV0641Si5341LMK03328
Q6 Is “more clock fanout” always better? When should zoning replace a single global fanout?

Mainboard checks: identify zones with very different noise/thermal environments (CPU socket vs PCIe/GPU slots) and evaluate whether independent enable/isolation is required.

When zoning wins: a noisy load zone couples supply/return noise into the clock rail; long cross-zone routing breaks reference-plane continuity; or fault containment is needed (one zone can be muted without collapsing the entire clock tree).

Fast proof: compare clock-rail noise and flap rate with a “zoned enable” experiment.

Escalate: if a specific endpoint’s tolerance is the question, hand off to the relevant PCIe Retimer/Switch page.

LMK1C11049DBV0641Si5341
Q7 Why are power-loss logs often untrustworthy—and how to tell “no time to write” vs “written but read wrong”?

Mainboard checks: verify the power-fail detect edge, the owner interrupt path, and a measurable “last-gasp window” marker (start/end) under worst-case load.

No time to write: detect is too late or the window collapses too fast. Written but read wrong: VALID/CRC/sequence mismatches or an outdated pointer selects the wrong record after reboot.

Fast proof: inject repeatable brownouts; confirm first-fault token and pointer advance every time.

Escalate: if detect/window are proven correct yet persistence still fails, hand off to the storage/firmware log-carrier page.

TPS3899FM25V10LTC3350
Q8 After a brief power interruption reboot, which logs/snapshots should be read first?

Mainboard checks: read in “cause-preserving” order to avoid chasing overwritten noise.

Best-first order: (1) first-fault token / last-shutdown summary, (2) reset cause + brownout marker, (3) PG vector at first-fail edge, (4) minimal telemetry snapshot (AON/AUX/clock/PCIe rails + hotspot temps), then (5) long-form ring logs.

Fast proof: validate ordering by repeating injected brownouts and confirming that the summary stays consistent.

Escalate: if summaries consistently implicate a single subsystem, hand off to that subsystem’s sibling page.

FM25V10PCF85063AW25Q128JV
Q9 Many sensors exist, but hotspots remain invisible—how should thermal zones and sampling points be changed to matter?

Mainboard checks: map sensors to the heat-flow chain: inlet/outlet air, VRM islands, DIMM banks, PCIe slot zones, and connector choke points.

Common causes: sensors sit on mechanically convenient but thermally “cold” copper, sampling is too slow to capture excursions, or alert thresholds are not aligned with the platform’s throttle/derate behavior.

Fast proof: perform a controlled workload ramp and compare inlet/outlet delta, VRM island temps, and slot-zone temps with event timing.

Escalate: if the question becomes fan/pump control policy, hand off to Fan & Thermal Management / Liquid Cooling pages.

TMP468TMP451NCT7802Y
Q10 For “minimal production test coverage,” which signal classes must be measurable to avoid future field gaps?

Mainboard checks: define a minimal set that proves the board can (1) power domains correctly, (2) exit reset deterministically, (3) distribute ref-clk to consumers, and (4) preserve a readable failure summary.

Must-measure classes: key rails presence (AON/AUX/clock/PCIe), PG/EN/RESET vector, ref-clk presence at consumer-side test points, and a readable first-fault token + snapshot pointer after an injected interruption.

Fast proof: adopt “fail-stop” gates: any missing class stops the line before deeper functional tests.

Escalate: if a failing gate points to one subsystem, hand off to that subsystem’s page for deep validation.

ADS1115TCA9548APCA9555
Q11 With only logs and limited telemetry in the field, how to quickly separate power issues, clock issues, and management-plane false alarms?

Mainboard checks: classify by correlation and first-cause integrity.

Power signature: PG drops and rail anomalies time-align with the event. Clock signature: ref-clk intent/enable or clock-rail noise correlates with link flaps while rails/PG remain stable. Management false-alarm signature: reset cause and hardware vectors stay clean while software-origin markers dominate the summary.

Fast proof: require a single timestamp domain for snapshots; reject conclusions from unordered logs.

Escalate: clean rails + clean clock intent + persistent platform errors → hand off to BMC / PCIe Retimer / CPU VRM pages as indicated by the first-fault token.

PCF85063AFM25V10Si5341
Q12 When should debug stop at the mainboard page and dive into sibling pages (CPU VRM / PCIe Retimer / BMC)?

Mainboard checks: stop only after the board-level evidence is “closed”: rails stable during the failure window, PG/RESET dependencies proven correct, ref-clk intent stable at the consumer, and logs/snapshots are survivable and time-ordered.

Dive criteria: evidence points to a single subsystem (one VR telemetry domain, one slot zone, or management-plane markers) while board-level vectors remain clean.

Fast proof: repeatability under injected conditions (brownout, thermal ramp, load step) while board-level state stays stable.

Escalate: CPU VRM for regulation/control details; PCIe Switch/Retimer for SerDes training/equalization; BMC for management-plane behavior.

LTC29379DBV0641FM25V10
Figure F12 — FAQ coverage map (Q1–Q12 → mainboard evidence domains)
Mainboard Evidence Domains (FAQs Q1–Q12) Power/PG/Reset · VRM Islands & Telemetry · Clock/Ref-clk · Power-fail & Logs · Interfaces/Test Sequencing & Reset Domains PG/EN dependencies · deglitch · reset cause Clock / PCIe Ref-clk Hygiene fanout zones · enables · jitter rail coupling VRM Islands (Integration View) placement · remote sense integrity · telemetry taps Telemetry Fabric PMBus/SMBus/I²C segmentation · timestamps Power-fail & Event Logs (Evidence Tree) detect → interrupt → write window → VALID/CRC/pointers Interfaces & Test sideband · test points · fail-stop gates Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q10 Q9 Q11 Q12
Q12 is the boundary guard: once rails/PG/resets, ref-clk intent, and survivable logs are proven stable, deeper analysis belongs to sibling pages.