123 Main Street, New York, NY 10001

OCP OpenRack Baseboard for I3C/I2C Bus Management

← Back to: Data Center & Servers

An OpenRack baseboard turns scattered I³C/I²C devices into an operable, segmentable, and recoverable management fabric by enforcing discovery/addressing, multi-domain telemetry context (domain + timestamp + identity), and tiered bus recovery—so observability survives hot-plug and power-domain transitions.

H2-1 — What “OpenRack Baseboard” Covers

An OpenRack baseboard is the physical layer for sideband connectivity and sensor/FRU visibility. Its engineering core is I3C/I2C bus governance (topology, segmentation, isolation), device discovery & addressing (DAA/address plan), and multi-domain telemetry aggregation (power/thermal + domain state) into the management plane.

DeliverablesWhat this page is expected to make operational
  • A bus architecture that stays stable at scale: segmented backbone + isolated islands so one fault does not take down the entire rack-side visibility.
  • A repeatable discovery/addressing scheme: deterministic inventory mapping across hot-join events and multi-domain power states.
  • A telemetry model that is actionable: raw readings → thresholds/alerts → event evidence (timestamps + affected segment/domain).
BoundaryWhat is in scope vs out of scope
Topic In scope (this page) Out of scope (linked elsewhere)
Management Bus master placement, sideband signal paths, alert wiring, inventory evidence fields. Redfish/IPMI software stack architecture, firmware workflows, UI/telemetry dashboards.
Power Telemetry ingestion (voltage/current/power), domain states (AON/MAIN/HOTPLUG), back-power prevention at the bus layer. PFC/LLC/CRPS conversion topology, VRM control-loop stability/compensation details.
Compute/IO Sideband presence/FRU, sensor islands near connectors, segment isolation strategy. PCIe/NVMe/IB/Ethernet protocol stacks, retimer equalization, dataplane acceleration.

Practical rule: if a paragraph starts describing how power is converted or how a management protocol is implemented, it belongs to a sibling page. This page only defines the bus/device layer that makes those systems observable and recoverable.

Figure F1 — Baseboard position: I3C backbone + I2C legacy islands + telemetry nodes
OpenRack baseboard bus map I3C backbone connects to multiple isolated I2C islands with FRU, temperature, and power telemetry nodes. Single column block diagram for mobile readability. OCP OpenRack Baseboard — Sideband Bus Layer I3C BACKBONE (Master / Hub) Bus Master Discovery / Address Plan Telemetry Aggregation SWITCH MUX BUFFER I2C ISLAND A (Legacy) FRU / EEPROM Temp Array GPIO / Presence I2C ISLAND B (Power) Power Monitor V / I / P Telemetry Domain State AON / MAIN / HOTPLUG ISOLATED SEG Hot-plug Join / Leave Alerts In-band / INT Backbone/segment links I3C backbone Islands (legacy/segmented) Design intent: one backbone for discovery + telemetry, isolated islands for fault containment and mixed legacy support.

H2-2 — System Context & Interfaces

A baseboard must be describable as a bus graph plus power/reset domains. The objective is to make every sideband endpoint (FRU, temperature, power monitors, presence) discoverable, addressable, and recoverable under real rack conditions: standby states, partial power loss, and hot-join events.

Interface inventoryMinimum fields to document (bus + domain + recovery)
Device class Bus segment Addressing Power domain Reset / Enable Alert path Failure impact
FRU / EEPROM I2C Island-A Static / behind mux AON Always enabled Polling Low (segmentable)
Temp array I2C Island-A Static AON or MAIN GPIO enable INT (optional) Medium (can flood bus)
Power monitor I2C Island-B Static MAIN Domain PGOOD gated ALERT#/INT High (stuck-low risk)
Presence / GPIO exp I2C Island-A/C Static HOTPLUG Hot-swap domain INT High (hot-join noise)
I3C hub/bridge I3C Backbone DAA / dynamic AON Always enabled In-band Critical (backbone)
Power & reset domainsRules that prevent “invisible” or “hung” systems
  • Define an AON minimum observable chain: a small set of endpoints that must remain readable in standby or partial failure (e.g., FRU + environment temp + backbone health).
  • Assign pull-ups to the correct domain: pull-ups that remain powered while a segment endpoint is off must not create back-power paths through I/O structures.
  • Segment by domain boundaries: any domain that can be powered off or hot-plugged should sit behind a switch/mux/buffer so a fault cannot clamp the entire bus.
  • Reset/enable must be coherent with discovery: if an endpoint can reboot independently, discovery should be re-entrant (re-scan or re-DAA) without destabilizing the backbone.

Minimum-observable does not mean “monitor everything”. It means “keep enough visibility to diagnose and recover”: domain presence, backbone health, and the most critical thermal/power indicators—without relying on full system power.

Figure F2 — Domains & interfaces: AON backbone with segmented MAIN/HOTPLUG islands
Power-domain aware sideband bus map Diagram shows AON domain hosting the I3C backbone and master, with MAIN and HOTPLUG device islands behind segment gates to avoid back-power and contain faults. AON Domain Backbone stays alive for discovery + minimum telemetry MAIN Domain HOTPLUG Domain I3C Master / Aggregator I3C Backbone (Discovery + Inventory + Alerts) SEG GATE SEG GATE Power Telemetry Island V/I/P monitors + domain state Thermal / FRU Island temp arrays + FRU reads Hot-Join / Presence Island presence + alerts + join/leave Fault Containment isolation prevents bus clamp Key risk: back-power across domains Rule: any off/hotplug domain sits behind a gate Bus links / segment gates Design intent: treat the baseboard as a bus graph with explicit domain boundaries, so discovery and telemetry remain available under partial power states.

H2-3 — Bus Topology Patterns

The baseboard should be treated as a bus graph with explicit segments. The objective is not simply connectivity, but fault containment: long traces, high capacitive loads, and hot-join domains must not clamp the backbone or destabilize discovery.

1) Single-master vs multi-master (engineering boundary)

Single-master (default)

One backbone master simplifies arbitration and makes recovery deterministic. Segments isolate failures so a single stuck line becomes a local event rather than a rack-wide visibility loss.

Multi-master (only with strong justification)

Adds arbitration, timing edges, and new failure modes. If introduced, segment boundaries and isolation must be stricter so a misbehaving master cannot flood retries or hold the bus in a degraded state.

2) Segmentation rules (when and why to split)
  • Physical distance: long runs amplify edge distortion and noise pickup; buffers/switches help restore electrical margin and limit the affected length.
  • Capacitive loading: many endpoints slow edges and increase glitch sensitivity; islands cap per-segment load so timing stays stable under worst-case conditions.
  • Power domains: any domain that can be off or partially powered must be behind a gate to prevent back-power and “stuck-low” propagation.
  • Noise sources: hot-plug connector zones and high-current regions deserve separate islands to keep backbone discovery stable.
3) MUX vs SWITCH vs BUFFER (choose by failure domain)
Component Best for Not a fix for Typical placement
MUX Address conflicts, branch selection, reducing visible endpoints per scan. Domain back-power, hot-join disturbance, hard fault isolation. Between backbone and legacy branches; upstream of small islands.
SWITCH Segment-level isolation, fault containment, controlled attach/detach of a domain. Edge restoration on long lines if used alone; protocol-level inventory logic. At domain boundaries (MAIN/HOTPLUG), at connector entry points.
BUFFER Electrical margin: rise-time help, fanout, restoring edges after long runs. Back-power prevention, isolating a stuck-low endpoint across domains. On long trunks, before high-fanout branches, near noisy zones.
4) Hot-plug maintainability (bus-layer rules)
  • Hot-plug domains must be gated: attach/detach should affect only the island, not the backbone.
  • Pull-up ownership must be explicit: avoid powering an unpowered island through bus pull-ups or I/O structures.
  • Discovery must be scoped: hot-join triggers a limited rescan/attach sequence for that island, not a disruptive global churn.

Rule of thumb: if a failure can pull SDA/SCL low, it must be possible to isolate that failure within one segment using a gate (switch/isolator), regardless of whether addressing conflicts also exist.

Figure F2 — Three topology patterns used on OpenRack baseboards
Baseboard bus topology patterns Three block-diagram patterns show common approaches: I3C backbone with bridge to I2C islands, I2C mux tree segmentation, and multi-domain isolation using switches and isolators. Topology Patterns (Block Diagram) A) I3C Backbone + Bridge → I2C Islands I3C BACKBONE BRIDGE I2C ISLANDS FRU TEMP POWER B) I2C Segmentation (MUX Tree) I2C TRUNK MUX MUX MUX ISLAND ISLAND C) Multi-domain Isolation (SWITCH + ISOLATOR) AON BACKBONE MAIN SENSORS HOTPLUG JOIN / LEAVE SW ISO Goal: isolate fault domains

H2-4 — I2C vs I3C in Baseboards

The practical question is not “new vs old”, but what becomes easier to operate. I3C strengthens discovery semantics and hot-join handling, while I2C islands remain valuable for legacy endpoints and simple devices that do not require dynamic identity or in-band alert behavior.

1) I3C value (baseboard-relevant capabilities)
  • Dynamic Address Assignment (DAA): reduces static address collisions and supports deterministic inventory mapping when endpoints change.
  • Hot-join semantics: enables controlled attach workflows so insertion events can be scoped to an island rather than destabilizing the backbone.
  • In-band interrupt / alerting: improves time-to-detect for threshold events without relying on heavy polling.
  • Stronger discovery vocabulary: supports a cleaner “device identity → inventory record” pipeline at the bus/device layer.
2) Why keep I2C islands (practical boundary)
  • Legacy endpoints: FRU/EEPROM and many simple sensor classes remain cost-effective and widely available on I2C.
  • Operational simplicity in islands: stable static addressing is acceptable for small, tightly bounded segments.
  • Isolation-first design: keeping legacy on islands can reduce the blast radius of electrical faults or partial-power behavior.
3) Migration pattern (backbone-first)

Step 1 — I3C backbone (AON)

Upgrade the backbone that must remain visible in standby. Make discovery and minimum telemetry deterministic and recoverable.

Step 2 — Bridge + segment gates

Bring legacy islands under the same inventory model via bridges and gates, so hot-plug or power-off behavior stays local.

Step 3 — Replace only where it pays

Move endpoints that benefit from dynamic identity or in-band alerts, while leaving stable low-complexity devices on I2C islands.

4) Engineering comparison (only what matters on baseboards)
Dimension I2C (islands) I3C (backbone)
Addressing Static planning; collisions handled via straps or mux segmentation. DAA supports dynamic identity management and reduces static collision pressure.
Discovery Scan-based visibility; works best for small bounded segments. Stronger discovery semantics; better for inventory mapping across churn.
Alerts Often sideband INT/ALERT or polling-driven detection. In-band alerting reduces dependency on heavy polling and speeds reaction.
Scale behavior Large fanout stresses edges and timing; segmentation becomes mandatory. Backbone-first helps centralize discovery while islands limit capacitive blast radius.
Recovery hooks Segment isolation + rescan; deterministic if islands are small. Scoped attach/detach + re-DAA patterns support controlled churn handling.
Migration pitfalls Address maps fragment easily if mux topology is undocumented. Bridges must preserve isolation and identity mapping under power-domain changes.

A mixed design is often the most robust: I3C for backbone-level discovery and alert semantics, I2C islands for bounded legacy endpoints. The design objective is stable observability, not maximal protocol uniformity.

Figure F3 — Layered mixed bus: I3C backbone → bridges → I2C islands
Layered mixed bus model I3C backbone provides discovery and alerts. Bridges connect to bounded I2C islands for legacy devices. Segment gates isolate domains and faults. Layer 1: I3C BACKBONE DISCOVERY • INVENTORY • ALERTS Layer 2: BRIDGES + SEGMENT GATES BRIDGE BRIDGE BRIDGE BRIDGE Layer 3: I2C ISLANDS (Bounded Legacy) ISLAND FRU TEMP ISLAND POWER GPIO ISLAND PRESENCE ALERT ISLAND LEGACY SENSORS Links = controlled segments

H2-5 — Device Discovery & Addressing

A baseboard onboarding flow must be repeatable: plug in → detect → assign an address → verify reachability → register into inventory. Discovery becomes “operational” only when it records segment, power-domain state, alert path, and a verification result alongside the observed address.

1) Static I2C address planning (make collisions predictable)

Collision avoidance toolkit

Prefer segmentation (islands) to keep conflicting endpoints from sharing the same visible bus. Use straps/config pins when available. Use mux selection only when segmentation boundaries are documented and enforced.

Rule

The same numeric address can repeat across different islands, but must not appear in the same visible segment at the same time. Document every segment boundary and its gating element.

2) I3C DAA onboarding (dynamic address is not the identity)
  • When to run DAA: after segment attach, after power-domain transitions that change device visibility, and on hot-join events.
  • Scope it: run onboarding on the affected segment/island, not the entire backbone, to avoid unnecessary address churn.
  • Map to inventory: the inventory key must be device identity + segment context; the dynamic address is the current reachability handle.
  • Record changes: log address transitions (before → after) and the reason (join, rescan, recovery).
3) Operational discovery (discover ≠ usable)
Evidence item What to record Why it matters operationally
Segment / island Segment ID, upstream gate/switch, bridge path Limits blast radius; enables scoped rescan and targeted isolation.
Power-domain AON / MAIN / HOTPLUG state Prevents false alarms when a domain is intentionally off; explains “missing” devices.
Alert path In-band alert / INT line / none Separates “device unreachable” from “event signal broken”; improves MTTR.
Address state Static expected, or dynamic assigned; before → after on changes Supports audit trails, avoids identity drift, and enables stable inventory linking.
Verification Basic read/health check: pass/fail + failure category Discovery without verification causes inventory pollution and noisy support tickets.
4) Template A — Address Plan (single source of truth)

This plan ties addressing to topology and domains. Use it to review changes before deployment.

Domain Segment / Island Device Class Inventory Key Addr Type Addr (Expected/Current) Alert Path Isolation Element Notes
AON B0 (Backbone) Bridge bridge.backbone.01 DAA dyn / dyn in-band switch gate Scoped onboarding after attach
MAIN I2C-1 TEMP temp.array.zoneA Static 0x4A / 0x4A INT mux branch Address shared only within island
HOTPLUG I2C-HP FRU fru.drawer.slot3 Static 0x50 / 0x50 none switch gate Gate open only during service window
5) Template B — Discovery Log (audit-friendly)

This log turns “it disappeared” into a timed, explainable event with scope and recovery actions.

Timestamp Event Domain / Segment State Inventory Key Address Before → After Verify Failure Category Action Taken
YYYY-MM-DD hh:mm:ss HOT-JOIN HOTPLUG / gate=open fru.drawer.slot3 — → 0x50 PASS Inventory update
YYYY-MM-DD hh:mm:ss RESCAN MAIN / stable temp.array.zoneA 0x4A → 0x4A FAIL NACK storm Isolate island, retry later
YYYY-MM-DD hh:mm:ss RECOVERY AON / backbone bridge.backbone.01 dyn → dyn PASS Scoped DAA on segment

Boundary: this section defines bus/device-layer onboarding evidence (scope, address state, verification, logs). It does not define management protocol stacks or backend database implementations.

Figure F4 — DAA + Hot-Join onboarding state machine (scoped to a segment)
DAA and Hot-Join onboarding state machine Block diagram shows the onboarding states and transitions with scoped actions and failure handling: isolate segment, retry, and log events. Onboarding State Machine (Scoped) SCOPE: Segment / Island IDLE MONITOR JOIN DETECTED HOT-JOIN / ATTACH ASSIGN DAA / STATIC VERIFY BASIC READ INVENTORY UPDATE + EVIDENCE FAILURE HANDLING ISOLATE GATE OFF LOG TIMESTAMP RETRY POLICY: scoped rescan / re-DAA / degrade ASSIGN FAIL VERIFY FAIL EVIDENCE FIELDS SEGMENT + DOMAIN ALERT PATH ADDR BEFORE/AFTER VERIFY RESULT

H2-6 — Bus Electrical Integrity

Most field failures begin as marginal edges: slow rise-time, excessive segment capacitance, glitches, or insufficient low-level margin. A bus that “works once” can still be unstable under temperature, hot-plug, or domain transitions. Evidence should be captured as waveforms and retry behavior, not assumptions.

1) Pull-ups and rise-time (engineering method, not a spec recital)
  • Estimate per-segment load: include trace length, connector parasitics, and endpoint input capacitance. Treat each island as a separate RC problem.
  • Choose pull-up by segment: keep pull-up ownership aligned with the segment’s power domain to prevent back-power paths.
  • Validate at the far end: measure rise-time and noise margin at the worst-case point (end of the longest branch), not only near the master.
2) Glitch sensitivity (keep it bus-layer)

What typically goes wrong

Short spikes can be interpreted as edges when rise-times are slow or thresholds are marginal. This can produce NACK storms or address-mapping drift during hot-join windows.

What to do (within scope)

Use segmentation and electrical buffering where long runs and noisy connector zones exist. Gate hot-plug islands so disturbances do not propagate into the backbone.

3) Clock stretching and timing edges (why “can run” ≠ stable)
  • Stretching amplifies marginality: one slow or partially-powered endpoint can elongate cycles and trigger timeouts.
  • Marginal edges increase retries: a stable system shows low retry frequency; rising retries are an early warning long before “bus hang”.
  • Contain the failure domain: a segment gate allows recovery actions (retry/rescan) without collapsing overall observability.
4) Oscilloscope checklist (actionable)
Checkpoint What to look at Interpretation / next action
Rise-time SCL/SDA edge speed at segment end; compare near-master vs far-end. Slow edges suggest high C or weak pull-up; segment, buffer, or adjust pull-up ownership per domain.
Low-level margin Low level “floor” stability under traffic; look for lifted lows. Insufficient margin can cause false reads; isolate partial-power endpoints and review segment gating.
Glitches Short spikes on SCL/SDA; correlate with missing devices or NACK storms. Glitches + slow edges are a common pair; shorten/noise-isolate the segment and add electrical buffering where needed.
Retry / repeat-start Protocol analyzer or firmware counters: retry rate over time. Rising retries indicate shrinking margin; treat as pre-failure signal and scope the worst segment first.
Hot-plug window Waveforms before/during/after attach; check if backbone edges degrade. If backbone degrades, hot-plug domain is not contained; strengthen gating/isolation and rescan only that segment.

Boundary: this section focuses on bus-edge integrity and measurement evidence. It does not prescribe full EMC design or chassis grounding rules.

Figure F5 — Simplified waveform signatures (good vs high-C vs weak/strong pull-up)
Waveform signatures for bus integrity Four simplified waveform panels compare signal edges. Each includes a threshold line and a sampling window block with minimal labels. Waveform Evidence (Simplified) GOOD SAMPLE FAST EDGE HIGH C SAMPLE SLOW RISE WEAK PULL-UP SAMPLE NOISE SENSITIVE STRONG PULL-UP / MARGIN RISK SAMPLE LOW-LEVEL LIFT Threshold

H2-7 — Multi-Domain Power & Isolation

In baseboards, the most damaging failures come from cross-domain coupling: a powered-off island can back-power, drag the bus, or spread hot-plug disturbances into always-on visibility. The goal is to keep each domain electrically and operationally scoped with clear pull-up ownership and segment gating.

1) Where back-powering sneaks in (conceptual paths)

Typical paths

Cross-domain pull-ups, IO protection/clamp structures, and “half-powered” endpoints can create unintended current paths. The symptom is often unstable bus levels, stuck lows, or devices that never fully reset.

Design intent (within bus/domain scope)

Enforce segment visibility with gates/switches, keep pull-ups owned by the segment’s intended domain, and ensure powered-off islands become electrically quiet and logically invisible.

2) Make powered-off devices “non-blocking”
  • Segment gates: isolate islands so a fault or off-domain endpoint cannot pull SCL/SDA low globally.
  • Pull-up ownership: assign pull-ups to the domain that remains valid for that segment (avoid cross-domain pull-ups by default).
  • Scoped recovery: isolate → log → rescan only the affected island (avoid global churn).
3) Hot-plug containment (keep disturbances local)
  • Hot-plug = separate segment: attach islands behind a gate so plug-in transients do not degrade the backbone.
  • Order of operations: power stable (PG) → open gate → discovery/addressing → verification → inventory update.
  • Evidence: capture “attach/open/close” events with timestamps and segment IDs to explain reachability changes.
4) Output: Domain–Bus ownership matrix

Use this matrix in design reviews to prove that off-domains cannot back-power or block the always-on view.

Domain Bus visibility Pull-up owner Gate element When domain OFF Recovery action
AON Backbone B0 AON only Core gate Must remain reachable; backbone stays stable Scoped rescan of attached islands only
MAIN I2C-1 / I2C-2 islands MAIN (per island) Island switch Island should be invisible (gate closed) Close gate → log → reopen after PG stable
HOTPLUG HP island HOTPLUG (local) HP gate Invisible during service / unpowered periods Attach → verify → inventory; detach → mark absent
5) Output: Allowed cross-domain signals (whitelist)

Treat cross-domain connectivity as “allowed by design,” not accidental. If a signal is not listed here, it should not cross domains.

Signal class Allowed direction Required properties Notes (scope control)
I3C / I2C Backbone → island (through gate) Segment gate; pull-up ownership defined; OFF-domain becomes invisible Prefer “attach/detach” semantics over always-connected wiring
ALERT / INT Island → AON (minimal dependency) Defined pull-up; known OFF-state behavior; debounced semantics Use for “wake/attention” when in-band is unavailable
PRESENT / PGOOD Island → AON Stable level when OFF; no back-power path Explains reachability changes without scanning
RESET (domain) AON → island Only valid when island domain is powered; no reverse feeding Scope reset to the island to avoid global churn
Scoped segments Pull-up ownership OFF = invisible Hot-plug contained

Boundary: this section defines domain visibility, segment gating, and pull-up ownership for bus stability. It does not define BMC protocol stacks, PSU/VRM power conversion, or full EMC grounding rules.

Figure F6 — Three-domain model: AON backbone + hot-plug segment + powered-off segment (isolation points)
Three-domain baseboard bus model with isolation Diagram shows AON backbone with master and pull-up block, a hot-plug segment behind a gate switch, and an OFF segment behind a closed gate, highlighting isolation points and back-power blocking. Domains, Visibility, and Isolation AON DOMAIN AON MASTER PU OWNED I3C BACKBONE (B0) STABLE + ALWAYS VISIBLE ATTACH SEGMENT ATTACH SEGMENT HOTPLUG DOMAIN GATE OPEN HP SEGMENT ATTACH + VERIFY FRU TEMP POWERED-OFF DOMAIN GATE CLOSED OFF SEGMENT INVISIBLE BACK-POWER BLOCKED ATTACH ATTACH

H2-8 — Telemetry Aggregation Model

A baseboard telemetry system is not a pile of sensors. It is a structured pipeline that binds every measurement to domain, segment, and an inventory key, with a clear path to alerts and event logs. The same measurement should be explainable over time (trend) and under transitions (attach/off/reset).

1) A simple four-layer model (data model, not software)

Layer 0 → 1

Raw reads become physical units (V/A/W/°C) with a domain + segment context.

Layer 2 → 3

Thresholds and alert policy produce events with timestamps and traceability back to inventory.

2) Key telemetry classes (organized for root-cause)
  • Electrical: voltage/current/power per domain and segment (pair with domain state to avoid false alarms).
  • Thermal arrays: hotspot/max/min and gradient cues (useful for localization, not only average temperature).
  • Domain state: present/pgood/reset/attach (explains why data is missing or why a device is unreachable).
  • Reachability evidence: segment gate state and retry level (links electrical issues to bus integrity symptoms).
3) Alert path selection: in-band vs out-of-band
Alert path Best fit Design note (scope control)
In-band (I3C) Structured events tied to discovery/inventory; when the bus view is stable and the segment is attached. Bind alerts to domain + segment + key fields.
Out-of-band (ALERT/INT) Minimal dependency signaling; “attention” when in-band is unavailable (OFF-domain transitions, early fault flags). Define OFF-state behavior and debounce semantics; treat it as a trigger to fetch structured data later.
4) Output: Telemetry field dictionary (copy-ready)

Standardize field names so logs and alerts remain comparable across platforms and generations.

Field name Unit Source class Sampling suggestion Threshold type Domain Segment Inventory key link Alert path Log policy
dom.aon.vbus_v V power normal absolute + persistence AON B0 power.aon.entry in-band periodic + on-alert
dom.main.pwr_w W power normal absolute + delta MAIN I2C-1 power.main.zoneA in-band periodic
dom.hp.present bool domain_state fast edge-trigger HOTPLUG HP fru.drawer.slotN INT on-change + on-alert
seg.hp.gate_state bool domain_state fast edge-trigger HOTPLUG HP gate.hp.01 in-band on-change
tmp.zoneA.hotspot_c °C temp normal absolute + rate MAIN I2C-1 temp.array.zoneA in-band periodic + on-alert
bus.i2c1.retry_level count reachability normal delta + persistence MAIN I2C-1 bridge.i2c1 in-band periodic
evt.alert.ts ms event on-event any any inventory.key in-band / INT on-event

Boundary: this section defines telemetry organization (fields, units, context binding, alert paths). It does not define storage backends, APIs, or BMC service implementations.

Figure F7 — Telemetry pipeline: Sensors → Segment → Aggregator → Log / Alert (with timestamp + domain context)
Telemetry pipeline model Block diagram with multiple sensor types feeding a segment, then aggregator with timestamp/context, branching into event log and alert outputs. Telemetry Pipeline (Model) SENSORS POWER V / I / W TEMP ARRAY DOMAIN PRESENT / PG REACH RETRY SEGMENT DOMAIN AON/MAIN/HP SEG ID B0 / I2C-1 / HP AGGREGATOR TS timestamp KEY inventory key EVENT LOG structured ALERT I3C / INT CONTEXT: domain + segment + key

H2-9 — Fault Modes & Bus Recovery

The recovery objective is restoring observability without a full system reboot. Treat the baseboard bus as segmented infrastructure: recover the AON backbone first, then isolate and reattach affected islands. Every recovery step should emit evidence (timestamp, segment, action, result).

1) Common fault signatures (what is observable on-site)
  • SDA stuck-low: the bus cannot return high; scans hang or collapse into repeated timeouts.
  • Timeout bursts: intermittent read/write failures that correlate with attach/off transitions or noisy edges.
  • Address conflict: two devices respond; identity becomes ambiguous; inventory mismatches grow.
  • Segment short / global drag: a single island pulls the whole network down.
  • Hot-plug half-attach: the segment becomes unstable during service events, causing sporadic losses.
2) Recovery ladder (escalate only when evidence demands it)
Level Trigger Action Exit criteria
L1 Retry + timeout Intermittent failures; single-device errors Scoped retry for the current segment/device Error rate drops below threshold; no segment-wide impact
L2 Clock pulses SDA stuck-low signature on a segment Issue unstick pulses on the affected segment SDA returns high; segment becomes reachable again
L3 Isolate segment Global drag or suspected short/half-attach Close gate for the suspected island; protect backbone Backbone stable; other segments recover and scan reliably
L4 Rediscover / readdress Inventory mismatch after attach/detach Reattach → rediscover → (I3C) DAA reassign → verify Inventory key ↔ address mapping converges; loss count returns to baseline
3) Evidence-first recovery logging (copy-ready fields)

Minimum evidence fields

ts domain segment_id symptom action result impact retry_level inventory_delta

4) Output: Bus Recovery Playbook (step triggers)
Observed symptom Evidence check Primary action If not recovered
SDA stuck-low Which segment_id? Gate state? Recent attach/off events? L2 clock pulses on that segment L3 isolate segment; keep backbone stable
Timeout bursts Retry level trend; temperature/power transitions; gate flaps L1 scoped retry with bounded timeout L3 isolate if bursts expand to other segments
Missing devices present/pg status; inventory_delta; segment attach state Rescan the segment (scoped) L4 reattach → rediscover/DAA → verify
Address conflict Multiple responders; inventory key ambiguity Scope to the conflicting island; isolate if needed L4 rediscover + readdress; record mapping changes
Global drag Backbone health vs islands; which gate change preceded collapse L3 isolate the most recently attached / suspected island Iterate isolation by segment until backbone recovers
Backbone first Segment scoped Escalate by evidence Log every step

Boundary: recovery actions here are bus/segment level (retry, unstick pulses, isolate, rediscover/readdress). This section does not define OS drivers, BMC service logic, or database persistence.

Figure F8 — Fault tree: Symptom → Likely cause → Evidence → Action (segment-scoped)
Bus fault tree and recovery mapping Four-column flow blocks map common symptoms to likely causes, required evidence points, and recovery actions with segment isolation. Fault Tree (Segment-Scoped Recovery) SYMPTOM CAUSE EVIDENCE ACTION SDA LOW STUCK ENDPOINT HALF-ATTACH segment_id gate_state CLOCK PULSES ISOLATE SEG TIMEOUT EDGE NOISE DOMAIN FLAP retry_level ts RETRY ESCALATE MISSING DEV OFF / DETACH MAPPING DRIFT present/pg inventory_delta RESCAN REDISCOVER ADDR CONFLICT CLASHED ISLAND segment_id key ambiguity ISOLATE READDRESS PRIORITY: keep AON backbone stable, isolate islands

H2-10 — Validation & Bring-up Checklist

Validation should eliminate “lab passes, rack fails” by proving stability at the worst corners: maximum nodes and length, extreme temperature, and domain/power disturbances. The bring-up sequence must establish an AON baseline first, then expand segments, then qualify hot-plug behavior.

1) Bring-up order (observability-driven)
  • Stage 1 — AON baseline: backbone stable; minimum telemetry chain readable; evidence logs working.
  • Stage 2 — Island expansion: add one segment at a time; prove faults remain scoped to that segment.
  • Stage 3 — Hot-plug readiness: attach/detach cycles do not perturb the backbone; recovery is bounded in time and loss rate.
2) Stress dimensions (drive toward worst corners)
Dimension Progression What to record (evidence)
Node count Typical → upper bound scan success rate, inventory_delta, retry_level trend
Length / load Typical → worst cable/trace length and capacitance timeout bursts, stuck-low incidence, segment isolation events
Temperature Ambient → hot / cold corners error rate vs temperature, hotspot indicators, recovery time
Domain / power disturb Stable → off/on transitions + hot-plug events time-to-visibility, loss rate, mapping drift count
3) Output: Validation case table (copy-ready)
Test ID Setup Steps Pass criteria Required logs
T01 AON only, backbone Boot → scan backbone → read minimum telemetry chain No timeouts; stable scan rate ts, segment_id=B0, retry_level, result
T02 Max nodes, typical length Scan loops for N cycles; record errors per cycle Success rate ≥ target; bounded retries ts, segment_id, retry_level, inventory_delta
T03 Worst length/load Repeat scans + induced attach events No global drag; isolation works gate_state, impact, action, result
T04 Hot corner Hold at high temp; scan; watch drift and timeouts Error rate remains below threshold temp hotspot, retry_level, timeout count
T05 Cold corner Repeat T04 at low temp Recovery times remain bounded time-to-visibility, loss rate
T06 Domain off/on Power off island → verify invisible → restore → rescan Backbone stable; island recovery bounded present/pg, gate_state, inventory_delta
T07 Hot-plug cycles Attach/detach for M cycles; verify mapping stability Low loss; minimal mapping drift ts, segment_id=HP, action, result, mapping drift

Boundary: this checklist specifies bring-up order, stress corners, and evidence fields. It does not specify automation frameworks, OS tooling, or rack deployment procedures.

Figure F9 — Test matrix: nodes × length (with temperature + power disturbance as corner modifiers)
Validation test matrix A 2D grid shows node count vs length/load with highlighted must-test corners; temperature and power disturbance are shown as modifiers applied to corners. Validation Matrix (Worst Corners) Y: LENGTH / LOAD X: NODE COUNT MUST TEST MUST TEST BASE LINE LOW HIGH LONG SHORT MODIFIERS: TEMP (HOT/COLD) + POWER DISTURB (OFF/ON, HOT-PLUG) ON MUST-TEST CORNERS

H2-11 — Parts / IC Selection Pointers (Bus & Telemetry Only)

This section lists common IC categories used on an OCP/OpenRack baseboard to make the sideband (I³C/I²C) fabric scalable, segmentable, and observable. Focus stays on the bus layer and telemetry plumbing—no BMC firmware stack, no power-conversion deep dive.

11.1 What to place where (one-glance placement rules)

Backbone fan-out: I3C hub Segment entry: I2C switch / hot-swap buffer Domain boundary: isolator / translator Inventory: FRU/ID EEPROM Observability: power + temp sensors Presence/LED: GPIO expander
Backbone (always-on) I³C host + hub(s). Keep pull-ups and ALERT/IBI handling stable in AON.
Each removable segment Switch/buffer at the segment “mouth” so one fault does not drag the whole bus.
Cross-voltage / cross-ground Translator or isolator where domains meet, not in the middle of a noisy segment.
Telemetry cluster Group power/temp sensors per domain and record the domain + timestamp alongside values.

Note: MPNs below are examples to speed up RFQs and param checks. Always validate I/O behavior during power-off, hot-join/hot-plug expectations, and bus-capacitance budgets against the current datasheet.

11.2 Example IC categories and MPNs (copy/paste shortlist)

The goal is not “one perfect part,” but the right function blocks: hub/bridge → segmentation → hot-swap friendliness → domain translation/isolation → telemetry & inventory.

Category Example MPNs Typical use on baseboard Selection pointers (what to ask)
I³C hub / fan-out NXP P3H2840HN
Renesas RG3M88B12
Scale one I³C backbone into multiple downstream segments (AON plane), while keeping segments individually controllable for recovery and maintenance. Downstream port count & control model; I³C vs mixed I²C support; hot-join behavior; reset semantics; fail-isolation (one bad segment impact); max bus speed and bus-cap budget.
I³C/I²C translators (for I³C rates) TI TCA39416
NXP P3A9606
Cross-voltage translation for sideband where I³C speed/edges matter (e.g., 1.2V↔1.8V/3.3V), keeping bidirectional open-drain semantics. Confirm I³C compatibility; directionless/bidirectional behavior; rise-time impact; power-off leakage; EN/disable behavior (does it isolate or “half-connect”); ESD robustness for field swaps.
I²C mux / switch TI TCA9548A
NXP PCA9548A
Segment legacy I²C islands to solve address conflicts, reduce effective bus capacitance, and localize stuck-low faults to one branch. Channel count; reset/power-up default; leakage in off channels; level-translation needs; switch Ron vs edge integrity; software model (single vs multiple channels enabled).
Hot-swap I²C buffers TI TCA4311A
TI TCA4307
Protect the backbone from “half-inserted” cards/segments, precharge lines, and support stuck-bus recovery without rebooting the whole management plane. Live-insertion behavior; precharge voltage; automatic stuck-bus recovery; how it handles clock stretching & arbitration; connection criteria (STOP/idle detection); capacitance isolation strength.
I²C bus repeater / segment buffer NXP PCA9515A Split a heavy I²C bus into two segments with buffered SDA/SCL to extend practical load/cap limits and isolate noisy branches. Multi-master friendliness; contention behavior; level translation needs; propagation delay; power-off behavior; recommended pull-up placement per segment.
Long/noisy run extender (dI²C) NXP PCA9615
NXP P82B96
When baseboard-to-remote panel runs are noisy/long: convert to differential (PCA9615) or use buffer extension concepts (P82B96) to improve robustness. Distance/noise target; required cabling; speed limits; common-mode tolerance; EMC implications; how failures localize (does one short kill both ends?).
I²C isolation (ground/domain isolation) ADI ADuM1250
TI ISO1540
Isolate sideband across different ground references or sensitive domains while retaining bidirectional I²C signaling semantics. Isolation rating; bidirectional support; speed ceiling; power-off behavior; fail-safe IO; CMTI/noise immunity; whether it is truly “non-latching” during hot events.
Digital power monitors TI INA229
TI INA238
ADI LTC2947
Per-domain V/I/P telemetry (and sometimes energy) tied to the same inventory context: domain ID, segment, and timestamp. Common-mode range; shunt vs integrated-sense options; conversion time/averaging; alert pins and thresholds; logging needs (min/max, energy); calibration strategy and drift expectations.
Temperature sensors (I³C/I²C) NXP P3T1085UK Clean temperature telemetry with I³C features (e.g., in-band interrupt capability), suitable for dense sensor deployments with discovery semantics. Accuracy & response; I³C features used (IBI vs polling); alert mechanism; placement strategy (hotspots vs gradients); sampling cadence vs noise.
GPIO expanders (presence/LED/sideband pins) NXP PCA9555
TI TCA9535
Microchip MCP23017
Add low-speed IO for presence, latch signals, LEDs, and simple discrete telemetry where routing dedicated SoC pins is costly. Interrupt output type; power-up default states; I/O drive & pull features; input glitch sensitivity; addressing options (how many can coexist); power-off leakage/back-power risk.
FRU / identity EEPROM Microchip 24AA02E64 Store a globally unique ID (and optional inventory fields) used by discovery pipelines to bind “dynamic bus address” to a stable asset identity. Pre-programmed ID needs; write endurance; power-loss behavior during writes; address conflicts planning; data model (what fields are mandatory for operations).

Practical rule: pick parts that make segmentation and recovery cheap. If a segment can be isolated, re-scanned, and re-inventoried in seconds, field uptime and debug time improve dramatically.

11.3 RFQ-ready checklist — 10 questions that prevent wrong parts

These questions align with real baseboard failure modes: stuck-low, address conflict, hot-plug glitches, cross-domain back-power, and “discovery without operability.”

  1. Bus role: hub, mux, buffer, translator, isolator, or sensor—what exact layer is being solved?
  2. Speed target: I²C (100/400/1MHz) vs I³C (up to 12.5MHz). Is the part truly compatible at the required mode?
  3. Power-off behavior: will any pin back-power another domain through protection structures or pull-ups?
  4. Isolation/segmentation: when disabled/reset, does it become a clean high-Z barrier or a partial path?
  5. Capacitance budget: what bus cap is assumed per segment, and how does the part help enforce it?
  6. Hot-plug events: precharge, connect criteria (idle/STOP), and behavior during “half insertion.”
  7. Fault containment: can one short/stuck device be localized to a single branch with minimal blast radius?
  8. Recovery hooks: reset pin semantics, stuck-bus recovery, and whether software can force re-discovery/re-addressing.
  9. Alert strategy: in-band (I³C/IBI) vs out-of-band (ALERT/INT). How are alerts latched and cleared?
  10. Ops mapping: how does the design bind dynamic addresses to stable identity (FRU/ID EEPROM) with timestamps?

Figure F10 — Reference placement example (hub → segments → isolation → telemetry)

A single-board view showing where hubs, switches/buffers, translators/isolators, and telemetry ICs are typically placed to keep the management plane resilient.

Figure F10 — Bus & telemetry IC placement on a baseboard
Placement idea: I3C backbone → segmented islands → safe boundaries → observable domains Always-on I³C Backbone host controller + hub fan-out + stable pull-ups/alerts I³C Hub (fan-out) Segment A hot-plug / serviceable Switch / Hot-swap buffer Legacy island FRU EEPROM GPIO EXP Temp sensors / Alerts Segment B cross-domain Translator / Isolator Domain telemetry Power monitor (V/I/P) Temp sensors + thresholds Segment C long/noisy run dI²C / bus extender Remote island Environmental sensors Presence + status pins Place control points at segment mouths (isolate & recover fast) Bind telemetry to domain + timestamp + stable identity Use isolation/translation at boundaries, not mid-segment

Reading the figure: each segment has a “control chokepoint” (switch/buffer/isolator) so recovery can be performed per-branch without taking down the entire management plane.

Request a Quote

Accepted Formats

pdf, csv, xls, xlsx, zip

Attachment

Drag & drop files here or use the button below.

H2-12 — FAQs (I3C/I2C Governance, Telemetry, Recovery)

These FAQs target field issues on an OCP/OpenRack baseboard: bus stability, discovery/addressing, multi-domain behavior, telemetry organization, recovery without full reboot, and validation. Content stays at the bus/telemetry layer.

Figure F11 — Evidence-first troubleshooting ladder (FAQ map)

A compact visual for how to move from symptoms to evidence, then to a scoped action: electrical integrity → segmentation/isolation → re-discovery/inventory consistency → validation coverage.

Figure F11 — Symptom → Evidence → Containment → Restore → Verify
Evidence-first ladder for I3C/I2C baseboard issues 1) Symptom (field) device drop • read errors • SDA stuck-low • hot-join jitter • inventory drift 2) Evidence (3 buckets) Waveform/edges • Segment-level counters & isolation state • Validation corner that reproduces 3) Contain blast radius segment switch/buffer • domain boundary isolation • keep AON backbone stable 4) Restore & verify retry → release → isolate → re-discover → re-inventory

The FAQ answers below follow this ladder so each issue produces actionable evidence and a scoped recovery plan.

FAQs (12)

Each answer stays within: I3C/I2C governance, discovery/addressing, multi-domain telemetry, bus recovery, and validation/bring-up.

1) Why can a scan find devices, yet they randomly disappear after running for a while? What 3 evidence buckets come first?
Start with three evidence buckets. (1) Electrical: compare SCL/SDA edges, spikes, and low-level margin under worst load/temperature; look for rising-edge slowdowns and retry bursts. (2) Segment-level: identify which branch first shows errors, and whether isolating that segment stabilizes the backbone. (3) Validation corner: reproduce with max nodes/capacitance and power-domain disturbances, then confirm recovery time and reappearance rate.
Jump to: H2-6 · H2-9 · H2-10
2) If the I2C bus is short, why can SDA still get stuck-low? What is the most common power-off back-powering path?
Short wiring does not prevent stuck-low when a powered-off device unintentionally back-powers or clamps the line. A common path is pull-ups referenced to an always-on domain feeding into a powered-off device through I/O protection structures, which can partially power the device and latch the bus. The fix is architectural: ensure power-off pins are high-Z, move pull-ups to the correct domain, and isolate removable/off domains so one segment cannot hold the backbone low.
Jump to: H2-7 · H2-9
3) Does an address conflict always require changing parts? How do muxing and segmentation resolve conflicts in an engineering-friendly way?
A conflict rarely requires replacing devices. Engineering-friendly fixes are (1) segmentation: split the bus into islands so identical-address parts never share a segment, and (2) muxing: explicitly select one branch at a time when scanning or operating. The key is operability: keep an address plan table (segment/domain/device/addr/alert) and a discovery log (timestamp, branch, address changes, failures). This makes conflicts predictable, testable, and maintainable across hot-plug and field service.
Jump to: H2-3 · H2-5 · H2-11
4) After I3C DAA, why can inventory show “device drift” or duplicates?
DAA assigns dynamic addresses, but inventory needs stable identity. Drift/duplicates appear when the system records “address = identity” without binding to a stable identifier (FRU/ID EEPROM) plus context (segment and power domain). Hot-join, resets, or recovery can reshuffle dynamic addresses, and the inventory layer must treat these as events: record timestamp, old/new address, segment, and verification result. When identity is stable, address movement becomes traceable rather than confusing.
Jump to: H2-5 · H2-8
5) Hot-Join triggers jitter and global read errors—how can the impact be contained to the plug-in segment?
Containment requires a “control point” at the segment mouth: a switch/buffer that can keep the backbone stable while the plug-in segment settles. Treat the design as three domains: always-on backbone, plug-in segment, and any off domain. Hot-join should first be confined to the plug-in segment, then promoted to the backbone only after verification (stable edges, no stuck-low, consistent discovery). If errors occur, isolate the segment immediately and restore backbone observability before re-trying join.
Jump to: H2-3 · H2-7 · H2-9
6) How should pull-ups be chosen so it works in the lab and in the field? Which waveform metrics must be measured?
Pull-ups must be selected against the worst-case effective bus capacitance of the deployed topology, not a lab bench setup. Measure: (1) rise-time and any edge “shoulder” indicating too much capacitance, (2) low-level margin under sink current (devices must pull low cleanly), (3) spikes/glitches near transitions, and (4) retry and timeout patterns that correlate with temperature or power disturbances. Validate at max nodes, longest wiring, and worst thermal corners to avoid field-only failures.
Jump to: H2-6 · H2-10
7) When is a buffer/switch necessary, and when is it just a pull-up and routing problem?
Use evidence first. If waveforms show slow edges, glitches, or low-level margin loss that scales with node count or cable length, start with capacitance budget, pull-ups, and routing. A buffer/switch becomes necessary when the system requirements exceed what passive fixes can guarantee: too many devices on one segment, serviceable/hot-plug segments, cross-domain power behavior, or the need to contain faults to one branch. A switch/buffer is also justified when recovery must isolate segments without rebooting the backbone.
Jump to: H2-3 · H2-6 · H2-11
8) In multi-domain designs, which signals must live in the always-on (AON) domain, and which must power down with a domain?
AON should keep the minimum observability chain alive: the backbone controller/hub, the pull-up reference for the backbone, and the sensors/alerts required to explain power events. Signals that cross into a domain that can power off (or be removed) must be able to become harmless when that domain is off—ideally high-Z and segment-isolated—so they cannot back-power or hold the bus low. If a signal cannot be made fail-safe, it should shut down with its domain and be separated by an isolator or switch.
Jump to: H2-2 · H2-7
9) Why does polling miss transient over-temperature or brownout events? How should in-band vs out-of-band alerts be split?
Polling samples at discrete intervals, so short events can occur and clear between reads. For operational telemetry, polling is fine for trends and steady-state limits, but critical transients need alerts. In-band alerts (e.g., I3C in-band mechanisms) are useful when the same bus context should carry identity and segment association. Out-of-band alerts (ALERT/INT) are better for fast, guaranteed wake/flag behavior even when the bus is congested. The best split ties every alert to domain + timestamp + stable identity for post-event reconstruction.
Jump to: H2-8
10) How to design a “tiered recovery” (retry → release → isolate → re-discover) to avoid full system reboot?
Use a tiered recovery ladder with clear triggers. Tier 1: bounded retry with timeouts to avoid infinite hangs. Tier 2: bus release actions (e.g., clock pulses) if stuck-low is suspected. Tier 3: isolate the suspected segment using a switch/buffer so the backbone becomes observable again. Tier 4: re-discover and re-address (I3C DAA if used), then re-bind to stable identity and update inventory. Each tier must emit a structured event record: timestamp, segment, action, and outcome.
Jump to: H2-9
11) During bring-up, how to quickly decide whether the issue is electrical integrity or discovery/addressing logic?
Electrical issues typically correlate with scaling and corners: failures worsen with more nodes, longer wiring, higher temperature, or power disturbances; waveforms show edge degradation, spikes, or low-level margin loss, and errors look random. Discovery/addressing issues are more deterministic: consistent conflicts, repeated duplicates, missing devices tied to one segment, or inventory drift after resets/recovery. The fastest method is a controlled matrix: hold topology constant while varying address/discovery steps, then hold discovery constant while stressing capacitance/pull-ups and domains. Log segment context for every failure.
Jump to: H2-5 · H2-6 · H2-10
12) When selecting an I3C hub/bridge, what 3 pitfalls are most often missed (power-off behavior / isolation / compatibility)?
First, power-off behavior: confirm off-state pins are high-Z and do not back-power other domains via pull-ups or protection paths. Second, isolation: check whether a faulty downstream segment can be cleanly cut off so the backbone remains usable; avoid parts that “half-connect” during reset or fault. Third, compatibility: verify mixed I3C + legacy I2C operation boundaries, hot-join expectations, and recovery semantics (reset/re-discovery hooks). A hub/bridge should reduce blast radius and improve observability, not just add ports.
Jump to: H2-11 · H2-7
Electrical integrity Segmentation & isolation Discovery & addressing Telemetry & alerts Tiered recovery Validation matrix