OCP OpenRack Baseboard for I3C/I2C Bus Management

Q: Does an address conflict always require changing parts? How do muxing and segmentation resolve conflicts in an engineering-friendly way?

A conflict rarely requires replacing devices. Use segmentation to split the bus into islands so identical-address parts never share a segment, and use muxing to select one branch at a time during scans or operation. Keep an address plan table (segment/domain/device/addr/alert) and a discovery log (timestamp, branch, address changes, failures) so conflicts remain predictable across hot-plug and field service.

Q: Hot-Join triggers jitter and global read errors—how can the impact be contained to the plug-in segment?

Containment requires a control point at the segment mouth: a switch/buffer that keeps the backbone stable while the plug-in segment settles. Treat the design as always-on backbone, plug-in segment, and any off domain. Verify stability (edges, no stuck-low, consistent discovery) before promoting the segment to the backbone. If errors occur, isolate the segment immediately and restore backbone observability before re-trying join.

Q: How should pull-ups be chosen so it works in the lab and in the field? Which waveform metrics must be measured?

Select pull-ups against the worst-case effective bus capacitance of the deployed topology, not a lab bench setup. Measure rise-time and any edge shoulder, low-level margin under sink current, spikes/glitches near transitions, and retry/timeout patterns that correlate with temperature or power disturbances. Validate at max nodes, longest wiring, and worst thermal corners to avoid field-only failures.

Q: When is a buffer/switch necessary, and when is it just a pull-up and routing problem?

Use evidence first. If waveforms degrade with node count, cable length, temperature, or power disturbances, start with capacitance budget, pull-ups, and routing. A buffer/switch is necessary when requirements exceed passive fixes: too many devices, serviceable/hot-plug segments, cross-domain power behavior, or the need to contain faults per-branch. It is also justified when recovery must isolate segments without rebooting the backbone.

Q: In multi-domain designs, which signals must live in the always-on (AON) domain, and which must power down with a domain?

AON should keep the minimum observability chain alive: backbone controller/hub, the backbone pull-up reference, and the sensors/alerts required to explain power events. Signals that enter a domain that can power off or be removed must become harmless when that domain is off—ideally high-Z and segment-isolated—so they cannot back-power or hold the bus low. If fail-safe behavior cannot be guaranteed, separate with an isolator/switch and power down with the domain.

← Back to: Data Center & Servers

An OpenRack baseboard turns scattered I³C/I²C devices into an operable, segmentable, and recoverable management fabric by enforcing discovery/addressing, multi-domain telemetry context (domain + timestamp + identity), and tiered bus recovery—so observability survives hot-plug and power-domain transitions.

H2-1 — What “OpenRack Baseboard” Covers

An OpenRack baseboard is the physical layer for sideband connectivity and sensor/FRU visibility. Its engineering core is I3C/I2C bus governance (topology, segmentation, isolation), device discovery & addressing (DAA/address plan), and multi-domain telemetry aggregation (power/thermal + domain state) into the management plane.

DeliverablesWhat this page is expected to make operational

A bus architecture that stays stable at scale: segmented backbone + isolated islands so one fault does not take down the entire rack-side visibility.
A repeatable discovery/addressing scheme: deterministic inventory mapping across hot-join events and multi-domain power states.
A telemetry model that is actionable: raw readings → thresholds/alerts → event evidence (timestamps + affected segment/domain).

BoundaryWhat is in scope vs out of scope

Topic	In scope (this page)	Out of scope (linked elsewhere)
Management	Bus master placement, sideband signal paths, alert wiring, inventory evidence fields.	Redfish/IPMI software stack architecture, firmware workflows, UI/telemetry dashboards.
Power	Telemetry ingestion (voltage/current/power), domain states (AON/MAIN/HOTPLUG), back-power prevention at the bus layer.	PFC/LLC/CRPS conversion topology, VRM control-loop stability/compensation details.
Compute/IO	Sideband presence/FRU, sensor islands near connectors, segment isolation strategy.	PCIe/NVMe/IB/Ethernet protocol stacks, retimer equalization, dataplane acceleration.

Practical rule: if a paragraph starts describing how power is converted or how a management protocol is implemented, it belongs to a sibling page. This page only defines the bus/device layer that makes those systems observable and recoverable.

Figure F1 — Baseboard position: I3C backbone + I2C legacy islands + telemetry nodes

Design intent: one backbone for discovery + telemetry, isolated islands for fault containment and mixed legacy support.

H2-2 — System Context & Interfaces

A baseboard must be describable as a bus graph plus power/reset domains. The objective is to make every sideband endpoint (FRU, temperature, power monitors, presence) discoverable, addressable, and recoverable under real rack conditions: standby states, partial power loss, and hot-join events.

Interface inventoryMinimum fields to document (bus + domain + recovery)

Device class	Bus segment	Addressing	Power domain	Reset / Enable	Alert path	Failure impact
FRU / EEPROM	I2C Island-A	Static / behind mux	AON	Always enabled	Polling	Low (segmentable)
Temp array	I2C Island-A	Static	AON or MAIN	GPIO enable	INT (optional)	Medium (can flood bus)
Power monitor	I2C Island-B	Static	MAIN	Domain PGOOD gated	ALERT#/INT	High (stuck-low risk)
Presence / GPIO exp	I2C Island-A/C	Static	HOTPLUG	Hot-swap domain	INT	High (hot-join noise)
I3C hub/bridge	I3C Backbone	DAA / dynamic	AON	Always enabled	In-band	Critical (backbone)

Power & reset domainsRules that prevent “invisible” or “hung” systems

Define an AON minimum observable chain: a small set of endpoints that must remain readable in standby or partial failure (e.g., FRU + environment temp + backbone health).
Assign pull-ups to the correct domain: pull-ups that remain powered while a segment endpoint is off must not create back-power paths through I/O structures.
Segment by domain boundaries: any domain that can be powered off or hot-plugged should sit behind a switch/mux/buffer so a fault cannot clamp the entire bus.
Reset/enable must be coherent with discovery: if an endpoint can reboot independently, discovery should be re-entrant (re-scan or re-DAA) without destabilizing the backbone.

Minimum-observable does not mean “monitor everything”. It means “keep enough visibility to diagnose and recover”: domain presence, backbone health, and the most critical thermal/power indicators—without relying on full system power.

Figure F2 — Domains & interfaces: AON backbone with segmented MAIN/HOTPLUG islands

Design intent: treat the baseboard as a bus graph with explicit domain boundaries, so discovery and telemetry remain available under partial power states.

H2-3 — Bus Topology Patterns

The baseboard should be treated as a bus graph with explicit segments. The objective is not simply connectivity, but fault containment: long traces, high capacitive loads, and hot-join domains must not clamp the backbone or destabilize discovery.

1) Single-master vs multi-master (engineering boundary)

Single-master (default)

One backbone master simplifies arbitration and makes recovery deterministic. Segments isolate failures so a single stuck line becomes a local event rather than a rack-wide visibility loss.

Multi-master (only with strong justification)

Adds arbitration, timing edges, and new failure modes. If introduced, segment boundaries and isolation must be stricter so a misbehaving master cannot flood retries or hold the bus in a degraded state.

2) Segmentation rules (when and why to split)

Physical distance: long runs amplify edge distortion and noise pickup; buffers/switches help restore electrical margin and limit the affected length.
Capacitive loading: many endpoints slow edges and increase glitch sensitivity; islands cap per-segment load so timing stays stable under worst-case conditions.
Power domains: any domain that can be off or partially powered must be behind a gate to prevent back-power and “stuck-low” propagation.
Noise sources: hot-plug connector zones and high-current regions deserve separate islands to keep backbone discovery stable.

3) MUX vs SWITCH vs BUFFER (choose by failure domain)

Component	Best for	Not a fix for	Typical placement
MUX	Address conflicts, branch selection, reducing visible endpoints per scan.	Domain back-power, hot-join disturbance, hard fault isolation.	Between backbone and legacy branches; upstream of small islands.
SWITCH	Segment-level isolation, fault containment, controlled attach/detach of a domain.	Edge restoration on long lines if used alone; protocol-level inventory logic.	At domain boundaries (MAIN/HOTPLUG), at connector entry points.
BUFFER	Electrical margin: rise-time help, fanout, restoring edges after long runs.	Back-power prevention, isolating a stuck-low endpoint across domains.	On long trunks, before high-fanout branches, near noisy zones.

4) Hot-plug maintainability (bus-layer rules)

Hot-plug domains must be gated: attach/detach should affect only the island, not the backbone.
Pull-up ownership must be explicit: avoid powering an unpowered island through bus pull-ups or I/O structures.
Discovery must be scoped: hot-join triggers a limited rescan/attach sequence for that island, not a disruptive global churn.

Rule of thumb: if a failure can pull SDA/SCL low, it must be possible to isolate that failure within one segment using a gate (switch/isolator), regardless of whether addressing conflicts also exist.

Figure F2 — Three topology patterns used on OpenRack baseboards

H2-4 — I2C vs I3C in Baseboards

The practical question is not “new vs old”, but what becomes easier to operate. I3C strengthens discovery semantics and hot-join handling, while I2C islands remain valuable for legacy endpoints and simple devices that do not require dynamic identity or in-band alert behavior.

1) I3C value (baseboard-relevant capabilities)

Dynamic Address Assignment (DAA): reduces static address collisions and supports deterministic inventory mapping when endpoints change.
Hot-join semantics: enables controlled attach workflows so insertion events can be scoped to an island rather than destabilizing the backbone.
In-band interrupt / alerting: improves time-to-detect for threshold events without relying on heavy polling.
Stronger discovery vocabulary: supports a cleaner “device identity → inventory record” pipeline at the bus/device layer.

2) Why keep I2C islands (practical boundary)

Legacy endpoints: FRU/EEPROM and many simple sensor classes remain cost-effective and widely available on I2C.
Operational simplicity in islands: stable static addressing is acceptable for small, tightly bounded segments.
Isolation-first design: keeping legacy on islands can reduce the blast radius of electrical faults or partial-power behavior.

3) Migration pattern (backbone-first)

Step 1 — I3C backbone (AON)

Upgrade the backbone that must remain visible in standby. Make discovery and minimum telemetry deterministic and recoverable.

Step 2 — Bridge + segment gates

Bring legacy islands under the same inventory model via bridges and gates, so hot-plug or power-off behavior stays local.

Step 3 — Replace only where it pays

Move endpoints that benefit from dynamic identity or in-band alerts, while leaving stable low-complexity devices on I2C islands.

4) Engineering comparison (only what matters on baseboards)

Dimension	I2C (islands)	I3C (backbone)
Addressing	Static planning; collisions handled via straps or mux segmentation.	DAA supports dynamic identity management and reduces static collision pressure.
Discovery	Scan-based visibility; works best for small bounded segments.	Stronger discovery semantics; better for inventory mapping across churn.
Alerts	Often sideband INT/ALERT or polling-driven detection.	In-band alerting reduces dependency on heavy polling and speeds reaction.
Scale behavior	Large fanout stresses edges and timing; segmentation becomes mandatory.	Backbone-first helps centralize discovery while islands limit capacitive blast radius.
Recovery hooks	Segment isolation + rescan; deterministic if islands are small.	Scoped attach/detach + re-DAA patterns support controlled churn handling.
Migration pitfalls	Address maps fragment easily if mux topology is undocumented.	Bridges must preserve isolation and identity mapping under power-domain changes.

A mixed design is often the most robust: I3C for backbone-level discovery and alert semantics, I2C islands for bounded legacy endpoints. The design objective is stable observability, not maximal protocol uniformity.

Figure F3 — Layered mixed bus: I3C backbone → bridges → I2C islands

H2-5 — Device Discovery & Addressing

A baseboard onboarding flow must be repeatable: plug in → detect → assign an address → verify reachability → register into inventory. Discovery becomes “operational” only when it records segment, power-domain state, alert path, and a verification result alongside the observed address.

1) Static I2C address planning (make collisions predictable)

Collision avoidance toolkit

Prefer segmentation (islands) to keep conflicting endpoints from sharing the same visible bus. Use straps/config pins when available. Use mux selection only when segmentation boundaries are documented and enforced.

Rule

The same numeric address can repeat across different islands, but must not appear in the same visible segment at the same time. Document every segment boundary and its gating element.

2) I3C DAA onboarding (dynamic address is not the identity)

When to run DAA: after segment attach, after power-domain transitions that change device visibility, and on hot-join events.
Scope it: run onboarding on the affected segment/island, not the entire backbone, to avoid unnecessary address churn.
Map to inventory: the inventory key must be device identity + segment context; the dynamic address is the current reachability handle.
Record changes: log address transitions (before → after) and the reason (join, rescan, recovery).

3) Operational discovery (discover ≠ usable)

Evidence item	What to record	Why it matters operationally
Segment / island	Segment ID, upstream gate/switch, bridge path	Limits blast radius; enables scoped rescan and targeted isolation.
Power-domain	AON / MAIN / HOTPLUG state	Prevents false alarms when a domain is intentionally off; explains “missing” devices.
Alert path	In-band alert / INT line / none	Separates “device unreachable” from “event signal broken”; improves MTTR.
Address state	Static expected, or dynamic assigned; before → after on changes	Supports audit trails, avoids identity drift, and enables stable inventory linking.
Verification	Basic read/health check: pass/fail + failure category	Discovery without verification causes inventory pollution and noisy support tickets.

4) Template A — Address Plan (single source of truth)

This plan ties addressing to topology and domains. Use it to review changes before deployment.

Domain	Segment / Island	Device Class	Inventory Key	Addr Type	Addr (Expected/Current)	Alert Path	Isolation Element	Notes
AON	B0 (Backbone)	Bridge	bridge.backbone.01	DAA	dyn / dyn	in-band	switch gate	Scoped onboarding after attach
MAIN	I2C-1	TEMP	temp.array.zoneA	Static	0x4A / 0x4A	INT	mux branch	Address shared only within island
HOTPLUG	I2C-HP	FRU	fru.drawer.slot3	Static	0x50 / 0x50	none	switch gate	Gate open only during service window

5) Template B — Discovery Log (audit-friendly)

This log turns “it disappeared” into a timed, explainable event with scope and recovery actions.

Timestamp	Event	Domain / Segment State	Inventory Key	Address Before → After	Verify	Failure Category	Action Taken
YYYY-MM-DD hh:mm:ss	HOT-JOIN	HOTPLUG / gate=open	fru.drawer.slot3	— → 0x50	PASS	—	Inventory update
YYYY-MM-DD hh:mm:ss	RESCAN	MAIN / stable	temp.array.zoneA	0x4A → 0x4A	FAIL	NACK storm	Isolate island, retry later
YYYY-MM-DD hh:mm:ss	RECOVERY	AON / backbone	bridge.backbone.01	dyn → dyn	PASS	—	Scoped DAA on segment

Boundary: this section defines bus/device-layer onboarding evidence (scope, address state, verification, logs). It does not define management protocol stacks or backend database implementations.

Figure F4 — DAA + Hot-Join onboarding state machine (scoped to a segment)

H2-6 — Bus Electrical Integrity

Most field failures begin as marginal edges: slow rise-time, excessive segment capacitance, glitches, or insufficient low-level margin. A bus that “works once” can still be unstable under temperature, hot-plug, or domain transitions. Evidence should be captured as waveforms and retry behavior, not assumptions.

1) Pull-ups and rise-time (engineering method, not a spec recital)

Estimate per-segment load: include trace length, connector parasitics, and endpoint input capacitance. Treat each island as a separate RC problem.
Choose pull-up by segment: keep pull-up ownership aligned with the segment’s power domain to prevent back-power paths.
Validate at the far end: measure rise-time and noise margin at the worst-case point (end of the longest branch), not only near the master.

2) Glitch sensitivity (keep it bus-layer)

What typically goes wrong

Short spikes can be interpreted as edges when rise-times are slow or thresholds are marginal. This can produce NACK storms or address-mapping drift during hot-join windows.

What to do (within scope)

Use segmentation and electrical buffering where long runs and noisy connector zones exist. Gate hot-plug islands so disturbances do not propagate into the backbone.

3) Clock stretching and timing edges (why “can run” ≠ stable)

Stretching amplifies marginality: one slow or partially-powered endpoint can elongate cycles and trigger timeouts.
Marginal edges increase retries: a stable system shows low retry frequency; rising retries are an early warning long before “bus hang”.
Contain the failure domain: a segment gate allows recovery actions (retry/rescan) without collapsing overall observability.

4) Oscilloscope checklist (actionable)

Checkpoint	What to look at	Interpretation / next action
Rise-time	SCL/SDA edge speed at segment end; compare near-master vs far-end.	Slow edges suggest high C or weak pull-up; segment, buffer, or adjust pull-up ownership per domain.
Low-level margin	Low level “floor” stability under traffic; look for lifted lows.	Insufficient margin can cause false reads; isolate partial-power endpoints and review segment gating.
Glitches	Short spikes on SCL/SDA; correlate with missing devices or NACK storms.	Glitches + slow edges are a common pair; shorten/noise-isolate the segment and add electrical buffering where needed.
Retry / repeat-start	Protocol analyzer or firmware counters: retry rate over time.	Rising retries indicate shrinking margin; treat as pre-failure signal and scope the worst segment first.
Hot-plug window	Waveforms before/during/after attach; check if backbone edges degrade.	If backbone degrades, hot-plug domain is not contained; strengthen gating/isolation and rescan only that segment.

Boundary: this section focuses on bus-edge integrity and measurement evidence. It does not prescribe full EMC design or chassis grounding rules.

Figure F5 — Simplified waveform signatures (good vs high-C vs weak/strong pull-up)

H2-7 — Multi-Domain Power & Isolation

In baseboards, the most damaging failures come from cross-domain coupling: a powered-off island can back-power, drag the bus, or spread hot-plug disturbances into always-on visibility. The goal is to keep each domain electrically and operationally scoped with clear pull-up ownership and segment gating.

1) Where back-powering sneaks in (conceptual paths)

Typical paths

Cross-domain pull-ups, IO protection/clamp structures, and “half-powered” endpoints can create unintended current paths. The symptom is often unstable bus levels, stuck lows, or devices that never fully reset.

Design intent (within bus/domain scope)

Enforce segment visibility with gates/switches, keep pull-ups owned by the segment’s intended domain, and ensure powered-off islands become electrically quiet and logically invisible.

2) Make powered-off devices “non-blocking”

Segment gates: isolate islands so a fault or off-domain endpoint cannot pull SCL/SDA low globally.
Pull-up ownership: assign pull-ups to the domain that remains valid for that segment (avoid cross-domain pull-ups by default).
Scoped recovery: isolate → log → rescan only the affected island (avoid global churn).

3) Hot-plug containment (keep disturbances local)

Hot-plug = separate segment: attach islands behind a gate so plug-in transients do not degrade the backbone.
Order of operations: power stable (PG) → open gate → discovery/addressing → verification → inventory update.
Evidence: capture “attach/open/close” events with timestamps and segment IDs to explain reachability changes.

4) Output: Domain–Bus ownership matrix

Use this matrix in design reviews to prove that off-domains cannot back-power or block the always-on view.

Domain	Bus visibility	Pull-up owner	Gate element	When domain OFF	Recovery action
AON	Backbone B0	AON only	Core gate	Must remain reachable; backbone stays stable	Scoped rescan of attached islands only
MAIN	I2C-1 / I2C-2 islands	MAIN (per island)	Island switch	Island should be invisible (gate closed)	Close gate → log → reopen after PG stable
HOTPLUG	HP island	HOTPLUG (local)	HP gate	Invisible during service / unpowered periods	Attach → verify → inventory; detach → mark absent

5) Output: Allowed cross-domain signals (whitelist)

Treat cross-domain connectivity as “allowed by design,” not accidental. If a signal is not listed here, it should not cross domains.

Signal class	Allowed direction	Required properties	Notes (scope control)
I3C / I2C	Backbone → island (through gate)	Segment gate; pull-up ownership defined; OFF-domain becomes invisible	Prefer “attach/detach” semantics over always-connected wiring
ALERT / INT	Island → AON (minimal dependency)	Defined pull-up; known OFF-state behavior; debounced semantics	Use for “wake/attention” when in-band is unavailable
PRESENT / PGOOD	Island → AON	Stable level when OFF; no back-power path	Explains reachability changes without scanning
RESET (domain)	AON → island	Only valid when island domain is powered; no reverse feeding	Scope reset to the island to avoid global churn

Scoped segments Pull-up ownership OFF = invisible Hot-plug contained

Boundary: this section defines domain visibility, segment gating, and pull-up ownership for bus stability. It does not define BMC protocol stacks, PSU/VRM power conversion, or full EMC grounding rules.

Figure F6 — Three-domain model: AON backbone + hot-plug segment + powered-off segment (isolation points)

H2-8 — Telemetry Aggregation Model

A baseboard telemetry system is not a pile of sensors. It is a structured pipeline that binds every measurement to domain, segment, and an inventory key, with a clear path to alerts and event logs. The same measurement should be explainable over time (trend) and under transitions (attach/off/reset).

1) A simple four-layer model (data model, not software)

Layer 0 → 1

Raw reads become physical units (V/A/W/°C) with a domain + segment context.

Layer 2 → 3

Thresholds and alert policy produce events with timestamps and traceability back to inventory.

2) Key telemetry classes (organized for root-cause)

Electrical: voltage/current/power per domain and segment (pair with domain state to avoid false alarms).
Thermal arrays: hotspot/max/min and gradient cues (useful for localization, not only average temperature).
Domain state: present/pgood/reset/attach (explains why data is missing or why a device is unreachable).
Reachability evidence: segment gate state and retry level (links electrical issues to bus integrity symptoms).

3) Alert path selection: in-band vs out-of-band

Alert path	Best fit	Design note (scope control)
In-band (I3C)	Structured events tied to discovery/inventory; when the bus view is stable and the segment is attached.	Bind alerts to domain + segment + key fields.
Out-of-band (ALERT/INT)	Minimal dependency signaling; “attention” when in-band is unavailable (OFF-domain transitions, early fault flags).	Define OFF-state behavior and debounce semantics; treat it as a trigger to fetch structured data later.

4) Output: Telemetry field dictionary (copy-ready)

Standardize field names so logs and alerts remain comparable across platforms and generations.

Field name	Unit	Source class	Sampling suggestion	Threshold type	Domain	Segment	Inventory key link	Alert path	Log policy
dom.aon.vbus_v	V	power	normal	absolute + persistence	AON	B0	power.aon.entry	in-band	periodic + on-alert
dom.main.pwr_w	W	power	normal	absolute + delta	MAIN	I2C-1	power.main.zoneA	in-band	periodic
dom.hp.present	bool	domain_state	fast	edge-trigger	HOTPLUG	HP	fru.drawer.slotN	INT	on-change + on-alert
seg.hp.gate_state	bool	domain_state	fast	edge-trigger	HOTPLUG	HP	gate.hp.01	in-band	on-change
tmp.zoneA.hotspot_c	°C	temp	normal	absolute + rate	MAIN	I2C-1	temp.array.zoneA	in-band	periodic + on-alert
bus.i2c1.retry_level	count	reachability	normal	delta + persistence	MAIN	I2C-1	bridge.i2c1	in-band	periodic
evt.alert.ts	ms	event	on-event	—	any	any	inventory.key	in-band / INT	on-event

Boundary: this section defines telemetry organization (fields, units, context binding, alert paths). It does not define storage backends, APIs, or BMC service implementations.

Figure F7 — Telemetry pipeline: Sensors → Segment → Aggregator → Log / Alert (with timestamp + domain context)

H2-9 — Fault Modes & Bus Recovery

The recovery objective is restoring observability without a full system reboot. Treat the baseboard bus as segmented infrastructure: recover the AON backbone first, then isolate and reattach affected islands. Every recovery step should emit evidence (timestamp, segment, action, result).

1) Common fault signatures (what is observable on-site)

SDA stuck-low: the bus cannot return high; scans hang or collapse into repeated timeouts.
Timeout bursts: intermittent read/write failures that correlate with attach/off transitions or noisy edges.
Address conflict: two devices respond; identity becomes ambiguous; inventory mismatches grow.
Segment short / global drag: a single island pulls the whole network down.
Hot-plug half-attach: the segment becomes unstable during service events, causing sporadic losses.

2) Recovery ladder (escalate only when evidence demands it)

Level	Trigger	Action	Exit criteria
L1 Retry + timeout	Intermittent failures; single-device errors	Scoped retry for the current segment/device	Error rate drops below threshold; no segment-wide impact
L2 Clock pulses	SDA stuck-low signature on a segment	Issue unstick pulses on the affected segment	SDA returns high; segment becomes reachable again
L3 Isolate segment	Global drag or suspected short/half-attach	Close gate for the suspected island; protect backbone	Backbone stable; other segments recover and scan reliably
L4 Rediscover / readdress	Inventory mismatch after attach/detach	Reattach → rediscover → (I3C) DAA reassign → verify	Inventory key ↔ address mapping converges; loss count returns to baseline

3) Evidence-first recovery logging (copy-ready fields)

Minimum evidence fields

ts domain segment_id symptom action result impact retry_level inventory_delta

4) Output: Bus Recovery Playbook (step triggers)

Observed symptom	Evidence check	Primary action	If not recovered
SDA stuck-low	Which segment_id? Gate state? Recent attach/off events?	L2 clock pulses on that segment	L3 isolate segment; keep backbone stable
Timeout bursts	Retry level trend; temperature/power transitions; gate flaps	L1 scoped retry with bounded timeout	L3 isolate if bursts expand to other segments
Missing devices	present/pg status; inventory_delta; segment attach state	Rescan the segment (scoped)	L4 reattach → rediscover/DAA → verify
Address conflict	Multiple responders; inventory key ambiguity	Scope to the conflicting island; isolate if needed	L4 rediscover + readdress; record mapping changes
Global drag	Backbone health vs islands; which gate change preceded collapse	L3 isolate the most recently attached / suspected island	Iterate isolation by segment until backbone recovers

Backbone first Segment scoped Escalate by evidence Log every step

Boundary: recovery actions here are bus/segment level (retry, unstick pulses, isolate, rediscover/readdress). This section does not define OS drivers, BMC service logic, or database persistence.

Figure F8 — Fault tree: Symptom → Likely cause → Evidence → Action (segment-scoped)

H2-10 — Validation & Bring-up Checklist

Validation should eliminate “lab passes, rack fails” by proving stability at the worst corners: maximum nodes and length, extreme temperature, and domain/power disturbances. The bring-up sequence must establish an AON baseline first, then expand segments, then qualify hot-plug behavior.

1) Bring-up order (observability-driven)

Stage 1 — AON baseline: backbone stable; minimum telemetry chain readable; evidence logs working.
Stage 2 — Island expansion: add one segment at a time; prove faults remain scoped to that segment.
Stage 3 — Hot-plug readiness: attach/detach cycles do not perturb the backbone; recovery is bounded in time and loss rate.

2) Stress dimensions (drive toward worst corners)

Dimension	Progression	What to record (evidence)
Node count	Typical → upper bound	scan success rate, inventory_delta, retry_level trend
Length / load	Typical → worst cable/trace length and capacitance	timeout bursts, stuck-low incidence, segment isolation events
Temperature	Ambient → hot / cold corners	error rate vs temperature, hotspot indicators, recovery time
Domain / power disturb	Stable → off/on transitions + hot-plug events	time-to-visibility, loss rate, mapping drift count

3) Output: Validation case table (copy-ready)

Test ID	Setup	Steps	Pass criteria	Required logs
T01	AON only, backbone	Boot → scan backbone → read minimum telemetry chain	No timeouts; stable scan rate	ts, segment_id=B0, retry_level, result
T02	Max nodes, typical length	Scan loops for N cycles; record errors per cycle	Success rate ≥ target; bounded retries	ts, segment_id, retry_level, inventory_delta
T03	Worst length/load	Repeat scans + induced attach events	No global drag; isolation works	gate_state, impact, action, result
T04	Hot corner	Hold at high temp; scan; watch drift and timeouts	Error rate remains below threshold	temp hotspot, retry_level, timeout count
T05	Cold corner	Repeat T04 at low temp	Recovery times remain bounded	time-to-visibility, loss rate
T06	Domain off/on	Power off island → verify invisible → restore → rescan	Backbone stable; island recovery bounded	present/pg, gate_state, inventory_delta
T07	Hot-plug cycles	Attach/detach for M cycles; verify mapping stability	Low loss; minimal mapping drift	ts, segment_id=HP, action, result, mapping drift

Boundary: this checklist specifies bring-up order, stress corners, and evidence fields. It does not specify automation frameworks, OS tooling, or rack deployment procedures.

Figure F9 — Test matrix: nodes × length (with temperature + power disturbance as corner modifiers)

H2-11 — Parts / IC Selection Pointers (Bus & Telemetry Only)

This section lists common IC categories used on an OCP/OpenRack baseboard to make the sideband (I³C/I²C) fabric scalable, segmentable, and observable. Focus stays on the bus layer and telemetry plumbing—no BMC firmware stack, no power-conversion deep dive.

11.1 What to place where (one-glance placement rules)

Backbone fan-out: I3C hub Segment entry: I2C switch / hot-swap buffer Domain boundary: isolator / translator Inventory: FRU/ID EEPROM Observability: power + temp sensors Presence/LED: GPIO expander

Backbone (always-on) I³C host + hub(s). Keep pull-ups and ALERT/IBI handling stable in AON.

Each removable segment Switch/buffer at the segment “mouth” so one fault does not drag the whole bus.

Cross-voltage / cross-ground Translator or isolator where domains meet, not in the middle of a noisy segment.

Telemetry cluster Group power/temp sensors per domain and record the domain + timestamp alongside values.

Note: MPNs below are examples to speed up RFQs and param checks. Always validate I/O behavior during power-off, hot-join/hot-plug expectations, and bus-capacitance budgets against the current datasheet.

11.2 Example IC categories and MPNs (copy/paste shortlist)

The goal is not “one perfect part,” but the right function blocks: hub/bridge → segmentation → hot-swap friendliness → domain translation/isolation → telemetry & inventory.

Category	Example MPNs	Typical use on baseboard	Selection pointers (what to ask)
I³C hub / fan-out	NXP P3H2840HN Renesas RG3M88B12	Scale one I³C backbone into multiple downstream segments (AON plane), while keeping segments individually controllable for recovery and maintenance.	Downstream port count & control model; I³C vs mixed I²C support; hot-join behavior; reset semantics; fail-isolation (one bad segment impact); max bus speed and bus-cap budget.
I³C/I²C translators (for I³C rates)	TI TCA39416 NXP P3A9606	Cross-voltage translation for sideband where I³C speed/edges matter (e.g., 1.2V↔1.8V/3.3V), keeping bidirectional open-drain semantics.	Confirm I³C compatibility; directionless/bidirectional behavior; rise-time impact; power-off leakage; EN/disable behavior (does it isolate or “half-connect”); ESD robustness for field swaps.
I²C mux / switch	TI TCA9548A NXP PCA9548A	Segment legacy I²C islands to solve address conflicts, reduce effective bus capacitance, and localize stuck-low faults to one branch.	Channel count; reset/power-up default; leakage in off channels; level-translation needs; switch Ron vs edge integrity; software model (single vs multiple channels enabled).
Hot-swap I²C buffers	TI TCA4311A TI TCA4307	Protect the backbone from “half-inserted” cards/segments, precharge lines, and support stuck-bus recovery without rebooting the whole management plane.	Live-insertion behavior; precharge voltage; automatic stuck-bus recovery; how it handles clock stretching & arbitration; connection criteria (STOP/idle detection); capacitance isolation strength.
I²C bus repeater / segment buffer	NXP PCA9515A	Split a heavy I²C bus into two segments with buffered SDA/SCL to extend practical load/cap limits and isolate noisy branches.	Multi-master friendliness; contention behavior; level translation needs; propagation delay; power-off behavior; recommended pull-up placement per segment.
Long/noisy run extender (dI²C)	NXP PCA9615 NXP P82B96	When baseboard-to-remote panel runs are noisy/long: convert to differential (PCA9615) or use buffer extension concepts (P82B96) to improve robustness.	Distance/noise target; required cabling; speed limits; common-mode tolerance; EMC implications; how failures localize (does one short kill both ends?).
I²C isolation (ground/domain isolation)	ADI ADuM1250 TI ISO1540	Isolate sideband across different ground references or sensitive domains while retaining bidirectional I²C signaling semantics.	Isolation rating; bidirectional support; speed ceiling; power-off behavior; fail-safe IO; CMTI/noise immunity; whether it is truly “non-latching” during hot events.
Digital power monitors	TI INA229 TI INA238 ADI LTC2947	Per-domain V/I/P telemetry (and sometimes energy) tied to the same inventory context: domain ID, segment, and timestamp.	Common-mode range; shunt vs integrated-sense options; conversion time/averaging; alert pins and thresholds; logging needs (min/max, energy); calibration strategy and drift expectations.
Temperature sensors (I³C/I²C)	NXP P3T1085UK	Clean temperature telemetry with I³C features (e.g., in-band interrupt capability), suitable for dense sensor deployments with discovery semantics.	Accuracy & response; I³C features used (IBI vs polling); alert mechanism; placement strategy (hotspots vs gradients); sampling cadence vs noise.
GPIO expanders (presence/LED/sideband pins)	NXP PCA9555 TI TCA9535 Microchip MCP23017	Add low-speed IO for presence, latch signals, LEDs, and simple discrete telemetry where routing dedicated SoC pins is costly.	Interrupt output type; power-up default states; I/O drive & pull features; input glitch sensitivity; addressing options (how many can coexist); power-off leakage/back-power risk.
FRU / identity EEPROM	Microchip 24AA02E64	Store a globally unique ID (and optional inventory fields) used by discovery pipelines to bind “dynamic bus address” to a stable asset identity.	Pre-programmed ID needs; write endurance; power-loss behavior during writes; address conflicts planning; data model (what fields are mandatory for operations).

Practical rule: pick parts that make segmentation and recovery cheap. If a segment can be isolated, re-scanned, and re-inventoried in seconds, field uptime and debug time improve dramatically.

11.3 RFQ-ready checklist — 10 questions that prevent wrong parts

These questions align with real baseboard failure modes: stuck-low, address conflict, hot-plug glitches, cross-domain back-power, and “discovery without operability.”

Bus role: hub, mux, buffer, translator, isolator, or sensor—what exact layer is being solved?
Speed target: I²C (100/400/1MHz) vs I³C (up to 12.5MHz). Is the part truly compatible at the required mode?
Power-off behavior: will any pin back-power another domain through protection structures or pull-ups?
Isolation/segmentation: when disabled/reset, does it become a clean high-Z barrier or a partial path?
Capacitance budget: what bus cap is assumed per segment, and how does the part help enforce it?
Hot-plug events: precharge, connect criteria (idle/STOP), and behavior during “half insertion.”
Fault containment: can one short/stuck device be localized to a single branch with minimal blast radius?
Recovery hooks: reset pin semantics, stuck-bus recovery, and whether software can force re-discovery/re-addressing.
Alert strategy: in-band (I³C/IBI) vs out-of-band (ALERT/INT). How are alerts latched and cleared?
Ops mapping: how does the design bind dynamic addresses to stable identity (FRU/ID EEPROM) with timestamps?

Figure F10 — Reference placement example (hub → segments → isolation → telemetry)

A single-board view showing where hubs, switches/buffers, translators/isolators, and telemetry ICs are typically placed to keep the management plane resilient.

Figure F10 — Bus & telemetry IC placement on a baseboard

Reading the figure: each segment has a “control chokepoint” (switch/buffer/isolator) so recovery can be performed per-branch without taking down the entire management plane.

Request a Quote

Name

Company

Part Number(s) / BOM

Quantity & Target Lead Time

Alternates Allowed

Temperature Grade

Package / Footprint

Compliance

Budget Window

Lot Size / Qty

Message

Attachment

Accepted Formats

pdf, csv, xls, xlsx, zip

Attachment

Drag & drop files here or use the button below.

H2-12 — FAQs (I3C/I2C Governance, Telemetry, Recovery)

These FAQs target field issues on an OCP/OpenRack baseboard: bus stability, discovery/addressing, multi-domain behavior, telemetry organization, recovery without full reboot, and validation. Content stays at the bus/telemetry layer.

Figure F11 — Evidence-first troubleshooting ladder (FAQ map)

A compact visual for how to move from symptoms to evidence, then to a scoped action: electrical integrity → segmentation/isolation → re-discovery/inventory consistency → validation coverage.

Figure F11 — Symptom → Evidence → Containment → Restore → Verify

The FAQ answers below follow this ladder so each issue produces actionable evidence and a scoped recovery plan.

FAQs (12)

Each answer stays within: I3C/I2C governance, discovery/addressing, multi-domain telemetry, bus recovery, and validation/bring-up.

1) Why can a scan find devices, yet they randomly disappear after running for a while? What 3 evidence buckets come first?

Start with three evidence buckets. (1) Electrical: compare SCL/SDA edges, spikes, and low-level margin under worst load/temperature; look for rising-edge slowdowns and retry bursts. (2) Segment-level: identify which branch first shows errors, and whether isolating that segment stabilizes the backbone. (3) Validation corner: reproduce with max nodes/capacitance and power-domain disturbances, then confirm recovery time and reappearance rate.

Jump to: H2-6 · H2-9 · H2-10

2) If the I2C bus is short, why can SDA still get stuck-low? What is the most common power-off back-powering path?

Short wiring does not prevent stuck-low when a powered-off device unintentionally back-powers or clamps the line. A common path is pull-ups referenced to an always-on domain feeding into a powered-off device through I/O protection structures, which can partially power the device and latch the bus. The fix is architectural: ensure power-off pins are high-Z, move pull-ups to the correct domain, and isolate removable/off domains so one segment cannot hold the backbone low.

Jump to: H2-7 · H2-9

3) Does an address conflict always require changing parts? How do muxing and segmentation resolve conflicts in an engineering-friendly way?

A conflict rarely requires replacing devices. Engineering-friendly fixes are (1) segmentation: split the bus into islands so identical-address parts never share a segment, and (2) muxing: explicitly select one branch at a time when scanning or operating. The key is operability: keep an address plan table (segment/domain/device/addr/alert) and a discovery log (timestamp, branch, address changes, failures). This makes conflicts predictable, testable, and maintainable across hot-plug and field service.

Jump to: H2-3 · H2-5 · H2-11

4) After I3C DAA, why can inventory show “device drift” or duplicates?

DAA assigns dynamic addresses, but inventory needs stable identity. Drift/duplicates appear when the system records “address = identity” without binding to a stable identifier (FRU/ID EEPROM) plus context (segment and power domain). Hot-join, resets, or recovery can reshuffle dynamic addresses, and the inventory layer must treat these as events: record timestamp, old/new address, segment, and verification result. When identity is stable, address movement becomes traceable rather than confusing.

Jump to: H2-5 · H2-8

5) Hot-Join triggers jitter and global read errors—how can the impact be contained to the plug-in segment?

Containment requires a “control point” at the segment mouth: a switch/buffer that can keep the backbone stable while the plug-in segment settles. Treat the design as three domains: always-on backbone, plug-in segment, and any off domain. Hot-join should first be confined to the plug-in segment, then promoted to the backbone only after verification (stable edges, no stuck-low, consistent discovery). If errors occur, isolate the segment immediately and restore backbone observability before re-trying join.

Jump to: H2-3 · H2-7 · H2-9

6) How should pull-ups be chosen so it works in the lab and in the field? Which waveform metrics must be measured?

Pull-ups must be selected against the worst-case effective bus capacitance of the deployed topology, not a lab bench setup. Measure: (1) rise-time and any edge “shoulder” indicating too much capacitance, (2) low-level margin under sink current (devices must pull low cleanly), (3) spikes/glitches near transitions, and (4) retry and timeout patterns that correlate with temperature or power disturbances. Validate at max nodes, longest wiring, and worst thermal corners to avoid field-only failures.

Jump to: H2-6 · H2-10

7) When is a buffer/switch necessary, and when is it just a pull-up and routing problem?

Use evidence first. If waveforms show slow edges, glitches, or low-level margin loss that scales with node count or cable length, start with capacitance budget, pull-ups, and routing. A buffer/switch becomes necessary when the system requirements exceed what passive fixes can guarantee: too many devices on one segment, serviceable/hot-plug segments, cross-domain power behavior, or the need to contain faults to one branch. A switch/buffer is also justified when recovery must isolate segments without rebooting the backbone.

Jump to: H2-3 · H2-6 · H2-11

8) In multi-domain designs, which signals must live in the always-on (AON) domain, and which must power down with a domain?

AON should keep the minimum observability chain alive: the backbone controller/hub, the pull-up reference for the backbone, and the sensors/alerts required to explain power events. Signals that cross into a domain that can power off (or be removed) must be able to become harmless when that domain is off—ideally high-Z and segment-isolated—so they cannot back-power or hold the bus low. If a signal cannot be made fail-safe, it should shut down with its domain and be separated by an isolator or switch.

Jump to: H2-2 · H2-7

9) Why does polling miss transient over-temperature or brownout events? How should in-band vs out-of-band alerts be split?

Polling samples at discrete intervals, so short events can occur and clear between reads. For operational telemetry, polling is fine for trends and steady-state limits, but critical transients need alerts. In-band alerts (e.g., I3C in-band mechanisms) are useful when the same bus context should carry identity and segment association. Out-of-band alerts (ALERT/INT) are better for fast, guaranteed wake/flag behavior even when the bus is congested. The best split ties every alert to domain + timestamp + stable identity for post-event reconstruction.

Jump to: H2-8

10) How to design a “tiered recovery” (retry → release → isolate → re-discover) to avoid full system reboot?

Use a tiered recovery ladder with clear triggers. Tier 1: bounded retry with timeouts to avoid infinite hangs. Tier 2: bus release actions (e.g., clock pulses) if stuck-low is suspected. Tier 3: isolate the suspected segment using a switch/buffer so the backbone becomes observable again. Tier 4: re-discover and re-address (I3C DAA if used), then re-bind to stable identity and update inventory. Each tier must emit a structured event record: timestamp, segment, action, and outcome.

Jump to: H2-9

11) During bring-up, how to quickly decide whether the issue is electrical integrity or discovery/addressing logic?

Electrical issues typically correlate with scaling and corners: failures worsen with more nodes, longer wiring, higher temperature, or power disturbances; waveforms show edge degradation, spikes, or low-level margin loss, and errors look random. Discovery/addressing issues are more deterministic: consistent conflicts, repeated duplicates, missing devices tied to one segment, or inventory drift after resets/recovery. The fastest method is a controlled matrix: hold topology constant while varying address/discovery steps, then hold discovery constant while stressing capacitance/pull-ups and domains. Log segment context for every failure.

Jump to: H2-5 · H2-6 · H2-10

12) When selecting an I3C hub/bridge, what 3 pitfalls are most often missed (power-off behavior / isolation / compatibility)?

First, power-off behavior: confirm off-state pins are high-Z and do not back-power other domains via pull-ups or protection paths. Second, isolation: check whether a faulty downstream segment can be cleanly cut off so the backbone remains usable; avoid parts that “half-connect” during reset or fault. Third, compatibility: verify mixed I3C + legacy I2C operation boundaries, hot-join expectations, and recovery semantics (reset/re-discovery hooks). A hub/bridge should reduce blast radius and improve observability, not just add ports.

Jump to: H2-11 · H2-7

Electrical integrity Segmentation & isolation Discovery & addressing Telemetry & alerts Tiered recovery Validation matrix

OCP OpenRack Baseboard for I3C/I2C Bus Management

OCP OpenRack Baseboard for I3C/I2C Bus Management

H2-1 — What “OpenRack Baseboard” Covers

H2-2 — System Context & Interfaces

H2-3 — Bus Topology Patterns

Single-master (default)

Multi-master (only with strong justification)

H2-4 — I2C vs I3C in Baseboards

Step 1 — I3C backbone (AON)

Step 2 — Bridge + segment gates

Step 3 — Replace only where it pays

H2-5 — Device Discovery & Addressing

Collision avoidance toolkit

Rule

H2-6 — Bus Electrical Integrity

What typically goes wrong

What to do (within scope)

H2-7 — Multi-Domain Power & Isolation

Typical paths

Design intent (within bus/domain scope)

H2-8 — Telemetry Aggregation Model

Layer 0 → 1

Layer 2 → 3

H2-9 — Fault Modes & Bus Recovery

Minimum evidence fields

H2-10 — Validation & Bring-up Checklist

H2-11 — Parts / IC Selection Pointers (Bus & Telemetry Only)

11.1 What to place where (one-glance placement rules)

11.2 Example IC categories and MPNs (copy/paste shortlist)

11.3 RFQ-ready checklist — 10 questions that prevent wrong parts

Figure F10 — Reference placement example (hub → segments → isolation → telemetry)

Request a Quote

Accepted Formats

Attachment

H2-12 — FAQs (I3C/I2C Governance, Telemetry, Recovery)

Figure F11 — Evidence-first troubleshooting ladder (FAQ map)

FAQs (12)

Explore

Categories

Get in Touch

OCP OpenRack Baseboard for I3C/I2C Bus Management

OCP OpenRack Baseboard for I3C/I2C Bus Management

H2-1 — What “OpenRack Baseboard” Covers

H2-2 — System Context & Interfaces

H2-3 — Bus Topology Patterns

Single-master (default)

Multi-master (only with strong justification)

H2-4 — I2C vs I3C in Baseboards

Step 1 — I3C backbone (AON)

Step 2 — Bridge + segment gates

Step 3 — Replace only where it pays

H2-5 — Device Discovery & Addressing

Collision avoidance toolkit

Rule

H2-6 — Bus Electrical Integrity

What typically goes wrong

What to do (within scope)

H2-7 — Multi-Domain Power & Isolation

Typical paths

Design intent (within bus/domain scope)

H2-8 — Telemetry Aggregation Model

Layer 0 → 1

Layer 2 → 3

H2-9 — Fault Modes & Bus Recovery

Minimum evidence fields

H2-10 — Validation & Bring-up Checklist

H2-11 — Parts / IC Selection Pointers (Bus & Telemetry Only)

11.1 What to place where (one-glance placement rules)

11.2 Example IC categories and MPNs (copy/paste shortlist)

11.3 RFQ-ready checklist — 10 questions that prevent wrong parts

Figure F10 — Reference placement example (hub → segments → isolation → telemetry)

Recommended topics you might also need

Request a Quote

Accepted Formats

Attachment

H2-12 — FAQs (I3C/I2C Governance, Telemetry, Recovery)

Figure F11 — Evidence-first troubleshooting ladder (FAQ map)

FAQs (12)

Explore

Categories

Get in Touch