123 Main Street, New York, NY 10001

Space Memory & Interconnect for Spacecraft Electronics

← Back to: Avionics & Mission Systems

Space Memory & Interconnect is about keeping spacecraft data correct and links deterministic under radiation and long mission life. It combines EDAC/ECC with scrubbing, proven link margins (BER/CRC), and logged, testable degrade/switch policies so “reliability” can be verified and replayed.

What this block owns: spacecraft data memory + deterministic links

This chapter sets the scope boundary and the engineering “proof loop” for reliable spacecraft data storage and transport under radiation-driven soft errors and long mission lifetimes.

Ownership What this page covers

Rad-hard memory control (SDRAM/DDR) with EDAC/ECC, background scrubbing, and deterministic interconnects (SpaceWire / SpaceFibre / LVDS) including physical-layer margining.

Core deliverable

A design that stays correct, observable, and provable

Reality check “Average bandwidth” is not a guarantee

Real-time loss events are typically triggered by worst-case service gaps: arbitration queues, refresh windows, and scrub reads/writes can create long tail latency even when average throughput looks sufficient.

  • Design to a maximum inter-service gap, not just MB/s.
  • Budget and verify worst-case service time (WCST).

Proof loop Reliability triad + evidence

Flight readiness is built from three cooperating mechanisms: ECC (local correction), scrubbing (lifetime maintenance), and link integrity (BER/CRC/FEC), all closed by telemetry counters and event logs that prove behavior in test and in flight.

  • Correctable vs uncorrectable errors must be counted and acted on.
  • Links must meet BER targets with margin across temperature and aging.

Typical data path (where reliability lives)

A common spacecraft payload path is: Sensor/FPGAMemory controller (EDAC/ECC)Interconnect PHY (SpaceWire/SpaceFibre/LVDS) → Payload computer / recorder. The critical engineering question is not “does it pass data once,” but whether the system remains correct and diagnosable after years of accumulated soft-error exposure and environmental drift.

This page owns Why it matters How it is proven
EDAC/ECC datapath
syndrome, correction, flags
Prevents silent corruption during reads; classifies error severity Error injection + counter/log validation; UE handling checks
Scrubbing policy
quota, priority, escalation
Prevents multi-bit accumulation beyond ECC capability CE-rate sweep; WCST under load; long-run stability tests
PHY integrity
BER/CRC/FEC, margin
Ensures correct delivery over temperature, harness, aging BER tests (PRBS), voltage/temp sweep, stress patterns
Observability
counters, events, timestamps
Makes reliability auditable and actionable in flight Log completeness review; reproducible fault playbooks

Out-of-scope details (not expanded here): TT&C waveform/protocol stacks, spacecraft bus power conversion, and system-level crypto/key lifecycle.

Figure F1 — Memory + Interconnect ownership map (data / control / telemetry)
Space Memory & Interconnect — ownership blocks Sensor / FPGA Burst streams Rad-Hard Mem Ctrl EDAC / ECC Scrub SDRAM / DDR Devices + refresh Interconnect PHY SpaceWire SpaceFibre LVDS Links Payload Computer Recorder / processing Error Logger Counters + events refresh/scrub policy Legend Data Control Error telemetry

Requirements decomposition: bandwidth, latency, determinism, lifetime, radiation

This chapter turns “it should be reliable” into measurable targets: throughput, worst-case service gaps, error budgets, margins, and acceptance evidence.

Step 1 — Separate sustained rate from burst behavior

Two numbers are required for each producer: sustained average rate (long-run MB/s) and burst demand (bytes that must be serviced within a bounded window). Burst behavior drives FIFO depth and arbitration pressure; sustained rate drives steady-state utilization and thermal/aging stress.

Producer Average rate Burst size Burst period Max allowed service gap
Payload stream A _____ MB/s _____ MB _____ ms _____ ms
Payload stream B _____ MB/s _____ MB _____ ms _____ ms
Housekeeping / logs _____ KB/s _____ KB _____ s _____ ms

The “max allowed service gap” is the key determinism target: it bounds how long a stream can wait between successful memory service windows without loss.

Step 2 — Budget the worst-case service time (WCST), not just average bandwidth

A memory system can meet average MB/s while still violating real-time constraints when it periodically stops servicing a flow. The worst-case gap typically stacks from: (a) arbitration queueing, (b) refresh windows, and (c) scrub insertion. A flight-oriented requirement should explicitly cap the combined effect.

Determinism requirement (example form)

Max inter-service gap ≤ Stream requirement

  • Arbitration: competing sources can create long tail waiting even if mean service is fast.
  • Refresh: periodic command slots reduce available service windows and create structured gaps.
  • Scrub: lifetime maintenance consumes bandwidth; policy must guarantee bounded interference.

Step 3 — Define reliability targets as observable metrics (CE/UE + link errors)

Reliability requirements must be written in terms of counters, thresholds, and actions—so they can be verified in test and audited in flight. This keeps ECC and link-integrity features from becoming “checkbox” items without operational meaning.

Metric What it proves Suggested evidence
Correctable errors (CE) rate Soft-error environment + scrub effectiveness CE histogram vs temperature; scrub-rate sweep; stable long-run trend
Uncorrectable errors (UE) Fault containment works and is observable UE injection tests; region retire / degrade path; event logs with context
CRC / FEC / link event counts Interconnect integrity under drift BER tests (PRBS); temp/voltage sweep; margin reports
Worst-case latency / gap Determinism holds under interference Stress run with refresh+scrub enabled at max load; WCST measurement

Step 4 — Constrain temperature/aging effects via explicit margins

Environmental drift should be treated as a margin requirement, not a post-hoc surprise. For memory, this is a timing margin problem; for interconnect, this is an eye/BER margin problem. Requirements should define how margin is demonstrated (sweeps, long runs, and stress patterns).

  • Memory margin: timing closure must remain valid across temperature corners and supply variation.
  • Link margin: BER targets must hold with measured margin over harness/connector variability and drift.

Radiation is treated here as a soft-error driver affecting observable CE/UE behavior and required scrub policy, without expanding into detector physics.

Figure F2 — Throughput vs determinism budget (WCST / service gaps)
Determinism is set by worst-case service gaps (not mean MB/s) time Arrival (bursts) Memory service slots burst burst burst serve serve serve serve serve gap sources arbitration refresh scrub WCST / max inter-service gap How to read incoming bursts service slots

Memory choices in space: SDRAM/DDR in a rad-hard context (what changes)

“SDRAM/DDR” in spacecraft design is not just a part number choice. The flight difference is how the memory behaves across temperature and aging, how errors are contained over mission life, and how the design proves correctness with counters, logs, and repeatable tests.

Engineering outcomes What “space-grade” changes in practice

Selection criteria are dominated by predictability and traceable evidence, not peak throughput. Device and lot variation can shift timing margin and soft-error behavior, so the system must remain stable with guardbands and observable error metrics.

  • Timing margin must hold at temperature corners and supply variation.
  • Error behavior must be bounded using ECC + scrubbing + containment.
  • Traceability (lot, screening, configuration) supports repeatable qualification.

Do not confuse Component “hardness” vs system “proof”

Flight reliability is rarely achieved by a single “more robust” memory device. It comes from a controller architecture that (1) corrects data on read, (2) prevents multi-bit accumulation via scrubbing, and (3) exposes telemetry and events so behavior is provable.

A practical rule

If it cannot be measured in counters and logs, it cannot be qualified.

Selection checklist (device + integration) — focus on consequences

The items below are written as engineering consequences rather than marketing categories. Each item should map to an acceptance test, a margin report, or a configuration control record.

Decision point Why it changes in space What to prove
Temperature range & drift Timing closure and read/write windows shrink at corners; long mission life magnifies drift sensitivity. Corner characterization; guardbanded timing; stable operation with worst-case patterns.
Lot/batch consistency Variation shifts timing and soft-error behavior; qualification must remain repeatable across procurement lots. Lot traceability; re-validation plan; configurable scrub and thresholds.
Packaging & interconnect parasitics Parasitics alter edge rate and signal integrity; stability depends on real margins, not nominal models. Board SI checks; margin tests; stable read training/guardbands (if used).
ECC support model ECC must align with the datapath width and system fault model; it is part of the architecture. Fault injection; CE/UE classification; no silent corruption under injected faults.
Scrubbing capability Without scrubbing, correctable errors can accumulate into uncorrectable events over long exposures. Scrub-rate sweep; bounded interference (WCST); stable long-run CE trend.

This chapter stays at the memory-control level. It does not expand into spacecraft power architecture, TT&C modulation, or system-level crypto lifecycle.

Why controller strategy matters: writes, refresh, self-refresh (risk points)

Refresh is not the main problem by itself. The risk comes from stacked interference—refresh windows combined with arbitration queueing and scrubbing insertion can create long service gaps and fragile corner behavior if the policy is not explicitly bounded and tested.

  • Write/read turn-around can create tail latency spikes under mixed traffic.
  • Self-refresh entry/exit requires deterministic criteria and recovery validation.
  • Refresh + scrub overlap must be scheduled to avoid unbounded inter-service gaps.
  • Evidence requirement: a measured WCST and logged error trends under stress runs.
Figure F3 — Rad-hard DDR controller deltas (Commodity vs Rad-hard)
Controller capability delta Commodity controller Rad-hard controller Req IF + DMA Scheduler peak throughput focus DDR PHY Limited flight telemetry Req IF + QoS Scheduler bounded latency DDR PHY EDAC / ECC Scrub Fault containment Telemetry + Added for flight: EDAC + scrub + containment + telemetry

Controller micro-architecture: scheduler, timing closure, refresh, buffering

A stable flight memory controller is designed around bounded latency and provable behavior. The goal is to prevent unbounded queueing delays while maintaining data integrity (ECC) and lifetime maintenance (scrubbing) without breaking real-time streams.

Scheduling Throughput vs determinism (the real trade)

Micro-architecture choices such as open-page vs close-page, bank interleaving, and read/write turn-around have a direct impact on tail latency. A flight-oriented scheduler prioritizes bounded per-flow delay over peak benchmarks.

  • Open-page: higher peak, but can amplify latency spikes on mixed/random traffic.
  • Close-page: more predictable service time under uncertainty.
  • Turn-around: treat read↔write switches as a first-class latency budget item.

Timing closure Margin as a requirement, not an assumption

DDR timing is only “closed” if it remains valid across corners. Stability comes from guardbanded constraints, measurable margins, and a verification plan that exercises worst-case patterns under temperature and supply variation.

  • Guardband critical timings (tRCD/tRP/tRAS) for corners and drift.
  • Validate with stress patterns and long-run runs, not only nominal vectors.
  • Link errors and memory errors must be correlated to conditions in logs.

Buffering & backpressure: preventing overflow without hiding failures

FIFO depth and backpressure protocols are not cosmetic implementation details—they define which stream fails first under contention. The design should explicitly size buffers from burst demand and enforce backpressure semantics that preserve determinism for critical flows.

Design element What can go wrong What to enforce
Request queue depth Bursts overflow; arbitration never catches up; dropped frames appear “random”. Derive from burst bytes and max inter-service gap; measure occupancy under stress.
Per-flow isolation Critical streams starve under best-effort traffic; tail latency explodes. QoS classes, weights, or per-flow queues; cap worst-case delay per class.
Backpressure protocol Hidden congestion causes silent loss; “works in lab” but fails at corner loads. Explicit ready/valid or credit semantics; log drops, retries, and saturation.

The acceptance target is not “no congestion ever,” but predictable congestion with bounded impact and observable signals.

Refresh vs scrub arbitration: guaranteeing WCST while maintaining integrity

Refresh and scrubbing both consume memory service slots, but they exist for different reasons: refresh preserves electrical correctness while scrub preserves long-term correctness in the presence of soft errors. A flight controller must implement a policy that bounds interference and provides telemetry that proves the policy is working.

  • Policy knobs: scrub quota, priority levels, escalation triggers based on CE rate.
  • Determinism constraint: cap the maximum injected work between service opportunities for real-time flows.
  • Evidence: measured WCST under max load with refresh+scrub enabled; logs that show counters and state transitions.

Acceptance evidence (what proves “stable”) — minimal, testable set

  • WCST measurement: worst-case inter-service gap remains below stream requirement under stress conditions.
  • Fault injection: injected single-bit and multi-bit scenarios are classified correctly (CE/UE) with no silent corruption.
  • Corner stability: temperature/supply sweeps with worst-case patterns do not produce unexplained resets or drift.
  • Telemetry completeness: counters, timestamps, and events allow post-test and in-flight auditing.
Figure F4 — DDR command pipeline + arbitration (refresh/scrub injection + logging)
DDR controller pipeline (bounded latency design) Request queue Arbiter QoS / quotas Cmd gen DDR PHY DDR device EDAC / ECC syndrome + flags Refresh mgr Scrub engine Error logger counters + events Legend data / commands policy injection telemetry / logs

ECC/EDAC fundamentals for flight memory: what to implement and why

ECC selection should be written as an engineering contract: what fault patterns are corrected or detected, what latency and bandwidth cost is paid, and what evidence proves “no silent corruption” under stress and fault injection.

Code choice When SECDED is enough vs when to go stronger

SECDED (single-error correct, double-error detect) is the default baseline for many flight designs because it provides strong protection against common single-bit soft errors with manageable overhead. Stronger schemes become worth paying for when errors show correlation or when the mission cannot tolerate frequent recovery actions triggered by detections.

  • SECDED: best default when single-bit errors dominate and recovery on rare detections is acceptable.
  • DECTED / stronger detection: consider when double-bit events become non-negligible or when deterministic recovery is required.
  • Chipkill-class: worth considering when a single device fault must be contained without turning into a system-level failure (kept concept-level here).

Cost model What ECC “costs” in a flight budget

ECC cost is not only parity bits. It includes encode/decode latency, correction mux timing, bandwidth overhead, and the observability infrastructure needed for qualification. The budget should include both steady-state and worst-case paths (including correction events).

Cost item Why it matters How to bound it
Parity overhead Reduces usable bandwidth and impacts bus width planning. Document data/parity mapping; verify throughput under worst-case traffic.
Encode/decode latency Changes worst-case service time and tail latency. Measure correction path timing; include in WCST budget.
Correction events Can perturb service slots under heavy CE rates. Stress test at elevated CE rates; confirm bounded behavior.
Telemetry/logging Required for qualification and in-flight auditing. Define counters + event schema; verify completeness.

Datapath placement: where EDAC must live

A flight EDAC implementation is defined by placement in the datapath: encode on writes and decode/correct on reads, with a syndrome computation that feeds both the correction mux and the observability plane (flags, counters, and event logs).

  • Write path: data → encode → (data + parity) → memory
  • Read path: (data + parity) → syndrome → classify → correct mux → data out
  • Observability: per-type counters + timestamped events with address context

The goal is to ensure that every corrected or detected anomaly becomes a measurable, auditable record rather than an invisible internal detail.

Handling strategy: Correctable vs Detectable vs Uncorrectable (minimal action set)

Error classes must map to explicit actions. The “minimal action set” below keeps behavior deterministic, enables qualification, and avoids silent corruption.

Class Minimum action Must log
CE Correct on read; increment counters; track hotspots; optionally raise scrub priority for the region. Address/region ID, timestamp, syndrome summary, corrected-bit count, channel/port, running CE rate.
DE Detect-only: block unsafe data use; trigger recovery action (re-read / re-fetch / isolate) per system policy. Address/region ID, timestamp, syndrome summary, action taken, retry outcome, associated stream/context.
UE Escalate immediately: isolate or retire the fault domain (page/bank/region); enter a known degraded-safe path. Fault domain ID, timestamp, UE count, trigger context, retire/remap result, post-action verification status.

Verification checklist (flight-oriented)

  • Fault injection: single-bit and multi-bit injections are classified correctly; correction never produces silent wrong output.
  • “No silent corruption”: detected errors always create events; corrected reads always increment the correct counters.
  • Budget impact: encode/decode + correction path is included in the WCST and throughput measurements under stress loads.
  • Log completeness: address context + timestamps exist so qualification and in-flight auditing are reproducible.
Figure F5 — EDAC datapath (encode, syndrome, correction mux, flags & counters)
EDAC datapath (flight memory) Write path Write data Encode Data bits (D) D D D D Parity bits (P) P P P Memory Read path Memory Syndrome classify Correct mux CE/UE flags Read data To system data out Flags + counters + event log (address + time) data telemetry EDAC blocks

Scrubbing & fault containment: keeping ECC meaningful over mission life

ECC corrects errors when data is accessed. Scrubbing prevents correctable errors from accumulating into uncorrectable events over long exposure. The flight challenge is to control scrub intensity so integrity improves without breaking deterministic latency.

Why scrub exists Prevent multi-bit accumulation beyond ECC capability

Without scrubbing, latent correctable errors can remain in memory until multiple upsets land in the same codeword, turning what would have been corrected reads into uncorrectable events. Scrubbing periodically reads, corrects, and rewrites memory to keep the latent error inventory low.

  • ECC: fixes the accessed word.
  • Scrub: maintains the whole population of stored words.
  • Containment: isolates regions that repeatedly misbehave.

Do not break RT Scrub must be a controlled background workload

Scrubbing consumes service slots and can create latency gaps if unmanaged. A flight implementation should cap interference using explicit quota and gating, and adjust intensity based on measured CE rate trends.

  • Quota: limit scrub work per time window.
  • Timeslice: run only in defined windows or when queue occupancy is safe.
  • Threshold: increase intensity when CE rate rises; back off when stable.

Scrub types (composable policies)

Policy When it helps What to verify
Background / patrol Baseline maintenance across all regions at a low, bounded rate. CE trend stays controlled; WCST remains within budget under load.
On-idle Accelerates maintenance during low traffic to reduce interference risk. Gating logic is correct; no starvation of required patrol coverage.
Region-priority Targets code/data/buffer regions differently based on criticality and error history. Hotspots are detected; elevated regions are treated without global disruption.

Fault containment: when to retire a page/bank/region (concept-level)

Fault containment turns repeated anomalies into bounded, diagnosable actions. The design should define a fault domain granularity and a trigger set that is measurable from counters and events.

  • Hotspot trigger: CE clustering at specific addresses/regions exceeds a threshold over a window.
  • Escalation trigger: repeated detectable events or any uncorrectable event in the same region.
  • Action: retire or remap the region; verify post-action behavior with follow-up reads and counters.

Containment is kept at the memory-controller level here. System-level data recovery policy is intentionally not expanded.

Acceptance evidence (minimal but sufficient)

  • WCST under scrub: worst-case inter-service gap stays below real-time requirements with scrub enabled.
  • CE-controlled trend: long-run tests show scrub reduces accumulated CE inventory and prevents UE growth.
  • Hotspot behavior: induced hotspots trigger priority scrub and/or retire events with correct logs.
  • State audit: state transitions and actions are visible in event records.
Figure F6 — Scrub policy state machine (quota-controlled escalation + retire)
Scrub escalation (bounded by quota + RT limits) Normal baseline patrol Elevated CE rate rising Aggressive priority scrub Retire region isolate / remap Quota + RT limits cap interference CE rate ↑ persistent CE stabilizes recovery UE event Inputs CE rate counters UE event quota / RT limits Every transition should emit an event record (state + reason + region).

Interconnect selection: SpaceWire vs SpaceFibre vs LVDS (engineering tradeoffs)

Interconnect selection should be expressed as a contract: the determinism model, the topology that must be supported, the diagnostic visibility that will be needed in flight, and the complexity cost of implementing redundancy without hidden failure modes.

SpaceWire Mature networks with predictable behavior and good diagnostics

SpaceWire is often chosen for its well-understood network behavior and operational visibility. It supports routing and multi-node topologies that are practical for distributed avionics and payload subsystems, but sustained high-rate payload streaming can expose throughput ceilings and tail-latency growth under contention.

  • Strength: deterministic-by-design network patterns; straightforward fault localization.
  • Constraint: bandwidth ceiling and contention-driven tail latency near saturation.
  • Reliability hooks: CRC and retry concepts support traceable error accounting (kept concept-level).

SpaceFibre High-rate trunks with QoS-style isolation for payload streams

SpaceFibre targets high throughput while providing mechanisms that help isolate critical flows from background traffic. For payload recording and high-rate sensor streams, the key value is not only speed, but the ability to maintain service expectations when multiple streams compete.

  • Strength: high-rate backbone; flow isolation concepts (virtual channels / QoS) for stream control.
  • Constraint: higher implementation and qualification complexity; multi-lane alignment can be a risk.
  • Operational focus: budget and prove margin with BER data and logged events.

LVDS Simple point-to-point links with low protocol overhead (but hidden system costs)

LVDS is a good fit when a design needs simple point-to-point paths with minimal protocol overhead. The tradeoff is that scaling and diagnosing a larger system becomes an engineering task: additional wiring, explicit test modes, and instrumentation are required to keep field support and fault isolation practical.

  • Strength: low overhead, fixed topology, clean timing assumptions for point-to-point paths.
  • Constraint: expansion requires more ports/wires; diagnostics must be designed in.
  • Recommended hooks: loopback, counters, link-health registers, and defined maintenance scripts.

Selection axes A decision table that can be used in reviews

Axis What is being optimized Typical fit (concept-level)
Bandwidth High-rate payload streams vs moderate-rate distributed traffic. SpaceFibre (high-rate trunk) / SpaceWire (moderate, shared network) / LVDS (fixed P2P)
Determinism Worst-case service time and tail latency under contention. SpaceWire patterns are predictable; SpaceFibre needs QoS discipline; LVDS relies on fixed topology
Topology Multi-node networks, routing, backbone distribution, or pure fanout. SpaceWire (network) / SpaceFibre (trunk + endpoints) / LVDS (P2P fanout)
Diagnostics Fault localization, counters, event logs, and operational scripts. SpaceWire/SpaceFibre usually provide stronger link-level visibility; LVDS needs explicit hooks
Redundancy How hard it is to implement and validate dual paths and failover. Networks require careful failover rules; P2P requires duplicated wiring and clear switchover logic

The “best” choice is the one that makes determinism, diagnostics, and redundancy easiest to prove for the intended data path.

Figure F7 — Topology patterns: SpaceWire network vs SpaceFibre trunk vs LVDS fanout
Topology patterns (engineering view) SpaceWire SpaceFibre LVDS Router Node Node Node Node Network • Routing • CRC/Retry High-rate trunk VC1 VC2 QoS Endpoint Endpoint Endpoint Endpoint Trunk • Multi-stream • QoS/VC Source Sink Sink Sink P2P • Low overhead • Add diagnostics

PHY/link integrity: encoding, CDR, skew, termination, BER and margining

“Works in the lab” often fails in flight due to temperature drift, harness variation, reflections, crosstalk, and aging. Link integrity engineering converts these risks into a measurable margin stack and a qualification script with clear pass/fail evidence.

Physical quantities What eats margin in real harnesses

Link robustness is governed by a small set of measurable quantities. These should be tracked as a margin stack rather than treated as isolated “signal integrity notes”.

  • Jitter: timing uncertainty that closes sampling/decision windows.
  • Eye opening: combined result of noise, loss, reflections, and crosstalk.
  • Channel loss: attenuation and frequency-dependent behavior of harness + connectors.
  • Reflections: termination and impedance discontinuities that distort edges.
  • Crosstalk: coupling that becomes pattern-sensitive in bundled pairs.

SerDes/CDR Lock range, jitter transfer, and multi-lane skew

For SerDes-style links, robustness depends on the receiver’s ability to maintain lock and on how jitter is transferred through the clock-data recovery path. For multi-lane operation, lane-to-lane skew and alignment windows become a primary qualification risk.

  • CDR lock: confirm lock acquisition and stability across temperature and voltage.
  • Jitter transfer: ensure the combined Tx + channel spectrum does not overwhelm Rx tolerance.
  • Lane skew: verify alignment margin and alarm behavior under worst harness variation.

Termination Reflections and coupling boundaries (SI-only)

Termination and impedance control decide whether energy is absorbed or reflected. The engineering goal is not to “follow a rule”, but to ensure reflections do not collapse eye margin under the worst pattern, temperature, and harness tolerance.

  • Differential termination: treat placement and value as part of the margin stack.
  • AC/DC coupling boundary: verify baseline behavior does not trigger pattern-dependent failures.
  • Harness variation: connectors and layout discontinuities must be included in qualification samples.

BER proof PRBS + stress patterns + environment sweeps

BER evidence should be collected with controlled patterns and stress conditions, and recorded with enough context to reproduce results during qualification and in-flight investigations.

  • Patterns: PRBS + stress patterns that maximize transitions and coupling sensitivity.
  • Sweeps: temperature and voltage sweeps; long-run soak to expose rare events.
  • Records: BER counters, lock-loss events, retrain counts, temperature/voltage tags, timestamps.

Engineering closure: turn integrity into a margin contract

A robust link is defined by a bounded margin stack and a qualification script that proves the remaining margin stays positive under worst-case temperature, harness tolerance, and aging assumptions.

Step What is measured Evidence output
Build a margin stack Tx jitter, channel loss/reflect/crosstalk, Rx tolerance/sensitivity, temp/aging derates. “Remaining margin” summary for each harness/temperature corner.
Run BER campaigns PRBS + stress patterns with sweeps and long-run soak. BER counters + lock-loss + event logs with timestamps and conditions.
Define pass/fail Target BER and allowed recovery events (lock-loss/retrain) per mission needs. Clear pass/fail gates suitable for review sign-off.
Figure F8 — Link budget & margin stack (plus BER proof loop)
Margin stack + BER proof Link budget (concept-level) Contributors that consume margin Tx jitter Channel loss Reflection Crosstalk Rx tolerance / sensitivity decision window and lock robustness Temperature + aging derate mission-life worst-case corners Remaining margin must stay positive at worst corners BER target BER proof loop PRBS / stress Link under test BER counter Sweeps Temp sweep Voltage sweep Long-run soak

Board & harness implementation: layout rules that keep margins real

Robust links are built on repeatable physical margins, not optimistic lab setups. This section provides board-and-harness rules that are audit-ready (review checklists) and test-ready (BERT/loopback without breaking flight configuration).

Differential pairs Minimal rules that prevent most margin loss

  • Reference plane continuity: route on a continuous return plane; avoid crossing plane splits or voids.
  • Impedance discipline: keep pair geometry stable; avoid sudden neck-downs and uncontrolled stubs.
  • Length + skew: match within each lane and between lanes only after the return path is correct.
  • Via strategy: minimize vias; keep transitions symmetric; avoid via stubs and unused layer transitions.
  • Return path: ensure a nearby, uninterrupted return path; do not force long return detours.

Priority order for reviews: return pathtermination/stubsviaslength matching.

Connectors & harness Where “works in the lab” often breaks

  • Stub control: avoid long branch stubs; treat every branch as a potential reflection source.
  • Connector variation: include connector + harness tolerance in qualification samples.
  • Pair integrity: preserve pairing through the connector (no accidental pair reshuffling).
  • Shield/ground (pointed only): changes in shield bonding can change coupling and eye margin; qualify representative builds.

Harness and connector choices should be treated as part of the link budget, not as “mechanical details”.

Test & observability Insert BERT/loopback without breaking flight mode

Qualification data must reflect the flight path. Test insertion should be designed as a controlled mode, not an ad-hoc lab hack.

  • Loopback modes: define digital and PHY loopbacks (concept-level) with clear enable/disable controls.
  • Non-invasive points: avoid test fixtures that alter termination or create extra stubs.
  • Lockout for flight: ensure test modes are disabled and verifiable in flight configuration.
  • Logged evidence: record counters, lock-loss events, temperature/voltage tags, timestamps.

Common pitfalls Failure patterns that are easy to miss in reviews

Pitfall Typical symptom How to catch it
Termination misplaced Pattern-sensitive errors; BER rises after harness changes or temperature sweeps. Check termination location; verify with margin/BER runs using representative harness.
Plane split crossing Intermittent lock-loss; “one board works, another fails”. Review return path continuity per segment; forbid split crossings.
Long stub / branch Short tests pass; long-run soak fails; failures cluster at certain patterns. Enforce stub limits; audit branches at connectors and test headers.
Lane swap not documented Debug mismatch; logs and scope points do not correlate to lane indices. Require lane map documentation + silk/labels + bring-up checklist.
Figure F9 — Good vs bad differential routing (margin-preserving rules)
Differential routing: what preserves margin GOOD BAD Continuous reference plane Tx Rx Rt Matched pair + spacing Short, symmetric vias No long branch stubs Termination at receiver Split reference plane Tx Rx X Rt Plane split crossing Long branch stub Uncontrolled transitions Termination misplaced

Redundancy & determinism: dual links, cold/warm spare, graceful degradation

Redundancy should reduce mission risk without creating uncontrolled switching or unpredictable tail latency. The recommended approach is to define (1) redundancy form, (2) trigger windows and debouncing, (3) action priorities, and (4) determinism evidence.

Redundancy forms Pick the form that is easiest to prove

Form Best fit Key proof focus
1+1 warm/hot spare Fast switchover with bounded interruption for critical streams. Switch time bound, false-switch rate, post-switch stability.
A/B cold spare Simpler isolation and lower steady-state complexity/power. Re-initialization sequence, repeatable recovery, stable BER at corners.
Dual active (parallel) Throughput scaling when aggregation is supported and verifiable. Ordering/buffering bounds, congestion behavior, tail latency proof.

The most reliable design is often the one with the simplest, testable failover behavior.

Triggers Use windows + debouncing to avoid flapping

Switching should not be triggered by single spikes. Triggers should be windowed and tied to logged evidence.

  • UE event: immediate escalation to protective action (policy-defined).
  • Link down / loss of lock: switch path with controlled re-sync and verification.
  • BER over threshold: windowed decision; attempt graceful degradation first when permitted.
  • Continuous CRC fail: debounce window; treat bursts differently from sustained failures.

Every trigger should carry: window length, threshold, cooldown, and log fields.

Graceful degradation Reduce load before switching, when safe

When mission objectives allow, a controlled degradation path can recover margin without immediate topology changes.

  • Reduce rate: lower link rate, frame rate, or stream resolution to restore margin.
  • Switch path: change to the redundant link/network after recovery checks.
  • Retire path: lock out unstable paths to prevent oscillation; require explicit re-qualification.

Determinism What must stay bounded under congestion

  • Worst-case service time: bounded “maximum wait” for critical streams.
  • Tail latency: control of worst-percentile latency under contention.
  • Switch interruption: maximum acceptable outage during failover/re-sync.
  • Evidence: timestamped logs, counters, and repeatable stress tests.

Determinism is proven by bounded results at worst corners, not by average throughput numbers.

Figure F10 — Redundancy decision tree (triggers → actions → logged evidence)
Failover logic (windowed, debounced, auditable) Inputs Error counters BER window Link status CRC window Debounce + cooldown Decision tree UE event? Link down / lock lost? BER over threshold? Continuous CRC fails? Actions Switch path Reduce rate Retire path Event log reason • action • outcome

Validation plan: proving memory + link robustness (fault injection & radiation campaigns)

A credible acceptance plan turns “robust” into auditable evidence: repeatable stimuli, observable counters/logs, and pass/fail gates that hold at corners. This section is written as a Definition-of-Done checklist plus a test matrix that can be signed off by systems, QA, and program teams.

Definition of Done What “done” must prove (not just “runs once”)

  • No silent corruption: injected and natural faults must be surfaced by EDAC/logging paths; data integrity is demonstrably preserved or safely degraded.
  • ECC remains meaningful over mission life: scrubbing meets quota under realistic load; CE/UE behavior matches policy; retirement/lockout is deterministic.
  • Link stability is corner-proof: BER/CRC/lock metrics meet targets across temperature, voltage disturbance, and harness variation.
  • Worst-case latency is bounded: combined stress (load + refresh + scrub + link stress) preserves an upper bound on tail latency for critical traffic.
  • Evidence is replayable: every run produces timestamped logs with condition tags, counter snapshots, actions taken, and outcomes.
tempvoltageloadfault-injectradiation-run-idcounter-snapshotaction/outcome

Layered coverage From unit tests to worst-case mission scheduling

  • L1 — Unit: EDAC encode/decode paths, CE/UE classification, scrub engine accounting, PHY loopback/PRBS/CRC counters.
  • L2 — Subsystem: memory↔link interaction under real traffic; verify that observability stays intact when load varies.
  • L3 — Scenario: scripted worst-case schedule with simultaneous refresh + scrub + peak traffic + link stress; verify tail latency bound and policy behavior.

Each layer must specify: stimulusobservablespass gatelog fields.

Outputs What must be recorded and reviewed

  • Memory: CE/UE counters, syndrome classes, affected region/address (policy-defined granularity), scrub progress, retire/lockout events.
  • Link: BER, CRC window stats, lock-loss/re-lock events, recovery time, (optional) eye/jitter snapshots for debug correlation.
  • System: actions taken (reduce rate / switch path / retire path), cooldown/debounce behavior, and post-action stability window.
  • Latency: tail percentiles and worst-case service time under combined stress conditions.

Memory acceptance EDAC injection, scrub quota, UE handling, log integrity

Test item Method Pass gate evidence
Single-bit injection Inject controlled single-bit faults in representative regions; execute read/write patterns under load. Correct data, CE counter increments, logged fields complete.
Multi-bit injection Inject multi-bit faults; verify detect/uncorrectable behavior per policy (isolate/retire/degrade). No silent failure; UE event logged; policy action is deterministic.
Scrub quota Run background/on-idle/region-priority scrub modes under multiple load levels. Quota met; scrub progress logged; latency impact remains bounded.
UE policy consistency Force UE conditions and verify consistent action chains and recovery windows. Action/outcome logged; no oscillation; lockout rules respected.

Link acceptance BER sweeps, temperature cycles, supply disturbance, PRBS/loopback

Test item Method Pass gate evidence
BER sweep PRBS/stress patterns across temperature corners and harness samples; include long-run soak tests. BER/CRC windows meet targets; artifacts attached per run.
Temperature cycling Cycle across expected flight thermal envelope; track lock, BER, CRC windows continuously. No uncontrolled lock-loss; recovery time within bound; logs tagged.
Supply disturbance Apply controlled ripple/droop profiles; verify CDR stability and counter behavior. Bounded lock-loss behavior; deterministic actions; stable post-window.
Loopback coverage Verify PHY/digital loopbacks and flight lockout; confirm test mode does not alter termination conditions. Repeatable results; test mode off in flight config; evidence logged.

Combined worst case Load + refresh + scrub + link stress (tail-latency proof)

The combined scenario must intentionally align “busy moments” to expose the true worst-case service time. The objective is not a high average throughput, but a provable upper bound on tail latency when refresh and scrubbing compete with peak traffic while the link is stressed.

  • Traffic profile: sustained stream + burst overlays (scripted phases with timestamps).
  • Memory activity: refresh operating normally; scrub operating with quota and/or accelerated modes.
  • Link stress: PRBS/stress pattern and corner condition (temperature or supply disturbance).
  • Policy behavior: verify that degrade/switch/retire actions (if enabled) do not cause flapping; cooldown/debounce must be observable.
  • Proof: record worst-case service time, tail percentiles, action triggers, and stable recovery windows.

Radiation campaigns (SEE) What to observe (no detector details)

Radiation runs should validate that resilience mechanisms remain testable and deterministic under real upset conditions. The focus is on counters, logs, actions, and recovery bounds.

  • Memory: CE/UE rate vs run segment; syndrome distribution; scrub effectiveness; region retire/lockout events.
  • Link: CRC window anomalies; BER excursions; lock-loss/re-lock frequency; recovery time distribution.
  • System: policy actions taken (reduce/switch/retire), debounce/cooldown compliance, and post-action stability.
  • Logging: every event must include timestamp, run-id tags, condition tags, and counter snapshots for post-analysis replay.

Example equipment (materials list) Typical part numbers used in acceptance labs

Use equivalent instruments if lab standards differ. The list below is included as concrete examples for documentation and procurement checklists.

Category Example part number Use in this plan
BERT / PRBS generator & analyzer M8040A (Keysight) / MP1800A (Anritsu) PRBS/stress patterns, BER sweeps, long-run soak statistics.
Power disturbance + profiling N6705C (Keysight) Controlled droop/ripple profiles and correlated rail telemetry.
Thermal cycling chamber SE-Series (Thermotron) Corner temperatures, thermal cycling, long dwell soak tests.
High-speed scope / eye debug (optional) 86100D (Keysight) / equivalent Debug correlation (eye/jitter snapshots) when BER anomalies occur.
Figure F11 — Test matrix (stimuli × evidence outputs)
Validation evidence matrix Stimuli (columns) × Evidence outputs (rows) — each cell defines a repeatable action + required log. Outputs \ Stimuli TEMP VOLT LOAD INJECT SEE UE RATE BER / CRC TAIL LAT LOG INTG RECOVERY SWEEP DISTURB PROFILE INJECT CE/UE OBSERVE CYCLE DROOP SOAK PRBS TAG BOUND BOUND WORST SCHED REFRESH +SCRUB BOUND TAG+LOG TAG+LOG TAG+LOG SNAP RUN-ID TIME TIME BOUND ACTION ACTION PASS GATE: no silent corruption bounded tail latency stable BER/lock replayable logs

Request a Quote

Accepted Formats

pdf, csv, xls, xlsx, zip

Attachment

Drag & drop files here or use the button below.

FAQs (Space Memory & Interconnect)

Practical answers with engineering evidence: counters, logs, margins, and acceptance gates (no cross-topic expansion).

1) Why is ECC alone insufficient without scrubbing over long missions?
ECC corrects isolated faults, but mission-time exposure allows errors to accumulate in the same word/page until they exceed correction strength. Scrubbing prevents “multi-bit build-up” by periodically reading, correcting, and rewriting data before accumulation turns into UE. A robust plan defines scrub modes (background/on-idle/region-priority), a quota, and a proof trail (CE-rate trend, scrub progress, retire events).
Maps to: H2-6Intent: ECC vs scrub
2) How to pick SECDED vs stronger ECC without overpaying bandwidth/latency?
Start from the acceptable UE probability and the maximum allowable tail latency. SECDED is often sufficient when scrubbing is effective and fault rates stay within a controlled envelope. Stronger ECC is justified when CE rates threaten accumulation, when certain regions are mission-critical, or when recovery actions are costly. Compare overhead sources: extra check bits (bandwidth), encode/decode depth (latency), and verification burden (test time and evidence completeness).
Maps to: H2-5Intent: ECC selection boundary
3) What counters/telemetry prove memory health in flight?
The minimum proof set is rate-based, not single snapshots: CE count and CE rate (windowed), UE events with region/address tagging (policy granularity), syndrome class distribution, scrub progress (bytes/regions per window), and retire/lockout events. Add condition tags (temperature, voltage state, load phase) and periodic counter snapshots. Health is proven by stable/controlled trends plus deterministic actions when thresholds are crossed.
Maps to: H2-2, H2-5, H2-6Intent: health monitoring
4) How does refresh interact with real-time determinism in DDR controllers?
Refresh creates service holes where commands cannot progress, and arbitration/queueing can amplify this into long tail latency even when average bandwidth looks adequate. Determinism requires bounding “worst-case service time” by controlling scheduling (priorities, maximum wait), documenting when refresh can preempt traffic, and proving the bound under stress. The plan must include traces or counters that capture refresh occupancy and worst-case queue delay during peak phases.
Maps to: H2-4Intent: refresh latency
5) What is the practical definition of “uncorrectable error” handling?
UE handling is a deterministic chain: detect → classify → log with counter snapshot → execute policy action → verify recovery window. Practical actions include isolating/retiring the affected region, switching to a redundant path, or applying a bounded degrade mode. Success is not “no crash,” but no silent corruption, no uncontrolled oscillation, and replayable evidence: time-stamped UE event, region tag, action/outcome, and post-action stability metrics.
Maps to: H2-5, H2-6, H2-10Intent: UE policy
6) How to set scrub rate targets from observed correctable error rate?
Use the observed CE rate as an input to a tiered policy: Normal scrub (baseline quota), Elevated scrub (higher quota when CE rate rises), and Aggressive scrub (temporary escalation when CE bursts appear). Targets should be defined as “coverage per time window” (regions/hour or bytes/hour) and proven under realistic load. Validation requires logging scrub progress, CE-rate windows, and transition events, then confirming tail-latency bounds remain satisfied.
Maps to: H2-6Intent: scrub sizing
7) SpaceWire vs SpaceFibre vs LVDS—what is the simplest decision flow?
Decide by four questions: required sustained/burst throughput, topology (network vs point-to-point), determinism/QoS needs, and diagnostics + redundancy cost. SpaceWire fits mature, deterministic networks with moderate bandwidth. SpaceFibre fits higher-rate trunks and payload streams where virtual channels/QoS matter. LVDS fits simple point-to-point fanout, but scaling and troubleshooting typically require stronger margining and explicit test hooks.
Maps to: H2-7Intent: interconnect choice
8) What are the top 5 physical-layer mistakes that kill BER margin?
The most common killers are: wrong termination location, long stubs, broken reference planes, via/stub asymmetry, and poor skew/lane management. Each has a measurable symptom: BER or CRC windows worsen at temperature corners, lock-loss increases under supply ripple, or margin collapses with harness variation. A strong process ties each rule to a verification step (PRBS/BER sweep + tagged logs + bounded recovery behavior).
Maps to: H2-8, H2-9Intent: SI pitfalls
9) How to validate a link budget without a full flight harness early on?
Build a conservative equivalent channel that approximates worst-case loss/reflection and validate with margining: PRBS patterns, temperature sweep, and controlled supply disturbance. Add loopback points that preserve electrical conditions (termination and routing) as much as possible. The goal is a replayable evidence set: BER/CRC windows versus conditions, lock-loss/re-lock statistics, and a clear delta plan for when the flight harness arrives (compare channels, re-baseline margins, close gaps).
Maps to: H2-8, H2-9, H2-11Intent: early validation
10) When does redundancy increase failure probability due to complexity?
Redundancy can harm reliability when it introduces false switching, oscillation (flapping), hidden priority inversions, or ambiguous fault ownership. A safe design uses windowed triggers (rate + persistence), cooldown/debounce, and a deterministic action ladder (reduce rate → switch path → retire region) with auditable logs. Reliability improves only when the failover policy is provable: bounded recovery time, stable post-action windows, and repeatable evidence across corners.
Maps to: H2-10Intent: redundancy tradeoff
11) How to run fault injection for ECC/link errors in FPGA/SoC prototypes?
Use controlled injections that exercise the full evidence chain. For memory: inject single-/multi-bit faults, verify CE/UE classification, counters, and policy actions (isolate/retire/degrade) with tagged logs. For links: induce CRC error windows via PRBS/stress patterns and controlled disturbances, then verify lock-loss handling and recovery timing. The acceptance artifact is a run report: stimulus, condition tags, counter snapshots, action/outcome, and pass/fail gates.
Maps to: H2-11Intent: fault injection
12) What is a “done” checklist for Space Memory & Interconnect?
“Done” means: (1) no silent corruption under injection and stress, (2) ECC + scrubbing effectiveness proven by rates and logs, (3) link BER/lock stability proven across corners, (4) combined worst-case tail latency has an upper bound, and (5) evidence is replayable (run-id, condition tags, counter snapshots, actions, outcomes). If any gate fails, the plan must specify the corrective loop and re-test scope.
Maps to: H2-11Intent: acceptance criteria