CAN-over-UDP Firmware Architecture Redesign
Redesigned multi-robot firmware from robot-specific CAN translation to unified raw CAN forwarding, reducing firmware binaries from N to 1 across 4-6 robot types
Context
- System: Multi-robot fleet with 4-6 different robot types (exact count varied as models were added/retired)
- Communication stack: ROS PC <-> UDP <-> STM32 Firmware <-> CAN bus <-> Peripheral modules
- Original design: Firmware translated CAN messages differently per robot type, then sent custom UDP format to PC
- Driver implementation: Disinfection module ROS driver (~1026 lines)
Core Problem
The firmware was doing two jobs simultaneously:
- Transport: Moving data between CAN bus and network
- Interpretation: Translating CAN messages according to robot-specific protocols
Each new robot type required firmware modifications, and deployment required flashing different binaries per robot type. The firmware grew with every variant.
Why this was hard: CAN protocols differ per robot type, but the temptation is to handle translation close to the hardware. This creates a firmware that must understand every robot variant.
Key Insight
Only interpretation varies per robot type. Transport is universal.
The firmware can be robot-agnostic if it forwards raw CAN frames without interpretation. Robot-specific parsing moves to ROS nodes on the PC, where software updates are easier to deploy and better debugging tools are available via ROS.
Approach
Phase 1: Standardized Transport Layer
Defined a minimal, robot-agnostic packet structure:
typedef struct CAN_PACKET_t {
uint32_t id; // CAN arbitration ID
uint8_t ide; // Extended ID flag
uint8_t rtr; // Remote transmission request
uint8_t dlc; // Data length code (0-8)
uint8_t data[8]; // Raw payload
} CAN_PACKET;
Firmware responsibility: forward CAN frames over UDP in this standardized format. No robot-specific logic.
Phase 2: Robot-Specific ROS Drivers
Robot-specific interpretation moved to ROS nodes on PC. Example from disinfection driver:
// CAN ID range 0x200-0x211 parsed by disinfection ROS node
void ParseCobiMessage(const CAN_PACKET& packet) {
switch (packet.id) {
case 0x201: ParseDustSensor(packet.data); break;
case 0x202: ParseAirQuality(packet.data); break;
case 0x211: ParseStatusFlags(packet.data); break;
// ...
}
}
Phase 3: Reliability Patterns
Rate-limited command dispatch: Commands dispatched at 30ms intervals to prevent CAN buffer overflow in firmware. Determined by incrementally decreasing interval until CAN buffer overflow occurred, then adding safety margin.
Reconciler pattern (Kubernetes-inspired): Three independent reconcilers handle Fan, Plasma, and UVC subsystems. Each reconciler compares desired state vs actual state and auto-retries on mismatch. This eliminates scattered retry logic.
Safety interlocks:
- UVC only activates when not tilted
- Plasma only activates when fan is stable
Connection recovery escalation:
- 10 seconds without connection: stop mission
- 30 seconds without connection: power cycle
Fan watchdog monitoring:
- RPM tolerances: Low=900+/-150, Mid=1300+/-150, High=1600+/-150
- Warning timeout: 60 seconds
- Error timeout: 120 seconds
Tradeoffs
| Decision | Rationale | Tradeoff |
|---|---|---|
| Raw CAN forwarding in firmware | Firmware no longer grows per robot type | More complexity on PC side |
| CAN_PACKET as minimal structure | Robot-agnostic transport layer | None significant |
| 30ms command dispatch interval | Prevents CAN message loss from buffer overflow | Commands take longer to execute |
| Reconciler pattern | Auto-retry without scattered retry logic | 1+ second latency on failures |
| Three independent reconcilers | Plasma depends on fan, UVC depends on tilt - separate safety logic | More code than single loop |
| Safety interlocks | UVC only when not tilted, Plasma only when fan stable | Reduced operational flexibility |
| Escalating connection recovery | Balance user experience with hardware protection | Aggressive power cycling may disrupt other modules |
Results
- Firmware binaries reduced: 4-6 (one per robot type) to 1 shared binary
- Robot types supported: 4-6 with single firmware
- Hardware-free testing enabled: Unit tests with FakeCobi module validate disinfection driver without physical hardware. FakeCobi simulates CAN responses for all disinfection subsystems (fan status, sensor data, error codes). Test coverage includes: state transitions (off → low → mid → high), error injection (timeout, invalid RPM), and safety interlock triggers (tilt during UVC, fan failure during plasma)
- Command dispatch tuned: 30ms interval prevents message loss in production
Key Takeaway
Separate transport from interpretation. Transport belongs in firmware (close to hardware, resource-constrained). Interpretation belongs in ROS drivers (easier software updates, better debugging tools via ROS). The result: one firmware binary serves all robot types, and robot-specific logic develops independently in software.