Remote Telemetry System

Remote Telemetry Dashboard — Motion state, position, velocity, torque analysis with predicted vs actual torque overlay

Context

System: ESP32-S3 based robotic devices (Ceily ceiling robot, Wally wall robot)
Scope: Full-stack telemetry — firmware data collection, cloud ingestion, time-series storage, real-time dashboard
Solo project: Designed and implemented the entire pipeline end-to-end

Core Problem

Devices installed at customer sites had no remote visibility:

When customers reported issues, there was no way to diagnose remotely
Installation support required on-site visits for every problem
Torque anomalies (e.g., furniture friction) went undetected until motor damage occurred

What was needed: A system that streams device telemetry in real-time so engineers can monitor, diagnose, and support installations remotely.

Key Insight

Structure telemetry into multiple MQTT topics with different sampling rates and retention policies. Motion data needs high-frequency sampling (0.5s) but short retention (7 days), while system health needs low-frequency (30s) but long retention (1 year). Batch samples on-device before publishing to reduce MQTT overhead.

Approach

1) Architecture

ESP32-S3 (TLS 8883)
    │
    ▼
AWS IoT Core ──── IoT Rule
    │
    ▼
Lambda (influxdb-writer)
    │
    ▼
InfluxDB (Docker, EC2)
    │
    ▼
Grafana Dashboard

The device publishes over MQTT with TLS mutual authentication. AWS IoT Rules route messages to a Lambda function that converts JSON to InfluxDB Line Protocol and writes to the appropriate bucket.

2) Telemetry Topic Design

Topic	Sampling	Batch	Retention	Data
`devices/{id}/motion`	0.5s	10 samples / 5s	7 days	Position, velocity, torque, model params
`devices/{id}/status`	5s	1 sample	30 days	Heap, CPU temp, servo state, ToF sensors
`devices/{id}/system`	30s	1 sample	1 year	Firmware version, uptime, boot count
`devices/{id}/events`	On event	1	90 days	State changes, errors, alerts

Why separate topics? Each data type has different write frequency, query patterns, and storage cost. InfluxDB retention policies can only be applied per-bucket, so topic separation maps directly to bucket separation.

3) On-Device Data Collection

Motion telemetry is sampled every 0.5 seconds and batched into groups of 10 before publishing:

typedef struct {
    uint64_t timestamp_ms;
    uint8_t  state;           // Motion state enum
    int16_t  position_mm;     // Position in millimeters
    uint8_t  position_pct;    // 0-100%
    int16_t  velocity_x10;    // velocity × 10 (fixed-point)
    int16_t  torque_mnm;      // Torque in milli-Nm
    int16_t  remain_distance; // Remaining distance
    int16_t  command_vel;     // Current command velocity
    uint8_t  speed_control;   // Speed control state
    // Torque model parameters
    uint8_t  torque_index;
    float    theta[4];        // Model coefficients
    float    p[4];            // Prediction parameters
    int16_t  predicted_total; // Predicted torque (mNm)
    int16_t  threshold;       // Applied sensitivity
    bool     is_learning;
} motion_sample_t;            // ~56 bytes per sample

Batching 10 samples into one MQTT publish reduces connection overhead while keeping latency under 5 seconds.

The struct includes online learning parameters for the torque prediction model. The device runs a recursive estimator that fits theta[4] coefficients to predict expected torque from position and velocity. p[4] tracks estimator covariance (uncertainty). During initial operation, is_learning is true and theta values update rapidly. As the model converges, p values decay toward zero and predictions stabilize. The telemetry streams these parameters so engineers can verify convergence on the Grafana Learning panel — if theta keeps drifting or p doesn’t decay, the model hasn’t stabilized and the device needs recalibration.

4) Cloud Ingestion (Lambda)

The Lambda function supports two protocol versions and routes to appropriate InfluxDB buckets:

def lambda_handler(event, context):
    message = event if isinstance(event, dict) else json.loads(event)
    ver = message.get('ver', 1)

    if ver >= 2:
        lines, bucket = parse_protocol_v2(message)
    else:
        lines, bucket = parse_protocol_v1(message)

    return write_to_influxdb(lines, bucket)

Each parser converts JSON samples to InfluxDB Line Protocol with proper tags (serial_number, device_type) and nanosecond timestamps. Protocol v2 adds hardware version tagging and array-based servo/sensor flattening.

5) Grafana Dashboard

The dashboard provides four monitoring layers:

Row	Panels	Purpose
Overview	Motion State, Position %, Position mm, Velocity, Remain Distance, Speed Control	At-a-glance device status
Motion Graphs	Position Over Time, Velocity Over Time	Motion profile analysis
Torque Analysis	Torque vs Predicted (with warning/critical thresholds), Current Torque, Deviation	Anomaly detection
Learning	Theta parameters, P parameters, Learning status	Torque model convergence

Torque analysis panels overlay actual torque against predicted values with configurable thresholds (warning: 15 Nm, critical: 20 Nm), enabling immediate visual detection of mechanical anomalies.

Tradeoffs

Decision	Rationale	Tradeoff
MQTT over HTTP	Persistent connection, lower overhead for frequent small messages	Requires connection management on ESP32
InfluxDB over DynamoDB	Purpose-built for time-series queries, native Grafana integration	Self-managed EC2 instance instead of serverless
Lambda for ingestion	Serverless scaling, no server to manage for ingestion layer	Cold start latency (acceptable for monitoring)
0.5s motion sampling	Captures full motion profile for torque analysis	~56 bytes × 2/sec = ~10 KB/min per device
Batch publish (10 samples)	Reduces MQTT publishes by 10×	Up to 5s delay before data appears

Results

Remote Diagnostics

Identified furniture-floor friction on a customer device: the Torque Analysis panel showed actual torque exceeding predicted torque by >5 Nm during specific position ranges, which matched the furniture location on the travel path. The pattern was consistent across multiple motion cycles, confirming mechanical obstruction rather than a transient spike
Resolved issue without on-site visit — previously would have required engineer dispatch

Operational Coverage

10+ devices monitored in production, scaling as deployment grows
4 telemetry channels with differentiated retention (7 days to 1 year)

System Performance

Motion telemetry: 2 samples/sec per device with 5s batch latency
Device status: continuous monitoring of CPU, memory, servo health, sensor state
Torque model: online learning parameters visible in real-time for model convergence verification

Pipeline Validation

End-to-end latency verified by comparing device-side timestamp_ms with InfluxDB write timestamps — confirmed <6s from sampling to dashboard display
Data integrity checked by matching device-side sample counts against InfluxDB point counts over 24-hour windows; no data loss observed under normal MQTT connectivity

Key Takeaway

The critical design choice was topic-level separation by data characteristics rather than a single telemetry stream. This directly maps to InfluxDB retention policies and Grafana query patterns, making the system both cost-efficient (short retention for high-frequency data) and operationally useful (long retention for system health trends). The torque monitoring capability — comparing actual vs predicted torque in real-time — turned out to be the highest-value feature, catching mechanical issues that would otherwise require physical inspection.