The business problem
Unplanned downtime costs manufacturers an estimated $50 billion annually. A single hour of downtime on an automotive production line can cost $1-2 million. Predictive maintenance aims to detect equipment degradation before failure, enabling planned repairs during scheduled maintenance windows. The challenge: anomalies often manifest as subtle multi-sensor patterns long before any individual sensor exceeds its threshold.
Why flat ML fails
- Independent monitoring: Threshold-based systems monitor each sensor independently. A temperature slightly above average, combined with vibration slightly below average, might indicate bearing wear. Neither sensor triggers alone.
- No causal chains: A pressure drop in Sensor A causes flow changes in Sensor B 30 seconds later. Flat models cannot capture these causal propagation patterns.
- Late detection: By the time a single sensor breaches its threshold, the fault has often progressed to where unplanned downtime is unavoidable. Early detection requires multi-sensor pattern analysis.
- No process topology: Sensors on the same machine or same process stage have stronger correlations. The physical topology matters for anomaly interpretation.
The relational schema
Node types:
Sensor (id, type, unit, normal_range, machine_id)
Machine (id, type, age, maintenance_history)
Process (id, stage, product_type, cycle_time)
Edge types:
Sensor --[on_machine]--> Machine
Sensor --[correlated]--> Sensor (pearson_r, lag_seconds)
Sensor --[upstream_of]--> Sensor (process_flow_order)
Machine --[in_process]--> ProcessSensors are connected by physical proximity (same machine), statistical correlation, and process flow order.
PyG architecture: GNN autoencoder for anomaly scoring
import torch
import torch.nn.functional as F
from torch_geometric.nn import GATConv, HeteroConv, Linear
class AnomalyGNN(torch.nn.Module):
def __init__(self, sensor_dim, hidden_dim=64, heads=4):
super().__init__()
self.sensor_lin = Linear(sensor_dim, hidden_dim)
self.machine_lin = Linear(-1, hidden_dim)
# Encoder: learn normal patterns from graph context
self.conv1 = HeteroConv({
('sensor', 'correlated', 'sensor'): GATConv(
hidden_dim, hidden_dim // heads, heads=heads),
('sensor', 'upstream_of', 'sensor'): GATConv(
hidden_dim, hidden_dim // heads, heads=heads),
('sensor', 'on_machine', 'machine'): GATConv(
hidden_dim, hidden_dim // heads, heads=heads),
}, aggr='sum')
self.conv2 = HeteroConv({
('sensor', 'correlated', 'sensor'): GATConv(
hidden_dim, hidden_dim // heads, heads=heads),
('sensor', 'upstream_of', 'sensor'): GATConv(
hidden_dim, hidden_dim // heads, heads=heads),
}, aggr='sum')
# Decoder: predict expected sensor readings
self.decoder = torch.nn.Sequential(
Linear(hidden_dim, 32),
torch.nn.ReLU(),
Linear(32, sensor_dim),
)
def forward(self, x_dict, edge_index_dict):
x_dict['sensor'] = self.sensor_lin(x_dict['sensor'])
x_dict['machine'] = self.machine_lin(x_dict['machine'])
x_dict = {k: F.relu(v) for k, v in
self.conv1(x_dict, edge_index_dict).items()}
x_dict = self.conv2(x_dict, edge_index_dict)
# Reconstruct expected sensor readings
predicted = self.decoder(x_dict['sensor'])
return predicted
def anomaly_score(self, actual, predicted):
# Reconstruction error = anomaly score
return (actual - predicted).pow(2).mean(dim=-1)GNN autoencoder: train on normal data to predict sensor readings from graph context. At inference, high reconstruction error = anomaly. No labeled anomaly data needed.
Expected performance
Anomaly detection is measured by Precision@K (at fixed recall) and detection lead time, not AUROC:
- Threshold monitoring: ~60% precision at 90% recall, detection at threshold breach
- Isolation Forest (flat features): ~70% precision at 90% recall
- GNN autoencoder: ~85% precision at 90% recall, 2-6 hours early detection
- KumoRFM (supervised failures): ~87% precision at 90% recall
Or use KumoRFM in one line
PREDICT failure_risk FOR machine
USING sensor, machine, process, reading_historyOne PQL query. KumoRFM captures inter-sensor dependencies and temporal patterns for predictive maintenance.