Berlin Tech Meetup: The Future of Relational Foundation Models, Systems, and Real-World Applications

Register now:
PyG/Use Case11 min read

Anomaly Detection: GNN on Manufacturing Sensor Graphs

Unplanned downtime costs manufacturers $50B annually. Threshold-based monitoring catches obvious failures. Here is how to build a GNN that detects subtle, multi-sensor anomalies hours before they escalate.

PyTorch Geometric

TL;DR

  • 1Manufacturing anomaly detection is a graph problem. Sensors are interconnected: anomalies propagate through physical and causal sensor dependencies before primary sensors breach thresholds.
  • 2GNN autoencoder learns to predict normal sensor readings from graph context. High reconstruction error signals anomalies, no labeled data required.
  • 3On predictive maintenance benchmarks, GNN autoencoders detect anomalies 2-6 hours earlier than threshold-based systems, with 30-50% fewer false alarms. Inter-sensor dependencies provide early warning signals.
  • 4The GNN operates on temporal sensor graph snapshots. Anomaly scores update every measurement cycle (seconds to minutes).
  • 5KumoRFM predicts failure risk with one PQL query, capturing inter-sensor dependencies from your data automatically without autoencoder training or anomaly threshold tuning.

The business problem

Unplanned downtime costs manufacturers an estimated $50 billion annually. A single hour of downtime on an automotive production line can cost $1-2 million. Predictive maintenance aims to detect equipment degradation before failure, enabling planned repairs during scheduled maintenance windows. The challenge: anomalies often manifest as subtle multi-sensor patterns long before any individual sensor exceeds its threshold.

Why flat ML fails

  • Independent monitoring: Threshold-based systems monitor each sensor independently. A temperature slightly above average, combined with vibration slightly below average, might indicate bearing wear. Neither sensor triggers alone.
  • No causal chains: A pressure drop in Sensor A causes flow changes in Sensor B 30 seconds later. Flat models cannot capture these causal propagation patterns.
  • Late detection: By the time a single sensor breaches its threshold, the fault has often progressed to where unplanned downtime is unavoidable. Early detection requires multi-sensor pattern analysis.
  • No process topology: Sensors on the same machine or same process stage have stronger correlations. The physical topology matters for anomaly interpretation.

The relational schema

schema.txt
Node types:
  Sensor   (id, type, unit, normal_range, machine_id)
  Machine  (id, type, age, maintenance_history)
  Process  (id, stage, product_type, cycle_time)

Edge types:
  Sensor  --[on_machine]-->   Machine
  Sensor  --[correlated]-->   Sensor  (pearson_r, lag_seconds)
  Sensor  --[upstream_of]-->  Sensor  (process_flow_order)
  Machine --[in_process]-->   Process

Sensors are connected by physical proximity (same machine), statistical correlation, and process flow order.

PyG architecture: GNN autoencoder for anomaly scoring

anomaly_model.py
import torch
import torch.nn.functional as F
from torch_geometric.nn import GATConv, HeteroConv, Linear

class AnomalyGNN(torch.nn.Module):
    def __init__(self, sensor_dim, hidden_dim=64, heads=4):
        super().__init__()
        self.sensor_lin = Linear(sensor_dim, hidden_dim)
        self.machine_lin = Linear(-1, hidden_dim)

        # Encoder: learn normal patterns from graph context
        self.conv1 = HeteroConv({
            ('sensor', 'correlated', 'sensor'): GATConv(
                hidden_dim, hidden_dim // heads, heads=heads),
            ('sensor', 'upstream_of', 'sensor'): GATConv(
                hidden_dim, hidden_dim // heads, heads=heads),
            ('sensor', 'on_machine', 'machine'): GATConv(
                hidden_dim, hidden_dim // heads, heads=heads),
        }, aggr='sum')

        self.conv2 = HeteroConv({
            ('sensor', 'correlated', 'sensor'): GATConv(
                hidden_dim, hidden_dim // heads, heads=heads),
            ('sensor', 'upstream_of', 'sensor'): GATConv(
                hidden_dim, hidden_dim // heads, heads=heads),
        }, aggr='sum')

        # Decoder: predict expected sensor readings
        self.decoder = torch.nn.Sequential(
            Linear(hidden_dim, 32),
            torch.nn.ReLU(),
            Linear(32, sensor_dim),
        )

    def forward(self, x_dict, edge_index_dict):
        x_dict['sensor'] = self.sensor_lin(x_dict['sensor'])
        x_dict['machine'] = self.machine_lin(x_dict['machine'])

        x_dict = {k: F.relu(v) for k, v in
                  self.conv1(x_dict, edge_index_dict).items()}
        x_dict = self.conv2(x_dict, edge_index_dict)

        # Reconstruct expected sensor readings
        predicted = self.decoder(x_dict['sensor'])
        return predicted

    def anomaly_score(self, actual, predicted):
        # Reconstruction error = anomaly score
        return (actual - predicted).pow(2).mean(dim=-1)

GNN autoencoder: train on normal data to predict sensor readings from graph context. At inference, high reconstruction error = anomaly. No labeled anomaly data needed.

Expected performance

Anomaly detection is measured by Precision@K (at fixed recall) and detection lead time, not AUROC:

  • Threshold monitoring: ~60% precision at 90% recall, detection at threshold breach
  • Isolation Forest (flat features): ~70% precision at 90% recall
  • GNN autoencoder: ~85% precision at 90% recall, 2-6 hours early detection
  • KumoRFM (supervised failures): ~87% precision at 90% recall

Or use KumoRFM in one line

KumoRFM PQL
PREDICT failure_risk FOR machine
USING sensor, machine, process, reading_history

One PQL query. KumoRFM captures inter-sensor dependencies and temporal patterns for predictive maintenance.

Frequently asked questions

Why use GNNs for manufacturing anomaly detection?

Manufacturing equipment has interconnected sensors: temperature, vibration, pressure, flow rate. An anomaly in one sensor often manifests as subtle changes across correlated sensors before the primary sensor crosses its threshold. GNNs model these inter-sensor dependencies, detecting anomalies earlier than independent sensor monitoring.

What does a sensor graph look like?

Sensors are nodes with time-series features. Edges connect physically related sensors (on the same machine), causally related sensors (upstream/downstream in the process), and statistically correlated sensors. The graph encodes the physical topology of the manufacturing process.

How do you detect anomalies with GNNs (no explicit labels)?

Train the GNN to predict normal sensor readings given the graph context (autoencoder approach). At inference, high reconstruction error indicates an anomaly: the actual reading deviates from what the graph context predicts. This unsupervised approach works without labeled anomaly data.

What is the advantage over threshold-based anomaly detection?

Threshold-based systems flag individual sensors that exceed fixed limits. GNNs detect contextual anomalies: a reading that is within normal range for the sensor but anomalous given what other related sensors are showing. This catches subtle, multi-sensor anomalies hours or days before a threshold breach.

Can KumoRFM detect manufacturing anomalies?

KumoRFM takes your sensor data tables (sensors, readings, machines, maintenance records) and predicts anomaly or failure risk with one PQL query. It captures inter-sensor dependencies and temporal patterns automatically.

Learn more about graph ML

PyTorch Geometric is the open-source foundation for graph neural networks. Explore more layers, concepts, and production patterns.