Berlin Tech Meetup: The Future of Relational Foundation Models, Systems, and Real-World Applications

Register now:
PyG/Guide7 min read

Schema-Agnostic Encoding: Processing Any Database Schema Without Architecture Changes

A relational foundation model must work on any database: 3 tables or 50, numerical or categorical, e-commerce or healthcare. Schema-agnostic encoding makes this possible by mapping any schema to a universal representation through column-type encoders and graph topology.

PyTorch Geometric

TL;DR

  • 1Schema-agnostic encoding converts any relational database into a fixed representation format. Numerical columns are normalized, categoricals are embedded, text is tokenized, datetimes are decomposed. All project to the same hidden dimension.
  • 2The relational structure (tables and foreign keys) maps to graph topology: node types and edge types. A single GNN architecture processes any resulting graph regardless of its specific schema.
  • 3This enables relational foundation models: pre-train on many databases, fine-tune on new ones. The model learns transferable patterns (recency effects, frequency patterns, monetary correlations) that apply across industries.
  • 4New schemas require zero architecture changes. New tables become new node types. New columns become new features. New foreign keys become new edges. The same model processes them all.
  • 5KumoRFM uses schema-agnostic encoding to achieve 76.71 zero-shot AUROC on RelBench databases it has never seen, demonstrating that relational patterns transfer across schemas.

Schema-agnostic encoding lets a single model architecture process any relational database without modification. An e-commerce database with 5 tables and 20 columns uses the same model as a healthcare database with 30 tables and 200 columns. The encoding layer handles the translation: column values are converted to a universal format, and the relational structure becomes graph topology that the GNN processes natively.

The problem: one schema, one model

Traditional ML pipelines are schema-specific. A churn model for an e-commerce database with 7 tables requires a custom feature engineering pipeline: specific SQL queries, specific aggregations, specific column selections. This pipeline does not transfer to a different e-commerce database (different table names, different column sets) let alone a healthcare database.

Schema-agnostic encoding breaks this coupling. The model sees column types and relational structure, not specific table names or column names.

Column-type encoders

Each column type has a universal encoder that maps values to a common hidden dimension:

  • Numerical: normalize (zero mean, unit variance), then linear projection to hidden dimension. Handles prices, amounts, ages, counts.
  • Categorical: learned embedding lookup. Each unique category value maps to a dense vector. Handles status codes, country names, product types.
  • Text: tokenize and encode with a language model. The resulting sentence embedding projects to hidden dimension. Handles descriptions, names, comments.
  • Datetime: decompose into cyclical components (sin/cos of hour, day-of-week, month, day-of-year) plus relative time features (days since epoch, days since row creation). Handles timestamps, dates.
  • Boolean: binary 0/1 projected through a linear layer. Handles flags, indicators.

Relational structure as graph topology

The database-to-graph mapping is the second component of schema-agnostic encoding:

  • Each table becomes a node type
  • Each foreign key becomes an edge type
  • The specific number of tables and FKs does not matter: the heterogeneous GNN adapts

A 5-table database produces a graph with 5 node types and some number of edge types. A 30-table database produces 30 node types and more edge types. The same GNN architecture processes both because heterogeneous message passing operates per-edge-type with shared foundational weights.

Why transferable patterns exist

Schema-agnostic encoding works because relational patterns are universal across databases:

  • Recency patterns: recent activity predicts future behavior regardless of whether it is purchase recency, login recency, or treatment recency
  • Frequency patterns: high-frequency entities behave differently from low-frequency ones across all domains
  • Monetary/magnitude patterns: value distribution patterns predict outcomes in e-commerce, banking, and healthcare
  • Graph topology patterns: hub nodes, clusters, and bridges have similar semantic meaning across domains (key accounts, fraud rings, connector roles)

A model pre-trained on diverse databases learns these universal patterns. When applied to a new database, it already understands that “a node with decreasing temporal frequency of connected edges” is a signal, regardless of whether those edges are purchases, logins, or hospital visits.

KumoRFM: schema-agnostic in practice

KumoRFM is a relational foundation model built on schema-agnostic encoding. Pre-trained on diverse relational databases, it achieves:

  • 76.71 AUROC zero-shot on RelBench databases it has never seen (vs 62.44 for expert-engineered flat-table LightGBM)
  • 81.14 AUROC fine-tuned with minimal adaptation to the target database

The zero-shot result is particularly significant: it means the model's understanding of relational patterns transfers across schemas without any training on the target database.

Frequently asked questions

What is schema-agnostic encoding?

Schema-agnostic encoding is a method for converting any relational database into a format that a single, fixed GNN architecture can process. Regardless of how many tables, columns, or foreign keys the database has, the same model architecture handles it. Numerical columns are normalized, categorical columns are embedded, text columns are tokenized, and the relational structure becomes graph topology. No architecture changes needed.

Why is schema-agnostic encoding important for foundation models?

A relational foundation model (like KumoRFM) must be pre-trained on many databases and fine-tuned on new ones. This requires a single architecture that works across all schemas. Schema-agnostic encoding enables this by providing a universal mapping from any schema to a common representation space. The model learns transferable patterns that apply across databases, industries, and prediction tasks.

How are different column types encoded?

Numerical columns: normalized to zero mean, unit variance, then projected through a linear layer. Categorical columns: mapped to learned embeddings (each unique value gets a vector). Text columns: tokenized and encoded with a language model. Datetime columns: decomposed into components (day-of-week, hour, month) and encoded as cyclical features. Boolean columns: binary 0/1 values. All encodings project to the same hidden dimension.

How does a schema-agnostic model handle a schema it has never seen?

The model has never seen the specific table names, column names, or values. But it has seen the patterns: 'a numerical column with high cardinality connected to a categorical column via a foreign key, with a temporal ordering.' These structural patterns are universal across databases. The column-type encoders handle the values; the GNN handles the relational structure. Together, they process any new schema without modification.

Learn more about graph ML

PyTorch Geometric is the open-source foundation for graph neural networks. Explore more layers, concepts, and production patterns.