What is schema-agnostic encoding?

Schema-agnostic encoding is a method for converting any relational database into a format that a single, fixed GNN architecture can process. Regardless of how many tables, columns, or foreign keys the database has, the same model architecture handles it. Numerical columns are normalized, categorical columns are embedded, text columns are tokenized, and the relational structure becomes graph topology. No architecture changes needed.

Why is schema-agnostic encoding important for foundation models?

A relational foundation model (like KumoRFM) must be pre-trained on many databases and fine-tuned on new ones. This requires a single architecture that works across all schemas. Schema-agnostic encoding enables this by providing a universal mapping from any schema to a common representation space. The model learns transferable patterns that apply across databases, industries, and prediction tasks.

How are different column types encoded?

Numerical columns: normalized to zero mean, unit variance, then projected through a linear layer. Categorical columns: mapped to learned embeddings (each unique value gets a vector). Text columns: tokenized and encoded with a language model. Datetime columns: decomposed into components (day-of-week, hour, month) and encoded as cyclical features. Boolean columns: binary 0/1 values. All encodings project to the same hidden dimension.

How does a schema-agnostic model handle a schema it has never seen?

The model has never seen the specific table names, column names, or values. But it has seen the patterns: 'a numerical column with high cardinality connected to a categorical column via a foreign key, with a temporal ordering.' These structural patterns are universal across databases. The column-type encoders handle the values; the GNN handles the relational structure. Together, they process any new schema without modification.

Schema-Agnostic Encoding: Processing Any Database Without Architecture Changes | Kumo.ai

Schema-agnostic encoding lets a single model architecture process any relational database without modification. An e-commerce database with 5 tables and 20 columns uses the same model as a healthcare database with 30 tables and 200 columns. The encoding layer handles the translation: column values are converted to a universal format, and the relational structure becomes graph topology that the GNN processes natively.

The problem: one schema, one model

Traditional ML pipelines are schema-specific. A churn model for an e-commerce database with 7 tables requires a custom feature engineering pipeline: specific SQL queries, specific aggregations, specific column selections. This pipeline does not transfer to a different e-commerce database (different table names, different column sets) let alone a healthcare database.

Schema-agnostic encoding breaks this coupling. The model sees column types and relational structure, not specific table names or column names.

Column-type encoders

Each column type has a universal encoder that maps values to a common hidden dimension:

Numerical: normalize (zero mean, unit variance), then linear projection to hidden dimension. Handles prices, amounts, ages, counts.
Categorical: learned embedding lookup. Each unique category value maps to a dense vector. Handles status codes, country names, product types.
Text: tokenize and encode with a language model. The resulting sentence embedding projects to hidden dimension. Handles descriptions, names, comments.
Datetime: decompose into cyclical components (sin/cos of hour, day-of-week, month, day-of-year) plus relative time features (days since epoch, days since row creation). Handles timestamps, dates.
Boolean: binary 0/1 projected through a linear layer. Handles flags, indicators.

Relational structure as graph topology

The database-to-graph mapping is the second component of schema-agnostic encoding:

Each table becomes a node type
Each foreign key becomes an edge type
The specific number of tables and FKs does not matter: the heterogeneous GNN adapts

A 5-table database produces a graph with 5 node types and some number of edge types. A 30-table database produces 30 node types and more edge types. The same GNN architecture processes both because heterogeneous message passing operates per-edge-type with shared foundational weights.

Why transferable patterns exist

Schema-agnostic encoding works because relational patterns are universal across databases:

Recency patterns: recent activity predicts future behavior regardless of whether it is purchase recency, login recency, or treatment recency
Frequency patterns: high-frequency entities behave differently from low-frequency ones across all domains
Monetary/magnitude patterns: value distribution patterns predict outcomes in e-commerce, banking, and healthcare
Graph topology patterns: hub nodes, clusters, and bridges have similar semantic meaning across domains (key accounts, fraud rings, connector roles)

A model pre-trained on diverse databases learns these universal patterns. When applied to a new database, it already understands that “a node with decreasing temporal frequency of connected edges” is a signal, regardless of whether those edges are purchases, logins, or hospital visits.

KumoRFM: schema-agnostic in practice

KumoRFM is a relational foundation model built on schema-agnostic encoding. Pre-trained on diverse relational databases, it achieves:

76.71 AUROC zero-shot on RelBench databases it has never seen (vs 62.44 for expert-engineered flat-table LightGBM)
81.14 AUROC fine-tuned with minimal adaptation to the target database

The zero-shot result is particularly significant: it means the model's understanding of relational patterns transfers across schemas without any training on the target database.

Key Takeaways

1Schema-agnostic encoding converts any relational database to a universal format: column-type encoders handle values (numerical, categorical, text, datetime, boolean), and relational structure maps to graph topology.
2All column types encode to the same hidden dimension, making node feature vectors schema-independent. A table with 5 columns and a table with 50 columns both produce sequences of same-shaped embeddings.
3Relational patterns are universal: recency, frequency, magnitude, and graph topology patterns transfer across databases, industries, and prediction tasks. This makes schema-agnostic pre-training effective.
4New schemas require zero architecture changes. New tables become node types, new columns become features, new FKs become edges. The same model processes any database.
5KumoRFM demonstrates schema-agnostic transfer: 76.71 AUROC zero-shot on unseen databases, beating expert flat-table models without any training on the target schema.

Schema-Agnostic Encoding: Processing Any Database Schema Without Architecture Changes