Berlin Tech Meetup: The Future of Relational Foundation Models, Systems, and Real-World Applications

Register now:
PyG/Guide7 min read

Tabular Data vs Graph Data: When Flat Tables Work and When You Need Graph Structure

Tabular data treats every row as independent. Graph data captures the connections between rows. The choice depends on where the prediction signal lives: in the entity's own features or in its relationships.

PyTorch Geometric

TL;DR

  • 1Tabular data: each row is an independent observation with features as columns. Works well for single-table predictions where the target depends only on the entity's own attributes (house price from square footage).
  • 2Graph data: entities are connected by relationships. The prediction signal is in the connections: fraud rings, social influence, supply chain cascades. These signals are invisible in flat tables.
  • 3The decision criterion: does the prediction depend on relationships between entities? If yes, graph. If only on the entity's own features, tabular may suffice.
  • 4On single-table data, XGBoost/LightGBM compete with or beat GNNs. On multi-table relational data, GNNs outperform by 13+ AUROC points. The data type determines the model advantage, not any universal superiority.
  • 5Engineering graph features (degree, PageRank) into flat tables captures some graph signal but cannot replace full GNN message passing, which learns relevant multi-hop patterns automatically.

The distinction between tabular and graph data is about where the prediction signal lives. If the signal is in each entity's own features (predict house price from square footage and bedrooms), a flat table is the right representation. If the signal is in the relationships between entities (predict fraud from transaction network topology), you need a graph.

Tabular data: when rows are independent

A tabular dataset treats each row as an independent observation:

  • Each row has a fixed set of features (columns)
  • Rows do not depend on each other
  • The order of rows does not matter
  • Prediction uses only the features of the target row

Classic tabular problems: predicting house prices (features: square footage, bedrooms, location), classifying iris species (features: petal length, width), credit scoring from application data (features: income, employment, credit history).

Gradient-boosted trees (XGBoost, LightGBM, CatBoost) are the state of the art for tabular data. They handle heterogeneous feature types, missing values, and non-linear relationships efficiently. For genuinely single-table problems, they remain extremely competitive.

Graph data: when relationships carry signal

Graph data explicitly represents relationships between entities:

  • Entities are nodes with features
  • Relationships are edges connecting nodes
  • The prediction depends on the entity's connections, not just its own features
  • Structural patterns (clusters, paths, hubs) carry signal

Graph problems: fraud detection (transaction networks), recommendation (user-item interactions), drug discovery (molecular structure), social analysis (influence networks), supply chain (supplier-manufacturer relationships).

The hidden graph in enterprise data

Most enterprise data looks tabular but is actually relational. A “customer churn” table with 200 features was derived from 10+ source tables through JOINs and aggregations. The flat table is the result of flattening a graph.

The question is not “is my data tabular or graph?” but “am I losing information by flattening my naturally relational data into a flat table?” When the source data has foreign key relationships between tables, the answer is usually yes.

Quantitative comparison

On RelBench (enterprise relational databases):

  • Flat-table LightGBM (expert features): 62.44 avg AUROC
  • GNN on relational graph: 75.83 avg AUROC (+13.4 points)

On Kaggle-style single-table competitions:

  • Gradient-boosted trees win most competitions
  • GNNs provide minimal improvement over tabular models

The pattern is clear: GNNs excel when relational structure exists. On flat single-table data, tabular models are sufficient and often preferable (faster training, simpler deployment, well-understood).

The hybrid path

In practice, the best approach combines both:

  • Use graph representation for the relational source data (preserve multi-table structure)
  • Use tabular-style feature handling for individual node features (numerical normalization, categorical embeddings)
  • Let GNN message passing discover cross-entity patterns while tabular features capture entity-level patterns

This is what relational deep learning does: treat the database as a graph (relational structure) while encoding each node's features with tabular best practices (normalization, embeddings).

Frequently asked questions

When is tabular data sufficient?

Tabular data works well when: (1) the prediction depends only on features of the target entity (predict house price from square footage, bedrooms, location), (2) there is no meaningful relational structure between entities, (3) the data fits naturally in a single table, and (4) gradient-boosted trees (XGBoost, LightGBM) already achieve strong performance. Adding graph structure in these cases provides marginal improvement.

When is graph data essential?

Graph data is essential when: (1) the prediction depends on relationships (fraud rings, social influence, supply chain cascades), (2) data spans multiple related tables (enterprise relational databases), (3) structural patterns carry signal (network topology, community membership), and (4) multi-hop dependencies matter (a customer's risk depends on their counterparties' counterparties). These signals are invisible in flat tables.

Can tabular models use graph features?

Yes, you can engineer graph features (degree, PageRank, clustering coefficient) and add them as columns to a flat table. This captures some graph signal but is limited: (1) you must know which graph features matter in advance, (2) fixed features cannot capture complex multi-hop patterns, and (3) the number of possible graph features is infinite. GNNs learn the relevant graph features automatically.

What about the XGBoost vs GNN debate?

On single-table data, XGBoost/LightGBM often matches or beats GNNs because tabular models are optimized for independent rows with heterogeneous features. On multi-table relational data, GNNs significantly outperform tabular models (13+ AUROC points on RelBench). The debate is about which data type you have, not which model is universally better.

Learn more about graph ML

PyTorch Geometric is the open-source foundation for graph neural networks. Explore more layers, concepts, and production patterns.