Why are GNNs better than hedonic pricing models for real estate?

Hedonic models (linear regression on property features) treat each property independently. GNNs model the spatial context: nearby comparable sales, neighborhood amenities, school quality, and transit access. A property's value depends heavily on its graph neighborhood, not just its bedrooms and square footage.

What graph structure represents real estate valuation?

Properties are nodes connected to nearby properties (geographic proximity), neighborhoods, schools, transit stations, and amenities. Transaction edges carry sale prices and dates. The graph captures both the property itself and the spatial context that drives value.

How do you handle spatial relationships in a GNN?

Connect properties within a radius (e.g., 0.5 miles) with edges weighted by distance. This creates a spatial graph where the GNN aggregates information from nearby properties. K-nearest-neighbor graphs (connect each property to its K closest) work well when density varies across areas.

Can GNN valuations handle market dynamics?

Yes. Temporal features on transaction edges (sale date, days on market) capture market trends. The GNN learns that recent comparable sales in the neighborhood are more relevant than older ones, and can model appreciation/depreciation trends from the temporal graph structure.

How does KumoRFM handle property valuation?

KumoRFM takes your real estate database (properties, transactions, neighborhoods, amenities) and predicts property values with one PQL query. It automatically discovers spatial relationships and comparable sale patterns.

Real Estate Valuation with PyG: GNN on Location + Transaction Graphs | PyG Guide

The business problem

The US residential real estate market is worth $47 trillion. Accurate property valuation underpins mortgage lending ($2.5T annually), insurance pricing, property tax assessment, and investment decisions. A 5% valuation error on a $500K home is $25K, enough to cause lending losses or mispriced insurance.

Traditional AVMs use hedonic regression: predict price from property features (bedrooms, bathrooms, square footage, lot size, age). They incorporate “comparable sales” as manually selected features. But the selection of comparables is itself the hard problem, and the spatial relationships between properties, schools, and amenities are complex and multidimensional.

Why flat ML fails

Comparable selection: Flat models use distance-based comparable selection as a preprocessing step. The “right” comparables depend on property type, condition, and market conditions, making rule-based selection suboptimal.
No spatial context: A property 0.1 miles from a top-rated school is worth more than one 1.5 miles away. Flat models encode this as “distance_to_school = 0.1” but miss the school quality propagated through the spatial graph.
No neighborhood dynamics: Gentrifying neighborhoods show rapid appreciation. The spatial graph captures this through recent comparable sales at increasing prices.
Amenity interactions: Near a park is good. Near a park and a transit station is better. The spatial graph naturally captures multi-amenity interactions through message passing.

The relational schema

schema.txt

Node types:
  Property    (id, beds, baths, sqft, lot, year_built, type)
  Neighborhood (id, median_income, crime_rate, walkability)
  School      (id, rating, type, enrollment)
  Amenity     (id, type, quality_score)

Edge types:
  Property --[near]--> Property    (distance_m)
  Property --[in]-->   Neighborhood
  Property --[zoned]--> School     (distance_m)
  Property --[close_to]--> Amenity (distance_m, walk_min)
  Property --[sold]-->  Property   (price, date)  # self-loop with txn

Properties connected by proximity, with neighborhood, school, and amenity context. Transaction edges carry sale prices for comparable-based valuation.

PyG architecture: SAGEConv for spatial valuation

valuation_model.py

import torch
import torch.nn.functional as F
from torch_geometric.nn import SAGEConv, HeteroConv, Linear

class ValuationGNN(torch.nn.Module):
    def __init__(self, hidden_dim=128):
        super().__init__()
        self.property_lin = Linear(-1, hidden_dim)
        self.neighborhood_lin = Linear(-1, hidden_dim)
        self.school_lin = Linear(-1, hidden_dim)
        self.amenity_lin = Linear(-1, hidden_dim)

        self.conv1 = HeteroConv({
            ('property', 'near', 'property'): SAGEConv(
                hidden_dim, hidden_dim),
            ('property', 'in', 'neighborhood'): SAGEConv(
                hidden_dim, hidden_dim),
            ('property', 'zoned', 'school'): SAGEConv(
                hidden_dim, hidden_dim),
            ('property', 'close_to', 'amenity'): SAGEConv(
                hidden_dim, hidden_dim),
        }, aggr='mean')

        self.conv2 = HeteroConv({
            ('property', 'near', 'property'): SAGEConv(
                hidden_dim, hidden_dim),
            ('property', 'in', 'neighborhood'): SAGEConv(
                hidden_dim, hidden_dim),
            ('property', 'zoned', 'school'): SAGEConv(
                hidden_dim, hidden_dim),
        }, aggr='mean')

        self.regressor = torch.nn.Sequential(
            Linear(hidden_dim, 64),
            torch.nn.ReLU(),
            Linear(64, 1),
        )

    def forward(self, x_dict, edge_index_dict):
        x_dict['property'] = self.property_lin(
            x_dict['property'])
        x_dict['neighborhood'] = self.neighborhood_lin(
            x_dict['neighborhood'])
        x_dict['school'] = self.school_lin(x_dict['school'])
        x_dict['amenity'] = self.amenity_lin(x_dict['amenity'])

        x_dict = {k: F.relu(v) for k, v in
                  self.conv1(x_dict, edge_index_dict).items()}
        x_dict = self.conv2(x_dict, edge_index_dict)

        return self.regressor(
            x_dict['property']).squeeze(-1)

SAGEConv aggregates comparable sales, school quality, and amenity proximity. Two hops capture neighborhood-level context: not just direct comparables but the neighborhood of comparables.

Expected performance

Property valuation is a regression task. The standard metric is Median Absolute Percentage Error (MdAPE):

Hedonic regression: ~12% MdAPE
LightGBM (flat features): ~9% MdAPE
GNN (spatial graph): ~6-7% MdAPE
KumoRFM (zero-shot): ~6% MdAPE

Or use KumoRFM in one line

KumoRFM PQL

PREDICT sale_price FOR property
USING property, neighborhood, school, amenity, transaction

One PQL query. KumoRFM discovers spatial relationships and comparable patterns for property valuation.

Real Estate Valuation: GNN on Location + Transaction Graphs