Berlin Tech Meetup: The Future of Relational Foundation Models, Systems, and Real-World Applications

Register now:
PyG/Dataset8 min read

Elliptic Bitcoin: Real Financial Fraud Detection on a Transaction Graph

Elliptic Bitcoin is a temporal graph of 203,769 Bitcoin transactions labeled as licit or illicit. It is the benchmark that brought financial fraud detection into the GNN community, demonstrating that transaction network patterns catch fraud that feature-based models miss.

PyTorch Geometric

TL;DR

  • 1Elliptic Bitcoin has 203,769 transaction nodes, 234,355 payment edges, and 165 features. The binary task identifies illicit (fraudulent) transactions with severe class imbalance (~2% positive).
  • 2It is the only public, real-world financial transaction graph. The 49 timesteps enable temporal fraud detection experiments that static benchmarks cannot.
  • 3GCN achieves ~95% AUROC, but the real metric is recall at high precision. Catching 80% of fraud with <5% false positive rate is the practical target.
  • 4Elliptic demonstrates why graph structure matters for fraud: illicit transactions cluster in the payment network, forming patterns invisible to per-transaction feature models.
  • 5KumoRFM detects fraud at enterprise scale on heterogeneous financial graphs orders of magnitude larger than Elliptic.

203,769

Nodes

234,355

Edges

165

Features

Fraud (binary)

Task

What Elliptic Bitcoin contains

Elliptic Bitcoin is a real Bitcoin transaction graph provided by Elliptic, a blockchain analytics company. The 203,769 nodes represent Bitcoin transactions. The 234,355 directed edges represent payment flows (output of one transaction is input to another). Each transaction has 165 features: 94 local features (timestamps, amounts, fees) and 71 aggregated features from 1-hop neighbors.

The graph spans 49 timesteps. Only ~46,000 transactions are labeled: ~4,500 as illicit (associated with ransomware, darknet markets, etc.) and ~42,000 as licit. The remaining ~158,000 are unlabeled. The extreme class imbalance (~2% illicit) and label scarcity mirror real fraud detection challenges.

Why Elliptic Bitcoin matters

Before Elliptic, fraud detection research relied on synthetic or tabular datasets. Elliptic provided the first public real-world financial graph, enabling researchers to study how fraud patterns manifest in transaction network topology. The key finding: illicit transactions form clusters in the payment graph. Fraudsters send funds through chains of transactions to launder money, creating distinctive subgraph patterns that per-transaction feature models cannot detect.

The temporal dimension adds realism. Fraud patterns evolve: as enforcement targets one laundering method, criminals switch to another. Models trained on early timesteps must generalize to new fraud patterns in later timesteps. This temporal shift is the hardest challenge in production fraud detection.

Loading Elliptic Bitcoin in PyG

load_elliptic.py
from torch_geometric.datasets import EllipticBitcoinDataset

dataset = EllipticBitcoinDataset(root='/tmp/Elliptic')
data = dataset[0]

print(f"Nodes: {data.num_nodes}")       # 203769
print(f"Edges: {data.num_edges}")       # 234355
print(f"Features: {data.num_features}") # 165
# Note: only ~46K nodes are labeled (data.y != -1)
labeled = (data.y >= 0).sum()
print(f"Labeled nodes: {labeled}")

Check your PyG version for EllipticBitcoinDataset availability. Manual loading from CSVs is an alternative.

Common tasks and benchmarks

Binary node classification (illicit vs licit) with temporal splits. Standard evaluation uses AUROC and F1. GCN: ~95% AUROC. GAT: ~96%. Temporal GNNs (EvolveGCN): ~97%+. The practical benchmark is precision-recall: how many illicit transactions can you catch while keeping false positives below 5%? This mirrors the real-world constraint where investigators can only review a limited number of flagged transactions.

Example: anti-money laundering

Banks spend $25+ billion annually on AML compliance. Each suspicious activity report requires manual investigation costing $50-500. Current rule-based systems generate 95%+ false positives, wasting investigator time. GNN-based detection on transaction graphs reduces false positives by identifying genuine fraud patterns (chain-like fund flows through shell accounts) while filtering out routine transactions that happen to trigger rules. Elliptic demonstrates this approach on real Bitcoin data.

Published benchmark results

Illicit transaction detection on Elliptic Bitcoin. Metric is AUROC (area under the ROC curve). Higher is better.

MethodAUROC (%)YearPaper
Random Forest~97.72019Weber et al.
Logistic Regression~93.22019Weber et al.
GCN~95.02019Weber et al.
GAT~96.02020Pareja et al.
EvolveGCN~97.22020Pareja et al.
Skip-GCN~96.52020Weber et al.

Note: Random Forest on the 165 features (which include 1-hop aggregates) is a strong baseline. GNN advantages are more pronounced in precision-recall at high-precision operating points.

Original Paper

Anti-Money Laundering in Bitcoin: Experimenting with Graph Convolutional Networks for Financial Forensics

Mark Weber, Giacomo Domeniconi, Jie Chen, Daniel Karl I. Weidele, Claudio Bellei, Tom Robinson, Charles E. Leiserson (2019). KDD Workshop on Anomaly Detection in Finance

Read paper →

Original data source

The Elliptic Bitcoin dataset is provided by Elliptic and is available on Kaggle.

cite_elliptic.bib
@inproceedings{weber2019anti,
  title={Anti-Money Laundering in Bitcoin: Experimenting with Graph Convolutional Networks for Financial Forensics},
  author={Weber, Mark and Domeniconi, Giacomo and Chen, Jie and Weidele, Daniel Karl I and Bellei, Claudio and Robinson, Tom and Leiserson, Charles E},
  booktitle={KDD Workshop on Anomaly Detection in Finance},
  year={2019}
}

BibTeX citation for the Elliptic Bitcoin dataset.

Which dataset should I use?

Elliptic vs DGraphFin: Elliptic is a Bitcoin transaction graph (203K nodes, 165 features). DGraphFin is a fintech social graph (3.7M nodes, 17 features). Use Elliptic for transaction-level fraud detection; use DGraphFin for social-network fraud at scale.

Elliptic vs IEEE-CIS Fraud: IEEE-CIS is tabular (no graph structure). Elliptic provides the transaction graph. Use Elliptic to study graph-based fraud detection specifically.

Elliptic vs OGB-Products: Both are large single graphs. OGB-Products is product co-purchase (no fraud). Use Elliptic for fraud domain; OGB-Products for scalability benchmarks.

From benchmark to production

Elliptic has 200K nodes. A major bank processes 100M+ transactions daily. Production fraud detection requires real-time scoring (millisecond latency), heterogeneous graphs (accounts, merchants, devices, locations), and continuous model updates as fraud patterns evolve. The temporal and class imbalance challenges in Elliptic are real, but the scale gap is 1000x.

Frequently asked questions

What is the Elliptic Bitcoin dataset?

Elliptic Bitcoin is a temporal transaction graph of 203,769 Bitcoin transactions (nodes) and 234,355 payment flows (edges). Each transaction has 165 features. The binary classification task is to identify illicit (fraudulent) transactions. Only 2% of nodes are labeled as illicit, creating severe class imbalance.

How do I load Elliptic Bitcoin in PyTorch Geometric?

Use `from torch_geometric.datasets import EllipticBitcoinDataset; dataset = EllipticBitcoinDataset(root='/tmp/Elliptic')`. The dataset contains a single graph with 203,769 nodes and 234,355 edges. Only ~46K nodes are labeled.

Why is Elliptic Bitcoin important for fraud detection?

It is the only publicly available, real-world financial fraud dataset with graph structure. Most fraud detection benchmarks use tabular data (features per transaction). Elliptic provides the transaction graph, enabling GNN-based fraud detection that leverages payment network patterns.

What makes fraud detection on graphs difficult?

Three challenges: extreme class imbalance (2% illicit vs 98% licit), temporal dynamics (fraud patterns evolve over the 49 timesteps), and label scarcity (only ~46K of 203K transactions are labeled). These challenges mirror real-world fraud detection at banks.

What accuracy should I expect on Elliptic Bitcoin?

Accuracy is misleading due to class imbalance (predicting all-licit gets 98% accuracy). Use AUROC or precision-recall metrics. GCN achieves ~95% AUROC, GAT ~96%, and methods with temporal modeling can exceed 97%. The practical metric is recall at high precision: catching fraud without too many false positives.

Learn more about graph ML

PyTorch Geometric is the open-source foundation for graph neural networks. Explore more layers, concepts, and production patterns.