A traditional credit scorecard uses 10 to 20 variables: outstanding balance, payment history, credit utilization, number of accounts, account age, recent inquiries. These variables are derived from credit bureau data and compressed into a single score. The model is logistic regression. The relationships are assumed to be linear. The scorecard is recalibrated annually.
This framework was built for an era when a credit bureau file was the only data available. Today, a bank has access to transaction-level data, account behavior, payment velocity, counterparty information, employment data, and real-time financial flows. The scorecard uses none of it.
The result: traditional models misclassify 15-25% of borrowers. Some are scored too high and default unexpectedly. Others are scored too low and are denied credit they would have repaid. Both errors have direct financial consequences.
borrower_scorecard_view
| borrower_id | FICO | utilization | payment_history | accounts | score_risk |
|---|---|---|---|---|---|
| B-3301 | 742 | 38% | 12/12 on-time | 4 | Low |
| B-3302 | 718 | 52% | 11/12 on-time | 6 | Medium |
| B-3303 | 695 | 67% | 10/12 on-time | 3 | Medium |
| B-3304 | 731 | 41% | 12/12 on-time | 5 | Low |
| B-3305 | 709 | 45% | 12/12 on-time | 4 | Low-Medium |
Five borrowers scored by a traditional 20-variable scorecard. B-3301 and B-3304 both appear low risk. But the scorecard cannot see what is happening inside their transaction accounts.
transaction_behavior (last 90 days)
| borrower_id | spend_trend | cash_advances | payment_timing | paycheck_status | merchant_shift |
|---|---|---|---|---|---|
| B-3301 | Stable | 0 | Day 5 of cycle | Regular bi-weekly | None |
| B-3302 | +35% in 30 days | 3 ($4,200) | Day 5 > Day 22 drift | Missed last deposit | Discretionary > essentials |
| B-3303 | Declining | 0 | Day 10 stable | Regular weekly | None |
| B-3304 | Stable | 0 | Day 8 of cycle | Regular bi-weekly | None |
| B-3305 | +60% in 14 days | 5 ($8,900) | Day 3 > Day 28 drift | Irregular since Feb | Restaurants > cash advances |
Highlighted: B-3302 and B-3305 show classic pre-default behavioral patterns: cash advance spikes, payment timing drift, spending composition shifts. The scorecard rates both as low-medium risk.
What scorecards miss
The limitations of traditional credit scorecards are not in the math. Logistic regression is a fine algorithm. The limitations are in the data it can consume and the patterns it can represent.
Transaction-level behavior
A scorecard sees "credit utilization: 45%." It does not see that utilization spiked from 20% to 45% in two weeks, driven by a series of cash advances rather than retail purchases. It does not see that the borrower's paycheck deposits stopped three weeks ago. It does not see that spending shifted from groceries and utilities to cash advances and overdraft-protected transfers.
Transaction-level data contains behavioral signals that aggregates destroy. The velocity, composition, and timing of transactions are strong predictors of financial stress. A borrower making minimum payments on the due date every month for 12 months has a different risk profile than one making the same minimum payments but progressively later in the grace period. The scorecard sees both as "12 on-time payments."
Counterparty and network risk
Credit risk is not independent. A borrower whose primary employer is experiencing financial distress has elevated risk, regardless of their personal payment history. A borrower whose major business clients are defaulting on their own obligations faces cascading risk. A small business whose suppliers are experiencing liquidity problems may face inventory disruptions that affect revenue.
These are relational patterns: the borrower's risk depends on the risk of entities they are connected to. Scorecards treat each borrower in isolation. Relational models propagate risk through the network.
Temporal dynamics
A borrower with a 720 FICO score and deteriorating transaction patterns has a different risk profile than a borrower with a 720 FICO score and stable patterns. The score is a point-in-time snapshot. It does not capture the trajectory. By the time the deterioration shows up in the scorecard variables (missed payments, increased utilization), the risk event may already be underway.
How ML improves credit risk models
ML-based credit risk models address the limitations of scorecards in three ways: more variables, non-linear patterns, and transaction-level granularity.
Gradient boosted trees on engineered features
The most common ML approach today: data scientists extract hundreds of features from transaction data, account history, and bureau files, then train XGBoost or LightGBM. This typically reduces default prediction error by 20-30%compared to logistic regression scorecards, according to multiple published studies including research from the Bank of England and the European Central Bank.
The limitation is the feature engineering bottleneck. Someone has to decide which transaction aggregates to compute, which time windows to use, and which cross-table features to construct. For a bank with transaction data, account data, customer data, product data, and external data, the possible feature space is enormous. Data science teams typically build 200-500 features and iterate for 3-6 months before production deployment.
Deep learning on transaction sequences
Some banks use LSTMs or Transformers on raw transaction sequences. Instead of aggregating transactions into features, the model reads the full sequence: amount, merchant category, timestamp, channel. It learns temporal patterns that aggregation destroys: spending velocity changes, category shifts, and payment timing drift.
This approach adds 5-10% accuracy improvement over feature-based ML on transaction data alone. But it only sees one data source. Account relationships, counterparty risk, and cross-product behavior are outside its view.
Traditional scorecard
- 10-20 variables from credit bureau data
- Logistic regression with assumed linear relationships
- Point-in-time snapshot, no temporal dynamics
- Each borrower treated as an independent entity
- Annual recalibration cycle
Relational ML model
- Hundreds of signals from transactions, accounts, and network
- Non-linear patterns and interaction effects captured automatically
- Temporal sequences reveal deterioration 3-6 months early
- Counterparty and network risk propagated through the graph
- Continuous learning from new transaction data
The relational approach to credit risk
Relational deep learning, published at ICML 2024, showed that representing a relational database as a temporal graph enables ML models to learn directly from multi-table data without feature engineering. For credit risk, this means the model sees the borrower not as a row of aggregated features but as a node in a graph connected to their transactions, accounts, counterparties, products, and historical events.
The graph neural network propagates information along these connections. A borrower's risk assessment incorporates the financial health of their employers, the behavior of their co-borrowers, the performance of similar borrowers who share transaction patterns, and the temporal trajectory of their own financial behavior. All of this happens automatically, without a data scientist specifying which features to extract.
What the model discovers
When trained on a bank's full relational data, graph models discover credit risk signals that no scorecard captures.
Spending composition shifts. A borrower whose transaction mix shifts from discretionary spending (restaurants, travel) to essential spending (groceries, utilities) over a 4-week period has elevated risk, even if total spend is unchanged. The model learns this from the transaction-merchant category graph.
spending_composition: Borrower B-3305
| week | restaurants | travel | groceries | cash_advances | total_spend |
|---|---|---|---|---|---|
| Week 1 | $420 | $280 | $180 | $0 | $880 |
| Week 2 | $310 | $0 | $240 | $500 | $1,050 |
| Week 3 | $85 | $0 | $290 | $1,800 | $2,175 |
| Week 4 | $0 | $0 | $310 | $3,600 | $3,910 |
Highlighted: discretionary spending (restaurants, travel) collapsed from $700 to $0 over 4 weeks. Cash advances spiked from $0 to $3,600. Total spend actually increased, which means utilization-based scorecards see this as stable. The composition shift is invisible.
flat_scorecard_view (what the model sees)
| borrower_id | avg_monthly_spend | utilization | on_time_payments | scorecard_risk |
|---|---|---|---|---|
| B-3305 | $2,004 | 45% | 12/12 | Low-Medium |
| B-3301 | $1,890 | 38% | 12/12 | Low |
B-3305 and B-3301 look similar in the scorecard. Both have on-time payments and moderate utilization. The scorecard has no column for 'cash advance ratio' or 'spending category trajectory'. B-3305 is 4 weeks from default.
Payment timing drift. A borrower who pays on day 5 of the billing cycle for 8 months and then gradually shifts to day 25 is exhibiting a pattern that precedes missed payments. The temporal model captures this drift; a scorecard sees "on-time" until the first late payment.
payment_timing: Borrower B-3302
| month | payment_day | grace_period_end | days_remaining | scorecard_status |
|---|---|---|---|---|
| Sep | Day 5 | Day 28 | 23 days | On-time |
| Oct | Day 8 | Day 28 | 20 days | On-time |
| Nov | Day 14 | Day 28 | 14 days | On-time |
| Dec | Day 19 | Day 28 | 9 days | On-time |
| Jan | Day 24 | Day 28 | 4 days | On-time |
| Feb | Day 31 | Day 28 | -3 days | LATE |
Highlighted: payment day drifted from day 5 to day 31 over 6 months. The scorecard recorded 'on-time' for 5 months because the grace period was not exceeded. By the time the scorecard detects the first late payment, the borrower has been deteriorating for half a year.
Network contagion. When a borrower's primary employer's payroll account shows reduced activity, employees at that company face elevated risk. The model propagates this signal through the employer-employee-account graph before individual borrower behavior changes.
employer_payroll_health
| employer | payroll_deposits_q3 | payroll_deposits_q4 | change | employees_affected |
|---|---|---|---|---|
| TechStartup Inc | 24 (bi-weekly) | 18 (irregular) | -25% | B-3302, B-3307, B-3312 |
| MegaCorp LLC | 24 (bi-weekly) | 24 (bi-weekly) | Stable | B-3301, B-3304 |
TechStartup Inc's payroll deposits dropped 25% and became irregular, a sign of financial distress. B-3302 is employed there. The relational model propagates this risk signal through the employer-employee graph. The scorecard knows nothing about the employer's financial health.
model_comparison (90-day default prediction)
| borrower_id | scorecard_PD | ML_PD | actual_outcome | early_warning |
|---|---|---|---|---|
| B-3301 | 2.1% | 1.8% | No default | — |
| B-3302 | 8.4% | 34.2% | Default (Day 68) | ML flagged 52 days early |
| B-3303 | 12.7% | 9.1% | No default | — |
| B-3304 | 2.3% | 2.0% | No default | — |
| B-3305 | 5.9% | 41.8% | Default (Day 45) | ML flagged 38 days early |
Highlighted: the scorecard gave B-3302 an 8.4% PD and B-3305 a 5.9% PD. Both defaulted. Relational ML saw the transaction behavior deterioration and flagged them at 34.2% and 41.8% respectively.
PQL Query
PREDICT default_90d FOR EACH borrowers.borrower_id WHERE borrowers.account_status = 'active'
Predict 90-day probability of default for every active borrower. The model incorporates transaction-level behavior (spending velocity, cash advance frequency, payment timing drift), counterparty health (employer payroll activity), and network risk (co-borrower and guarantor default rates).
Output
| borrower_id | PD_90d | risk_tier | top_signal | early_warning_days |
|---|---|---|---|---|
| B-3305 | 41.8% | High | Cash advance spike + paycheck irregular | 38 |
| B-3302 | 34.2% | High | Payment drift + spending shift | 52 |
| B-3303 | 9.1% | Medium | Utilization elevated but stable | — |
| B-3301 | 1.8% | Low | All indicators stable | — |
| B-3304 | 2.0% | Low | All indicators stable | — |
Regulatory considerations
Model risk management guidance (SR 11-7, SS1/23) requires that models be validated, documented, and governed regardless of methodology. ML models are held to the same standards as scorecards. The additional requirements for ML are explainability and fairness testing.
In practice, many banks use a dual approach: ML models for risk screening, portfolio monitoring, and early warning systems, where the regulatory bar for explainability is lower; and traditional scorecards for final credit decisioning, where explainability requirements are highest. The ML model identifies borrowers whose risk profile is changing. The scorecard makes the final approve/decline decision.
This is evolving. The OCC's 2023 guidance explicitly acknowledges that ML models can improve risk management and does not prohibit their use in decisioning, provided adequate governance is in place. The EU AI Act classifies credit scoring as "high risk" but does not prohibit ML, requiring instead transparency, human oversight, and bias testing.
The foundation model path
KumoRFM brings the foundation model approach to credit risk. The model is pre-trained on relational patterns across thousands of databases, including financial transaction patterns, temporal behavioral dynamics, and network effects. It has already learned the universal signals of financial stress: spending composition changes, payment timing drift, utilization velocity, and counterparty risk propagation.
A bank connects its relational database and writes a predictive query:
PREDICT default_90d FOR borrowers
The model returns a probability of default for every borrower, incorporating the full relational context. No feature engineering, no 6-month development cycle, no annual recalibration schedule. The model updates as new data arrives, capturing deteriorating patterns in real time.
For a bank with $50 billion in consumer credit exposure, a 20% reduction in default prediction error translates to tens of millions in reduced credit losses annually. The cost of achieving this with traditional ML is a team of data scientists working for months. The cost with a foundation model is a database connection and a query.
Credit risk modeling has been constrained by 1950s data and 1990s methods. The data has caught up. The methods are catching up now. The banks that use the full relational signal will price risk more accurately, detect deterioration earlier, and extend credit more broadly and more safely than those still relying on 20-variable scorecards.