09/12/2024
Uncovering Anomalies: How Kumo AI uses Graph Machine Learning for Advanced Fraud Detection
By Naisha Agarwal and Blaz Stojanovic
Introduction
Fraud is an ever present problem in the financial service industry. Building reliable machine learning (ML) systems that can catch fraud fast is crucial for banks to remove fraudulent users from their platforms, and to retain the money they would have otherwise lost to fraud. As security measures improve, fraudsters adapt by developing new methods and techniques to circumvent detection systems and algorithms. Fraud is closely related to anomaly detection problems, where rare entities (such as fraudsters) must be found within data and systems. Kumo, with its graph neural network approach, provides a robust scalable model that delivers outstanding performance on these types of problems in real-world production settings. In this blog post we will explore how Kumo can be used to achieve state of the art performance on a financial anomaly detection dataset – DGraphFin.
Anomaly Detection
Anomaly detection is concerned with finding entities which are sufficiently different from all other entities. The field has a rich history which contains both supervised approaches, where we manually label some anomalous examples and aim to find similar anomalies in almost a semi-supervised fashion, as well as unsupervised approaches where data analysis is performed to find what anomalous means in the first place. Most of the time both types of approaches are used in tandem to effectively combat fraud. Anomaly detection methods and workflows are also of utmost importance in other high-impact domains, including e-commerce, social networks, cybersecurity, finance and others. Similarly to other fields of applied machine learning, anomaly detection has seen rapid adoption and production of new and more powerful ML approaches in recent years, one of the most promising being graph machine learning.
Why Graphs?
In predictive models, data is king, and how we choose to represent and model the available data determines our success. In anomaly detection, much like in other subfields of applied machine learning, we can contrast traditional approaches with a new, more data centric paradigm. In traditional data science we flatten many relational tables and perform careful feature engineering to generate a single training table and train a tabular model. With more modern “data centric” approaches, an emphasis is put on high quality data and its most natural representation. An effort is made to now “throw away” any useful signals as one is inevitably forced to do if performing feature engineering. Additionally, powerful ML models that can naturally learn over data in this form are used for more predictive power. Over the past few years we have already seen this paradigm shift happen in computer vision, moving from hand designed filters to CNNs and now vision transformers, as well as in natural language processing with the dawn of LLMs. Kumo’s graph centric approach embodies this paradigm shift for tabular/relational datasets. Our graph transformers operate on the relational database directly, without any feature engineering and provide the same performance edge over traditional approaches.
The subfield of anomaly detection which explores the use of graph machine learning approaches is commonly referred to as Graph Anomaly Detection (GAD). Graphs have two basic components: nodes and edges. Nodes represent entities while edges represent relationships between those entities (e.g., nodes can be bank accounts and edges are transactions between accounts). Furthermore, graphs can be either static or dynamic. A static graph is a fixed graph that does not change over time, and all entities remain immutable throughout time. Dynamic graphs (or temporal graphs) are ones that change dynamically (users joining/leaving and interacting in a social network can be represented as a temporal graph example of a case that would require a dynamic graph).Why use graphs and graph-based approaches for anomaly detection? By structuring our data as graphs, we can reframe the problem as detecting anomalies within these graphs, more specifically abnormal nodes (e.g. users), edges (e.g. user transactions), or subgraphs (e.g. a small group of users and their transactions). Graphs are in particular well-suited for dynamic data due to their inherent ability to model relationships and patterns between different entities without any simplification to the underlying data. Additionally, graph machine learning approaches are adaptable to topological structure, can be scalable to very large networks, incorporate multimodal attribute data, are highly interpretable, and can be extended in many ways e.g. with self-attention or transformer components. Graph anomaly detection has massive impact and benefits in the real world. Let us now look at this problem in the context of the DGraphFin dataset.
Dataset
The DGraphFin dataset is a publicly accessible dataset provided by Finvolution Group, a Chinese consumer finance industry with over 140 million consumers; the Finvolution platform connects underserved borrowers with financial institutions. In order for a borrower to use the Finvolution platform, they must register with an account, where they complete a basic personal profile with age, gender, and other factors to determine their loan limit. One of the required elements of this personal profile is an emergency contact – the person in the user’s life that should be contacted in the event of an emergency. Before each new loan application, users must give one contact’s name; the platform then evaluates loan requests and determines whether to give loans to users.
There are some users on this platform that borrow money but do not pay it back, which the platform labels as fraudulent. These fraudsters are anomalies on the platform, and hence the machine learning task Finvolution is trying to solve is detecting these anomalies on their platforms, an industrial challenge. These fraudulent users often fill in false information on their profiles, an important part that can be used to detect them. In particular, the emergency contact that all users are required to fill out is a highly predictive signal of fraud for this task. In the graph centric approach the emergency contact is treated as part of the overall network, providing crucial topological/structural signals that turn out to be very useful for classification/prediction on this task.
In essence the DGraphFin dataset is structured as a directed multigraph, with nodes representing users and eleven different types of edges connecting these users (the edges are different interactions between users, one of them being an emergency contact). The edges are also attributed with a TIMESTAMP signifying when the interaction happened. In total, the graph contains 3,700,550 nodes and 4,300,999 directed edges. Each user node is a vector of 17 dimensions, dimensions which correspond to a certain portion of the personal profile that users fill out when entering the platform – i.e. user attributes.
Visualization of what the DGraphFin dataset looks like. Can model the dataset as a directed graph between users, where one user is the emergency contact of another. The different colors indicate the different types of interactions in the graph, and each interaction is labeled with a timestamp.
32.2% of the nodes in DGraph (1,225,601) have related borrowing records, and are labeled as such, and of those, 15,509 are classified as anomalous and the rest as normal. DGraph is a challenging anomaly detection dataset. Not only is the classification problem heavily imbalanced, but the many relationships between users need to be considered and there is a large variety between user neighborhoods – traditionally this would require careful feature construction and evaluation. Additionally, this graph is very sparse, as some users do not fill emergency contact information, and emergency contacts who are not in the Finvolution network are also filtered out. Hence, DGraph contains 49.9% missing values, making the problem of identifying fraudsters in the graph even more difficult.
Solution with Kumo
The original dataset contains two tables:
users table:
includesuser_id
a unique identifier of each user, anomaly label, plus several features from their personal profileinteractions table:
which contains four columns, two withuser_id’s,
one for each user part of this interaction in the interaction, a timestamp column indicating the point in time when the interaction happened, and the type column which records the interaction type.
We can structure the dataset/graph in two different ways:
- Maintain the original structure of
users
andinteractions
tables, connect the tables using the primary<>foreignuser_id
s in the respective tables (see Graph 1). - Alternatively, we can split the interactions tables into a separate table for each interaction type, and drop the interaction type column. We do this for illustrative purposes here, but this step might be beneficial if each interaction type had many different features associated with it, or if the interaction types are very different. We connect these interaction tables to the
users
table in the same way as in approach 1 (see Graph 2).
Graphs
The first graph is composed of two tables, with the primary key being the user_id
in the users table, and the foreign keys in the interactions
table being the user_source_id
and user_target_id
in the respective interaction tables. Graph 1 looks as follows:
Graph 1: no restructuring was done.
In the second graph, there are twelve tables tables in total. It is composed of the users
table and the eleven different interactions
tables, each interaction table corresponding to a different type of interaction captured in the dataset. Graph 2 looks as follows:
Graph 2: Interactions table is split into separate tables for each interaction
The Predictive Query
Machine learning models in Kumo can be defined with a declarative Predictive Query Language (PQL). The predictive query for a binary classification model, where we’re predicting labels at the user level (we have the label column available in that table) is very simple:
The PQL is very expressive and can be used to define a variety of ML tasks; for reference, refer to the Predictive Query Tutorial. After we define the ML problem with PQL, we set the model plan. Here are some of our settings for this particular task:
The search for these parameters was performed by using Kumo in BEST run-mode. Once the model is defined and we run training, Kumo takes care of many crucial data science steps:
- Ingests the data from individual tables in the graph, takes care of missing values, encodes/vectorizes the data in individual columns according to our encoder settings.
- Constructs the temporal graph based on primary and foreign key connections
- Generates the training table according to our specifications, automatically takes care of point-in-time correctness for all entities in the graph with timestamps
- Performs model architecture search over the model space we defined in the model plan
- Provides relevant evaluation and explanations of the best model in our search
Kumo’s graph transformers
Kumo employs an advanced graph transformer model that captures the nuanced patterns and relationships within the dataset, outperforming all other existing graph machine learning models. Our model makes predictions based on the aggregation of information in a node’s local neighborhood. An example computational graph would look like this:
Computational graph for DGraphFin data set given target node and input graph
Kumo’s graph transformer model architecture is based on many recent graph neural network and LLM advancements and cutting edge academic research. This includes models such as GraphSAGE that enables inductive representation learning, making it effective even with limited interaction data. ID-GNN is used to discern patterns such as repeated transactions, while PNA introduces diverse aggregation operators. GCN uses mean-pooling aggregation to capture user similarities, GIN focuses on frequency signals to understand complex behaviors, NBF efficiently captures paths between notes, and GraphMixer employs temporal representation learning to analyze sequences of user actions. LLM techniques are carefully used to ensure that proper learning of temporal sequences is possible.
Results
Kumo significantly outperforms other benchmarks, particularly for the graph with the two table setup, achieving an AUROC of 0.841 and an AUPRC of 0.063. The entire training and prediction time for the Kumo model took approximately 4 hours, no feature engineering required. Achieving this type of performance using traditional ML methods may not be possible for this particular dataset and even getting close would take an experienced data scientist many days of iteration and careful feature engineering. These metrics indicate a robust capability to detect fraudulent activities, making Kumo an excellent choice for real-world applications in fraud detection.
Conclusion
In this blog, we introduced the field of anomaly detection, focusing on its importance in identifying fraud. We explore the massive benefits of restructuring these types of problems into a graph, and using graph machine learning approaches to detect anomalies. Not only is graph ML adaptable to the topological structure of the data, it is scalable to very large networks, can incorporate multimodal attribute data, is highly interpretable, and can be extended in many ways. Kumo, with its advanced nuanced graph transformer model, shows the power of machine learning in handling anomaly detection tasks, significantly outperforming traditional machine learning approaches, specifically on the DGraphFin dataset.
Kumo’s approach offers several key advantages: it can seamlessly integrate various types of data, scale to large datasets, and constantly adapt to patterns in the data. This makes the product highly effective for real-world fraud detection and other anomaly detection tasks. The use of GNNs allows for the automatic learning of complex relationships and patterns in the data and incorporation of LLM techniques ensures the resolution of complex temporal behaviors, without having to do manual feature engineering and enabling more accurate predictions.
To get started with Kumo for fraud or other anomaly detection tasks, check out this guided walkthrough of Kumo for a fraud detection task. Kumo offers a scalable efficient solution for any organization looking to utilize graph AI to enhance their data analysis and prediction capabilities.
Next step: request a demo.