Solution Background and Business Value
Entity resolution enables companies to identify and merge records that refer to the same real-world entity (such as customers, products, or businesses) across different data sources. Consolidating these records allows for a more accurate and holistic view of each entity, which in turn improves data quality and enhances decision-making. Entity resolution is also crucial when performing predictive tasks, as duplicate entries can introduce unwanted noise and will generally reduce overall model accuracy. Despite its importance, entity resolution is notoriously difficult to perform accurately using traditional rule-based approaches. These methods often rely on manually crafted heuristics that not only can become difficult to maintain at scale, but also struggle to capture the usually complex relationships between different data fields. Kumo AI’s feature-based learning and graph neural network approach creates context-aware embeddings that allow it to identify subtle and non-obvious links between records, perfect for performing entity resolution tasks. While there exist many different entity resolution problems, we will provide an example of how to use Kumo AI to create a link prediction model that can identify accounts that have been created by the same user on two different platforms.Data Requirements and Schema
To develop an effective entity resolution model, we need a structured set of tables that captures all the relevant user data for both platforms and is able to represent different signals to perform entity resolution on. While there exists a minimum amount of tables for generating entity resolution predictions, the addition of relevant information and complexity to the graph will only serve to increase model accuracy. One of the most critical parts of setting up training an entity resolution model is the labels table. Kumo trains supervised models which require high-quality labels to be accurate. In this example, each row of the labels table represents an established link between two accounts on different platforms. This table needs to be generated before training, which can be done either from prior data or by selecting the highest-confidence signal as your ground truth. Then, as the model finds more pairs, the label table can be updated with new entries, leading to improved overall accuracy. For this example, the highest confidence signal we have is email, meaning we assume that two users having the same email is ground truth for the existence of a link. Device ID is a medium confidence signal, meaning if two users access the platform through the same device there’s a strong likelihood that there’s a link. To add other signals, such as IP addresses or content links, you can follow the same structure used for device signals: a shared table with connections to users from different platforms. Core Tables- Platform A User Data:
- Stores data about each user from platform A, using email as an identifier
- Note: Emails are omitted from the table to prevent data leakage during training
- Key attributes:
platform_a_user_id
: unique user identifier for platform Afirst_seen
: user creation datelast_seen
: last time a user was seen- Optional: Other user attributes (age, gender, location, etc.)
- Platform B User Data:
- Stores data about each user from platform B, using email as an identifier
- Contains similar information to the platform A user data table
- Key attributes:
platform_b_user_id
: unique user identifier for platform Bfirst_seen
: user creation datelast_seen
: last time a user was seen- Optional: Other user attributes (age, gender, location, etc.)
- Platform A User Sessions:
- Stores data about each user session from platform A
- Key attributes:
platform_a_session_id
: unique session identifier for platform Aplatform_a_user_id
: the user from platform A this session belonged tocreate_date
: create date of the sessiondevice_id
: device used for this session- Optional: ip address, duration, location, etc.
- Platform B User Sessions:
- Stores data about each user session from platform B
- Contains similar information about user sessions as those from platform A
- Key attributes:
platform_b_session_id
: unique session identifier for platform Bplatform_b_user_id
: the user from platform B this session belonged tocreate_date
: create date of the sessiondevice_id
: device used for this session- Optional: ip address, duration, location, etc.
- Device Data:
- Stores data about each device used by users from both platforms A and B
- Key attributes:
device_id
: unique device identifierdevice_type
: device type- Optional: device brand, device model, etc.
- Labels Table:
- Stores data about each device used by users from both platform A and B
- Key attributes:
link_id
: unique identifier for each linkplatform_a_user_id
: identifier for a user from platform Aplatform_b_user_id
: identifier for a user from platform B