Match Tuning General Considerations
Before developing a matching strategy, it is important to consider the client organization's data and the potential challenges the algorithm will have to account for. The following general considerations and challenges are commonly encountered when implementing matching.
Start Small
Initially tuning with a full data set may not be advisable when volumes are large. Instead, tuning should be done in iterations of increasing volume sizes. As an example, start with 1,000 records, eventually go to 100,000, and in the end, at least 20 percent of the full data set should be taken into consideration. Make sure to consider a good sampling of data, such as data from all sources, and data captured through different means. If the volume is small enough, consider tuning using 20, 30, 60, and 100 percent of the total data set.
Obtaining Data
Solution consultants should expect delays in receiving customer / entity data. They should work with the client organization early in the process to define the data to be delivered and push to receive it as early as possible. There may be both technical and process-related reasons for holding up data delivery. The Extract, Transform, Load (ETL) team may have issues in staging data from the source systems, and the legal or security team may introduce other delays.
Solution consultants should establish a delivery date that the client agrees on and should emphasize that delays to that date will also delay critical-path tasks.
Note: Having access to real production data is a critical dependency to starting the algorithm tuning tasks.
Data from all sources must be included. Data quality and characteristics can vary from source system to source system, so getting samples from all sources is critical. This includes samples of all object types in scope as well as data captured through different means (call center, web, mobile, etc.)
Other considerations for sampling data:
-
Data that crosses regions
-
The age of records (recently created vs. created 20 years ago)
-
Records' last update (recently updated vs. updated 20 years ago)
Data Security
Solution consultants should work with the client organization to determine the level of security needed around the data they provide. Note that the client may be held to a higher degree of security due to their industry's regulations.
Using Real Data
Algorithm tuning is highly data-dependent, so real production data from each source system must be made available for analysis. Entity data is required at key points during the implementation:
-
10-100 records for data modeling
-
20 percent of total data volume for algorithm tuning
-
100 percent of total data volume for go-live
Stakeholder Input
It is important to have both the data steward and the data owner present during sample pair review sessions. For more information about sample pair review, refer to the Process section in the Matching Algorithm Tuning topic here.
In this context, the data steward is the business user who has been tasked with the formation and execution of policies for the management of data and metadata. The data owner is the business user who typically has a direct line of responsibility for a functional area.
There are several personnel considerations to make during the sample pair review process:
-
Stibo Systems recommends that the consultant be present on-site for the sample pair review sessions. Due to the highly interactive nature of these discussions, being on-site helps facilitate the process.
-
Data owners from different functional areas within the organization may have differing opinions on matching requirements. It may be difficult to get consensus among them.
-
Some clients may introduce a data governance board that can assist with reaching consensus among the data owners.
Note: It is important to set expectations with the client organization that, while those implementing the initial configuration can provide guidance, it is the organization's responsibility to determine which entity records should match and which should not.
However, solution consultants should not expect a client organization to be able to articulate their matching requirements. To arrive at a baseline algorithm configuration, a discussion with the client should occur, focusing on what their priorities are for generating golden records. The sample pair review sessions will help facilitate the discussion around finalizing matching requirements.
Clerical Review
Keep in mind that any records below the auto-merge threshold and above the clerical review threshold are placed in clerical review for manual review. It is important to discuss with the client organization what types of potential duplicates are evaluated as part of clerical review and the volume of records that are acceptable. Client organizations with a low volume, such as B2B organizations, may want a looser algorithm where most or sometimes all records are reviewed manually. Client organizations with a high volume, such as B2C organizations, may want little to no records reviewed manually.
This discussion becomes increasingly important as volumes become larger, potentially leading to thousands of tasks which need manual review. Additionally, if it is not possible to articulate rules that define a match, it likely will not be possible for a human to determine if records are a match using the data provided.
Iterations of Review
Each client has a unique data set requiring the matching algorithm be tuned specifically to identify matches in that data. Three or more match tuning iterations should be expected. It is not uncommon to have five or more iterations.
Rule Tips
Rules are a set of criteria that must be true for the result score to be assigned. The rules can quickly become complex as varying use cases are identified in the data. The best way to handle this complexity is to make the rules simple and easy to understand, especially since there is no restriction to the number of rules that can be created. Additionally, the rules should be well documented so that changes to the algorithm are easier both during tuning and later after going live.
The result should be calculated as a weighted sum of the matchers in play. Conditions should be used to limit the combinations of matchers in the weighted sum to no more than two matchers; otherwise, it is not possible to tune to a threshold.