Matching Algorithm Tuning

Algorithm tuning begins during the build phase of the implementation. Early on in this process, it is common to find numerous invalid matches making it past the auto-merge threshold while valid matches fall short of the clerical review threshold. Thus, the goal of the algorithm tuning sessions is to perfect the matching logic's accuracy so that matches between golden records score within the appropriate thresholds.

Considerations

It is expected that anyone working with matching algorithm tuning is familiar with how to create a match tuning configuration. For more information on match tuning and creating a match tuning configuration, refer to the Match Tuning topic section of the Matching, Linking, and Merging documentation.

There are many considerations to take that will improve output when configuring the match tuning process.

Match Tuning General Considerations - Broad, conceptual factors to consider before the match tuning process.
Match Tuning Pair Export Considerations - Specific to pair exports used for manual or offline confirmation and rejection of matched pairs, before and during the match tuning process.

Process

The matching algorithm tuning process is as follows:

Configure: Use a match tuning configuration to generate a data profile. Using this data profile, identify key data points to consider when configuring a baseline algorithm (matching algorithm and match codes).

For more information on match tuning and creating a match tuning configuration, refer to the Match Tuning topic section of the Matching, Linking, and Merging documentation.
Generate Sample Pair: Once the baseline algorithm is configured, generate the random sample pair spreadsheet via a match tuning configuration. This baseline configuration is a ‘best-guess’ configuration based on the analysis of the data so far.

Before the sample pair review can begin, the raw data from the output file should be formatted for readability. The sample pair formatter Excel sheet can optionally be used on the output file.
Review Sample Pair: Review the sample pairs with the client. Each individual pair gets either a ‘Yes’, ‘No’, or ‘Not Sure’ indicating whether they should be considered the same entity by the algorithm and linked together.

The sample pair review process can be a time-consuming task, but it is critical in getting the algorithm tuned to meet requirements. Typically, review 1,000+ sample pairs each cycle with the stakeholders. For some of the iterations, a pair export may be as large as 1,000 records per percentage points of interest.

Once the random sample pair spreadsheet is generated and formatted, it is vital to review the sample pairs to determine how the algorithm evaluates them. The primary purpose of the review is to assess the confidence of each merge and modify the thresholds if the scores appear inaccurate. During the review process, it is important to consider the following:
- Each set should be marked with a decision as to whether the records are considered the same entity (based on the data available).
- It is best to approach this task from a ‘human’ standpoint as opposed to creating logic to help achieve a certain score.
- This is not a data cleaning task.
The goal by the end of each sample pair review session is to improve the quality of the matches found. It is much easier to identify false positives than false negatives in the pair export. Therefore, it is recommended to start with a broadly defined algorithm and narrow the match criteria during tuning. For more information on false positives and false negatives, refer to the Match Tuning Pair Export Considerations topic.
Tune Algorithm: Tune the algorithm based on feedback from the sample pair review, and generate a new set of sample pairs based on the updated algorithm. This goal can be achieved by:
- Adjusting the scoring method and weight of each scored attribute.
- Adjusting the relative weight of scoring across all the scored attributes.
- Adjusting the auto-merge and clerical review thresholds.
Repeat steps 2 and 3 for two more cycles (or more, as needed).
Finalize: Decide on the final auto-merge threshold and clerical review threshold.