Matching Algorithm - Match Result Tab

When a matching algorithm is applied, the identified matches are displayed on the 'Match Result' tab of the matching algorithm. This tool can be used along with the duplicates tabs, as defined in the Matching Algorithm - Duplicates Tabs topic here.

For Match and Merge, it is recommended to do the first rounds of tuning using the match tuning option, as defined in the Match Tuning topic here. Match and Merge cannot be reapplied in the same way that Match and Link can (as described below). Match tuning using Match Result does not override earlier merge decisions by the match algorithm. While you can tune a running system using the Match Result tab, you may have to manually unmerge erroneously merged records.

For Match and Link, you can bypass the match tuning step since the algorithm is non-invasive towards the source records and you can rerun to fully recalculate the golden records.

Truth Table

Determining how well different versions of a matching algorithm work requires a 'truth table'; a set of known data that includes verified duplicates and non-duplicates. A truth table includes pairs of objects that a user has inspected and determined are duplicates or not. A truth table can be built from the Match Result tab using either the information in the tab or the ‘Pair Export’ option.

Using ‘Pair Import Confirmed’ and ‘Pair Export Confirmed’ features, a Match and Link or Identify Duplicates solution can continuously evaluate the results of the algorithm against the truth table. This import is less valuable in Match and Merge as it does not use Confirmed Duplicate references, instead, it merges the information directly into the golden records.

Note: For Match and Merge solutions, the Pair Import and Export tools are not applicable for early evaluations. Instead, use of the Match Tuning functionality to adjust matching algorithms. For more information, refer to the Match Tuning documentation here.

Pair Export

The Pair Export option generates a CSV file that can be used for manual, offline confirmation and rejection of matched pairs. Use this option to export match scores.

The file has a header and the following standard columns:

  • <Pair> - One row per source object and the 'Pair' info is used to indicate which objects belong together. The first two rows have the value '1,' the next two rows have '2,' and so on.

  • <Match y n> - Indicates whether pairs are matches or not. A value is only required for the first object in a pair.

  • <Equality> - The calculated equality percentage between the two objects.

  • <ID> - ID of the object in the current row.

  • <Name> - Name of the object in the current row.

  • <URL> - STEP URL of the object in the current row.

While no template is required for the initial export, to work with the data offline, include attribute values in the file via a template file.

Prerequisite

Create a basic text document template file to be selected in the dialog as follows:

  • Attribute IDs separated by semicolons (;)
  • Save as CSV format

Configuration

Use the following steps to perform Pair Export.

  1. Click the Pair Export button.

  2. In the Pair Export dialog, specify the following:

  • Interval to export: Specify an equality percentage interval that includes pairs expected to be both matches and non-matches, as well as pairs that are not clearly matches or non-matches. Only pairs with equality scores within this interval are exported.

  • Pairs per percent: Specify the maximum number of pairs to be exported for each percentage point.

  • Template file: Select the template file that contains the required attribute values.

  • Process description: Provide a description for the background process found under the Background Process tab.

  • Export Match Details: When checked, columns with part scores from decision table comparators and sub decision tables are included.

  1. Click the Export button to start the background process.

  2. From the BGP, open the exported file in Excel, and enter the decisions in the <Match y n> column for the first object in a pair.

  1. Save your changes.

  2. Use the Pair Import Confirmed option defined below to apply the manual matches added to the file.

Pair Import Confirmed

After the file exported via the pair export option has been populated with matches, it can be imported via the 'Pair Import Confirmed' option. The 'Pair Import Confirmed' process uses its the data for identification purposes but does not import anything other than the confirmation data. This avoids reverting values updated elsewhere since the pair export was performed.

Configuration

Use the following steps to perform Pair Import Confirmed.

  1. Click the Pair Import Confirmed button.

  2. In the Pair Import Confirmed dialog, specify the following:

  • Import File: Select the CSV file to import. This file must have been produced by the Pair Export process, use a semicolon delimiter, and include the header row.

  • Confirmed Relation Reason: Provide a reason for confirming the objects as duplicates or non-duplicates. This reason is saved on each confirmed relation as a meta data attribute and can be viewed on the matching tab of the relevant objects.

  • Process description: Provide a description for the background process.

  1. Click the OK button to start the background process.

  2. Review the BGP Execution Report for a count of the matches and the Confirmed Duplicates and Confirmed Non Duplicates tabs for the modified records.

Pair Export Confirmed

The Pair Export Confirmed option allows you to compare two versions of a matching algorithm against the confirmed duplicates / non duplicates truth table constructed manually or via the steps described above. A background process generates a CSV file with the comparison results and enables the Match Distribution tool. This tool allows the user to view the differences between the match algorithms and compare their accuracy.

Prerequisites

  1. Duplicate your matching algorithm and edit the copy as desired. You will compare the original and the copy which has been fine-tuned.

  2. Create a basic text document template file to be selected in the dialog as follows:

    • Attribute IDs separated by semicolons (;)
    • Save as CSV format

Configuration

Use the following steps to perform Pair Export Confirmed.

  1. Click the Pair Export Confirmed button.

  2. In the 'Pair Export Confirmed' dialog, specify the following:

  • Comparison Algorithm: Select the fine-tuned matching algorithm that you want to compare with the selected algorithm (the original).

  • Template File: Select the CSV file to import. This file must have been produced by the Pair Export process, use a semicolon delimiter, and include the header row.

  • Process description: Provide a description for the background process.

  1. Click the OK button to start the background process.

  2. Click the Go to process button, or on the BG Processes tab, expand the 'Matching Pair Export' node and select the relevant confirmed export process.

  3. On the BGP, open the Result flipper and on the Match Distribution row click the Show Distribution button.

  1. On the Confirmed Matches Distribution dialog:
  • In the table, select a row to view the algorithm data in a chart. Each column is defined in a section following these steps.
  • From the dropdown, select Bar Chart or Accumulated Chart. Each is defined in a section following these steps.

  1. Review the data and determine possible next steps to improve the algorithm. Click the OK button to close the dialog.

  2. Repeat this process as required.

  3. When the fine-tuned version of the matching algorithm produces fewer or zero 'False Positives' and 'False Negatives', choose an option to update the algorithm in use:

  • Copy the logic to the original matching algorithm

  • Replace the original algorithm with the fine-tuned version

Algorithm Data

When reviewing the results, false negatives and false positives are the errors produced by the algorithm when compared to the manually reviewed pairs. While the goal of fine-tuning an algorithm is to achieve 0 false results, having a count of 0 does not mean that the algorithm is perfect. The reliability of the result depends on the amount of data in the testing data set and how well the test data set represents the full data.

On the Confirmed Matches Distribution dialog, the table shows the following information about each algorithm:

  • Algorithm: The ID of the algorithm.
  • Threshold: The threshold used to distinguish between positives and negatives.
  • True Negative: The number of comparisons that were classified as a non-match, both manually, and by the algorithm.
  • False Negative: Count of comparisons that were manually classified as a match, but the algorithm classified as a non-match because the scores were below the threshold.
  • False Positive: Count of comparisons that were manually classified as a non-match, but the algorithm classified as a match because the scores were above the threshold.
  • True Positive: Count of comparisons that were classified as a match both manually and by the algorithm.

Data Charts

For both the Bar Chart and the Accumulated Chart, the colors are identified in the chart legend shown below the chart, and generally:

  • Green represents relations that have been manually confirmed as duplicates

  • Red represents relations that have been manually confirmed as non-duplicates.

The threshold of the algorithm is shown as a vertical line.

Bar Chart

The bars in the chart show the frequency of the scores of the selected algorithm. The bar chart can either show a single algorithm or two algorithms in a special compare mode that enables a detailed comparison of the two algorithms.

Red bars are usually displayed to the left of the threshold indicator and green bars to the right.

  • Green bars displayed to the left of the threshold represent false negatives.

  • Red bars displayed to the right of the threshold indicator represent false positives.

For exact numbers of false positives and false negatives, review the table. Because the bars have a resolution of 1 percent point, the exact number of false positives and false negatives are not available in the graph.

  • Click a colored bar to display the Match Pair List dialog. This includes an extract of the corresponding data from the CSV file to allow inspection of the attribute values of the pairs.

  • In the Match Pair List dialog, click the binocular button () to open the matching algorithm editor with the relevant pair selected in the System Setup tab. This allows investigation of the algorithm behavior for a given pair.

Accumulated Chart

The chart shows the accumulated score frequency for the algorithms. Manually classified matches are green and accumulate to the right of the threshold line. Manually classified no-matches are red and accumulate to the left. The accumulated chart is useful to compare the matching abilities of two algorithms because it is easy to evaluate the number of scores up to a certain point. The chart is also useful for identifying a good threshold value.