Matcher: Machine Learning Matcher

The Machine Learning Matcher employs a pre-trained machine learning model to match individual party data elements such as person names, emails, phone numbers, and addresses. The first pre-trained model version released with update 2024.1 supports person name matching exclusively.

The Machine Learning Matcher simplifies the matching process by improving the ability to create accurate and efficient matching algorithms for comparing person names. Additionally, the Machine Learning Matcher supports groups of nicknames, further simplifying the process of defining the nickname aliases.

Important: The option to utilize the Machine Learning Matcher is exclusive to STEP SaaS systems. On-premises systems are not supported for implementing this matcher. Additionally, the use of the Machine Learning Matcher is exclusive to matching algorithms using embedded match codes. For more information, refer to the Match Codes topic in the Matching, Linking, and Merging documentation

Model version

The Machine Learning Matcher has a model version concept that allows versioning of the matcher. New model versions are released outside of the normal STEP update cycle. In the Machine Learning Matcher editor dialog, there is a link to open 'Release Notes' that explain the changes done in each model version.

Every version of the model has a different underlying pre-trained machine learning model and will therefore produce different scores. Additionally, each model version can have different capabilities, supporting different data elements and producing different score output elements.

The versioning system consists of a major version number (first digit), a minor version number (second digit), and the date it was released. The rules governing these are:

Major versions: Involves a change in supported input data elements and / or a change in output score elements.
Minor versions: Involves a change to the scores, but the supported input data elements and output score elements remain the same.

Data Elements

The Machine Learning Matcher takes input from the Data Elements that are selected in the matcher.

It is possible to configure only part of the data elements. The output scores corresponding to unconfigured data elements will always be 0.

Note: Subsequent model versions will be released outside of the normal STEP update cycle to support additional party data object types.

Nickname CSV Asset

Some model versions support a ‘Nickname CSV Asset’ which is a CSV file containing nickname alias groups that will be used in the person name matching. The file enables STEP to provide additional information to the matching process, facilitating the identification of names that are nicknames or shorthand versions of longer names.

For example, if a person is registered under different names like ‘Bill’ or ‘William’, the matcher might return a low name match score. By providing the Nickname CSV Asset, the matching service can recognize the match between the two names, resulting in a higher score. Data stewards can maintain and adjust the Nickname CSV Asset to suit their company’s unique data requirements.

All names on each row of the file are considered part of a nickname group, and all names are handled equally, meaning that the order of the names has no significance.

Additionally, the nickname groups can be utilized when generating match codes. For more information, refer to the Match Code Generator: Person Name and Address topic in the Matching, Linking, and Merging documentation.

When creating the CSV file, Stibo Systems recommends that users adhere to the following guidance to prevent errors:

Semicolon is a reserved character. Avoid using semicolons within names, as they serve as separators between names.
Avoid line breaks in nickname values. Nickname values should not contain line breaks, as this will be interpreted as the start of a new nickname group.
Support for ‘Newline’ formats. The system supports ‘Newline’ formats, including CR LF, LF, and CR.
UTF-8 file format is required. The CSV file must be in the UTF-8 format to ensure compatibility with the system.
Lines without a semicolon are ignored. Lines lacking a semicolon are disregarded. Ensure that a semicolon is included as a name separator to have the line included in the output.
Tabs and spaces are trimmed. White-space characters at the beginning or end of a line are removed, so avoid using them.
Multiple tokens separated with white-space(s) will be ignored. Nicknames should consist of a single name without internal white-space.

A default Nickname CSV Asset, containing typical US nicknames, can be acquired by contacting Stibo Systems at cmdm@stibo.com.

Output scores

The Machine Learning Matcher produces individual scores for each of the configured elements in the Model version, e.g., Names, Addresses, Emails, and Phone numbers. Some Model versions for person name matching also facilitate the generation of name subscores, such as ‘name.firstname’ and ‘name.lastname’. These subscores are derived from the overall ‘Name’ score, providing more detailed information about the first name and the last name match scores.The score hierarchy is displayed with the use of dot (.) notation.

In the example below, the matcher returns an overall match score of 95.3 when comparing the two selected names.

All individual scores and subscores can be used in Rules in the Match Criteria as well as in Function and JavaScript Function matchers, using the same dot notation. For more information, refer to the Matcher: JavaScript Function topic in the Matching, Linking, and Merging documentation.

Configuring a Machine Learning Matcher

The Machine Learning Matcher can be added in the ‘Matchers’ flipper of the Decision Table dialog by clicking the ‘Add Matcher’ link (as defined in the Match Criteria topic of the Matching, Linking, and Merging documentation).

After adding the Machine Learning Matcher, configure it as follows:

Click into the ‘Matcher’ column and click the click the ellipsis button () to access the configuration dialog.
The configuration dialog for the Machine Learning Matcher opens.
- To choose the Model version, click the dropdown to select the desired pre-trained model. The dropdown provides a list of available model versions. By default, the latest version is selected.
- Clicking the Release Notes link will display a table showing the release notes for all available model versions. The table includes information such as the version number, the release date, and the release note information itself.
- In Data Elements, a table is available with the ‘Type’ and ‘Data Element’ fields. The ‘Type’ field is pre-populated with the supported types for the model version selected. To specify the data elements from which the Machine Learning Matcher should obtain input, click into the ‘Data Element’ field and make a selection.
- Model versions that support person name matching often also support nickname groups. To provide a Nickname CSV Asset containing nickname alias groups, click the ellipsis button () and browse to select the file. Before supplying the CSV asset, a new object type for Person Name Alias Groups must be added.
To evaluate the configuration of the data model for the Select Nodes parameter:
- Click the ellipsis button ()for each field and select two objects for comparison.
- Click the Evaluate button.
  
  When evaluating the two nodes, the Machine Learning Matcher produces individual scores for each of the configured elements in the Data Element field, e.g., Names, Addresses, Emails, and Phone numbers. Additional subscores are also displayed if the selected model version supports it.
Click OK to save and display the configuration in the 'Matchers' flipper.

Configuring new object type for Person Name Alias Groups

Before providing a Nickname CSV Asset, it is necessary to add a new object type for Person Name Alias Groups. Subsequently, configure the Matching Component Model with this newly added object type.

For details, refer to the Configuring Matching Component Model topic in the Matching, Linking, and Merging documentation.

Ensure that the object type is configured with the MIME Type 'Text/plain; charset=UTF-8' to allow the matching service to recognize and read the content of the file. For details, refer to the ‘Object Types and Structures’ section of the System Setup documentation.

Now, it is possible to add the Nickname CSV file to a new asset using the newly created object type and select the asset when configuring the Machine Learning Matcher.