Running the Image Deduplication Process

Important: Image Analytics Package / Image Deduplication: This functionality has been deprecated and is no longer supported and/or available for new installations. This documentation is retained as a reference only for customers already using the functionality and for whom it remains available in the current state. The functionality will be removed in the future so customers using this should make plans to transition away from their implementation of it.

The image deduplication process can include the following parts:

Preparing images for deduplication evaluates all images and assigns a pHash.

Clearing stored values allows you to remove unnecessary pHash values.

Running Image Deduplication allows you to verify the auto-handling and/or clerical review settings meet your expectations for identifying a duplicate.

Changes made by the deduplication process are recorded on the asset object's Status tab under the Revisions flipper. For updates made during auto-handling, the user who executed the deduplication process is written in the User parameter. For changes made during the clerical review workflow, the user doing the workflow task is written in the User parameter. To write the same user for all image deduplication processing, create a STEP user specifically for image deduplication processing and log in as that user when doing any deduplication work.

Prerequisites

Before you can evaluate the results of the image deduplication process, you must:

  1. Set up the Web UI for managing images sent to the clerical review workflow, as defined in the Configuring Web UI for the Image Deduplication Clerical Review Workflow topic here.
  2. Create an Image Deduplication Configuration to define what constitutes a duplicate, as defined in the Creating an Image Deduplication Configuration topic here.
  3. Consider the window size for comparisons during the image deduplication runs. The default is 20, but it can be adjusted to allow greater accuracy in identifying potential duplicates. While a larger window size increases accuracy, a smaller number optimizes performance. Subsequent runs at a smaller window size will likely return additional potential duplicates.

For best results, test window size with a known set of duplicates to determine your acceptable level of accuracy compared to the performance level required.

To adjust the window size, in the sharedconfig.properties file on the STEP application server, add the case-sensitive ImageDeduplication.ImageDeduplicationWindowSize property and provide an integer. Changes to the properties file are implemented when the server is restarted. For example:

ImageDeduplication.ImageDeduplicationWindowSize=50
  1. Assign the 'STEP Workflow Administrator' privilege to users who will run the image deduplication configuration right-click options in System Setup. This privilege allows removing tasks from a workflow. Each time the image deduplication process is run, tasks that are already in the workflow must first be removed. For more information, refer to the Workflows section of the Setup Actions and Error Descriptions topic in the System Setup documentation here.

Important: As with any deduplication task aimed to delete redundant data, it is vital to first thoroughly test the process on a non-production system, such as a test environment. Metadata can and intentionally will be lost as a result of the deduplication handling process. There is no undo option, nor is there a recovery function. While restoring from a backup can be acceptable in a test environment, it is likely to cause an unacceptable amount of lost data in a production system.

Preparing Images for Deduplication

The 'Prepare images for duplication' option is a manual way to run the deduplication algorithm and ensure that a pHash is assigned to each image in the selected classification. This option is expected to be used when you first activate image deduplication so that all existing images can be evaluated and have a pHash assigned. Assigning a pHash value is also included in the 'Run Image Deduplication' process, but increases the overall process time if a pHash value must be generated for many images. For details, refer to the Preparing for Deduplication section of the Handling Duplicate Images topic here.

Note: To decrease the time required for the initial 'Run Image Deduplication' process, run 'Prepare images for deduplication' when system use is low, for example, over night.

Use the following steps to prepare images for deduplication.

  1. Select the context that includes images to be deduplicated. For more information, refer to the Contexts topic here.

  1. In System Setup, select an image deduplication configuration, right-click to display the deduplication options and click the Prepare images for deduplication option.

  1. On the background process status dialog that displays:
  • Click Go to process to display the BG Processes tab with the 'Image Deduplication Preparation' process under the 'DeduplicationPreparation' node.
  • Click Close to dismiss the status dialog.

Clearing Stored Values

The 'Clear stored values' option removes all stored pHash values. This can be used when the classification selected in an image deduplication configuration changes, since the stored pHash values for the original classification are no longer required.

This option can also be used if the server crashes or there is some unexpected server error while storing pHash values, since the cache can be corrupted.

Once the values are cleared, use the 'Prepare images for deduplication' option to create new pHash values prior to running the image deduplication process.

Running Image Deduplication

Initially, running image deduplication should include testing to verify that the auto-handling and clerical review settings on the configuration correctly identify the expected duplicate images. Once the configuration is verified to meet the requirements, you will review the background process execution report to determine if images were auto-handled and/or sent to clerical review.

Note: When testing, it is a good idea to set the configuration for a single classification folder that contains a known set of images, for example, a predetermined number of actual duplicates or near matches. Evaluating the accuracy of the results is easier when you know what is expected. For more information, refer to the Deduplication Strategy outlined in the Handling Duplicate Images topic here.

Use the following steps to run an image deduplication configuration.

  1. In System Setup, select the image deduplication configuration, right-click, and click the Run Image Deduplication option. For details, refer to the Handling Duplicate Images topic here.

  1. On the background process status dialog that displays, click Go to process to display the BG Processes tab with the 'Image Deduplication Run' process under the 'DeduplicationRun' node.
  2. Review the Execution Report to determine if images were handled in the way you expected, either auto-handled and/or will be handled manually in clerical review.

For example, in the image below:

  • Box 1 shows that a group is being auto-handled. An image has been selected as the master, and the others have been marked for deletion, as noted by the IDs shown.
  • Box 2 shows that no images are being sent to clerical review.

  1. Take action, based on the Execution Report results:
  • If there are images to be handled manually in clerical review, continue with the Using Image Deduplication Clerical Review topic here.
  • If you want to modify the configuration and retest the same images again, continue with the Clearing Image Deduplication Metadata Attribute Values section below.

Clearing Image Deduplication Metadata Attribute Values

While testing your image deduplication configuration, you may need to run deduplication multiple times on the same images to determine the settings that meet your requirements. Completing a deduplication run includes writing values to metadata attributes on images, and these values can prevent the image from being considered in a future deduplication run. Clearing the metadata values allows the images to be evaluated again.

To clear the image deduplication attribute values, repeat the steps below for the following metadata attributes:

  • Confirmed Duplicates
  • Confirmed Non-Duplicates
  • Deduplication Delete Flag
  1. On the Advanced Search tab, in the Search parameter, select one of the deduplication attributes listed above. For more information, refer to the Search Functionality topic here.
  2. Type ' = *' after the attribute to indicate a wildcard search and click the Search button.
  3. On the Warning dialog, click the Search anyway button.

  1. Click the Bulk Update button to configure a bulk update for this attribute.

  1. On the Operations step, from the dropdown select the Attribute Values group, and choose the Set Value operation. Leave the value parameter blank to clear the attribute value. For more information, refer to the Attribute Values Set Value Operation topic in the Bulk Updates documentation here.

  1. Complete the bulk update as defined in the Creating a Bulk Update topic in the Bulk Updates documentation here.
  2. Repeat these steps for all image deduplication metadata attributes.