Handling Duplicate Images
Handling duplicate images involves marking them for deletion and transferring their references to a master image that is retained in the system. The stages for handling duplicates are the same when using the auto-handling method and when using the clerical review workflow method. The difference is that for auto-handling, all action is taken without user interaction; for clerical review, a user can manually override the system actions.
A combination of the two methods provides the most effective means of identifying and removing duplicate images, as defined in the Deduplication Strategy section below.
For both methods, the complete process is defined below, and involves the following stages:
- Preparing for deduplication
- Identifying duplicates
- Selecting the master
- Processing images
- Troubleshooting errors
Deduplication Strategy
The most effective means of identifying and removing duplicate images involves using both auto-handing and the clerical review methods. Using this strategy, pixel-to-pixel matches are identified and automatically handled first, leaving less obvious potential duplicates to be handled manually by a user.
Important: As with any deduplication task aimed to delete redundant data, it is vital to first thoroughly test the process on a non-production system, such as a test environment. Metadata can and intentionally will be lost as a result of the deduplication handling process. There is no undo option, nor is there a recovery function. While restoring from a backup can be acceptable in a test environment, it is likely to cause an unacceptable amount of lost data in a production system.
For the initial deduplication run, set the configuration 'Auto-Handling Threshold' parameter to 'Yes' and the 'Clerical Review Threshold' parameter to 'No Clerical Review.' With this configuration, since auto-handling only considers pixel-to-pixel matches, from the set of potential duplicates, the system selects a master image and every other image in the group is compared to that master. If all images match the master pixel-to-pixel, then all images are auto-handled. If more than one image does not match the master pixel-to-pixel, then all are sent to clerical review. This configuration is intended to handle the bulk of the pixel-to-pixel matches up front, reducing the number of images for an end user to process in clerical review. However, as pixel-to-pixel matches are only identified relative to the selected master of the group, it is possible that some subsets of identical images will not be found by this method (for example, two identical images in a larger group will not be auto-handled if neither is a match to the master).
Modify the configuration for subsequent runs with the 'Auto-Handling Threshold' parameter set to 'Yes' and the 'Clerical Review Threshold' parameter set to 'Near Matches.' The process described above will still take place, but with this configuration, the group of images that are determined to be very close to the master will be sent to clerical review. In subsequent runs, the master from the auto-handled group will likely be grouped with the master from the clerical review group for further comparison.
When the configuration no longer produces groups of potential duplicates, consider modifying the 'Clerical Review Threshold' parameter to consider less than near matches and further reduce potential duplicate images.
Important: Once an image is marked as a duplicate (its 'Deduplication Delete Flag' metadata attribute is set to 'true') it is ignored by the deduplication functionality, and the final processing should be performed manually. That may include using a workflow to verify and then delete it from STEP, or move it to a hierarchy node outside of the one selected in the configuration, or searching to find all images marked for deletion and then deleting them from STEP as a group. The final processing should also include removing the IDs of the deleted images from the 'Confirmed Duplicates' metadata attribute.
Example
To illustrate this strategy, consider that images 1-3 are identified as a potential duplicate group, and image 1 is selected as the master. Image 1 is a pixel-to-pixel match to images 2 and 3, so images 2 and 3 will be automatically confirmed as duplicates, marked for deletion, and have their references moved to the master. Next, images 4-6 are identified as a potential duplicate group, and image 4 is selected as the master. Images 4 and 6 are not pixel-to-pixel matches with image 4, so they will be sent as a group to clerical review and a master within the group will be selected, for example image 4. Images 5-6 will be marked as duplicate or non-duplicate based on the user selections, and confirmed duplicates will be handled the same as described for the auto-handling scenario. In a subsequent deduplication run, confirmed duplicates are not considered, but the two masters from a previously split group (images 1 and 4 in this example) may be presented for clerical review against one another.
Preparing for Deduplication
The foundation of the deduplication process uses perceptual hashing, which produces a numeric string representing each image, known as the pHash. The pHash values of images are compared to determine their Hamming distance, which is the number of positions in the string at which the numbers differ. A Hamming distance of zero does not necessarily mean that two images are identical, but it does indicate that they are likely quite similar. Before duplicates can be identified, a pHash value must be assigned to the images that will be evaluated. For more information on pHash, search the web.
- For initial setup, manually run this process to assign a pHash to all images in the classification selected in the image deduplication configuration.
- For subsequent deduplication processing, as additional images are added to the classification, or existing images are modified, a pHash value is calculated when the deduplication process is run. However, manually invoking the prepare images option when a large number of images have been added may reduce the overall time required for deduplication.
For more information, refer to the Preparing Images for Deduplication section of the Running the Image Deduplication Process topic here.
Identifying Duplicates
The premise of the deduplication algorithm is 'when images look the same, they are the same.' This definition allows for you to determine a level of variation that is acceptable, while potentially sending variations outside that range to the clerical review workflow.
Only elements that can be visually observed affect the outcome of the algorithm. Non-observable ways to compare images do not affect the outcome of the algorithm, such as STEP metadata on the asset object (description attributes), keywords, EXIF, or other embedded data (like photographer or location). Images that appear identical but use different color models (CMYK and RGB) will likely be sent to clerical review (if enabled).
When setting up an image deduplication configuration, the Hamming Distance is taken into account by both the 'Auto-Handling Threshold' and the 'Clerical Review Threshold' parameters. These parameters work together to determine how duplicates are identified and processed. The possible settings are defined in the Threshold Settings section of the Creating an Image Deduplication Configuration topic here. For more information on Hamming Distance, search the web.
For the clerical review process, the user manually selects duplicate images as defined in the Managing Duplicates section of the Using Image Deduplication Clerical Review topic here.
For the auto-handling process, duplicates are images with a pHash and that match the master pixel-to-pixel.
Results
When the image deduplication process completes successfully the following updates are made to a duplicate image:
- The duplicate image displays ID of the master image in the 'Confirmed Duplicates (ImageDeduplicationConfirmedDuplicates)' metadata attribute. Confirming additional duplicates does not overwrite existing non-duplicate IDs.
- The duplicate image displays 'true' for the metadata attribute 'Deduplication Delete Flag (ImageDeduplicationDeleteFlag)'. This indicates that references have been moved to the master image, and the duplicate is ready to be deleted. As long as this value is 'true,' the image is ignored by the image deduplication functionality, regardless of changes to the image or its metadata.
- Classification links on the duplicate images are moved from the duplicate to the master image.
- Product references on the duplicate images are moved from the duplicate to the master image.
Important: Once an image is marked as a duplicate (its 'Deduplication Delete Flag' metadata attribute is set to 'true') it is ignored by the deduplication functionality, and the final processing should be performed manually. That may include using a workflow to verify and then delete it from STEP, or move it to a hierarchy node outside of the one selected in the configuration, or searching to find all images marked for deletion and then deleting them from STEP as a group. The final processing should also include removing the IDs of the deleted images from the 'Confirmed Duplicates' metadata attribute.
Selecting the Master
The system selects a 'master' image based on the evaluation criteria defined below. The master is the image that should be kept and be updated with classification and product references from the duplicates. If a single image cannot be determined as the master (because multiple images meet the criteria), one is selected at random from the images that remain after the last criteria is evaluated. For details, refer to the Managing Duplicates section of the Using Image Deduplication Clerical Review topic here.
When possible, the auto-handling process selects a single master image based on the following evaluation criteria. When no single image can be selected, the image set is sent to clerical review so the user can manually confirm or override the selected master.
Evaluation Criteria for Auto-Handling Master Selection
The evaluation criteria uses the following checkpoints, in the order defined, in an attempt to find the image where the most information is retained.
For reference, 'lossy' = JPEG and 'non-lossy' = TIFF, PNG, EPS (assuming the TIFF images are not stored using JPEG compression).
For example, generally the most information is indicated by the largest image in terms of pixels. But if there is a non-lossy image format that is greater than 80% as large as a lossy image format, the non-lossy is prioritized over an absolute pixel size. If that fails to lead to a unique master image, the color depth is considered, with a preference for keeping the larger depth. Finally, if that fails to lead to a master image, the color space is considered, knowing that RGB is a larger space than CMYK, so the RGB image has priority.
- Find the subset of assets in the set that have the highest pixel count (height x width)
- If the subset includes ONLY non-lossy images:
- If the subset size = 1, keep this asset and do no further evaluation
- If the subset size > 1, keep evaluating subsequent criteria (beginning with number 2 below) until a single asset is found, or evaluation criteria runs out
- If the subset includes ONLY lossy images, AND one or more non-lossy images exist outside of the subset but within the duplicate set at greater than 80% of the pixel count of the highest pixel count, discard the lossy images as candidates and re-start the evaluation from the first bullet after number 1 above with the non-lossy images.
- If the subset includes ONLY lossy images and there are no non-lossy images outside of the subset at greater than 80% of the pixel count of the highest pixel count, keep evaluating criteria (beginning with number 2 below).
- If the subset includes lossy and non-lossy images, discard the lossy images and re-start the evaluation from the first bullet after number 1 above.
- From the set of candidate assets remaining after criteria number 1 is evaluated, find the subset of assets with the highest color depth.
- If the subset size = 1, keep this asset and do no further evaluation
- If the subset size > 1, keep evaluating subsequent criteria (beginning with number 3 below) until a single asset is found, or evaluation criteria runs out
- Sort the remaining set of assets after criteria number 2 is evaluated by color space, with RGB > CMYK.
- If the subset size = 1, keep this asset and do no further evaluation
- If subset size >1, select a random asset from the resulting set as the master. (They are not sent to clerical review.)
Results
When the image deduplication process completes successfully, the master image is updated as follows:
- The ID of all duplicates are written in the 'Confirmed Duplicates (ImageDeduplicationConfirmedDuplicates)' metadata attribute.
- The ID of all non-duplicates manually marked in clerical review are written in the 'Confirmed Non-Duplicates (ImageDeduplicationConfirmedNonDuplicates)' metadata attribute.
- Classification links are moved from the duplicate(s) to the master image.
- Product references are moved from the duplicate(s) to the master image.
Processing Images
Once a master image and the duplicates are identified, and the image deduplication process completes successfully, the system updates the metadata attributes on the images and moves product-to-asset and product-to-classification references from the duplicates to the master. Moving references / links allows the duplicates to be deleted without losing reference / link data.
Important: Metadata attributes on images hold IDs of confirmed duplicates and confirmed non-duplicates. Modifying these attribute values will cause errors with future image deduplication comparisons.
If images being processed by image deduplication are in more than one classification, or if an image is moved while included in a image deduplication workflow task, there can be impacts outside of the selected classification. When deduplication is run, any tasks in the workflow where the system-selected master is child to the selected classification of the image deduplication configuration will have those tasks removed from the workflow.
Configuration
To ensure the best performance when writing values to the confirmed duplicate metadata attribute, the maximum number of values that will be written is limited to 3,000 by default. When the number of values exceeds the limit, the image is filtered out of future processing. For example, with the default limit, an image that already displays 3,000 confirmed duplicate IDs is no longer evaluated during image deduplication.
Important: Once an image is marked as a duplicate (its 'Deduplication Delete Flag' metadata attribute is set to 'true') it is ignored by the deduplication functionality, and the final processing should be performed manually. That may include using a workflow to verify and then delete it from STEP, or move it to a hierarchy node outside of the one selected in the configuration, or searching to find all images marked for deletion and then deleting them from STEP as a group. The final processing should also include removing the IDs of the deleted images from the 'Confirmed Duplicates' metadata attribute.
Increasing the maximum number of values decreases performance. However, the default can be changed via the sharedconfig.properties file on the STEP application server using the case-sensitive ImageDeduplication.ImageDeduplicationDuplicateAttributesValuesMax property, up to a maximum size of 30,000. When this property is absent from the file, the default is used. Any number entered above 30,000 is ignored and the 3,000 max is used.
For example, you could use the following text to increase the limit to 4,000:
ImageDeduplication.ImageDeduplicationDuplicateAttributesValuesMax = 4000
When an image is filtered out due to the number of values being exceeded, a message is included in the execution report and in the logs with the following text:
The image with ID [Asset ID] has been excluded from the deduplication process as it has exceeded the max number of values set by the ImageDeduplication.ImageDeduplicationDuplicateAttributesValuesMax property for the number of confirmed duplicates. Resolve confirmed duplicate data by removing the IDs of previously handled confirmed duplicates or increase the maximum values allowed for the confirmed duplicates attribute.
Results
When the process completes successfully, the user will notice that the metadata and references have been updated.
Important: This handling may result in loss of data from duplicate asset objects, for example, metadata on the asset, or metadata on references to or from a duplicate asset.
Images identified as duplicates are handled as follows:
- Attribute values on the images are only retained on the master image. This means that if the master image has empty values, they are not updated with data from duplicate images.
- The STEP Name value is retained on the master image.
- If they do not already exist on the master image, classification references / links on the images, and metadata on the references / links, are moved from the duplicate images to the master image. If the reference / link does already exist on the master, the master values and metadata are not modified.
- For any references where the source of the reference (the product) is different between the master and the duplicate(s), the target of the reference (and any metadata on the reference) is moved to be the master.
- For any references where the source of the reference (the product) is the same between the master and a duplicate, but the references are of a different type, the target of the reference is not changed, but is displayed as an error.
- For any references where the source of the reference (the product) is the same between the master and a duplicate, but the references are of the same type, the reference to the duplicate is broken since this reference type already exists.
- For any reference type (product is the source, and the image is the target): when the product is the same on the master image and a duplicate image, but the reference types are not the same, the target is not changed. An error is logged and processing continues for the set of images.
- For any reference type (product is the source, and the image is the target): when the product is the same on the master image and a duplicate image, and the reference types are the same, the reference to the duplicate image is removed. This allows the image to be manually deleted.
Important: All changes made by the handling process are auto-approved, resulting in partial approval for products and images. Depending on the settings in relevant OIEPs, these partial approvals can generate a large number of events.
All images handled are recorded in the step.0 log, which can be accessed via the STEP System Administration link on the Start Page. The log includes errors due to conflicts that cause the deduplication process to fail and allows a user to identify issues so that a manual resolution can be provided. For more information, refer to the Administration Portal documentation here.
Troubleshooting Errors
The first error encountered by the deduplication process causes the processing to stop for the group, while the overall process continues. Within the group that includes an error, all handling is rolled back and the group is sent to clerical review (or remains in clerical review if that is were the error occurred).
Errors are stored in the workflow variable 'ImageDeduplicationErrors' and are reported differently, based on their location:
- During auto-handling, errors are reported in the background process execution report, and the group is sent to the clerical review workflow (even if the 'Clerical Review Threshold 'parameter was set to 'No clerical review'). Errors are then displayed on the screen within the clerical review task.
- During clerical review, errors are displayed on the screen and must be addressed before the image deduplication process can be completed. For example, an error is displayed when a product has a reference to a master image and a duplicate image, but they are of different reference types. In this case, manual action is required to remove one of the references, or remove the image as a duplicate since the existing references would cause a conflict and cannot both exist at the same time.