Deduplicating Images

Important: Image Analytics Package / Image Deduplication: This functionality has been deprecated and is no longer supported and/or available for new installations. This documentation is retained as a reference only for customers already using the functionality and for whom it remains available in the current state. The functionality will be removed in the future so customers using this should make plans to transition away from their implementation of it.

The Image Deduplication functionality identifies and manages duplicate images to ensure that only one version of a particular image is maintained in the system. This provides a single source of truth which ensures consistent and accurate image data, regardless of the number of objects using the image.

Image Deduplication compares and evaluates all images within a classification, regardless of encoding differences (such as file type or color model) or being referenced to products. Essentially, if images look the same, they are considered duplicates. Images that use CMYK and RGB color models and have the extensions in the table below are considered by the process.

Image Deduplication File Types

  • .BMP
  • .GIF
  • .JPEG
  • .JPG
  • .MSL
  • .MVG
  • .P7
  • .PBM
  • .PNG
  • .PNM
  • .PPM
  • .PSD
  • .TIF
  • .TIFF
  • .XWD

For example, selecting a single parent classification node would recursively compare all images of the identified types within the node to determine potential duplicates, but would not consider images in other nodes.

Running the image deduplication process includes:

  • Generating a pHash (perceptual hash) for each image in the classification. Similar images have a similar pHash, which provides a way to identify potential duplicate images.
  • Identifying and grouping duplicates based on pHash comparison and pixel-to-pixel comparison if auto-handling is enabled.
  • Handling duplicates by marking them for deletion and transferring their references to a master image that is retained in the system.

To access the Image Deduplication functionality, the 'asset-deduplication' component must be activated on your system. Contact Stibo Systems for details.

Limitations

The following limitations should be considered when evaluating the Image Deduplication functionality:

  • Image deduplication runs on the current context. In auto-handling, a master with content with in multiple contexts can be selected since no data is deleted from the master. However, potential duplicate images with content in multiple contexts are ignored by auto-handling, and will not be presented to the user in the clerical review workflow. Attempting to deduplicate images with multi-context content can cause unexpected results.
  • Undo functionality is not possible once a file is processed by image deduplication and action is taken to move references / links on duplicates.
  • Non-image assets, such as videos, PDF, and documents, are excluded by image deduplication.
  • Multi-sequence images (images that contain a sequence within a single image, common with TIF images) are excluded by image deduplication.
  • Images stored outside of STEP cannot be processed by image deduplication.
  • Only the file types in the Image Deduplication File Types table above are considered by image deduplication.
  • If the STEP application server is stopped while image deduplication is running, image deduplication must be run again manually once the server is started.

Additional Information

Image Deduplication can be configured and run as defined in the following topics:

  • Initial Setup for Image Deduplication (here)
  • Creating an Image Deduplication Configuration (here)
  • Configuring Web UI for the Image Deduplication Clerical Review Workflow (here)
  • Image Deduplication Clerical Review Screen (here)
  • Running the Image Deduplication Process (here)
  • Using Image Deduplication Clerical Review (here)

The following topics provide an explanation of how image deduplication works and an example of image deduplication: