Uses for near-duplicate identification (aka. Near-Deduplication)

by Conor Looney | CEO | Aug 21, 2018 | Analytics, eDisclosure, eDiscovery

Analytics in eDiscovery can help you gain efficiency in your review and locate key documents more quickly so you are able to make informed case strategy decisions earlier in the case. 

Today’s section focuses on near-duplicate identification:

If you’ve ever reviewed financial statements or contracts at a financial services company, you know that some of these documents can be extremely similar. Sometimes the only difference is the date, account number or company name, for example.

Near-duplicate identification is an analytical tool that utilizes extracted or Optical Character Recognition (OCR) generated text to define a similarity percentage. This tool allows for a custom similarity percentage that tells the tool how closely similar the documents need to be to be considered a near duplicate. The output can be leveraged to improve the document review efficiency and often leads to reduced time and cost.

The system generates two outputs: a grouping ID and a similarity percentage. Within each group ID, a single document is chosen as the base document, while all other documents identified as textually similar will be assigned a percentage to show how similar the document is to the base document. All documents that have been assigned a group ID will be pulled into a viewing panel, allowing you to view the base document and the similar documents side-by-side. Differences between the two documents are highlighted, allowing you to quickly make a judgment.

This feature can be a valuable tool to identify documents similar to a document that has been determined to be a key or hot document to your matter.

There are also occasions where documents are essentially duplicates that weren’t de-duped during preprocessing because they don’t have the same hash values. Maybe the documents were saved on a different date, for example. This tool will catch those. (A hash value is a unique numeric code that identifies data, think digital fingerprint. However, even the same document saved as a Microsoft Word file and converted to a PDF would not have the same hash value as the original Microsoft Word value, so would not be deduplicated before processing. The near-duplicate tool would find this document due to the textual content being identical.

Near-duplicate analysis is especially useful for quality control when used to find tagging conflicts within near-duplicate groups. If you notice that some documents that are extremely similar are tagging as responsive and others are tagging as non-responsive, you can take a closer look to determine why. One scenario could be that one of the reviewers needs more training.

Another use case for the near-duplicate tool is running it across your received productions. By running the tool across the incoming documents and your own data set, the documents can be put into buckets: those unique to the opposing production, those unique to your document set; and documents that are included in both sets.

Another use case for the near-duplicate tool is running it across the data set that needs to be reviewed, allowing reviewers to make bulk coding decisions. For example, if there are 20 documents with the same group ID and they are showing 98% – 99% similarity, you may be able to safely assume that all documents are either responsive or non-responsive, therefore you can tag all documents at once rather than reviewing and tagging the documents individually. This can have a major impact on pace and consistency of the review.

Keep in mind that near-duplicate identification is based on extracted text, so any documents such as TIFFs and scanned PDFs would need to be converted to text first using OCR, or Optical Character Recognition technology. This tool also does not actually remove any of the near-duplicate documents, but simply organizes them for comparison to help efficiently achieve your objectives.

As with any analytics tools, the functionality is astounding, but near-duplicate analysis is only valuable when utilized based on your case objectives and strategy. LDM Global’s consultants work with these tools every day and help advise you on what may work best in your matter and what has worked successfully on similar cases in the past.

If you’re interested in learning more about near-duplicate identification, other analytics tools or eDiscovery strategies, please contact us at info