3 De-duplication Options: How Do You Choose?

Written by Thought Leadership Team | March 19, 2012

When collecting raw ESI from multiple individuals, there are bound to be tremendous amounts of duplicative documents. In company-wide e-mail chains, for example, a message is sent to multiple recipients and stored within each individual's mailbox. Depending on your organization's data retention policies, copies of the same file might also be found on the employee's hard drive, file server, or company backup tape.When collecting raw ESI from multiple individuals, there are bound to be tremendous amounts of duplicative documents. In company-wide e-mail chains, for example, a message is sent to multiple recipients and stored within each individual's mailbox. Depending on your organization's data retention policies, copies of the same file might also be found on the employee's hard drive, file server, or company backup tape.

For the attorney tasked with identifying, collecting and reviewing ESI, an exhaustive review of a document set rife with duplicates threatens the timeliness, cost effectiveness and efficiency of a project. The risks intensify during review, where duplicate documents increase the potential for inconsistent privilege and responsiveness decisions on identical documents.

To mitigate these concerns, many practitioners turn to de-duplication technologies, where duplicate documents are identified and managed during ediscovery processing to minimize redundant review. Effectively, de-duplication can reduce the number of documents to be reviewed by as much as 90 percent, and, on average, 30 or 40 percent.

To mitigate these concerns, many practitioners turn to de-duplication technologies, where duplicate documents are identified and managed during e-discovery processing to minimize redundant review. Effectively, de-duplication can reduce the number of documents to be reviewed by as much as 90 percent, and, on average, 30 or 40 percent.

With de-duplication, an electronic "fingerprint" is created for each document at the bit level, by leveraging a hashing algorithm. The resultant fingerprints are measured against one another to determine which documents are exact duplicates. Fingerprints change with nearly any type of modification to the file —such as an extra space or formatting changes—and stand out when measured against the existing document universe.

However, identifying duplicates is only the first step. Simply removing all duplicate documents robs the reviewing attorney of potentially important contextual information—such as who maintained or had access to an important e-mail or document. Sophisticated e-discovery technologies have evolved to allow several options for discovery teams to examine these associated details.

With the KLDiscovery e-discovery processing engine, case teams have several de-duplication options. When choosing a de-duplication method, careful consideration of case needs should be measured in relation to the following options:

No de-duplication: All duplicate documents are provided for review and categorization, producing the largest number of documents for review. This method is strongly discouraged for cases involving voluminous amounts of data from backup tapes or collected over various occasions.
Global or horizontal de-duplication: As each file is uploaded, it is compared to the entire data set for the e-discovery project. Only the first instance of each unique document is provided for review and categorization, resulting in the fewest number of documents for review. However, care should be taken when employing this method of de-duplication, as only one document will remain without any consideration of its relevance to the case over other duplicates.
Per custodian or vertical de-duplication: Each file is uploaded and compared to a limited set of documents form the same document custodian, time period, or other data slice segment of documents. Only the first instance of each unique document per custodian or data slice will be provided for review. However, the same document may exist in other custodians or data slices and may then be provided for independent review. This type of de-duplication is particularly useful when processing multiple tapes for the same custodians over time or when discerned the context of the specific document in relation to the custodian.The deduplication options above are applied to documents as they are processed. Additionally, as documents are reviewed, they can be identified for relative similarity, called near duplicate identification, which ascertains similar documents that differ by simple formatting, document type or other semantic differences. These documents are often identified and grouped by one document—the "core" of the group. All related near-duplicate documents are compared to this core document. Near duplicate identification can help the reviewer better understand the relationship between the documents, allowing for mass actions on groups with similarities.Regardless of the method chosen, de-duplication can result in tremendous savings when properly leveraged to meet the needs of a project. However, it can also be wrought with complexity and pitfalls if improperly utilized. To avoid these risks and increase your efficiencies, contact your KLDiscovery Case Manager.

View full post