eDiscovery Daily Blog

eDiscovery Best Practices: Cluster Documents for More Effective Review


With document review estimated to up to 80% of the total cost of the eDiscovery process and the amount of data in the world growing at an exponential rate, it’s no wonder that many firms are turning to technology to make the review process more efficient.  Whether using sophisticated searching capabilities of early case assessment (ECA) tools such as FirstPass®, powered by Venio FPR™ to filter collections more effectively or predictive coding techniques (as discussed in these two recent blog posts) to make the coding process more efficient, technology is playing an important role in saving review costs.  And, of course, review tools that manage the review process make review more efficient (like OnDemand®), simply by delivering documents efficiently and tracking review progress.

How the documents are organized for review can also make a big difference in the efficiency of review, not only saving costs, but also improving accuracy by assigning similar documents to the same reviewer.  This process of organizing documents with similar content into “clusters” (also known as “concepts”) helps each reviewer make quicker review decisions (if a single reviewer looks at one document to determine responsiveness and the next few documents are duplicates or mere variations of that first document, he or she can quickly “tag” most of those variations in the same manner or identify the duplicates).  It also promotes consistency by enabling the same reviewer to review all similar documents in a cluster (for example, you don’t get one reviewer marking a document as privileged while another reviewer fails to mark a copy of the that same document as such, leading to inconsistencies and potential inadvertent disclosures).  Reviewers are human and do make mistakes.

Clustering software such as Hot Neuron’s Clustify™ examines the text in your documents, determines which documents are related to each other, and groups them into clusters.  Clustering organizes the documents according to the structure that arises naturally, without preconceptions or query terms.  It labels each cluster with a set of keywords, providing a quick overview of the cluster.  It also identifies a “representative document” that can be used as a proxy for the cluster.

Examples of types of documents that can be organized into clusters:

  • Email Message Threads: Each message in the thread contains the conversation up to that point, so the ability to group those messages into a cluster enables the reviewer to quickly identify the email(s) containing the entire conversation, categorize those and possibly dismiss the rest as duplicative (if so instructed).
  • Document Versions: As “drafts” of documents are prepared, the content of each draft is similar to the previous version, so a review decision made on one version could be quickly applied to the rest of the versions.
  • Routine Reports: Sometimes, periodic reports are generated that may or may not be responsive – grouping those reports together in a cluster can enable a single reviewer to make that determination and quickly apply it to all documents in the cluster.
  • Published Documents: Have you ever published a file to Adobe PDF format?  Many of you have.  What you end up with is an exact copy of the original file (from Word, Excel or other application) in content, but different in format – hence, these documents won’t be identified as “dupes” based on a HASH value.  Clustering puts those documents together in a group so that the dupes can still be quickly identified and addressed.

Within the parameters of a review tool which manages the review process and delivers documents quickly and effectively for review, organizing documents into clusters can speed decision making during review, saving considerable time and review costs.

So, what do you think?  Have you used software to organize documents into clusters or concepts for more effective review?  Please share any comments you might have or if you’d like to know more about a particular topic.

Full disclosure: I work for CloudNine Discovery, which provides SaaS-based eDiscovery review applications FirstPass® (for early case assessment) and OnDemand® (for linear review and production).  CloudNine Discovery has an alliance with Hot Neuron and uses Clustify™ software to provide conceptual clustering and near-duplicate identification services for its clients.