eDiscovery Daily Blog

Craig Ball Explains HASH Deduplication As Only He Can: eDiscovery Best Practices

Ever wonder why some documents are identified as duplicates and others are not, even though they appear to be identical?  Leave it to Craig Ball to explain it in plain terms.

In the latest post (Deduplication: Why Computers See Differences in Files that Look Alike) in his excellent Ball in your Court blog, Craig states that “Most people regard a Word document file, a PDF or TIFF image made from the document file, a printout of the file and a scan of the printout as being essentially “the same thing.”  Understandably, they focus on content and pay little heed to form.  But when it comes to electronically stored information, the form of the data—the structure, encoding and medium employed to store and deliver content–matters a great deal.”  The end result is that two documents may look the same, but may not be considered duplicates because of their format.

Craig also references a post from “exactly” three years ago (it’s four days off Craig, just sayin’) that provides a “quick primer on deduplication” that shows the three approaches where deduplication can occur, including the most common approach of using HASH values (MD5 or SHA-1).

My favorite example of how two seemingly duplicate documents can be different is the publication of documents to Adobe Portable Document Format (PDF).  As I noted in our post from (nowhere near exactly) three years ago, I “publish” marketing slicks created in Microsoft® Publisher, “publish” finalized client proposals created in Microsoft Word and “publish” presentations created in Microsoft PowerPoint to PDF format regularly (still do).  With a free PDF print driver, you can conceivably create a PDF file for just about anything that you can print.  Of course, scans of printed documents that were originally electronic are another way where two seemingly duplicate documents can be different.

The best part of Craig’s post is the exercise that he describes at the end of it – creating a Word document of the text of the Gettysburg Address (saved as both .DOC and .DOCX), generating a PDF file using the Save As and Print As PDF file methods and scanning the printed document to both TIFF and PDF at different resolutions.  He shows the MD5HASH value and the file size of each file.  Because the format of the file is different each time, the MD5HASH value is different each time.  When that happens for the same content, you have what some of us call “near dupes”, which have to be analyzed based on the text content of the file.

The file size is different in almost every case too.  We performed a similar test (still not exactly) three years ago (but much closer).  In our test, we took one of our one page blog posts about the memorable Apple v. Samsung litigation and saved it to several different formats, including TXT, HTML, XLSX, DOCX, PDF and MSG – the sizes ranged from 10 KB all the way up to 221 KB.  So, as you can see, the same content can vary widely in both HASH value and file size, depending on the file format and how it was created.

As usual, I’ve tried not to steal all of Craig’s thunder from his post, so please check out it out here.

So, what do you think?  What has been your most unique deduplication challenge?  Please share any comments you might have or if you’d like to know more about a particular topic.

Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by CloudNine. eDiscovery Daily is made available by CloudNine solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscovery Daily should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.