eDiscovery Daily Blog

Hashing Out the Idea of a Standard Hash Algorithm for Vendors: eDiscovery Best Practices

In a blog post earlier this month, Craig Ball discussed the question (which was posed at the recent ILTACON conference by Beth Patterson, Chief Legal & Technology Services Officer for Allens) of why eDiscovery service providers can’t (or don’t) standardize hash values so as to support identification and deduplication across products and collections.  Good question.  Let’s take a look.

In his post from his excellent Ball in Your Court blog (Cross-Matter & -Vendor Message ID), Craig noted that standardization would enable you to use work from one matter in another and flag emails already identified as privileged in one case so that they don’t slip through.  Wouldn’t that be great?

According to Craig, unfortunately, the panelists’ response to the question appeared to be to characterize it as “a big technical challenge.”

Craig then took a look at the issue, beginning by recapping some “hash facts” to establish a baseline for understanding considerations for computing hash values.  He then differentiated loose documents (easy, because as long as they are properly preserved, they should generate the same hash value consistently) from emails.  Emails are more difficult to construct consistent hash values for because the hash value of an email depends on when it is exported as well as other factors.  So, the same email exported at different times or from different email clients will have a different hash value – even though we see them as the same, the computer doesn’t.  Make sense?

Craig also took a look at some approaches for generating standardized hash values for emails and also took a look at MD5 vs. SHA-1 methods of hashing and debunked the idea that MD5 hash values aren’t unique enough to be “defensible”.  There are 340,282,366,920,938,463,463,374,607,431,768,211,000 unique MD5 hash values.  Unique enough for you?

I asked Bill David, Chief Technical Officer at CloudNine and architect of the platform, about the use of MD5 for generating hash values.

“Of these (and other) HASH routines, we ultimately chose MD5 for a couple of reasons”, Bill said. “First, for all practical purposes, MD5 Hash is sufficient for identifying duplicate files in a given population. Second, it’s faster than the alternatives. And third, it is widely available. You can find the MD5 Hash routine in all major computer languages as well as in most relational database. This allows us to utilize and generate HASH values from a client’s browser all the way down the line to the rational databases used in a review platform.”

As for the idea of eDiscovery vendors agreeing to use the same routine to generate the same hash value, Bill seemed to think it was very doable and advocated a concatenation approach:

“As is commonly known, emails throw us a monkey wrench. Every email has some hidden data that is unique to that file. And as a result, we have to pick certain sections of a given email to construct a “string” of data, which we can then “HASH” to generate a unique value. But the slightest change in the format of the data affects the resulting unique hash. Something as simple as a single extra space will result in a completely different hash value.”

“What we have to do is to take the different parts of an email, combine them altogether and hash the result. At CloudNine, we pull these parts of an email and separate them with a single space.

  • SentDate (in the ISO format)
  • From
  • To
  • CC
  • BCC
  • Subject
  • Attachments (file names separated by semi-colons)
  • MsgText (text version)”

Bill, while noting that these are his initial thoughts after reading Craig’s article and might be subject to some revision, suggested a way to “code” it, in this case using C# (C Sharp) programming language:

“The combination of these fields give us a unique finger print of an email. As an extra step in trying to normalize data it’s wise to ‘trim’ up these fields (remove any leading or trailing spaces). So in code it would look like this:”

hashString = String.Format(“{0} {1} {2} {3} {4} {5} {6} {7}”,

     args.file.SentDate.ToString(“yyyy’-‘MM’-‘dd’T’HH’:’mm’:’ss”),   //ISO Format example 2009-06-15T13:45:30

     args.file.From.Trim(),

     args.file.To.Trim(),

     args.file.CC.Trim(),

     args.file.BCC.Trim(),

     args.file.Subject.Trim(),

     args.file.Attachments.Trim(),

     args.file.MsgText.Trim());

“We now have a string to hash. The last step is to hash the string. Many MD5 hash routines will contain ‘dashes’. In one more step to normalize the results let’s remove those dashes and force all of the characters to lower case.”

hash = clsHash.GetHash(hashString, clsHash.HashType.MD5).Replace(“-“, “”).ToLower();

“Based on my initial thoughts, that’s how you could standardize a hash value to use for deduping.”

Sounds like standardization on a method for generating hash values could be relatively straightforward – if you can get all the vendors to agree.

So, what do you think?  Would you benefit from a standardized method for computing hash values across all eDiscovery platforms?  Please share any comments you might have or if you’d like to know more about a particular topic.

Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by CloudNine. eDiscovery Daily is made available by CloudNine solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscovery Daily should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.