eDiscovery Daily Blog

Skip the HASH When Deduping Outlook MSG Files – eDiscovery Best Practices

As we discussed recently in this blog, Microsoft® Outlook emails can take many forms.  One of those forms is the MSG file extension, which is used to represent a self-contained unit for an individual message “family” (email and its attachments).  MSG files can exist on your computer in the same folders as Word, Excel and other data files.  But, when it comes to deduping those MSG files, the approach to do so is typically different.

A few years ago, I was assisting a client and collecting emails from their email archiving system for discovery, outputting the selected emails to individual MSG files (per their request).  Because this was an enterprise-wide search of email archives, the searches that I performed found the same emails again and again in different custodian folders.  There was literally hundreds of thousands of duplicate emails in this collection.  Of course, this is typical – anytime you send an email to three co-workers, all four of you have a copy of the email (assuming none of you deleted it).  If the email is responsive and your goal is to dedupe across custodians, you only want to review and produce one copy, not four.

However, had I performed a HASH value identification of duplicates on those output MSG files, I would find no duplicates.  Why is that?

That’s because each MSG file contains a field which stores the Creation Date and Time. Because this value will be set at the date and time the MSG is saved, two emails with otherwise identical content will not be considered duplicates based on the HASH value.  Remember how “drag and drop” sets the Creation Date and Time of the copy to the current date and time?  The same thing happens when an MSG file is created.

Hmmm, what to do?  Typically, the approach for MSG files is to use key metadata fields to identify duplicates.  Many processing vendors use a typical combination of fields that consist of: From, To, CC, BCC, Subject, Attachment Name, Sent Date/Time and Body of the email.  Some use those fields only on MSG files; others use it on all emails (to dedupe individual emails within MSG files against those same emails within an OST or a PST file).

So, if you’re hungry to eliminate duplicates from your collection of MSG files, skip the HASH and use the metadata fields.  It’s much more (ful)filling.

So, what do you think?  Have you encountered any challenges when it comes to deduping emails?  Please share any comments you might have or if you’d like to know more about a particular topic.

Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by CloudNine Discovery. eDiscoveryDaily is made available by CloudNine Discovery solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscoveryDaily should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.