eDiscovery Daily Blog

Version 1 of the EDRM Enron Data Set NOW AVAILABLE – eDiscovery Trends

Last week, we reported from the Annual Meeting for the Electronic Discovery Reference Model (EDRM) group and discussed some significant efforts and accomplishments by each of the project teams within EDRM.  That included an update from the EDRM Data Set project, where an effort was underway to identify and remove personally-identifiable information (“PII”) data from the EDRM Data Set.  Now, version 1 of the Data Set is completed and available for download.

To recap, the EDRM Enron Data Set, sourced from the FERC Enron Investigation release made available by Lockheed Martin Corporation, has been a valuable resource for eDiscovery software demonstration and testing (we covered it here back in January 2011).  Initially, the data was made available for download on the EDRM site, then subsequently moved to Amazon Web Services (AWS).  However, after much recent discussion about PII data (including social security numbers, credit card numbers, dates of birth, home addresses and phone numbers) available within FERC (and consequently the EDRM Data Set), the EDRM Data Set was taken down from the AWS site.

Yesterday, EDRM, along with Nuix, announced that they have republished version 1 of the EDRM Enron PST Data Set (which contains over 1.3 million items) after cleansing it of private, health and personal financial information. Nuix and EDRM have also published the methodology Nuix’s staff used to identify and remove more than 10,000 high-risk items.

As noted in the announcement, Nuix consultants Matthew Westwood-Hill and Ady Cassidy used a series of investigative workflows to identify the items, which included:

  • 60 items containing credit card numbers, including departmental contact lists that each contained hundreds of individual credit cards;
  • 572 items containing Social Security or other national identity numbers—thousands of individuals’ identity numbers in total;
  • 292 items containing individuals’ dates of birth;
  • 532 items containing information of a highly personal nature such as medical or legal matters.

While the personal data was (and still is) available via FERC long before the EDRM version was created, completion of this process will mean that many in the eDiscovery industry that rely on this highly useful data set for testing and software demonstration can now use a version which should be free from sensitive personal information!

For more information regarding the announcement, click here. The republished version 1 of the Data Set, as well as the white paper discussing the methodology is available at nuix.com/enron.  Nuix is currently applying the same methodology to the EDRM Enron Data Set v2 (which contains nearly 2.3 million items) and will publish to the same site when complete.

So, what do you think?  Have you used the EDRM Enron Data Set?  If so, do you plan to download the new version?  Please share any comments you might have or if you’d like to know more about a particular topic.

Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by CloudNine Discovery. eDiscoveryDaily is made available by CloudNine Discovery solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscoveryDaily should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.