eDiscovery Daily Blog
The Enron Data Set is No Longer a Representative Test Data Set: eDiscovery Best Practices
If you attend any legal technology conference where eDiscovery software vendors are showing their latest software developments, you may have noticed the data that is used to illustrate features and capabilities by many of the vendors – it’s data from the old Enron investigation. The Enron Data Set has remained the go-to data set for years as the best source of high-volume data to be used for demos and software testing. And, it’s still good for software demos. But, it’s no longer a representative test data set for testing processing – at least not as it’s constituted – and it hasn’t been for a good while. Let’s see why.
But first, here’s a reminder of what the Enron Data Set is.
The data set is public domain data from Enron Corporation that originated from the Federal Energy Regulatory Commission (FERC) Enron Investigation (you can still access that information here). The original data set consists of:
- Email: Data consisting of 92% of Enron’s staff emails;
- Scanned documents: 150,000 scanned pages in TIFF format of documents provided to FERC during the investigation, accompanied by OCR generated text of the images;
- Transcripts: 40 transcripts related to the case.
Over eight years ago, EDRM created a Data Set project that took the email and generated PST files for each of the custodians (about 170 PST files for 151 custodians). Roughly 34.5 GB in 153 zip files, probably two to three times that size unzipped (I haven’t recently unzipped it all to check the exact size). The Enron emails were originally in Lotus Notes databases, so the PSTs created aren’t a perfect rendition of what they might look like if they originated in Outlook (for example, there are a lot of internal Exchange addresses vs. SMTP email addresses), but it still has been a really useful good sized collection for demo and test data. EDRM has also since created some micro test data sets, which are good for specific test cases, but not high volume.
As people began to use the data, it became apparent that there was a lot of Personally Identifiable Information (PII) contained in the set, including social security numbers and credit card numbers (back in the late 90s and early 2000s, there was nowhere near the concern about data privacy as there is today). So, a couple of years later, EDRM partnered with NUIX to “clean” the data of PII and they removed thousands of emails with PII (though a number of people identified additional PII after that process was complete, so be careful).
If there are comparable high-volume public domain collections that are representative of a typical email collection for discovery, I haven’t seen them (and, believe me, I have looked). Sure, you can get a high-volume dump of data from Wikipedia or other sites out there, but that’s not indicative of a typical eDiscovery data set. If any of you out there know of any that are, I’m all ears.
Until then, the EDRM Enron Data Set remains the gold-standard as the best high-volume example of a public domain email collection. So, why isn’t it a good test data set anymore for processing?
Do you remember the days when Microsoft Outlook limited the size of a PST file to 2 GB? Outlook 2002 and earlier versions limited the size of PST files to 2 GB. Years ago, that was about the largest PST file we typically had to process in eDiscovery. Since Outlook 2003, a new PST file format has been used, which supports Unicode and doesn’t have the 2 GB size limit any more. Now, in discovery, we routinely see PST files that are 20, 30, 40 or even more GB in a single file.
What difference does it make? The challenge today with large mailstore files like these (as well as large container files, including ZIP and forensic containers) is that single-threaded processes bog down on these large files and they can take a long time to process. These days, to get through large files like these more quickly, you need multi-threaded processing capabilities and the ability to throw multiple agents at these files to get them processed in a fraction of the time. As an example, we’ve seen processing throughput increased 400-600% with multi-threaded ingestion using CloudNine’s LAW product compared to single-threaded processes (a reduction of processing time from over 24 hours to a little over 4 hours in a recent example). Large container files are very typical in eDiscovery collections today and many PST files we see today are anywhere from 10 GB to more than 50 GB in size. They’re becoming the standard in most eDiscovery email collections.
As I mentioned, the Enron Data Set is 170 PST files over 151 custodians, with some of the larger custodians’ collections broken into multiple PST files (one custodian has 11 PST files in the collection). But, none of them are over 2 GB in size (presumably Lotus Notes had a similar size limit back in the day) and most of them are less than 200 MB. That’s not indicative of a typical eDiscovery email collection today and wouldn’t provide a representative speed test for processing purposes.
Can the Enron Data Set still be used to benchmark single-threaded vs. multi-threaded processes? Yes, but not as it’s constituted – to do so, you have to combine them into larger PST files more representative of today’s collections. We did that at CloudNine and came up with a 42 GB PST file that contains several hundred thousand de-duped emails and attachments. You could certainly break that up into 2-4 smaller PST files to conduct a test of multiple PST files as well. That provides a more typical eDiscovery collection in today’s terms – at least on a small scale.
So, when an eDiscovery vendor quotes processing throughput numbers for you, it’s important to know the types of files that they were processing to obtain those metrics. If they were using the Enron Data Set as is, those numbers may not be as meaningful as you think. And, if somebody out there is aware of a good, new, large-volume, public domain, “typical” eDiscovery collection with large mailstore and container files (not to mention content from mobile devices and other sources), please let me know!
So, what do you think? Do you still rely on the Enron Data Set for demo and test data? Please share any comments you might have or if you’d like to know more about a particular topic.
Sponsor: This blog is sponsored by CloudNine, which is a data and legal discovery technology company with proven expertise in simplifying and automating the discovery of data for audits, investigations, and litigation. Used by legal and business customers worldwide including more than 50 of the top 250 Am Law firms and many of the world’s leading corporations, CloudNine’s eDiscovery automation software and services help customers gain insight and intelligence on electronic data.
Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by CloudNine. eDiscovery Daily is made available by CloudNine solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscovery Daily should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.