Processing

Evaluate a Proven Approach to eDiscovery and Data Processing with CloudNine Explore

The digital age has had a major impact on more than just how we occupy our free time. It’s also changed the way we review and process legal data.  

 Lawyers and paralegals handle much more than the physical evidence of discovery. Most law firms sift through unprecedented volumes of evidence that come with the digital age. 

 

When Data Volumes Exceed Capacity: Controlling The Ever Growing Amount of Data

Legal service providers (LSPs) review and process massive sets of complex and diverse digital content oftentimes, in the terabytes. For context, consider this comparison of data:

  • 1 MB = a 400-page book
  • 1 GB = over a thousand 400-page books
  • 1 TB = more than a million 400-page books

Faced with this overwhelming volume of data, an eDiscovery solution capable of working at a high speed and top-of-the-line accuracy will equip you to fight your cases with maximum efficiency. 

Reaping the Benefits Out of Cloud-Based Discovery Software

You can also lose control of your data in expensive cloud platforms. You’re completely dependent on THEIR solution, as they hold your data hostage indefinitely, at whatever rates they set.  Plus, if an eDiscovery solution doesn’t have the capacity to scale with your ever-increasing data needs or carry the solutions you need, you can suffer from ineffective workflow functionality. 

Your organization needs an eDiscovery solution that provides you with:

  • A great degree of workflow flexibility with on-premise and cloud solutions
  • The ability to add new fields, as needed
  • The power to flex up and down the data storage as you consume, allowing you to only pay for what you need
  • The process of continuous improvement in regards to their data processing engine

With a cloud-based eDiscovery solution capable of handling the volume and variety of data you have in addition to the functionality and features you need, you will be able to work efficiently.

 

Evaluating The CloudNine Explore Solution

As the industry leader for processing eDiscovery data, CloudNine Explore is based on four key components:

  • Explore
  • Assess
  • Protect
  • Deliver

Explore:  

CloudNine Explore helps you navigate your way through massive volumes of data to identify risk, determine the scope of the project and control your costs. This helps you uncover important information about the data before you begin:

  • How much data is there?
  • What type of data has been collected?
  • What languages are included in the data?
  • What data is hidden from view?
  • Where did the data come from?

Knowing this information will help you evaluate the risk and potential cost of litigation at the earliest possible point.   This knowledge enables you to set more realistic costs to process and store the data, as well as to determine the size of your review team and necessary skill sets. 

Assess:  

With CloudNine Explore, you can inspect and review your data using both automated and manual processes. When you receive hundreds of thousands of documents and files through eDiscovery, you need to be able to process through it quickly and efficiently. CloudNine Explore’s multi-threaded, multi-core indexing functionality helps you filter through your exported data faster so you can pull only the data you need. 

Being thoughtful about your research and using specific keyword search terms to promote specific documents, helps you filter out data like specific email domain names for later review.

Protect:  

If the integrity of your data is compromised, lost, distorted or manipulated, the consequence can be devastating to your case. 

CloudNine Explore allows you to securely upload, ingest and preserve all relevant data for your ongoing investigations or litigation with an easier and more efficient eDiscovery platform.

All your data is stored securely in a single, on-prem location, housed safely behind the firewall. This safety net saves money while giving you direct access to your data so you know exactly where it is at all times.

Deliver:  

Sharing discovered assets with the opposing side for review is more than a courtesy, it’s required.  CloudNine Explore makes it easy to provide information as required for legal production or further investigation so you stay compliant. 

Avoid costly and time-consuming production of redundant and unnecessary documents while reducing the risk of producing privileged or protected content. 

 

Faster Data Ingestion, Faster ROI With CloudNine Explore

CloudNine Explore saves you money, which in turn positions your project to yield high ROI. 

  • Explore works extremely fast, which means less time spent processing and reviewing data. 
  • Explore stores your data securely in a single, on-prem location, which provides your organization with consistent and transparent pricing. 
  • Integration costs are minimal because of its simplicity. Installation, scale, and automation are simple and straightforward. You don’t even need an IT department to deploy it. 

Now that you’ve learned how CloudNine Explore allows you to safely store and process your data; request a free demo to learn how you can save time and money.

 

Working with CloudNine Explore and PST Attachments

#Did You Know: Yes, users really DO attach PSTs to emails!  When examining your early case data, you need complete content visibility including the multiple layers of PST and OST attachments within email containers.

Older processing engines have trouble extracting certain archive containers especially when those containers have PSTs and OSTs attached to emails.  In these cases, the processing may skip over the attached email container or record it as an error.

CloudNine Explore fully expands the data container, without creating duplicate files to process its contents in full, including multi-layered PST and OST files.

A custodian creates a PST file containing several dozen messages about a particular topic and emails it to a co-worker.

Earlier processing engines could process the email sent to the co-worker but could not expand the attached PST to process its contents without requiring a separate and manual process.

 

Visual example of a PST file in a Zip file, with a PST attachment:

 

Explore uses the newest extraction technologies to fully expand the attached PST and collect the metadata and emails contained within.  The manual processes are not necessary, and the data is fully expanded and available for searching and review.

 

Example display of extracted email and metadata within Explore:

Have the assurance of a thorough early case assessment to find hidden or multi-layered files with CloudNine Explore.

Learn how to automate your eDiscovery with the legal industry’s most powerful processing and early case assessment tool.  Click the button below to schedule a demo with a CloudNine eDiscovery specialist.

EDRM Announces Five New Projects: eDiscovery Best Practices

Did anybody doubt that EDRM under the leadership of Mary Mack and Kaylee Walstad was going to be doing BIG things?  If you did doubt it, here’s an announcement that signals that EDRM will be busy creating and improving frameworks, resources and standards within the eDiscovery community.

Last week, EDRM announced five new projects and is seeking new contributors for them.  They are:

Data Sets: This new project is being championed by Cash Butler, founder of Clarilegal, and is seeking project participants. “Everyone still tests and demonstrates with the very old and familiar data set that is comprised primarily of Enron email and attachment data,” claims Cash Butler. “A new modern data set needs to be created that is focused on modern data types as well as email. Slack, Snapchat, Instagram, text messaging, GPS and many other data types that are needed for testing and demonstrating how they process and present in a useful way. In addition, to creating the new data set we will also look to form a framework for community members to easily add, curate and update the data set to stay current.”

One word: Hallelujah!  We’ve needed new up-to-date data sets for years to replace the old Enron set, so I’m hopeful this team will make it happen.

Processing Specifications: John Tredennick, founder of Merlin Legal Open Source Foundation is championing this project with the help of co-trustees Craig Ball, president, Craig D. Ball P.C. (who recently created a processing primer) and Jeffrey Wolff, director of eDiscovery services and principal architect, ZyLAB. The Processing Specifications project will run in parallel with the Merlin Foundation’s programming project for processing.

Data Mapping: Eoghan Kenny, associate, senior manager data projects and Rachel McAdams (no, not her), data projects, at A & L Goodbody, Ireland are championing this project, which the need has arisen due to the new SEAR Act (senior executive accountability regime) to help provide frameworks around who is responsible for what data and where it resides. “The importance of data mapping has grown enormously in Europe – not just for GDPR and investigation purposes, but also to help organizations deal with the increasingly active regulatory environment,” says Kenny. “However, most of our clients struggle with data mapping as it is a new concept to most organizations, with no clear business owner, that often sits in limbo between the “business” and “IT”! The goals of this project are to build frameworks for data mapping exercises, and provide clear guidelines on what the process should look like, because the better an organization understands its data, the cheaper it is to comply with any discovery or investigation obligations.”

State eDiscovery Rules: Suzanne Clark, discovery counsel at eDiscovery CoCounsel and Janice Yates, senior e-discovery consultant at Prism Litigation are co-championing this project and how the State Rules relate to the eDiscovery Federal rules in place. The vision for the State eDiscovery Rules project is to provide a starting point for attorneys to quickly reference the rules in different states and compare and contrast to the federal rules with the various state rules relating to eDiscovery. For example, if an attorney is involved with a case in a state where they are not accustomed to practicing, this EDRM resource will allow them to quickly get up to speed on that state’s rules, where they differ and where they align with the federal rules. “The project work happening at the EDRM is impressive,” says Suzanne Clark. “The time and talent that the project leads and participants donate to the cause of advancing eDiscovery knowledge and good practices will surely serve to advance the industry and legal practice in the discovery realm.”  The project will start with Florida and Michigan and are looking for more contributors from other states.

I look forward to this as we need an up to date resource here – I’m not sure that the ones I’ve covered in the past are being actively updated.

Pro Bono: This project was just launched and has had an overwhelming reach out from people in every area, attorneys, paralegals (and associations), litigation support professional, service providers, platforms, corporations and those in need. We are still seeking assistance as the need for access to justice is great. Stewarded by BDO director, George Socha and HB Gordon, eDiscovery manager for the Vanguard Group, the Pro Bono project will create subgroups to accelerate providing eDiscovery services to those in need.

As the announcement notes, projects, both ongoing and newly initiated, will be advanced at the EDRM Summit/Workshop 2020 at Duke University School of Law, June 24-26. I’ll have more to say about that as we get closer to it, but it certainly sounds like it will be very busy!  I’m certainly planning to be there!

So, what do you think?  Are you interested in participating in EDRM?  As always, please share any comments you might have or if you’d like to know more about a particular topic.

Sponsor: This blog is sponsored by CloudNine, which is a data and legal discovery technology company with proven expertise in simplifying and automating the discovery of data for audits, investigations, and litigation. Used by legal and business customers worldwide including more than 50 of the top 250 Am Law firms and many of the world’s leading corporations, CloudNine’s eDiscovery automation software and services help customers gain insight and intelligence on electronic data.

Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by CloudNine. eDiscovery Daily is made available by CloudNine solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscovery Daily should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.

Here’s a Terrific Listing of eDiscovery Workstream Processes and Tasks: eDiscovery Best Practices

Let’s face it – workflows and workstreams in eDiscovery are as varied as organizations that conduct eDiscovery itself.  Every organization seems to do it a little bit differently, with a different combination of tasks, methodologies and software solutions than anyone else.  But, could a lot of organizations improve their eDiscovery workstreams?  Sure.  Here’s a resource (that you probably already know well) which could help them do just that.

Rob Robinson’s post yesterday on his terrific Complex Discovery site is titled The Workstream of eDiscovery: Considering Processes and Tasks and it provides a very comprehensive list of tasks for eDiscovery processes throughout the life cycle.  As Rob notes:

“From the trigger point for audits, investigations, and litigation to the conclusion of cases and matters with the defensible disposition of data, there are countless ways data discovery and legal discovery professionals approach and administer the discipline of eDiscovery.  Based on an aggregation of research from leading eDiscovery educators, developers, and providers, the following eDiscovery Processes and Tasks listing may be helpful as a planning tool for guiding business and technology discussions and decisions related to the conduct of eDiscovery projects. The processes and tasks highlighted in this listing are not all-inclusive and represent only one of the myriads of approaches to eDiscovery.”

Duly noted.  Nonetheless, the list of processes and tasks is comprehensive.  Here are the number of tasks for each process:

  • Initiation (8 tasks)
  • Legal Hold (11 tasks)
  • Collection (8 tasks)
  • Ingestion (17 tasks)
  • Processing (6 tasks)
  • Analytics (11 tasks)
  • Predictive Coding (6 tasks)*
  • Review (17 tasks)
  • Production/Export (6 tasks)
  • Data Disposition (6 tasks)

That’s 96 total tasks!  But, that’s not all.  There are separate lists of tasks for each method of predictive coding, as well.  Some of the tasks are common to all methods, while others are unique to each method:

  • TAR 1.0 – Simple Active Learning (12 tasks)
  • TAR 1.0 – Simple Passive Learning (9 tasks)
  • TAR 2.0 – Continuous Active Learning (7 tasks)
  • TAR 3.0 – Cluster-Centric CAL (8 tasks)

The complete list of processes and tasks can be found here.  While every organization has a different approach to eDiscovery, many have room for improvement, especially when it comes to exercising due diligence during each process.  Rob provides a comprehensive list of tasks within eDiscovery processes that could help organizations identify steps they could be missing in their processes.

So, what do you think?  How many steps do you have in your eDiscovery processes?  Please share any comments you might have or if you’d like to know more about a particular topic.

Sponsor: This blog is sponsored by CloudNine, which is a data and legal discovery technology company with proven expertise in simplifying and automating the discovery of data for audits, investigations, and litigation. Used by legal and business customers worldwide including more than 50 of the top 250 Am Law firms and many of the world’s leading corporations, CloudNine’s eDiscovery automation software and services help customers gain insight and intelligence on electronic data.

Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by CloudNine. eDiscovery Daily is made available by CloudNine solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscovery Daily should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.

Exceptions are the Rule: eDiscovery Throwback Thursdays

Here’s our latest blog post in our Throwback Thursdays series where we are revisiting some of the eDiscovery best practice posts we have covered over the years and discuss whether any of those recommended best practices have changed since we originally covered them.

This post was originally published on November 12, 2010, when eDiscovery Daily was less than two months old (over nine years ago!).  Despite that, the advice below is still largely as written back then – it is still applicable today pretty much as is.  Enjoy!

Virtually every collection of electronically stored information (ESI) has at least some files that cannot be effectively searched.  Corrupt files, password protected files and other types of exception files are pretty much constant components of your ESI collection and it can become very expensive to make these files searchable or reviewable.  Being without an effective plan for addressing these files could lead to problems – even spoliation claims – in your case.

How to Address Exception Files

The best way to develop a plan for addressing these files that is reasonable and cost-effective is to come to agreement with opposing counsel on how to handle them.  The prime opportunity to obtain this agreement is during the meet and confer with opposing counsel.  The meet and confer gives you the opportunity to agree on how to address the following:

  • Efforts Required to Make Unusable Files Usable: Corrupted and password protected files may be fairly easily addressed in some cases, whereas in others, it takes extreme (i.e., costly) efforts to fix those files (if they can be fixed at all). Up-front agreement with the opposition helps you determine how far to go in your recovery efforts to keep those recovery costs manageable.
  • Exception Reporting: Because there will usually be some files for which recovery is unsuccessful (or not attempted, if agreed upon with the opposition), you need to agree on how those files will be reported, so that they are accounted for in the production. The information on exception reports will vary depending on agreed upon format between parties, but should typically include: file name and path, source custodian and reason for the exception (e.g., the file was corrupt).

If your case is in a jurisdiction where a meet and confer is not required (such as state cases where the state has no meet and confer rule for eDiscovery), it is still best to reach out to opposing counsel to agree on the handling of exception files to control costs for addressing those files and avoid potential spoliation claims.

Next time, we’ll talk about the types of exception files and the options for addressing them.  Be patient – next Thursday is a holiday!  ;o)

So, what do you think?  Have you been involved in any cases where the handling of exception files was disputed?  Please share any comments you might have or if you’d like to know more about a particular topic.

Sponsor: This blog is sponsored by CloudNine, which is a data and legal discovery technology company with proven expertise in simplifying and automating the discovery of data for audits, investigations, and litigation. Used by legal and business customers worldwide including more than 50 of the top 250 Am Law firms and many of the world’s leading corporations, CloudNine’s eDiscovery automation software and services help customers gain insight and intelligence on electronic data.

Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by CloudNine. eDiscovery Daily is made available by CloudNine solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscovery Daily should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.

The Files are Already Electronic, How Hard Can They Be to Load?: eDiscovery Throwback Thursdays

Here’s our latest blog post in our Throwback Thursdays series where we are revisiting some of the eDiscovery best practice posts we have covered over the years and discuss whether any of those recommended best practices have changed since we originally covered them.

This post was originally published on July 25, 2013, when eDiscovery Daily was less than three years old.  It was a throwback post of sorts even back then as it referenced several earlier posts and was inspired for today’s post by Craig Ball’s new primer – Processing in E-Discovery – which I covered yesterday on our blog.  Craig’s new primer immediately confronts a myth that many attorneys believe with regard to electronic files and how easily (and quickly) they can be made ready for production.  Spoiler alert!  There’s a lot more to it than most attorneys realize.  Craig’s primer does a thorough job of explaining the ins and outs of that, but if you haven’t gotten a chance to read it all yet – you should – here are a few specific reasons that I explained over six years ago why the files need processing to be reviewable and useful.  Enjoy!

Since hard copy discovery became electronic discovery, I’ve worked with a number of clients who expect that working with electronic files in a review tool is simply a matter of loading the files and getting started.  Unfortunately, it’s not that simple!

Back when most discovery was paper based, the usefulness of the documents was understandably limited.  Documents were paper and they all required conversion to image to be viewed electronically, optical character recognition (OCR) to capture their text (though not 100% accurately) and coding (i.e., data entry) to capture key data elements (e.g., author, recipient, subject, document date, document type, names mentioned, etc.).  It was a problem, but it was a consistent problem – all documents needed the same treatment to make them searchable and usable electronically.

Though electronic files are already electronic, that doesn’t mean that they’re ready for review as is.  They don’t just represent one problem, they can represent a whole collection of problems.  For example:

These are just a few examples of why working with electronic files for review isn’t necessarily straightforward.  Of course, when processed correctly, electronic files include considerable metadata that provides useful information about how and when the files were created and used, and by whom.  They’re way more useful than paper documents.  So, it’s still preferable to work with electronic files instead of hard copy files whenever they are available.  But, despite what you might think, that doesn’t make them ready to review as is.

So, what do you think?  Do you work with attorneys who still expect the files to be available for review immediately?  Please share any comments you might have or if you’d like to know more about a particular topic.

Sponsor: This blog is sponsored by CloudNine, which is a data and legal discovery technology company with proven expertise in simplifying and automating the discovery of data for audits, investigations, and litigation. Used by legal and business customers worldwide including more than 50 of the top 250 Am Law firms and many of the world’s leading corporations, CloudNine’s eDiscovery automation software and services help customers gain insight and intelligence on electronic data.

Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by CloudNine. eDiscovery Daily is made available by CloudNine solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscovery Daily should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.

Why Process in eDiscovery? Isn’t it “Review Ready”?: eDiscovery Best Practices

As I’ll point out in tomorrow’s blog post (spoiler alert!), I’ve been asked a variation of this question for years.  But, perhaps the best answer to this question lies in Craig Ball’s new primer – Processing in E-Discovery.

Craig, who introduced the new primer in his latest blog post – the 200th of his excellent Ball in Your Court blog – asked the questions posed in the title of this post in the beginning of that primer (after the Introduction) and confronts a myth that many attorneys believe with regard to electronic files and how easily (and quickly) they can be made ready for production.  As Craig explains:

“Though all electronically stored information is inherently electronically searchable, computers don’t structure or search all ESI in the same way; so, we must process ESI to normalize it to achieve uniformity for indexing and search.”

But I’m getting ahead of myself.  In the Introduction, Craig says this:

“Talk to lawyers about e‐discovery processing and you’ll likely get a blank stare suggesting no clue what you’re talking about.  Why would lawyers want anything to do with something so disagreeably technical? Indeed, processing is technical and strikes attorneys as something they need not know. That’s lamentable because processing is a phase of e‐discovery where things can go terribly awry in terms of cost and outcome. Lawyers who understand the fundamentals of ESI processing are better situated to avoid costly mistakes and resolve them when they happen.”

Then, Craig illustrates the point with a variation of the Electronic Discovery Reference Model (EDRM) which extracts processing as “an essential prerequisite” to Review, Analysis and Production (while noting that the EDRM model is a “conceptual view, not a workflow”).

As Craig discusses, to understand eDiscovery processing is to understand the basics of computers – from bits and bytes to ASCII and Unicode to Hex and Base64 and Encoding.  How to identify files based on file extensions, binary file signatures and file structure.  Why data compression makes smart phones, digitized music, streaming video and digital photography possible.  And, much more.

Want to know how “E-Discovery” would be written in a binary ASCII sequence?  Here you go:

0100010100101101010001000110100101110011011000110110111101110110011001010111001001111001

Craig covers the gamut of processing – from ingestion to data extraction and document filters, from recursion and embedded object extraction to family tracking and exceptions reporting, from lexical preprocessing to building a database and Concordance(!) index.

Yes, it’s technical.  But, a very important read if you’re an attorney wanting to better understand eDiscovery processing and what’s involved and why the files need to be processed in the first place.  Many attorneys don’t understand what’s involved and that leads to unreasonable expectations and missed deadlines.

Craig’s new primer is a 55-page PDF file that is chock-full of good information about eDiscovery processing – a must read for attorneys and eDiscovery professionals alike.  He wrote it for the upcoming Georgetown Law Center Advanced E-Discovery Institute on November 21 and 22, which you can still register for (and get a discount for, per Craig’s blog post).  My only quibble with it is the spelling of “E-Discovery”, but that’s a quibble for another day (you’re welcome, Ari Kaplan!).  :o)

So, what do you think?  Are you mystified by eDiscovery processing?  Please share any comments you might have or if you’d like to know more about a particular topic.

Sponsor: This blog is sponsored by CloudNine, which is a data and legal discovery technology company with proven expertise in simplifying and automating the discovery of data for audits, investigations, and litigation. Used by legal and business customers worldwide including more than 50 of the top 250 Am Law firms and many of the world’s leading corporations, CloudNine’s eDiscovery automation software and services help customers gain insight and intelligence on electronic data.

Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by CloudNine. eDiscovery Daily is made available by CloudNine solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscovery Daily should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.

The Number of Pages (Documents) in Each Gigabyte Can Vary Widely: eDiscovery Throwback Thursdays

Here’s our latest blog post in our Throwback Thursdays series where we are revisiting some of the eDiscovery best practice posts we have covered over the years and discuss whether any of those recommended best practices have changed since we originally covered them.

This post was originally published on July 31, 2012 – when eDiscovery Daily wasn’t even two years old yet.  It’s “so old (how old is it?)”, it references a blog post from the now defunct Applied Discovery blog.  We’ve even done an updated look at this topic with more file types about four years later.  Oh, and (as we are more focused on documents than pages for most of the EDRM life cycle as it’s the metric by which we evaluate processing to review), so it’s the documents per GB that tends to be more considered these days.

So, why is this important?  Not only for estimation purposes for review, but also for considering processing throughput.  If you have two 40 GB (or so) PST container files and one file has twice the number of documents as the other, the one with more documents will take considerably longer to process. It’s getting to a point where the document per hour throughput is becoming more important than the GB per hour, as that can vary widely depending on the number of documents per GB.  Today, we’re seeing processing throughput speeds as high as 1 million documents per hour with solutions like (shameless plug warning!) our CloudNine Explore platform.  This is why Early Data Assessment tools have become more important as they can provide that document count quickly that lead to more accurate estimates.  Regardless, the exercise below illustrates just how widely the number of pages (or documents) can vary within a single GB.  Enjoy!

A long time ago, we talked about how the average number of pages in each gigabyte is approximately 50,000 to 75,000 pages and that each gigabyte effectively culled out can save $18,750 in review costs.  But, did you know just how widely the number of pages (or documents) per gigabyte can vary?  The “how many pages” question came up a lot back then and I’ve seen a variety of answers.  The aforementioned Applied Discovery blog post provided some perspective in 2012 based on the types of files contained within the gigabyte, as follows:

“For example, e-mail files typically average 100,099 pages per gigabyte, while Microsoft Word files typically average 64,782 pages per gigabyte. Text files, on average, consist of a whopping 677,963 pages per gigabyte. At the opposite end of the spectrum, the average gigabyte of images contains 15,477 pages; the average gigabyte of PowerPoint slides typically includes 17,552 pages.”

Of course, each GB of data is rarely just one type of file.  Many emails include attachments, which can be in any of a number of different file formats.  Collections of files from hard drives may include Word, Excel, PowerPoint, Adobe PDF and other file formats.  So, estimating page (or document) counts with any degree of precision is somewhat difficult.

In fact, the same exact content ported into different applications can be a different size in each file, due to the overhead required by each application.  To illustrate this, I decided to conduct a little (admittedly unscientific) study using our one-page blog post (also from July 2012) about the Apple/Samsung litigation (the first of many as it turned out, as that litigation dragged on for years).  I decided to put the content from that page into several different file formats to illustrate how much the size can vary, even when the content is essentially the same.  Here are the results:

  • Text File Format (TXT): Created by performing a “Save As” on the web page for the blog post to text – 10 KB;
  • HyperText Markup Language (HTML): Created by performing a “Save As” on the web page for the blog post to HTML – 36 KB, over 3.5 times larger than the text file;
  • Microsoft Excel 2010 Format (XLSX): Created by copying the contents of the blog post and pasting it into a blank Excel workbook – 128 KB, nearly 13 times larger than the text file;
  • Microsoft Word 2010 Format (DOCX): Created by copying the contents of the blog post and pasting it into a blank Word document – 162 KB, over 16 times larger than the text file;
  • Adobe PDF Format (PDF): Created by printing the blog post to PDF file using the CutePDF printer driver – 211 KB, over 21 times larger than the text file;
  • Microsoft Outlook 2010 Message Format (MSG): Created by copying the contents of the blog post and pasting it into a blank Outlook message, then sending that message to myself, then saving the message out to my hard drive – 221 KB, over 22 times larger than the text file.

The Outlook example back then was probably the least representative of a typical email – most emails don’t have several embedded graphics in them (with the exception of signature logos) – and most are typically much shorter than yesterday’s blog post (which also included the side text on the page as I copied that too).  Still, the example hopefully illustrates that a “page”, even with the same exact content, will be different sizes in different applications.  Data size will enable you to provide a “ballpark” estimate for processing and review at best, but, to provide a more definitive estimate, you need a document count today to get there.  Early data assessment has become key to better estimates of scope and time frame for delivery than ever before.

So, what do you think?  Was this example useful or highly flawed?  Or both?  Please share any comments you might have or if you’d like to know more about a particular topic.

Sponsor: This blog is sponsored by CloudNine, which is a data and legal discovery technology company with proven expertise in simplifying and automating the discovery of data for audits, investigations, and litigation. Used by legal and business customers worldwide including more than 50 of the top 250 Am Law firms and many of the world’s leading corporations, CloudNine’s eDiscovery automation software and services help customers gain insight and intelligence on electronic data.

Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by CloudNine. eDiscovery Daily is made available by CloudNine solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscovery Daily should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.

The Enron Data Set is No Longer a Representative Test Data Set: eDiscovery Best Practices

If you attend any legal technology conference where eDiscovery software vendors are showing their latest software developments, you may have noticed the data that is used to illustrate features and capabilities by many of the vendors – it’s data from the old Enron investigation.  The Enron Data Set has remained the go-to data set for years as the best source of high-volume data to be used for demos and software testing.  And, it’s still good for software demos.  But, it’s no longer a representative test data set for testing processing – at least not as it’s constituted – and it hasn’t been for a good while.  Let’s see why.

But first, here’s a reminder of what the Enron Data Set is.

The data set is public domain data from Enron Corporation that originated from the Federal Energy Regulatory Commission (FERC) Enron Investigation (you can still access that information here).  The original data set consists of:

  • Email: Data consisting of 92% of Enron’s staff emails;
  • Scanned documents: 150,000 scanned pages in TIFF format of documents provided to FERC during the investigation, accompanied by OCR generated text of the images;
  • Transcripts: 40 transcripts related to the case.

Over eight years ago, EDRM created a Data Set project that took the email and generated PST files for each of the custodians (about 170 PST files for 151 custodians).  Roughly 34.5 GB in 153 zip files, probably two to three times that size unzipped (I haven’t recently unzipped it all to check the exact size).  The Enron emails were originally in Lotus Notes databases, so the PSTs created aren’t a perfect rendition of what they might look like if they originated in Outlook (for example, there are a lot of internal Exchange addresses vs. SMTP email addresses), but it still has been a really useful good sized collection for demo and test data.  EDRM has also since created some micro test data sets, which are good for specific test cases, but not high volume.

As people began to use the data, it became apparent that there was a lot of Personally Identifiable Information (PII) contained in the set, including social security numbers and credit card numbers (back in the late 90s and early 2000s, there was nowhere near the concern about data privacy as there is today).  So, a couple of years later, EDRM partnered with NUIX to “clean” the data of PII and they removed thousands of emails with PII (though a number of people identified additional PII after that process was complete, so be careful).

If there are comparable high-volume public domain collections that are representative of a typical email collection for discovery, I haven’t seen them (and, believe me, I have looked).  Sure, you can get a high-volume dump of data from Wikipedia or other sites out there, but that’s not indicative of a typical eDiscovery data set.  If any of you out there know of any that are, I’m all ears.

Until then, the EDRM Enron Data Set remains the gold-standard as the best high-volume example of a public domain email collection.  So, why isn’t it a good test data set anymore for processing?

Do you remember the days when Microsoft Outlook limited the size of a PST file to 2 GB?  Outlook 2002 and earlier versions limited the size of PST files to 2 GB.  Years ago, that was about the largest PST file we typically had to process in eDiscovery.  Since Outlook 2003, a new PST file format has been used, which supports Unicode and doesn’t have the 2 GB size limit any more.  Now, in discovery, we routinely see PST files that are 20, 30, 40 or even more GB in a single file.

What difference does it make?  The challenge today with large mailstore files like these (as well as large container files, including ZIP and forensic containers) is that single-threaded processes bog down on these large files and they can take a long time to process.  These days, to get through large files like these more quickly, you need multi-threaded processing capabilities and the ability to throw multiple agents at these files to get them processed in a fraction of the time.  As an example, we’ve seen processing throughput increased 400-600% with multi-threaded ingestion using CloudNine’s LAW product compared to single-threaded processes (a reduction of processing time from over 24 hours to a little over 4 hours in a recent example).  Large container files are very typical in eDiscovery collections today and many PST files we see today are anywhere from 10 GB to more than 50 GB in size.  They’re becoming the standard in most eDiscovery email collections.

As I mentioned, the Enron Data Set is 170 PST files over 151 custodians, with some of the larger custodians’ collections broken into multiple PST files (one custodian has 11 PST files in the collection).  But, none of them are over 2 GB in size (presumably Lotus Notes had a similar size limit back in the day) and most of them are less than 200 MB.  That’s not indicative of a typical eDiscovery email collection today and wouldn’t provide a representative speed test for processing purposes.

Can the Enron Data Set still be used to benchmark single-threaded vs. multi-threaded processes?  Yes, but not as it’s constituted – to do so, you have to combine them into larger PST files more representative of today’s collections.  We did that at CloudNine and came up with a 42 GB PST file that contains several hundred thousand de-duped emails and attachments.  You could certainly break that up into 2-4 smaller PST files to conduct a test of multiple PST files as well.  That provides a more typical eDiscovery collection in today’s terms – at least on a small scale.

So, when an eDiscovery vendor quotes processing throughput numbers for you, it’s important to know the types of files that they were processing to obtain those metrics.  If they were using the Enron Data Set as is, those numbers may not be as meaningful as you think.  And, if somebody out there is aware of a good, new, large-volume, public domain, “typical” eDiscovery collection with large mailstore and container files (not to mention content from mobile devices and other sources), please let me know!

So, what do you think?  Do you still rely on the Enron Data Set for demo and test data?  Please share any comments you might have or if you’d like to know more about a particular topic.

Sponsor: This blog is sponsored by CloudNine, which is a data and legal discovery technology company with proven expertise in simplifying and automating the discovery of data for audits, investigations, and litigation. Used by legal and business customers worldwide including more than 50 of the top 250 Am Law firms and many of the world’s leading corporations, CloudNine’s eDiscovery automation software and services help customers gain insight and intelligence on electronic data.

Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by CloudNine. eDiscovery Daily is made available by CloudNine solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscovery Daily should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.

Process This! – Close Outlook Before Compressing or Zipping PST Files for Processing: eDiscovery Best Practices

Having recently experienced this with a client, I thought I would revisit this helpful tip.  This is one of the tips Tom O’Connor and I will be covering this Friday – E-Discovery Day – on our webcast Murphy’s eDiscovery Law: How to Keep What Could Go Wrong From Going Wrong at noon CST (1:00pm EST, 10:00am PST).  Click here to register for Friday’s webcast.

As you may know, at CloudNine (shameless plug warning!), we have an automated processing capability for enabling clients to load and process their own data – they can use this capability to load their data into our review platform.  They can even process and load data straight into Relativity using our Outpost for Relativity module.

Regardless whether they load data into CloudNine or Relativity, most of our users are using the processing capability to process emails, usually from Outlook Personal Storage Table (PST) files.  Even though increased volumes of social media and other types of electronically stored information, emails are still predominant in eDiscovery.  And, for users trying to process and load that data, we get one issue more than any other when it comes to processing those Outlook emails:

They still have Outlook open with the PST file opened when they attempt to upload that PST file or when they try to create a ZIP file containing the Outlook PST.

When that happens, the resulting ZIP file that is created (either by the user or by our client application if the data is not already contained in an archive file) will almost invariably be corrupted or empty.  Either way, this will result in a failure during processing of the loaded data – because the data being processed will simply be corrupt.

This is not only true for CloudNine processing, this is also true for any application that you use for processing, such as Law PreDiscovery.  So, before attempting to create a ZIP (or RAR or other type of archive) of a PST file (or before you upload it to a platform like CloudNine for processing), make sure that Outlook is closed or at least that the PST file is closed within Outlook.  That’s the best way to have a positive “outlook” to discovering emails.  Get it?  :o)

So, what do you think?  Is email still the predominant source of discoverable ESI in your organization?  Please share any comments you might have or if you’d like to know more about a particular topic.

Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by CloudNine. eDiscovery Daily is made available by CloudNine solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscovery Daily should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.