Processing

The Number of Pages (Documents) in Each Gigabyte Can Vary Widely: eDiscovery Throwback Thursdays

Here’s our latest blog post in our Throwback Thursdays series where we are revisiting some of the eDiscovery best practice posts we have covered over the years and discuss whether any of those recommended best practices have changed since we originally covered them.

This post was originally published on July 31, 2012 – when eDiscovery Daily wasn’t even two years old yet.  It’s “so old (how old is it?)”, it references a blog post from the now defunct Applied Discovery blog.  We’ve even done an updated look at this topic with more file types about four years later.  Oh, and (as we are more focused on documents than pages for most of the EDRM life cycle as it’s the metric by which we evaluate processing to review), so it’s the documents per GB that tends to be more considered these days.

So, why is this important?  Not only for estimation purposes for review, but also for considering processing throughput.  If you have two 40 GB (or so) PST container files and one file has twice the number of documents as the other, the one with more documents will take considerably longer to process. It’s getting to a point where the document per hour throughput is becoming more important than the GB per hour, as that can vary widely depending on the number of documents per GB.  Today, we’re seeing processing throughput speeds as high as 1 million documents per hour with solutions like (shameless plug warning!) our CloudNine Explore platform.  This is why Early Data Assessment tools have become more important as they can provide that document count quickly that lead to more accurate estimates.  Regardless, the exercise below illustrates just how widely the number of pages (or documents) can vary within a single GB.  Enjoy!

A long time ago, we talked about how the average number of pages in each gigabyte is approximately 50,000 to 75,000 pages and that each gigabyte effectively culled out can save $18,750 in review costs.  But, did you know just how widely the number of pages (or documents) per gigabyte can vary?  The “how many pages” question came up a lot back then and I’ve seen a variety of answers.  The aforementioned Applied Discovery blog post provided some perspective in 2012 based on the types of files contained within the gigabyte, as follows:

“For example, e-mail files typically average 100,099 pages per gigabyte, while Microsoft Word files typically average 64,782 pages per gigabyte. Text files, on average, consist of a whopping 677,963 pages per gigabyte. At the opposite end of the spectrum, the average gigabyte of images contains 15,477 pages; the average gigabyte of PowerPoint slides typically includes 17,552 pages.”

Of course, each GB of data is rarely just one type of file.  Many emails include attachments, which can be in any of a number of different file formats.  Collections of files from hard drives may include Word, Excel, PowerPoint, Adobe PDF and other file formats.  So, estimating page (or document) counts with any degree of precision is somewhat difficult.

In fact, the same exact content ported into different applications can be a different size in each file, due to the overhead required by each application.  To illustrate this, I decided to conduct a little (admittedly unscientific) study using our one-page blog post (also from July 2012) about the Apple/Samsung litigation (the first of many as it turned out, as that litigation dragged on for years).  I decided to put the content from that page into several different file formats to illustrate how much the size can vary, even when the content is essentially the same.  Here are the results:

  • Text File Format (TXT): Created by performing a “Save As” on the web page for the blog post to text – 10 KB;
  • HyperText Markup Language (HTML): Created by performing a “Save As” on the web page for the blog post to HTML – 36 KB, over 3.5 times larger than the text file;
  • Microsoft Excel 2010 Format (XLSX): Created by copying the contents of the blog post and pasting it into a blank Excel workbook – 128 KB, nearly 13 times larger than the text file;
  • Microsoft Word 2010 Format (DOCX): Created by copying the contents of the blog post and pasting it into a blank Word document – 162 KB, over 16 times larger than the text file;
  • Adobe PDF Format (PDF): Created by printing the blog post to PDF file using the CutePDF printer driver – 211 KB, over 21 times larger than the text file;
  • Microsoft Outlook 2010 Message Format (MSG): Created by copying the contents of the blog post and pasting it into a blank Outlook message, then sending that message to myself, then saving the message out to my hard drive – 221 KB, over 22 times larger than the text file.

The Outlook example back then was probably the least representative of a typical email – most emails don’t have several embedded graphics in them (with the exception of signature logos) – and most are typically much shorter than yesterday’s blog post (which also included the side text on the page as I copied that too).  Still, the example hopefully illustrates that a “page”, even with the same exact content, will be different sizes in different applications.  Data size will enable you to provide a “ballpark” estimate for processing and review at best, but, to provide a more definitive estimate, you need a document count today to get there.  Early data assessment has become key to better estimates of scope and time frame for delivery than ever before.

So, what do you think?  Was this example useful or highly flawed?  Or both?  Please share any comments you might have or if you’d like to know more about a particular topic.

Sponsor: This blog is sponsored by CloudNine, which is a data and legal discovery technology company with proven expertise in simplifying and automating the discovery of data for audits, investigations, and litigation. Used by legal and business customers worldwide including more than 50 of the top 250 Am Law firms and many of the world’s leading corporations, CloudNine’s eDiscovery automation software and services help customers gain insight and intelligence on electronic data.

Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by CloudNine. eDiscovery Daily is made available by CloudNine solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscovery Daily should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.

The Enron Data Set is No Longer a Representative Test Data Set: eDiscovery Best Practices

If you attend any legal technology conference where eDiscovery software vendors are showing their latest software developments, you may have noticed the data that is used to illustrate features and capabilities by many of the vendors – it’s data from the old Enron investigation.  The Enron Data Set has remained the go-to data set for years as the best source of high-volume data to be used for demos and software testing.  And, it’s still good for software demos.  But, it’s no longer a representative test data set for testing processing – at least not as it’s constituted – and it hasn’t been for a good while.  Let’s see why.

But first, here’s a reminder of what the Enron Data Set is.

The data set is public domain data from Enron Corporation that originated from the Federal Energy Regulatory Commission (FERC) Enron Investigation (you can still access that information here).  The original data set consists of:

  • Email: Data consisting of 92% of Enron’s staff emails;
  • Scanned documents: 150,000 scanned pages in TIFF format of documents provided to FERC during the investigation, accompanied by OCR generated text of the images;
  • Transcripts: 40 transcripts related to the case.

Over eight years ago, EDRM created a Data Set project that took the email and generated PST files for each of the custodians (about 170 PST files for 151 custodians).  Roughly 34.5 GB in 153 zip files, probably two to three times that size unzipped (I haven’t recently unzipped it all to check the exact size).  The Enron emails were originally in Lotus Notes databases, so the PSTs created aren’t a perfect rendition of what they might look like if they originated in Outlook (for example, there are a lot of internal Exchange addresses vs. SMTP email addresses), but it still has been a really useful good sized collection for demo and test data.  EDRM has also since created some micro test data sets, which are good for specific test cases, but not high volume.

As people began to use the data, it became apparent that there was a lot of Personally Identifiable Information (PII) contained in the set, including social security numbers and credit card numbers (back in the late 90s and early 2000s, there was nowhere near the concern about data privacy as there is today).  So, a couple of years later, EDRM partnered with NUIX to “clean” the data of PII and they removed thousands of emails with PII (though a number of people identified additional PII after that process was complete, so be careful).

If there are comparable high-volume public domain collections that are representative of a typical email collection for discovery, I haven’t seen them (and, believe me, I have looked).  Sure, you can get a high-volume dump of data from Wikipedia or other sites out there, but that’s not indicative of a typical eDiscovery data set.  If any of you out there know of any that are, I’m all ears.

Until then, the EDRM Enron Data Set remains the gold-standard as the best high-volume example of a public domain email collection.  So, why isn’t it a good test data set anymore for processing?

Do you remember the days when Microsoft Outlook limited the size of a PST file to 2 GB?  Outlook 2002 and earlier versions limited the size of PST files to 2 GB.  Years ago, that was about the largest PST file we typically had to process in eDiscovery.  Since Outlook 2003, a new PST file format has been used, which supports Unicode and doesn’t have the 2 GB size limit any more.  Now, in discovery, we routinely see PST files that are 20, 30, 40 or even more GB in a single file.

What difference does it make?  The challenge today with large mailstore files like these (as well as large container files, including ZIP and forensic containers) is that single-threaded processes bog down on these large files and they can take a long time to process.  These days, to get through large files like these more quickly, you need multi-threaded processing capabilities and the ability to throw multiple agents at these files to get them processed in a fraction of the time.  As an example, we’ve seen processing throughput increased 400-600% with multi-threaded ingestion using CloudNine’s LAW product compared to single-threaded processes (a reduction of processing time from over 24 hours to a little over 4 hours in a recent example).  Large container files are very typical in eDiscovery collections today and many PST files we see today are anywhere from 10 GB to more than 50 GB in size.  They’re becoming the standard in most eDiscovery email collections.

As I mentioned, the Enron Data Set is 170 PST files over 151 custodians, with some of the larger custodians’ collections broken into multiple PST files (one custodian has 11 PST files in the collection).  But, none of them are over 2 GB in size (presumably Lotus Notes had a similar size limit back in the day) and most of them are less than 200 MB.  That’s not indicative of a typical eDiscovery email collection today and wouldn’t provide a representative speed test for processing purposes.

Can the Enron Data Set still be used to benchmark single-threaded vs. multi-threaded processes?  Yes, but not as it’s constituted – to do so, you have to combine them into larger PST files more representative of today’s collections.  We did that at CloudNine and came up with a 42 GB PST file that contains several hundred thousand de-duped emails and attachments.  You could certainly break that up into 2-4 smaller PST files to conduct a test of multiple PST files as well.  That provides a more typical eDiscovery collection in today’s terms – at least on a small scale.

So, when an eDiscovery vendor quotes processing throughput numbers for you, it’s important to know the types of files that they were processing to obtain those metrics.  If they were using the Enron Data Set as is, those numbers may not be as meaningful as you think.  And, if somebody out there is aware of a good, new, large-volume, public domain, “typical” eDiscovery collection with large mailstore and container files (not to mention content from mobile devices and other sources), please let me know!

So, what do you think?  Do you still rely on the Enron Data Set for demo and test data?  Please share any comments you might have or if you’d like to know more about a particular topic.

Sponsor: This blog is sponsored by CloudNine, which is a data and legal discovery technology company with proven expertise in simplifying and automating the discovery of data for audits, investigations, and litigation. Used by legal and business customers worldwide including more than 50 of the top 250 Am Law firms and many of the world’s leading corporations, CloudNine’s eDiscovery automation software and services help customers gain insight and intelligence on electronic data.

Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by CloudNine. eDiscovery Daily is made available by CloudNine solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscovery Daily should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.

Process This! – Close Outlook Before Compressing or Zipping PST Files for Processing: eDiscovery Best Practices

Having recently experienced this with a client, I thought I would revisit this helpful tip.  This is one of the tips Tom O’Connor and I will be covering this Friday – E-Discovery Day – on our webcast Murphy’s eDiscovery Law: How to Keep What Could Go Wrong From Going Wrong at noon CST (1:00pm EST, 10:00am PST).  Click here to register for Friday’s webcast.

As you may know, at CloudNine (shameless plug warning!), we have an automated processing capability for enabling clients to load and process their own data – they can use this capability to load their data into our review platform.  They can even process and load data straight into Relativity using our Outpost for Relativity module.

Regardless whether they load data into CloudNine or Relativity, most of our users are using the processing capability to process emails, usually from Outlook Personal Storage Table (PST) files.  Even though increased volumes of social media and other types of electronically stored information, emails are still predominant in eDiscovery.  And, for users trying to process and load that data, we get one issue more than any other when it comes to processing those Outlook emails:

They still have Outlook open with the PST file opened when they attempt to upload that PST file or when they try to create a ZIP file containing the Outlook PST.

When that happens, the resulting ZIP file that is created (either by the user or by our client application if the data is not already contained in an archive file) will almost invariably be corrupted or empty.  Either way, this will result in a failure during processing of the loaded data – because the data being processed will simply be corrupt.

This is not only true for CloudNine processing, this is also true for any application that you use for processing, such as Law PreDiscovery.  So, before attempting to create a ZIP (or RAR or other type of archive) of a PST file (or before you upload it to a platform like CloudNine for processing), make sure that Outlook is closed or at least that the PST file is closed within Outlook.  That’s the best way to have a positive “outlook” to discovering emails.  Get it?  :o)

So, what do you think?  Is email still the predominant source of discoverable ESI in your organization?  Please share any comments you might have or if you’d like to know more about a particular topic.

Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by CloudNine. eDiscovery Daily is made available by CloudNine solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscovery Daily should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.

Brad Jenkins of CloudNine: eDiscovery Trends

This is the first of the 2017 LegalTech New York (LTNY) Thought Leader Interview series.  eDiscovery Daily interviewed several thought leaders at LTNY (aka LegalWeek) this year to get their observations regarding trends at the show and generally within the eDiscovery industry.

Today’s thought leader is Brad Jenkins of CloudNine™.  Brad has over 20 years of experience as an entrepreneur, as well as 15 years leading customer focused companies in the litigation technology arena. Brad also has authored several articles on document management and litigation support issues, and has appeared as a speaker before national audiences on document management practices and solutions.  He’s also my boss!  🙂

What are your observations about LTNY this year and how it compared to other LTNY shows that you have attended?

Once again, a majority of my time at LTNY was spent in meetings with colleagues and business partners as CloudNine had a suite and we had several meetings set up over the course of the three days of the show.  It seems that the meetings outside the show have become as big as the show itself.  Several people that I met with had hardly spent any time (if any) at the show when I met with them.  Because it’s the biggest conference of the year, LTNY provides a unique opportunity for face to face meetings you don’t get during the rest of the year, so it pays to take advantage of that opportunity.  Unfortunately, that comes at the expense of attending most of the conference itself.

I was able to attend some of the conference and spent a little time in the exhibit hall.  Based on what I saw, attendance seemed down this year and some of the exhibitors that I spoke with seemed to agree.  I assume the decision by ALM to charge a fee for the Exhibits Plus passes for the first time ever had an impact on attendance in the exhibit hall.  Not surprisingly, some criticized that decision, so it will be interesting to see if exhibitors push back on that and if ALM decides to charge that fee again next year.

Regardless, with so many opportunities for providers to reach prospects in a less expensive manner and with a market that clearly appears to be consolidating, I would expect that it will continue to be a challenge for ALM to retain exhibitors.  Over the past few years, the number of exhibitors have dropped and I wouldn’t be surprised to see that trend continue unless ALM gets creative in identifying new ways to attract potential exhibitors to the conference.

What about general industry trends?  Are there any notable trends that you’ve observed?

Last year, I noted a clear trend toward SaaS automation within eDiscovery and I think it’s clear that trend has not only continued, but expanded.  In addition to the investment in some automation providers, and the emergence of others like our company, CloudNine, we’ve seen several of the “big boys” (such as Ipro, Thomson Reuters and kCura) roll out their own cloud-based automation initiatives.  In the past year, we also saw organizations like Gartner acknowledge that cloud eDiscovery solutions are gaining momentum in the market due to their ease of use and competitive and straightforward pricing structures.  The move to the cloud for eDiscovery reflects a similar migration to the cloud within organizations for everything from SalesForce.com to Office 365.  In fact, Forbes.com recently published an article that reflected a prediction that, by 2020, 92% of everything we do will be in the cloud.  So, it makes sense that eDiscovery solutions would reflect that trend.

Another trend that has been happening for a few years and is certainly accelerating is the move to the left of the EDRM model for discovery and analytics.  With estimates of data doubling in organizations every 1.2 years, organizations are certainly having to turn to technology to address the challenges associated with that explosion of data.  The need for discovery is no longer initiated just by trigger events such as litigation or investigations – the need for organizations to perform discovery is a perpetual need.  You’re seeing organizations beginning to focus on data discovery to explore patterns and trends within unstructured data, even at the point of data creation, to gather insight into the data they have.  Then, when those trigger events occur, organizations are progressing into more traditional legal discovery to identify, preserve, collect, process, analyze, review and produce key ESI to support legal or investigative activities.  I think you’ll see that trend toward an increased focus on data discovery continue to accelerate as a way for organizations to address the challenges associated with the explosion of data in their environments.

One last trend that I’ll mention is the growing number of state bar associations that have adopted some sort of expectation or guidance for technology competence among their bar members.  I believe that there are 26 states now that have adopted some version of Comment 8 to ABA Model Rule 1.1 and Florida has become the first state to actually mandate technology CLE for their attorneys – three hours of technology CLE over a three year period.  At CloudNine, we believe that educated clients make the best clients and we’ve tried to do our part for the past several years to help educate the legal profession with our blog and, this year, we are adding educational webcasts (with CLE certification in some states) to help educate lawyers.  While I think we still have a long way to go before the legal profession is generally knowledgeable about technology, I think the increased focus on technology competence along with the continued trend toward simplified discovery automation puts attorneys in a better position than ever to use technology to support their discovery needs.

What are you working on that you’d like our readers to know about?

In addition to the educational webcasts that we have started conducting this year, CloudNine recently announced our latest accomplishment in simplified discovery automation with our integration with Relativity that provides Relativity users with a client application that automates the upload, processing, and ingestion of ESI into Relativity, directly from their desktop.  Just as CloudNine users have been able to automate the upload, processing, and ingestion of ESI into CloudNine for several years now, the universe of more than 150,000 Relativity users will now be able to do the same.

We have several other new features and capabilities that provide simplified discovery automation capabilities to our clients that are also in the works and I look forward to having more information to share on those soon.

We are also very active in the data discovery space that I referred to earlier, providing solutions and assistance to help clients address their data discovery needs.  We’re finding that the needs of organizations to gain insight into their data occurs long before litigation and other events trigger the duty of those organizations and CloudNine is at the forefront in helping organizations address their data discovery needs.

As I said during last year’s interview, we feel that CloudNine is the leader in simplifying discovery automation and our unique combination of Speed, Simplicity, Security and Services enables CloudNine to simplify discovery for our clients.  That continues to be our mission as a company and has been throughout our more than 14 years as a company assisting our clients.

Thanks, Brad, for participating in the interview!

And to the readers, as always, please share any comments you might have or if you’d like to know more about a particular topic!

Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by CloudNine. eDiscovery Daily is made available by CloudNine solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscovery Daily should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.

Hashing Out the Idea of a Standard Hash Algorithm for Vendors: eDiscovery Best Practices

In a blog post earlier this month, Craig Ball discussed the question (which was posed at the recent ILTACON conference by Beth Patterson, Chief Legal & Technology Services Officer for Allens) of why eDiscovery service providers can’t (or don’t) standardize hash values so as to support identification and deduplication across products and collections.  Good question.  Let’s take a look.

In his post from his excellent Ball in Your Court blog (Cross-Matter & -Vendor Message ID), Craig noted that standardization would enable you to use work from one matter in another and flag emails already identified as privileged in one case so that they don’t slip through.  Wouldn’t that be great?

According to Craig, unfortunately, the panelists’ response to the question appeared to be to characterize it as “a big technical challenge.”

Craig then took a look at the issue, beginning by recapping some “hash facts” to establish a baseline for understanding considerations for computing hash values.  He then differentiated loose documents (easy, because as long as they are properly preserved, they should generate the same hash value consistently) from emails.  Emails are more difficult to construct consistent hash values for because the hash value of an email depends on when it is exported as well as other factors.  So, the same email exported at different times or from different email clients will have a different hash value – even though we see them as the same, the computer doesn’t.  Make sense?

Craig also took a look at some approaches for generating standardized hash values for emails and also took a look at MD5 vs. SHA-1 methods of hashing and debunked the idea that MD5 hash values aren’t unique enough to be “defensible”.  There are 340,282,366,920,938,463,463,374,607,431,768,211,000 unique MD5 hash values.  Unique enough for you?

I asked Bill David, Chief Technical Officer at CloudNine and architect of the platform, about the use of MD5 for generating hash values.

“Of these (and other) HASH routines, we ultimately chose MD5 for a couple of reasons”, Bill said. “First, for all practical purposes, MD5 Hash is sufficient for identifying duplicate files in a given population. Second, it’s faster than the alternatives. And third, it is widely available. You can find the MD5 Hash routine in all major computer languages as well as in most relational database. This allows us to utilize and generate HASH values from a client’s browser all the way down the line to the rational databases used in a review platform.”

As for the idea of eDiscovery vendors agreeing to use the same routine to generate the same hash value, Bill seemed to think it was very doable and advocated a concatenation approach:

“As is commonly known, emails throw us a monkey wrench. Every email has some hidden data that is unique to that file. And as a result, we have to pick certain sections of a given email to construct a “string” of data, which we can then “HASH” to generate a unique value. But the slightest change in the format of the data affects the resulting unique hash. Something as simple as a single extra space will result in a completely different hash value.”

“What we have to do is to take the different parts of an email, combine them altogether and hash the result. At CloudNine, we pull these parts of an email and separate them with a single space.

  • SentDate (in the ISO format)
  • From
  • To
  • CC
  • BCC
  • Subject
  • Attachments (file names separated by semi-colons)
  • MsgText (text version)”

Bill, while noting that these are his initial thoughts after reading Craig’s article and might be subject to some revision, suggested a way to “code” it, in this case using C# (C Sharp) programming language:

“The combination of these fields give us a unique finger print of an email. As an extra step in trying to normalize data it’s wise to ‘trim’ up these fields (remove any leading or trailing spaces). So in code it would look like this:”

hashString = String.Format(“{0} {1} {2} {3} {4} {5} {6} {7}”,

     args.file.SentDate.ToString(“yyyy’-‘MM’-‘dd’T’HH’:’mm’:’ss”),   //ISO Format example 2009-06-15T13:45:30

     args.file.From.Trim(),

     args.file.To.Trim(),

     args.file.CC.Trim(),

     args.file.BCC.Trim(),

     args.file.Subject.Trim(),

     args.file.Attachments.Trim(),

     args.file.MsgText.Trim());

“We now have a string to hash. The last step is to hash the string. Many MD5 hash routines will contain ‘dashes’. In one more step to normalize the results let’s remove those dashes and force all of the characters to lower case.”

hash = clsHash.GetHash(hashString, clsHash.HashType.MD5).Replace(“-“, “”).ToLower();

“Based on my initial thoughts, that’s how you could standardize a hash value to use for deduping.”

Sounds like standardization on a method for generating hash values could be relatively straightforward – if you can get all the vendors to agree.

So, what do you think?  Would you benefit from a standardized method for computing hash values across all eDiscovery platforms?  Please share any comments you might have or if you’d like to know more about a particular topic.

Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by CloudNine. eDiscovery Daily is made available by CloudNine solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscovery Daily should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.

Here’s a New Twist to Text Overlays on Image-Only PDF Files That Can Be Even More Problematic: eDiscovery Best Practices

Remember when we discussed the issue of text overlays on image-only PDF files (typically represented as Bates numbers) and the problems they cause?  Well, we found a variation to the issue that is even more of a problem.

Here’s a recap of the issue we identified a couple of years ago.  The client was using the Discovery Client that allows clients to upload their own native data for automated processing and loading into new or existing projects into our CloudNine platform.  The collection was purported to consist mostly of image-only PDF files, which is one way to create PDF files (click back to the old post for more info on both ways to do so).

Like many processing tools, such as LAW PreDiscovery®, CloudNine was programmed back then to handle PDF files by extracting the text if present or, if not, performing OCR on the files to capture text from the image.  Text from the file is always preferable to OCR text because it’s a lot more accurate, so this is why OCR is typically only performed on the PDF files lacking text.

After the client loaded their data, we did a spot quality control check (like we always do) and discovered that the text for several of the documents only consisted of Bates numbers.

Why?

Because the Bates numbers were added as text overlays to the pre-existing image-only PDF files.  When the processing software viewed the file, it found that there was extractable text, so it extracted that text instead of OCRing the PDF file.  In effect, adding the Bates numbers as text overlays to the image-only PDF rendered it as no longer an image-only PDF.

As a result of this issue a couple of years ago, we added logic to the processing engine of CloudNine to perform OCR if there is minimal text per page (to account for the scenarios where there is only a Bates number).  Therefore, the content portion of the text would still be captured, so it would be available for indexing and searching.  Problem solved, right?

For the most part, yes.  Until a couple of weeks ago, where we ran into the situation again on a few PDF files.  Again, these files only generated the Bates numbers during processing.  What made them different?

Ever hear of a watermark?  These documents were stamped DRAFT via a light gray watermark on the PDF file.  Then, they were Bates stamped with the Adobe Acrobat Bates Numbering functionality.

Evidently, because of the watermark, the document image and the text overlaid Bates number were on separate levels of the PDF.  The processing tool failed to pick up the text because it essentially couldn’t find it.  Our production team ultimately had to re-generate the PDF files (by printing them back to PDF) and then OCR them.  That’s one reason why it’s good to have a team in place – to handle anomalies like that which occur.

As we noted a couple of years ago, if you haven’t applied Bates numbers on the files yet (or have a backup of the files before they were applied – highly recommended) and they haven’t been produced, you should process the files before putting Bates numbers on the images to ensure that you capture the most text available.  And, if opposing counsel will be producing any image-only PDF files, you will want to request the text as well (along with a load file) so that you can maximize your ability to search their production.  Doing so will save you additional processing charges.

Of course, your first choice should be to receive native format productions whenever possible – here’s a link to an excellent guide on that subject.

So, what do you think?  Have you dealt with image-only PDF files with text overlaid Bates numbers?  Please share any comments you might have or if you’d like to know more about a particular topic.

Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by CloudNine. eDiscovery Daily is made available by CloudNine solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscovery Daily should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.

Here is Where You Can Catch Last Week’s ACEDS Webinar: eDiscovery Trends

Our webinar panel discussion conducted by ACEDS last week was highly attended, well reviewed and generated some interesting discussion (more on that soon).  Were you unable to attend last week’s webinar?  Good news, we have it for you here, on demand, whenever you want to check it out.

The webinar panel discussion, titled How Automation is Revolutionizing eDiscovery was sponsored by CloudNine.  Our panel discussion provided an overview of eDiscovery automation technologies and we took a hard look at the technology and definition of TAR and potential limitations associated with both.  Mary Mack, Executive Director of ACEDS moderated the discussion and I was one of the panelists, along with Bill Dimm, CEO of Hot Neuron and Bill Speros, Evidence Consulting Attorney with Speros & Associates, LLC.

Thanks to our friends at ACEDS for presenting the webinar and to Bill Dimm and Bill Speros for participating in an interesting and thought-provoking discussion.  Hope you enjoy the presentation!

So, what do you think?  Do you think automation is revolutionizing eDiscovery?  As always, please share any comments you might have or if you’d like to know more about a particular topic.

Happy Anniversary to my wife (and the love of my life), Paige!  I’m very lucky to be married to such a wonderful woman!

Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by CloudNine. eDiscovery Daily is made available by CloudNine solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscovery Daily should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.

The Cloud is a “Rush” Project’s Best Friend: eDiscovery Best Practices

Today is Friday.  While many of you can look forward to a long, enjoyable Memorial Day weekend, chances are that at least a few of you will be making weekend plans when, late in the day, you will receive a CD, DVD, hard drive or link to data on a server somewhere that needs to be reviewed over the weekend.  There goes your weekend!

Not only that, good luck connecting with your in-house litigation support person or a vendor for assistance late on a Friday – you may play a game of “phone tag” or wait for email responses for a bit.  Lit support people and vendors have weekend plans too.  Even if you do get in touch with them, you then have to fill out a form and arrange to get the data to them, which can be tricky.  It’s a lot of time, hassle and cost to get started – especially if you’re at a small law firm that doesn’t already have an eDiscovery software application to support processing and review of the data.

When consumers quickly need to find that special item to buy, or that new cool song to download, or need to stream the new season of Bloodline (available starting today on Netflix) for binge watching, they turn to the cloud.  More than ever, attorneys are turning to the cloud as well to help them get their “rush” project started immediately.  And, you don’t even have to own the software or interact with anyone to get started.

As an eDiscovery provider that offers a no-risk free trial, CloudNine (shameless plug warning!) sees at least one or two clients a week that give our software a try (many of them with “rush” projects just like this).  The trend toward automation and the cloud in the industry has not only made eDiscovery more affordable than ever, it has also made it easier than ever to get a “rush” project off and running.

If you find yourself in that situation later today, here are three easy steps to get started:

  1. Sign up for a free account here. You will receive an email with your credentials (including temporary password), to get started.
  2. When you first log in, you’ll see a button to “Upload Data”. That will take you to a form to download the CloudNine Discovery client (which is a Windows based client application that resides on your desktop) for uploading data for processing.  Download and install the client to upload data.
  3. Once the client is downloaded and installed, launch the client, log in with your newly created credentials and simply follow the wizard prompts to upload the desired data set and put it into the project of your choice (which you can create if it doesn’t already exist). It’s that easy!

We can’t get you out of working this weekend.  But, we can take the hassle out of getting started.  You’re welcome.  :o)

So, what do you think?  Have you been faced with any “rush” eDiscovery projects lately?  Please share any comments you might have or if you’d like to know more about a particular topic.

eDiscovery Daily will return on Tuesday as we remember this Memorial Day the people who gave their lives while serving in our armed forces.

Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by CloudNine. eDiscovery Daily is made available by CloudNine solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscovery Daily should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.

Brad Jenkins of CloudNine: eDiscovery Trends

This is the first of the 2016 LegalTech New York (LTNY) Thought Leader Interview series.  eDiscovery Daily interviewed several thought leaders at LTNY this year to get their observations regarding trends at the show and generally within the eDiscovery industry.  Unlike previous years, some of the questions posed to each thought leader were tailored to their position in the industry, so we have dispensed with the standard questions we normally ask all thought leaders.

Today’s thought leader is Brad Jenkins of CloudNine™.  Brad has over 20 years of experience as an entrepreneur, as well as 15 years leading customer focused companies in the litigation technology arena. Brad also has authored several articles on document management and litigation support issues, and has appeared as a speaker before national audiences on document management practices and solutions.  He’s also my boss!  :o)

What are your general observations about LTNY this year and how it compared to other LTNY shows that you have attended?

Again this year, LTNY seemed reasonably well attended.  Thankfully, we didn’t have the weather and travel issues that we had the past few years, so that probably helped boost attendance.  And, the Hilton Lobby Lounge was back this year, so that provided an additional location to meet, though most of our meetings were in our suite.  Though I was really busy and didn’t get much chance to attend sessions, I understand that they were very good as always.  I did notice a drop in the number of exhibitors again this year and the exhibit hall did seem to be less crowded.  One colleague of mine who exhibited indicated that the number of leads he received at the show dropped about 30 percent from last year, so that’s consistent with my own observations and those of my colleagues.

For me, LTNY has become as much about the meetings with colleagues and business partners as it is about the show itself.  CloudNine had meetings practically booked throughout the show, with various people including industry analysts, partners and potential partners and clients and prospects.  Because it is the biggest show of the year, most people in the industry attend, so it’s an ideal opportunity to meet face to face and move business relationships along further.  Sometimes, there is just no substitute for in-person meetings to further business relationships and to communicate your message to other business colleagues.

What about general industry trends?  Are there any notable trends that you’ve observed?

Certainly one trend that I have noticed, as others have certainly noticed, is the accelerated consolidation within our industry within the provider community and the growth of investment of outside venture capital firms in our industry.  Just in the past couple of months, we have seen Huron Legal acquired by Consilio (which received a major investment from Shamrock Capital Advisors a few months before that), Millnet acquired by Advanced Discovery, Orange Legal acquired by Xact Data Discovery and Kiersted Systems acquired by OmniVere.  Rob Robinson does a terrific job of tracking mergers, acquisitions and investments in our industry and, according to his list, there have been eleven significant acquisitions and investments in just the past three months!

Another noticeable trend in the industry is the clear trend toward automation within eDiscovery.  You wrote about it earlier this year and, like you, I believe that the age of automation is here.  Some have dismissed the term “automation” as a marketing term, but I can’t think of a better term to describe the transformation of tasks that used to require a high degree of manual intervention and supervision to a point where little, if any, human involvement is necessary.  We’ve seen it for years through automation of review with technology assisted review techniques such as clustering and predictive coding and we have begun to see use of some artificial intelligence techniques on the information governance side.  Now, we are seeing automation of the processing of data to get it into a review platform and cloud-based providers (including CloudNine) automating that process.

Having been in the legal technology industry for many years, I have really seen an evolution of technology offerings in the marketplace.  At the beginning, I saw applications that were originally developed for other purposes being adapted for eDiscovery and those solutions were incomplete.  As the market developed, there started to be applications that were specifically designed for eDiscovery and those solutions were an improvement, but they were designed for isolated processes, such as collection or processing or review, with no automation of tasks.  The next generation of solutions were designed for eDiscovery and designed for task integration, but still adapted for task automation – some of those are the most popular solutions in the market today.  The new solutions – the “fourth generation” technology offerings are not only designed for eDiscovery and designed for task integration, they’re designed for task automation as well.

Many people say that if you want to tell where an industry is heading, follow the money.  In the past several months, you’ve seen providers like Logikcull and Everlaw that emphasize automation receive significant capital investments and, just before LTNY, you saw Thomson Reuters announce a new platform where automated processing is a key component.  It’s clear that big money is being invested in the growing automation sector of the industry.  You can get on the bus, or you can get run over by the bus.  As a provider that has been committed to simplified eDiscovery automation for several years now, CloudNine is on the bus and we feel that we have an excellent “seat” on that bus and are well positioned to help usher eDiscovery into the automation age.

What are you working on that you’d like our readers to know about?

Well, since I was just talking about fourth generation technology solutions, it seems appropriate to discuss how CloudNine has gotten to the point where we are in that evolution.  About 3 1/2 years ago at CloudNine, we looked at our legacy platform that had been in place since the early 2000s and was on version 14.  Our clients were happy with the platform overall, but we realized that if we were going to stay competitive as the market evolved, our legacy platform wasn’t going to be able to support those future needs.  So, we made the decision to almost completely start from scratch and re-develop our platform from the ground up, using the latest technology with an eye toward a truly simplified eDiscovery automation approach.  The platform that you see today via the user interface is just the tip of the iceberg of the overall solution – behind it is a series of workflows to accomplish various tasks.  For example, there are 34 distinct workflows (our CTO and co-founder Bill David calls them “cascading buckets“ that enable the workflows to scale) just in our Discovery Client application that enables clients to upload and process data into our CloudNine review platform.  This modularized approach of putting together re-usable workflows enables us to both scale and adapt as needed to meet changing client needs and positions us well for the future.

We feel that CloudNine is the leader in simplifying eDiscovery automation.  We do this through what we call the 4 S’s: Speed, Simplicity, Security and Services.  Clients, even brand new clients, can be up and running in five minutes (Speed) through their ability to sign up for their own account and upload and process their own data.  We recently had a brand new client who signed up for an account, uploaded and processed 27 GB of Outlook PST files (which amounted to over 300,000 emails and attachments) and culled out nearly two-thirds of the collection via HASH deduplication and irrelevant domain culling – all within 24 hours without ever having to speak to a CloudNine representative!  The ease of use (Simplicity) of the platform through the wizard-based client application for uploading data and a browser independent review module enables our clients to get up to speed with no more than an hour (or less) of training required.

Our approach to Security is unique as well – we operate within a protected cloud, not a public cloud, where the clients know that their data will be located on our servers in a Tier IV data center that is located 5 minutes from our offices.  This data center hosts data for nine of the top Fortune 20 corporations and was instrumental in us being selected over a year ago by a Fortune 150 corporation to host their data.  Finally, what makes us unique are the Services that we provide to support the software and automation – in addition to the software that we provide to help automate the eDiscovery process, we also provide managed services ranging from forensic collection to data conversion to technical advice and eDiscovery best practices and managed document review.  This enables our clients to rely on one provider for all of their services needs – as opposed to software-only providers that would have to outsource those services to a third party.

We believe that the combination of Speed, Simplicity, Security and Services enables CloudNine to provide the simplified eDiscovery automation approach that our clients want.  It’s an exciting time in our industry and CloudNine is excited to be forefront in its continued evolution, as we have been for the last 13 years!

Thanks, Brad, for participating in the interview!

And to the readers, as always, please share any comments you might have or if you’d like to know more about a particular topic!

Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by CloudNine. eDiscovery Daily is made available by CloudNine solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscovery Daily should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.

For a Positive Outlook to Discovering Emails, You Need a Closed Outlook: eDiscovery Best Practices

Does that statement seem confusing?  Let me explain.

Let’s call this a “tip of the day”.  As you may know, at CloudNine (shameless plug warning!), we have an automated processing capability for enabling clients to load and process their own data – they can use this capability to load their data into our review platform or they can even process data for loading into their own preferred review platform if they want.  So, we can still help you even if you already use Relativity or a number of other popular platforms.

Regardless of that fact, most of our users are using the processing capability to process emails, usually from Outlook Personal Storage Table (PST) files.  Let’s face it, despite increased volumes of social media and other types of electronically stored information, emails are still predominant in eDiscovery.  And, for those users, we get one issue more than any other when it comes to processing those Outlook emails:

They still have Outlook open with the PST file opened when they attempt to upload that PST file or when they try to create a ZIP file containing the Outlook PST.

The resulting ZIP file that is created (either by the user or by our client application if the data is not already contained in an archive file) will almost invariably be corrupted or empty.  Either way, this results in a failure during processing of the loaded data – because, that data is simply corrupt.

So, my tip of the day is this: Before attempting to create a ZIP (or RAR or other type of archive) of a PST file (or before you upload it to a platform like CloudNine for processing), make sure that Outlook is closed or at least that the PST file is closed within Outlook.  For a positive outlook to discovering emails, you need a closed Outlook.

Does that make sense now?  :o)

So, what do you think?  Is email still the predominant source of discoverable ESI in your organization?  Please share any comments you might have or if you’d like to know more about a particular topic.

Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by CloudNine. eDiscovery Daily is made available by CloudNine solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscovery Daily should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.