Processing Archives

If Your Documents Are Not Logical, Discovery Won’t Be Either – eDiscovery Best Practices

May 13, 2014

Scanning may no longer be cool, but it’s still necessary. Electronic discovery still typically includes a paper component. When it comes to paper, how documents are identified is critical to how useful they will be. Here’s an example.

Your client collects hard copy documents from various custodians related to the case and organizes them into folders. In one of the folders is a one page fax cover sheet attached to a two page letter, as well as an unrelated report and four different contracts, each 15-20 pages. The entire folder is scanned as a single document, as either a TIFF or PDF file.

Only the letter is retrieved in a search as responsive to the case. But, because it is contained within a document containing 70 to 80 other pages, you wind up reviewing 70 to 80 unrelated pages that would not otherwise have to review. It complicates production, as well – how do you produce partial “documents”? Also, if the non-responsive report and contracts have duplicates in the collection, you can’t effectively de-dupe those to eliminate those from the review population because they’re combined together.

It happens more often than you think. It also can happen – sometimes quite often – with the scanned documents that the other side produces to you. So, how do you get the documents into a more logical and usable organization?

Logical Document Determination (or LDD) is a process that some eDiscovery providers (including – shameless plug warning! – CloudNine Discovery). It’s a process where each image page in a scanned document set is reviewed and the “logical document breaks” (i.e., each page that starts a new document) is identified. Then, the documents are re-assembled, based on those logical document breaks.

Once the documents are logically organized, other processes – like Optical Character Recognition (OCR) and clustering (including near duplicate identification) can then be performed at the appropriate level of documents and the smaller, more precise, unitized documents can be indexed for searching. Instead of reviewing a 70-80 page “document” comprised of several logical documents, your search will retrieve the two page letter that is actually responsive, making your review and production processes more efficient.

LDD is typically priced on a per page basis of pages reviewed for logical document breaks – prices can vary depending on the volume of pages to be reviewed and where the work is being performed (there are providers in the US and overseas). While it’s a manual process, it’s well worth it if your collection of imaged documents is poorly defined.

So, what do you think? Have you ever received a collection of poorly organized image files? If so, did you use Logical Document Determination to organize them properly? Please share any comments you might have or if you’d like to know more about a particular topic.

Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by CloudNine Discovery. eDiscoveryDaily is made available by CloudNine Discovery solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscoveryDaily should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.

Tom O’Connor of Gulf Coast Legal Technology Center – eDiscovery Trends

March 14, 2014

This is the ninth of the 2014 LegalTech New York (LTNY) Thought Leader Interview series. eDiscoveryDaily interviewed several thought leaders after LTNY this year (don’t get us started) and generally asked each of them the following questions:

What significant eDiscovery trends did you see at LTNY this year and what do you see for 2014?
With new amendments to discovery provisions of the Federal Rules of Civil Procedure now in the comment phase, do you see those being approved this year and what do you see as the impact of those Rules changes?
It seems despite numerous resources in the industry, most attorneys still don’t know a lot about eDiscovery? Do you agree with that and, if so, what do you think can be done to improve the situation?
What are you working on that you’d like our readers to know about?

Today’s thought leader is Tom O’Connor. Tom is a nationally known consultant, speaker and writer in the area of computerized litigation support systems. A frequent lecturer on the subject of legal technology, Tom has been on the faculty of numerous national CLE providers and has taught college level courses on legal technology. Tom’s involvement with large cases led him to become familiar with dozens of various software applications for litigation support and he has both designed databases and trained legal staffs in their use on many of the cases mentioned above. This work has involved both public and private law firms of all sizes across the nation. Tom is the Director of the Gulf Coast Legal Technology Center in New Orleans.

What significant eDiscovery trends did you see at LTNY this year and what do you see for 2014?

In my opinion, LegalTech has become a real car show. There are just too many vendors on the show floor, all saying they do the same thing. Someone at the show tallied it up and determined that 38% of the exhibitors were eDiscovery vendors. And, that’s just the dedicated eDiscovery vendors – there are other companies like Lexis, who do other things, but half of their booth was focused on eDiscovery. The show has sections of the booths down one long hall with sales people standing in front of each section and it’s like “running the gauntlet” when you walk by them. It’s a bit overwhelming.

Having said that, a lot of people were still getting stuff done, but they were doing so in the suites either at the hotel or across the street. I saw a lot of good B-to-B activities off the sales floor and I think you can get more done with the leads that you get if you can get them off the sales floor in a more sane environment. At the same time, if you’re not at the show, people question you. They’ll say “hey, what happened to the wombat company?” So, being at the show still helps, at least with name recognition.

One trend that has been going on for a while is that “everybody under the sun” is doing eDiscovery or says that they’re doing eDiscovery. The phenomenal growth of the number of eDiscovery vendors of all sizes surprises me. We see headlines about providers getting bought out and some companies acquiring other companies, but it seems like every time one gets acquired, two more take its place. That surprised me as I expected to see more stratification, but did not. Not that buyouts aren’t occurring, but there’s just so much growth in the space that the number of players is not shrinking.

Another trend that I noticed which puzzled me until I walked around the show and realized what was going on, is the entry of companies like IBM and Xerox into the eDiscovery space. It puzzled me until I took a good look at their products and realized that the trend is to get more throughput in processing. Our data sets are getting so big. A terabyte is just not that unusual anymore. Two to five terabytes is becoming typical in large cases. 500 GB to 1 terabyte is becoming more common, even in a small case. Being able to process 5 to 10 GB an hour isn’t cutting it anymore and I saw more pressure on vendors to process up to a terabyte (or even more) per day. So, it makes sense that companies like IBM and Xerox are going to get into the big data space for corporate clients because they’re already there and they have the horsepower. So, I see the industry focused on different ways to speed up ingestion and processing of data.

That has been accompanied by another trend: pricing pressures. Providers are starting to offer deals like $20 per GB all in with hosting, processing, review, unlimited users, etc. From the other end of the spectrum of companies like IBM and Xerox are small technology companies, coming not from legal but from a very high-end technology background, looking to apply their technology skills in the eDiscovery space and offering really discounted prices. I’ve seen a lot of that and we started to see it last year, with providers starting to offer project pricing and getting away from a per GB pricing model. I think we’re going to see more and more of that as the year goes along. I hesitate to use the word “commoditized” because I don’t think it is. It’s not like scanning – every eDiscovery job is different with the types of files you have and what you want to accomplish. But, there will certainly be a big push to lower the pricing from what we’ve been seeing for the 1-3 years and I think you’re going to see some pretty dramatic price cuts with pressure from new players coming into the market and increased competition.

With new amendments to discovery provisions of the Federal Rules of Civil Procedure now in the comment phase, do you see those being approved this year and what do you see as the impact of those Rules changes?

I’ve been astonished that after the first wave of comments last fall that there has been little or no public comments or even discussion in the media about the rules changes. The public comment period closes tomorrow (Tom was interviewed on February 14) and you know the saying “March comes in like a lion and goes out like a lamb”? That seems to be how it is with the end of the comment period. I think I saw one article mentioning the fact that the comments were closing this week. It has been a surprising non-issue to me.

For that reason, I think the rules changes will go through. I don’t think there has been a concerted effort to speak out against them. As I understand it, the rules still won’t be enacted until 2016 because they still have to go back to the committee and through Congress and through the Supreme Court. It’s a really lengthy period which allows for intervention at a number of different steps. But, I haven’t seen any concerted effort mounted to talk against them, though Judge Scheindlin has been quite adamant in her comments. My personal feeling is that we didn’t need the new rules. I think they benefit the corporate defense world and change some standards. Craig Ball pointed out in a column last year that they don’t even address the issue of metadata, which is problematic. I don’t think we needed the rules changes, quite frankly. And, I wrote a column about that last year. In a world where I hear commentators and judges say that 90% of the attorneys that appear in front of them still don’t understand ESI or how things work, clearly if they don’t understand the current rules, why do we need rules changes? Let’s get people up to speed on what they’re supposed to be doing now before we worry about fine tuning it. I understand the motivation behind getting them enacted from the people who are pushing for them, why they wanted them and I suspect they will pretty much go through as written.

It seems despite numerous resources in the industry, most attorneys still don’t know a lot about eDiscovery? Do you agree with that and, if so, what do you think can be done to improve the situation?

I absolutely agree with that. I think the obvious remedy is to educate them where lawyers get educated, which is in law schools and I think the law schools have been negligent, if not grossly negligent, in addressing that issue. Browning Marean and I went around to the different law schools to try to get them to sponsor a clinic or educational program in this area eight or nine years ago and were rebuffed. Even to this day, though there are some individuals that are teaching classes at individual law schools, with the exception of a new program at Northeastern, there has been no curriculum devoted to technology as part of the regular law school curriculum.

Even the programs that have sprung up: the wonderful job that Craig Ball and Judge Facciola does at Georgetown Law School is sponsored by their CLE department, not the law school itself. Michael Arkfeld has a great program that he does for three days down at the Sandra Day O’Connor law school in Arizona State University (covered on the blog here). But, it’s a three day program, not a course, not a curriculum. It’s not a focus in the curriculum of the actual law school itself. We’ve had “grass roots” efforts spring up with Craig’s and Michael’s efforts, what Ralph Losey and his son Adam have been doing, as well as a number of people at the local level with CLE programs. But, the fact is that lawyers get educated in law schools and if you really want to solve this, you make it as part of the curriculum at law schools.

There has always been an attitude on the part of law schools. As Browning and I were told by the dean of a top flight law school several years ago, “we train architects, not carpenters”. I myself was referred to, face-to-face, by a group of law professors as a “tradesman”. They said “Gee, Tom, this proposal is a great idea, but why would we trust the education of our students to a tradesman like you?” There’s this sort of disdainful academic outlook on anything that involves the hands-on use of computers and that’s got to change. Judge Rosenthal said that “we have to change the paradigm” on how we handle things. Lawyers and judges alike have to look at things differently and all of us need to adjust how we look at the world today. Because it’s not just a legal issue, it’s a social issue. Society has changed how it manufactures, creates and stores information/data/documents. Other professional areas have caught onto that and legal education has really lagged behind.

I mentioned the eDiscovery Institute at Georgetown Law School, which happens every June. But, they cap the attendants at about 60. Do the math, there are about a million lawyers in the country and if you’re only going to educate 60 per year, you’ll never get there. I also think that bar associations could be much more forthright in education in this area and requiring it. Judicial pressure is having the best results – judges are requiring some sort of certification of competence in this area. I know of several Federal judges who require the parties to state for the record that they’re qualified to address eDiscovery. Some of the pilot projects that have sprung up, like the one at the University of Chicago, are going to require a self-certifying affidavit of competence (assuming they pass) stating that you’re qualified to talk about these issues. Judges are expecting lawyers, regardless of how they learn it, to know what they’re talking about with regard to technology and not to waste the court’s time.

What are you working on that you’d like our readers to know about?

I just recently published a new guide on Technolawyer, titled LitigationWorld Quick Start Guide to Mastering Ediscovery (and covered on this blog here). There are a lot of beginner’s guides to eDiscovery, but this one doesn’t really focus on eDiscovery, it focuses on technology, answering questions like: How do computers work? What are bits, bytes, RAM, what’s a gigabyte, what’s a terabyte, etc.

I literally had a discussion about an hour ago with a client for whom we have a big case going on in Federal court and there’s a large production, over a terabyte being processed by our opponents in the case right now. I asked the client how much paper he thought that was and he had no idea. The next time we start arguing cost in front of the judge, I’m going to bring in a chart that says a gigabyte is X number of pages of paper so that it has some meaning to them. So, I think it’s really important to explain these basic concepts, and we in the technology world forget quite a bit how little many lawyers know about technology. So the guide is designed to talk about how electronic media stores data, how that data is retrieved and explains some of the common terms and phrases used in the physical construction and workings of a computer. Before you even start talking about eDiscovery, you need to have an understanding of how computers work and how they find data and where data can reside. We throw around terms like “slack space” and “metadata” casually without realizing that not everyone understands those terms. This guide is meant to address that knowledge gap.

I’m continuing some of my case work, of course. Lastly, I recently joined a company called Cavo, which is bringing a new eDiscovery product to market that I’m excited about. Busy as always! And, of course, there are always good things going on in New Orleans!

Thanks, Tom, for participating in the interview!

And to the readers, as always, please share any comments you might have or if you’d like to know more about a particular topic!

Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by CloudNine Discovery. eDiscoveryDaily is made available by CloudNine Discovery solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscoveryDaily should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.

Are You Scared Yet? – eDiscovery Horrors!

October 31, 2013

Today is Halloween. Every year at this time, because (after all) we’re an eDiscovery blog, we try to “scare” you with tales of eDiscovery horrors. So, I have one question: Are you scared yet?

Did you know that there has been over 3.4 sextillion bytes created in the Digital Universe since the beginning of the year, and data in the world will grow nearly three times as much from 2012 to 2017? How do you handle your own growing universe of data?

What about this?

The proposed blended hourly rate was $402 for firm associates and $632 for firm partners. However, the firm asked for contract attorney hourly rates as high as $550 with a blended rate of $466.

How about this?

You’ve got an employee suing her ex-employer for discrimination, hostile work environment and being forced to resign. During discovery, it was determined that a key email was deleted due to the employer’s routine auto-delete policy, so the plaintiff filed a motion for sanctions. Sound familiar? Yep. Was her motion granted? Nope.

Or maybe this?

After identifying custodians relevant to the case and collecting files from each, you’ve collected roughly 100 gigabytes (GB) of Microsoft Outlook email PST files and loose electronic files from the custodians. You identify a vendor to process the files to load into a review tool, so that you can perform review and produce the files to opposing counsel. After processing, the vendor sends you a bill – and they’ve charged you to process over 200 GB!!

Scary, huh? If the possibility of exponential data growth, vendors holding data hostage and billable review rates of $466 per hour keep you awake at night, then the folks at eDiscovery Daily will do our best to provide useful information and best practices to enable you to relax and sleep soundly, even on Halloween!

Then again, if the expense, difficulty and risk of processing and loading up to 100 GB of data into an eDiscovery review application that you’ve never used before terrifies you, maybe you should check this out.

Of course, if you seriously want to get into the spirit of Halloween, click here. This will really terrify you!

What do you think? Is there a particular eDiscovery issue that scares you? Please share your comments and let us know if you’d like more information on a particular topic.

Happy Halloween!

Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by CloudNine Discovery. eDiscoveryDaily is made available by CloudNine Discovery solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscoveryDaily should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.

Plaintiffs’ Supreme Effort to Recuse Judge Peck in Da Silva Moore Denied – eDiscovery Case Law

October 30, 2013

As we discussed back in July, attorneys representing lead plaintiff Monique Da Silva Moore and five other employees filed a petition for a writ of certiorari with the US Supreme Court arguing that New York Magistrate Judge Andrew Peck, who approved an eDiscovery protocol agreed to by the parties that included predictive coding technology, should have recused himself given his previous public statements expressing strong support of predictive coding. Earlier this month, on October 7, that petition was denied by the Supreme Court.

Da Silva Moore and her co-plaintiffs had argued in the petition that the Second Circuit Court of Appeals was too deferential to Peck when denying the plaintiff’s petition to recuse him, asking the Supreme Court to order the Second Circuit to use the less deferential “de novo” standard.

The plaintiffs have now been denied in their recusal efforts in four courts. Here is the link to the Supreme Court docket item, referencing denial of the petition.

This battle over predictive coding and Judge Peck’s participation has continued for over 18 months. For those who may have not been following the case or may be new to the blog, here’s a recap.

Last year, back in February, Judge Peck issued an opinion making this case likely the first case to accept the use of computer-assisted review of electronically stored information (“ESI”) for this case. However, on March 13, District Court Judge Andrew L. Carter, Jr. granted the plaintiffs’ request to submit additional briefing on their February 22 objections to the ruling. In that briefing (filed on March 26), the plaintiffs claimed that the protocol approved for predictive coding “risks failing to capture a staggering 65% of the relevant documents in this case” and questioned Judge Peck’s relationship with defense counsel and with the selected vendor for the case, Recommind.

Then, on April 5, 2012, Judge Peck issued an order in response to Plaintiffs’ letter requesting his recusal, directing plaintiffs to indicate whether they would file a formal motion for recusal or ask the Court to consider the letter as the motion. On April 13, (Friday the 13th, that is), the plaintiffs did just that, by formally requesting the recusal of Judge Peck (the defendants issued a response in opposition on April 30). But, on April 25, Judge Carter issued an opinion and order in the case, upholding Judge Peck’s opinion approving computer-assisted review.

Not done, the plaintiffs filed an objection on May 9 to Judge Peck’s rejection of their request to stay discovery pending the resolution of outstanding motions and objections (including the recusal motion, which has yet to be ruled on. Then, on May 14, Judge Peck issued a stay, stopping defendant MSLGroup’s production of electronically stored information. On June 15, in a 56 page opinion and order, Judge Peck denied the plaintiffs’ motion for recusal. Judge Carter ruled on the plaintiff’s recusal request on November 7 of last year, denying the request and stating that “Judge Peck’s decision accepting computer-assisted review … was not influenced by bias, nor did it create any appearance of bias”.

The plaintiffs then filed a petition for a writ of mandamus with the Second Circuit of the US Court of Appeals, which was denied this past April, leading to their petition for a writ of certiorari with the US Supreme Court, which has now also been denied.

So, what do you think? Will we finally move on to the merits of the case? Please share any comments you might have or if you’d like to know more about a particular topic.

Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by CloudNine Discovery. eDiscoveryDaily is made available by CloudNine Discovery solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscoveryDaily should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.

For Successful Discovery, Think Backwards – eDiscovery Best Practices

October 8, 2013

The Electronic Discovery Reference Model (EDRM) has become the standard model for the workflow of the process for handling electronically stored information (ESI) in discovery. But, to succeed in discovery, regardless whether you’re the producing party or the receiving party, it might be helpful to think about the EDRM model backwards.

Why think backwards?

You can’t have a successful outcome without envisioning the successful outcome that you want to achieve. The end of the discovery process includes the production and presentation stages, so it’s important to determine what you want to get out of those stages. Let’s look at them.

Presentation

As a receiving party, it’s important to think about what types of evidence you need to support your case when presenting at depositions and at trial – this is the type of information that needs to be included in your production requests at the beginning of the case.

Production

The format of the ESI produced is important to both sides in the case. For the receiving party, it’s important to get as much useful information included in the production as possible. This includes metadata and searchable text for the produced documents, typically with an index or load file to facilitate loading into a review application. The most useful form of production is native format files with all metadata preserved as used in the normal course of business.

For the producing party, it’s important to save costs, so it’s important to agree to a production format that minimizes production costs. Converting files to an image based format (such as TIFF) adds costs, so producing in native format can be cost effective for the producing party as well. It’s also important to determine how to handle issues such as privilege logs and redaction of privileged or confidential information.

Addressing production format issues up front will maximize cost savings and enable each party to get what they want out of the production of ESI.

Processing-Review-Analysis

It also pays to determine early in the process about decisions that affect processing, review and analysis. How should exception files be handled? What do you do about files that are infected with malware? These are examples of issues that need to be decided up front to determine how processing will be handled.

As for review, the review tool being used may impact production specs in terms of how files are viewed and production of load files that are compatible with the review tool, among other considerations. As for analysis, surely you test search terms to determine their effectiveness before you agree on those terms with opposing counsel, right?

Preservation-Collection-Identification

Long before you have to conduct preservation and collection for a case, you need to establish procedures for implementing and monitoring litigation holds, as well as prepare a data map to identify where corporate information is stored for identification, preservation and collection purposes.

As you can see, at the beginning of a case (and even before), it’s important to think backwards within the EDRM model to ensure a successful discovery process. Decisions made at the beginning of the case affect the success of those latter stages, so don’t forget to think backwards!

So, what do you think? What do you do at the beginning of a case to ensure success at the end? Please share any comments you might have or if you’d like to know more about a particular topic.

P.S. — Notice anything different about the EDRM graphic?

Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by CloudNine Discovery. eDiscoveryDaily is made available by CloudNine Discovery solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscoveryDaily should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.

eDiscovery Daily is Three Years Old!

September 20, 2013

We’ve always been free, now we are three!

It’s hard to believe that it has been three years ago today since we launched the eDiscoveryDaily blog. We’re past the “terrible twos” and heading towards pre-school. Before you know it, we’ll be ready to take our driver’s test!

We have seen traffic on our site (from our first three months of existence to our most recent three months) grow an amazing 575%! Our subscriber base has grown over 50% in the last year alone! Back in June, we hit over 200,000 visits on the site and now we have over 236,000!

We continue to appreciate the interest you’ve shown in the topics and will do our best to continue to provide interesting and useful posts about eDiscovery trends, best practices and case law. That’s what this blog is all about. And, in each post, we like to ask for you to “please share any comments you might have or if you’d like to know more about a particular topic”, so we encourage you to do so to make this blog even more useful.

We also want to thank the blogs and publications that have linked to our posts and raised our public awareness, including Pinhawk, Ride the Lightning, Litigation Support Guru, Complex Discovery, Bryan College, The Electronic Discovery Reading Room, Litigation Support Today, Alltop, ABA Journal, Litigation Support Blog.com, Litigation Support Technology & News, InfoGovernance Engagement Area, EDD Blog Online, eDiscovery Journal, Learn About E-Discovery, e-Discovery Team ® and any other publication that has picked up at least one of our posts for reference (sorry if I missed any!). We really appreciate it!

As many of you know by now, we like to take a look back every six months at some of the important stories and topics during that time. So, here are some posts over the last six months you may have missed. Enjoy!

Rodney Dangerfield might put it this way – “I Tell Ya, Information Governance Gets No Respect”

Is it Time to Ditch the Per Hour Model for Document Review? Here’s some food for thought.

Is it Possible for a File to be Modified Before it is Created? Maybe, but here are some mechanisms for avoiding that scenario (here, here, here, here, here and here). Best of all, they’re free.

Did you know changes to the Federal eDiscovery Rules are coming? Here’s some more information.

Count Minnesota and Kansas among the states that are also making changes to support eDiscovery.

By the way, since the Electronic Discovery Reference Model (EDRM) annual meeting back in May, several EDRM projects (Metrics, Jobs, Data Set and the new Native Files project) have already announced new deliverables and/or requested feedback.

When it comes to electronically stored information (ESI), ensuring proper chain of custody tracking is an important part of handling that ESI through the eDiscovery process.

Do you self-collect? Don’t Forget to Check for Image Only Files!

The Files are Already Electronic, How Hard Can They Be to Load? A sound process makes it easier.

When you remove a virus from your collection, does it violate your discovery agreement?

Do you think that you’ve read everything there is to read on Technology Assisted Review? If you missed anything, it’s probably here.

Consider using a “SWOT” analysis or Decision Tree for better eDiscovery planning.

If you’re an eDiscovery professional, here is what you need to know about litigation.

BTW, eDiscovery Daily has had 242 posts related to eDiscovery Case Law since the blog began! Forty-four of them have been in the last six months.

Our battle cry for next September? “Four more years!” 🙂

Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by CloudNine Discovery. eDiscoveryDaily is made available by CloudNine Discovery solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscoveryDaily should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.

Data Needs to Be Converted More Often than You Think – eDiscovery Best Practices

September 3, 2013

We’ve discussed previously that electronic files aren’t necessarily ready to review just because they’re electronic. They often need processing and good processing requires a sound process. Sometimes that process includes data conversion if the data isn’t in the most useful format.

Case in point: I recently worked with a client that received a multi-part production from the other side (via a another party involved in the litigation, per agreement between the parties) that included image files, OCR text files and metadata. The files that my client received were produced over several months to several other parties in the litigation. The production contained numerous emails, each of which (of course) included an email sent date. Can you guess which format the email sent date was provided in? Here are some choices (using today’s date and 1:00 PM as an example):

09/03/2013 13:00:00
9/03/2013 1:00 PM
September 3, 2013 1:00 PM
Sep-03-2013 1:00 PM
2013/09/03 13:00:00

The answer: all of them.

Because there were several productions to different parties with (apparently) different format agreements, my client didn’t have the option to request the data to be reproduced in a standard format. Not only that, the name of the produced metadata field wasn’t consistent between productions – in about 15 percent of the documents the producing party named the field email_date_sent, in the rest it was named date_sent.

Ever try to sort emails chronologically when they’re not only in different formats, but also in two different fields? It’s impossible. Fortunately, at CloudNine Discovery, there is no shortage of computer “geeks” to address problems like this (I’m admittedly one of them).

As a result, we had to standardize the format of the dates into one standard format in one field. We used a combination of SQL queries to get the data into one field and string commands and regular expressions to manipulate dates that didn’t fit a standard SQL date format by re-parsing them into a correct date format. For example, the date 2013/09/03 was reparsed into 09/03/2013.

Getting the dates into a standard format in a single field not only enabled us to sort the emails chronologically by date sent, it also enabled us to identify (in combination with other standard email metadata fields) duplicates in the collection based on metadata fields (since the data was in image and OCR formats, HASH algorithms weren’t a viable option for de-duplication).

Over the years, I’ve seen many examples where data (either from our side or the other side) needs to be converted. It happens more than you think. When that happens, it’s good to have a computer “geek” on your side to address the problem.

So, what do you think? Have you encountered data conversion issues in your cases? Please share any comments you might have or if you’d like to know more about a particular topic.

Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by CloudNine Discovery. eDiscoveryDaily is made available by CloudNine Discovery solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscoveryDaily should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.

How Big is Your ESI Collection, Really? – eDiscovery Best Practices

August 26, 2013

When I was at ILTA last week, this topic came up in a discussion with a colleague during the show, so I thought it would be good to revisit here.

After identifying custodians relevant to the case and collecting files from each, you’ve collected roughly 100 gigabytes (GB) of Microsoft Outlook email PST files and loose electronic files from the custodians. You identify a vendor to process the files to load into a review tool, so that you can perform review and produce the files to opposing counsel. After processing, the vendor sends you a bill – and they’ve charged you to process over 200 GB!! Are they trying to overbill you?

Yes and no.

Many of the files in most ESI collections are stored in what are known as “archive” or “container” files. For example, while Outlook emails can be stored in different file formats, they are typically collected from each custodian and saved in a personal storage (.PST) file format, which is an expanding container file. The scanned size for the PST file is the size of the file on disk.

Did you ever see one of those vacuum bags that you store clothes in and then suck all the air out so that the clothes won’t take as much space? The PST file is like one of those vacuum bags – it often stores the emails and attachments in a compressed format to save space. There are other types of archive container files that compress the contents – .ZIP and .RAR files are two examples of compressed container files. These files are often used to not only to compress files for storage on hard drives, but they are also used to compact or group a set of files when transmitting them, often in email. With email comprising a major portion of most ESI collections and the popularity of other archive container files for compressing file collections, the expanded size of your collection may be considerably larger than it appears when stored on disk.

When PST, ZIP, RAR or other compressed file formats are processed for loading into a review tool, they are expanded into their normal size. This expanded size can be 1.5 to 2 times larger than the scanned size (or more). And, that’s what some vendors will bill processing on – the expanded size. In those cases, you won’t know what the processing costs will be until the data is expanded since it’s difficult to determine until processing is complete.

It’s important to be prepared for that and know your options when processing that data. Make sure your vendor selection criteria includes questions about how processing is billed, on the scanned or expanded size. Some vendors (like the company I work for, CloudNine Discovery), do bill based on the scanned size of the collection for processing, so shop around to make sure you’re getting the best deal from your vendor.

So, what do you think? Have you ever been surprised by processing costs of your ESI? Please share any comments you might have or if you’d like to know more about a particular topic.

Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by CloudNine Discovery. eDiscoveryDaily is made available by CloudNine Discovery solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscoveryDaily should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.

A Technical Explanation of Near-Dupes – eDiscovery Tutorial

August 9, 2013

Bill Dimm provides a comprehensive and interesting description of near-dupes and the algorithms used to identify them in his Clustify blog (What is a near-dupe, really?). If you want to understand the “three reasonable, but different, ways of defining the near-dupe similarity between two documents”, bring your brain and check it out.

As we discussed last month, just because information volume in most organizations doubles every 18-24 months doesn’t mean that it’s all original. When reviewers are reviewing the same data again and again, it’s unnecessarily expensive and prone to mistakes.

As Bill notes in his post, “Near-duplicates are documents that are nearly, but not exactly, the same. They could be different revisions of a memo where a few typos were fixed or a few sentences were added. They could be an original email and a reply that quotes the original and adds a few sentences. They could be a Microsoft Word document and a printout of the same document that was scanned and OCRed with a few words not matching due to OCR errors.” I also classify examples such as a Word document published to an Adobe PDF file (where the content is the same, but the file format is different, so the hash value will be different) as near-duplicates because they won’t be de-duped with an MD5 or SHA-1 hash algorithm at the file level. You need an algorithm that looks for similarity in the document content.

Identifying near-duplicates that contain almost the same information reduces redundant review and saves costs. A recent client of mine had over 800,000 emails belonging to near-duplicate groupings that would have been impossible to identify without an effective algorithm to group them together.

Bill’s blog post goes on to discuss different methods for measuring similarity using mechanisms like a Jaccard index and a MinHash algorithm which counts shingles (don’t worry, they’re neither painful nor scaly). Understanding how your near-dupe software works is important. As Bill notes, “If misunderstandings about how the algorithm works cause the similarity values generated by the software to be higher than you expected when you chose the similarity threshold, you risk tagging near-dupes of non-responsive documents incorrectly (grouped documents are not as similar as you expected). If the similarity values are lower than you expected when you chose the threshold, you risk failing to group some highly similar documents together, which leads to less efficient review (extra groups to review).” His post is an excellent primer to developing that understanding.

So, what do you think? Do you have a plan for handling near-duplicates in your collection? Please share any comments you might have or if you’d like to know more about a particular topic.

Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by CloudNine Discovery. eDiscoveryDaily is made available by CloudNine Discovery solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscoveryDaily should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.

I Removed a Virus, Did I Just Violate My Discovery Agreement? – eDiscovery Best Practices

August 8, 2013

As we discussed last month, working with electronic files in a review tool is NOT just simply a matter of loading the files and getting started. Electronic files are diverse, they can represent a whole collection of issues to address in order to process them for loading, and processing them effectively requires a sound process. But, what if the evidentiary files you collect from your custodians contain viruses or other malware?

It’s common to refer to all types of malware as “viruses”, but a computer virus is only one type of malware. Malware includes computer viruses, worms, trojan horses, spyware, dishonest adware, scareware, crimeware, most rootkits, and other malicious and unwanted software or program. A report from 2008 stated that more malicious code and other unwanted programs was being created than legitimate software applications. If you’ve ever had to attempt to remove files from an infected computer, you’ve seen just how prolific different types of malware can be.

Having worked with a lot of clients who don’t understand why it can take time to get ESI processed and loaded into their review platform, I’ve had to spend some time educating those clients as to the various processes required (including those we discussed last month). Before any of those processes can happen, you must first scan the files for viruses and other malware that may be infecting those files. If malware is found in any files, one of two things must happen:

Attempt to remove the malware with virus protection software, or
Isolate and log the infected files as exceptions (which you will also have to do if the virus protection software fails to remove the malware).

So, let’s get started, right? Not so fast.

While it may seem logical that the malware should always be removed, doing so is technically altering the file. It’s important to address how malware should be handled as part of the Rule 26(f) “meet and confer” conference, so neither party can be accused of spoliating data when removing malware from potentially discoverable files. If both sides agree that malware removal is acceptable, there still needs to be a provision to handle files for which malware removal attempts fail (i.e., exception logs). Regardless, the malware needs to be addressed so that it doesn’t affect the entire collection.

By the way, malware can hit anybody, as I learned (the hard way) a couple of years ago.

So, what do you think? How do you handle malware in your negotiations with opposing counsel and in your ESI collections? Please share any comments you might have or if you’d like to know more about a particular topic.

Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by CloudNine Discovery. eDiscoveryDaily is made available by CloudNine Discovery solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscoveryDaily should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.

Processing