Processing Archives

eDiscovery Best Practices: Your ESI Collection May Be Larger Than You Think

April 11, 2011

Here’s a sample scenario: You identify custodians relevant to the case and collect files from each. Roughly 100 gigabytes (GB) of Microsoft Outlook email PST files and loose “efiles” is collected in total from the custodians. You identify a vendor to process the files to load into a review tool, so that you can perform first pass review and, eventually, linear review and produce the files to opposing counsel. After processing, the vendor sends you a bill – and they’ve charged you to process over 200 GB!! What happened?!?

Did the vendor accidentally “double-bill” you? That would be great – but no. There’s a much more logical explanation and, unfortunately, you may wind up paying a lot more to process these files that you expected.

Many of the files in most ESI collections are stored in what are known as “archive” or “container” files. For example, as noted above, Outlook emails are typically saved for each custodian in a personal storage (.PST) file format, which is an expanding container file. For most custodians, all of their email (and the corresponding attachments, if present) resides in a few PST files. The scanned size for the PST file is the size of the file on disk.

Did you ever see one of those vacuum bags that you store clothes in and then suck all the air out so that the clothes won’t take as much space? The PST file is like one of those vacuum bags – it typically stores the emails and attachments in a compressed format to save space. When the emails and attachments are processed into a review tool, they are expanded into their normal size. This expanded size can be 1.5 to 2 times larger than the scanned size (or more). And, that’s what many vendors will bill on – the expanded size.

There are other types of archive container files that compress the contents – .zip and .rar files are two examples of compressed container files. These files are often used to not only to compress files for storage on hard drives, but they are also used to compact or group a set of files when transmitting them, usually in – you guessed it – email. With email comprising a majority of most ESI collections and the popularity of other archive container files for compressing file collections, the expanded size of your collection may be considerably larger than it appears when stored on disk. It’s important to be prepared for that and know your options when processing that data, so you can effectively anticipate those processing costs.

So, what do you think? Have you ever been surprised by processing costs of your ESI? Please share any comments you might have or if you’d like to know more about a particular topic.

eDiscovery Trends: Tom O’Connor of Gulf Coast Legal Technology Center

March 2, 2011

This is the eighth of the LegalTech New York (LTNY) Thought Leader Interview series. eDiscoveryDaily interviewed several thought leaders at LTNY this year and asked each of them the same three questions:

What do you consider to be the current significant trends in eDiscovery on which people in the industry are, or should be, focused?
Which of those trends are evident here at LTNY, which are not being talked about enough, and/or what are your general observations about LTNY this year?
What are you working on that you’d like our readers to know about?

Today’s thought leader is Tom O’Connor. Tom is a nationally known consultant, speaker and writer in the area of computerized litigation support systems. A frequent lecturer on the subject of legal technology, Tom has been on the faculty of numerous national CLE providers and has taught college level courses on legal technology. Tom’s involvement with large cases led him to become familiar with dozens of various software applications for litigation support and he has both designed databases and trained legal staffs in their use on many of the cases mentioned above. This work has involved both public and private law firms of all sizes across the nation. Tom is the Director of the Gulf Coast Legal Technology Center in New Orleans.

What do you consider to be the current significant trends in eDiscovery on which people in the industry are, or should be, focused?

I think that there is still a lack of general baseline understanding of, not just eDiscovery principles, but technology principles. Attorneys have been coming to LegalTech for over 30 years and have seen people like Michael Arkfeld, Browning Marean and folks like Neil Aresty, who got me started in the business. The nouns have changed, from DOS to Windows, from paper to images, and now its eDiscovery. The attorneys just haven’t been paying attention. Bottom line is: for years and years, they didn’t care about technology. They didn’t learn it in law school because a) they had no inclination to learn technology and b) they didn’t have any real ability to learn it, myself included. With the exception of a few people like Craig Ball and George Socha, who are versed in the technical side of things – the average attorney is not versed at all. So, the technology side of the litigation world consisted of the lit support people, the senior paralegals, the support staff and the IT people (to the minimal extent they assisted in litigation). That all changed when the Federal Civil Rules changed, and it became a requirement.

So, if I pick up a piece of paper here and ten years ago used this as an exhibit, would the judge say “Hey, counsel, that’s quite a printout you have there, is that a Sans Serif font? Is that 14 point or 15 point? Did you print this on an IBM 3436?” Of course not. The judge would authenticate it and admit it – or not – and there might be an argument. Now, when we go to introduce evidence, there are all sorts of questions that are technical in nature – “Where did you get that PST file? How did that email get generated? Did you run HASH values on that?”, etc. And, I’m not just making this up. If you look at decisions by Judge Grimm or Facciola or Peck or Waxse, they’re asking these questions. Attorneys, of course, have been caught like the “deer in the headlights” in response to those questions and now they’re trying to pick up that knowledge. If there’s one real trend I’m seeing this year, it’s that attorneys are finally taking technology seriously and trying to play catch up with their staff on understanding what all of this stuff is about. Judges are irritated about it. We have had major sanctions because of it. And, if they had been paying attention for the last ten years, we wouldn’t be in the mess that we are now.

Of course, some people disagree and think that the sheer volume of data that we have is contributing to that and folks like Ralph Losey, who I respect, think we should tweak the rules to change what’s relevant. It shouldn’t be anything that reasonably could lead to something of value in the case, we should “ratchet it down” so that the volume is reduced. My feeling on that is that we’ve got the technology tools to reduce the volume – if they’re used properly. The tools are better now than they were three years ago, but we had the tools to do that for awhile. There’s no reason for these whole scale “data dumps” that we see, and I forget if it was either Judge Grimm or Facciola who had a case where in his opinion he said “we’ve got to stop with these boilerplate requests for discovery and responses for requests for discovery and make them specific”.

So, that’s the trend I see, that lawyers are finally trying to take some time to try to get up to speed – whining and screaming pitifully all the way about how it’s not fair, and the sanctions are too high and there’s too much data. Get a life, get a grip. Use the tools that are out there that have been given to you for years. So, if I sound cynical, it’s because I am.

Which of those trends are evident here at LTNY, which are not being talked about enough, and/or what are your general observations about LTNY this year?

{Interviewed on the final afternoon of LTNY} Well, as always, a good show. This year, I think it was a great show, which is actually a bit of a surprise to me. I was worried, not that it would go down from last year, but that we had maybe flattened out because of the economy (and the weather). But, the turnout was great, the exhibit halls were great, a lot of good information. I think we’re seeing a couple of trends from vendors in general, especially in the eDiscovery space. We’re seeing vendors trying to consolidate. I think attorneys who work in this space are concerned with moving large amounts of data from one stage of the EDRM model to another. That’s problematic, because of the time and energy involved, the possible hazards involved and even authentication issues involved. So, the response to that is that some vendors attempt to do “end-to-end” or at least do three out of the six stages and reduce the movement or partner with each other with open APIs and transparent calls, so that process is easier.

At the same time, we’re seeing the process faster and more efficient with increased speed times for ingestion and processing, which is great. Maybe a bigger trend and one that will play out as the year goes along is a change in the pricing model, clearly getting away from per GB pricing to some other alternative such as, maybe, per case or per matter. Because of the huge amount of data we have do so. But also, we’re leaving out an area that Craig Ball addressed last year with his EDna challenge – what about the low end of the spectrum? This is great if you’re Pillsbury or DLA Piper or Fulbright & Jaworski – they can afford Clearwell or Catalyst or Relativity and can afford to call in KPMG or Deloitte. But, what about the smaller cases? They can benefit from technology as well. Craig addressed it with his EDna challenge for the $1,000 case and asked people to respond within those parameters. Browning Marean and I were asking “what about the $500,000 case?” Not that there’s anything bad about low end technology, you can use Adobe and S1 and some simple databases to do a great job. But, what about in the middle, where I still can’t afford to buy Relativity and I still can’t afford to process with Clearwell? What am I going to use? And, that’s where I think new pricing and some of the new products will address that. I’ve seen some hot new products, especially cloud based products, for small firms. That’s a big change for this year’s show, which, since it’s in New York, has been geared to big firms and big cases.

What are you working on that you’d like our readers to know about?

I think the things that excite me the most that are going on this year are the educational efforts I’m involved in. They include Ralph Losey’s online educational series through his blog, eDiscovery Team and Craig Ball through the eDiscovery Training Academy at Georgetown Law School in June. Both are very exciting.

And, my organization, the Gulf Coast Legal Technology Center continues to do a lot of CLE and pro-bono activities for the Mississippi and Louisiana bar, which are still primarily small firms. We also continue to assist Gulf Coast firms with technology needs as they continue to rebuild their legal technology infrastructure after Katrina.

Thanks, Tom, for participating in the interview!

And to the readers, as always, please share any comments you might have or if you’d like to know more about a particular topic!

eDiscovery Trends: George Socha of Socha Consulting

February 28, 2011

This is the seventh of the LegalTech New York (LTNY) Thought Leader Interview series. eDiscoveryDaily interviewed several thought leaders at LTNY this year and asked each of them the same three questions:

What do you consider to be the current significant trends in eDiscovery on which people in the industry are, or should be, focused?
Which of those trends are evident here at LTNY, which are not being talked about enough, and/or what are your general observations about LTNY this year?
What are you working on that you’d like our readers to know about?

Today’s thought leader is George Socha. A litigator for 16 years, George is President of Socha Consulting LLC, offering services as an electronic discovery expert witness, special master and advisor to corporations, law firms and their clients, and legal vertical market software and service providers in the areas of electronic discovery and automated litigation support. George has also been co-author of the leading survey on the electronic discovery market, The Socha-Gelbmann Electronic Discovery Survey. In 2005, he and Tom Gelbmann launched the Electronic Discovery Reference Model project to establish standards within the eDiscovery industry – today, the EDRM model has become a standard in the industry for the eDiscovery life cycle and there are eight active projects with over 300 members from 81 participating organizations. George has a J.D. for Cornell Law School and a B.A. from the University of Wisconsin – Madison.

What do you consider to be the current significant trends in eDiscovery on which people in the industry are, or should be, focused?

On the very “flip” side, the number one trend to date in 2011 is predictions about trends in 2011. They are part of a consistent and long-term pattern, which is that many of these trend predictions are not trend predictions at all – they are marketing material and the prediction is “you will buy my product or service in the coming year”.

That said, there are a couple of things of note. Since I understand you talked to Tom about Apersee, it’s worth noting that corporations are struggling with working through a list of providers to find out who provides what services. You would figure that there is somewhere in the range of 500 or so total providers. But, my ever-growing list, which includes both external and law firm providers, is at more than 1,200. Of course, some of those are probably not around anymore, but I am confident that there are at least 200-300 that I do not yet have on the list. My guess when the list shakes out is that there are roughly 1,100 active providers out there today. If you look at information from the National Center for State Courts and the Federal Judicial Center, you’ll see that there are about 11 million new lawsuits filed every year. I saw an article in the Cornell Law Forum a week or two ago which indicated that there are roughly 1.1 million lawyers in the country. So, there are 11 million lawsuits, 1.1 million lawyers and 1,100 providers. Most of those lawyers have no experience with eDiscovery and most of those lawsuits have no provider involved, which means eDiscovery is still very much an emerging market, not even close to being a mature market. As fast as providers disappear, through attrition or acquisition, new providers enter the market to take their place.

Which of those trends are evident here at LTNY, which are not being talked about enough, and/or what are your general observations about LTNY this year?

{Interviewed on the second afternoon of LTNY} Maybe this is overly optimistic, but part of what I’m seeing in leading up to the conference, on various web sites and at the conference itself, is that a series of incremental changes taking place over a long period are finally leading to some radical differences. One of those differences is that we finally are reaching a point where a number of providers can make the claim to being “end-to-end providers” with some legitimacy. For as long as we’ve had the EDRM model, we’ve had providers that have professed to cover the full EDRM landscape, by which they generally have meant Identification through Production. A growing number of providers not only cover that portion of the EDRM spectrum but have some ability to address Information Management, Presentation, or both By and large, those providers are getting there by building their software and services based on experience and learning over the past 8 to 10 to 12 years, introducing new offerings at the show that reflect that learned experience.

A couple of days ago, I only half-jokingly issued “the Dyson challenge” (as in the Dyson vacuum cleaner). Every year, come January, our living room carpet is strewn with pine tree needles and none of the vacuum cleaners that we have ever had have done a good job of picking up those needles. The Dyson vacuum cleaner claims it cyclones capture more dirt than anything, but I was convinced that could not include those needles. Nonetheless I tried, and to my surprise it worked like a charm! I want to see the providers offering products able to perform at that high level, not just meeting but exceeding expectations.

I also see a feeling of excitement and optimism that wasn’t apparent at last year’s show.

What are you working on that you’d like our readers to know about?

As I mentioned, we have launched the Apersee web site, designed to allow consumers to find providers and products that fit their specific needs. The site is in beta and the link is live. It’s in beta because we’re still working on features to make it as useful as possible to customers and providers. We’re hoping it’s a question of weeks, not months, before those features are implemented. Once we go fully live, we will go two months with the system “wide open” – where every consumer can see all the provider and product information that any provider has put in the system. After that, consumers will be able to see full provider and product profiles for providers who have purchased blocks of views. Even if a provider does not purchase views, all selection criteria it enters are searchable, but search results will display only the provider’s name and website name. Providers will be able to get stats on queries and how many times their information is viewed, but not detailed information as to which customers are connecting and performing the queries.

As for EDRM, we continue to make progress with an array of projects and a growing number of collaborative efforts, such as the work the Data Set group has down with TREC Legal and the work the Metrics group has done with the LEDES Committee. We not only want to see membership continue to grow, but we also want to continue to push for more active participation to continue to make progress in the various working groups. We’ve just met at the show here regarding the EDRM Testing pilot project to address testing standards. There are very few guidelines for testing of electronic discovery software and services, so the Testing project will become a full EDRM project as of the EDRM annual meeting this May to begin to address the need for those guidelines.

Thanks, George, for participating in the interview!

And to the readers, as always, please share any comments you might have or if you’d like to know more about a particular topic!

eDiscovery Trends: Jim McGann of Index Engines

February 18, 2011

This is the third of the LegalTech New York (LTNY) Thought Leader Interview series. eDiscoveryDaily interviewed several thought leaders at LTNY this year and asked each of them the same three questions:

What do you consider to be the current significant trends in eDiscovery on which people in the industry are, or should be, focused?
Which of those trends are evident here at LTNY, which are not being talked about enough, and/or what are your general observations about LTNY this year?
What are you working on that you’d like our readers to know about?

Today’s thought leader is Jim McGann. Jim is Vice President of Information Discovery at Index Engines. Jim has extensive experience with the eDiscovery and Information Management in the Fortune 2000 sector. He has worked for leading software firms, including Information Builders and the French-based engineering software provider Dassault Systemes. In recent years he has worked for technology-based start-ups that provided financial services and information management solutions.

What do you consider to be the current significant trends in eDiscovery on which people in the industry are, or should be, focused?

What we’re seeing is that companies are becoming a bit more proactive. Over the past few years we’ve seen companies that have simply been reacting to litigation and it’s been a very painful process because ESI collection has been a “fire drill” – a very last minute operation. Not because lawyers have waited and waited, but because the data collection process has been slow, complex and overly expensive. But things are changing. Companies are seeing that eDiscovery is here to stay, ESI collection is not going away and the argument of saying that it’s too complex or expensive for us to collect is not holding water. So, companies are starting to take a proactive stance on ESI collection and understanding their data assets proactively. We’re talking to companies that are not specifically responding to litigation; instead, they’re building a defensible policy that they can apply to their data sources and make data available on demand as needed.

Which of those trends are evident here at LTNY, which are not being talked about enough, and/or what are your general observations about LTNY this year?

{Interviewed on the first morning of LTNY} Well, in walking the floor as people were setting up, you saw a lot of early case assessment last year; this year you’re seeing a lot of information governance.. That’s showing that eDiscovery is really rolling into the records management/information governance area. On the CIO and General Counsel level, information governance is getting a lot of exposure and there’s a lot of technology that can solve the problems. Litigation support’s role will be to help the executives understand the available technology and how it applies to information governance and records management initiatives. You’ll see more information governance messaging, which is really a higher level records management message.

As for other trends, one that I’ll tie Index Engines into is ESI collection and pricing. Per GB pricing is going down as the volume of data is going up. Years ago, prices were a thousand per GB, then hundreds of dollars per GB, etc. Now the cost is close to tens of dollars per GB. To really manage large volumes of data more cost-effectively, the collection price had to become more affordable. Because Index Engines can make data on backup tapes searchable very cost-effectively, for as little as $50 per tape, data on tape has become as easy to access and search as online data. Perhaps even easier because it’s not on a live network. Backup tapes have a bad reputation because people think of them as complex or expensive, but if you take away the complexity and expense (which is what Index Engines has done), then they really become “full point-in-time” snapshots. So, if you have litigation from a specific date range, you can request that data snapshot (which is a tape) and perform discovery on it. Tape is really a natural litigation hold when you think about it, and there is no need to perform the hold retroactively.

So, what does the ease of which the information can be indexed from tape do to address the inaccessible argument for tape retrieval? That argument has been eroding over the years, thanks to technology like ours. And, you see decisions from judges like Judge Scheindlin saying “if you cannot find data in your primary network, go to your backup tapes”, indicating that they consider backup tapes in the next source right after online networks. You also see people like Craig Ball writing that backup tapes may be the most convenient and cost-effective way to get access to data. If you had a choice between doing a “server crawl” in a corporate environment or just asking for a backup tape of that time frame, tape is the much more convenient and less disruptive option. So, if your opponent goes to the judge and says it’s going to take millions of dollars to get the information off of twenty tapes, you must know enough to be in front of a judge and say “that’s not accurate”. Those are old numbers. There are court cases where parties have been instructed to use tapes as a cost-effective means of getting to the data. Technology removes the inaccessible argument by making it easier, faster and cheaper to retrieve data from backup tapes.

The erosion of the accessibility burden is sparking the information governance initiatives. We’re seeing companies come to us for legacy data remediation or management projects, basically getting rid of old tapes. They are saying “if I’ve got ten years of backup tapes sitting in offsite storage, I need to manage that proactively and address any liability that’s there” (that they may not even be aware exists). These projects reflect a proactive focus towards information governance by remediating those tapes and getting rid of data they don’t need. Ninety-eight percent of the data on old tapes is not going to be relevant to any case. The remaining two percent can be found and put into the company’s litigation hold system, and then they can get rid of the tapes.

How do incremental backups play into that? Tapes are very incremental and repetitive. If you’re backing up the same data over and over again, you may have 50+ copies of the same email. Index Engines technology automatically gets rid of system files and applies a standard MD5Hash to dedupe. Also, by using tape cataloguing, you can read the header and say “we have a Saturday full backup and five incremental during the week, then another Saturday full backup”. You can ignore the incremental tapes and just go after the full backups. That’s a significant percent of the tapes you can ignore.

What are you working on that you’d like our readers to know about?

Index Engines just announced today a partnership with LeClairRyan. This partnership combines legal expertise for data retention with the technology that makes applying the policy to legacy data possible. For companies that want to build policy for the retention of legacy data and implement the tape remediation process we have advisors like LeClairRyan that can provide legacy data consultation and oversight. By proactively managing the potential liability of legacy data, you are also saving the IT costs to explore that data.

Index Engines also just announced a new cloud-based tape load service that will provide full identification, search and access to tape data for eDiscovery. The Look & Learn service, starting at $50 per tape, will provide clients with full access to the index of their tape data without the need to install any hardware or software. Customers will be able to search the index and gather knowledge about content, custodians, email and metadata all via cloud access to the Index Engines interface, making discovery of data from tapes even more convenient and affordable.

Thanks, Jim, for participating in the interview!

And to the readers, as always, please share any comments you might have or if you’d like to know more about a particular topic!

eDiscovery Best Practices: Judges’ Guide to Cost-Effective eDiscovery

February 7, 2011

Last week at LegalTech, I met Joe Howie at the blogger’s breakfast on Tuesday morning. Joe is the founder of Howie Consulting and is the Director of Metrics Development and Communications for the eDiscovery Institute, which is a 501(c)(3) nonprofit research organization for eDiscovery.

eDiscovery Institute has just released a new publication that is a vendor-neutral guide for approaches to considerably reduce discovery costs for ESI. The Judges’ Guide to Cost-Effective E-Discovery, co-written by Anne Kershaw (co-Founder and President of the eDiscovery Institute) and Joe Howie, also contains a foreword by the Hon. James C. Francis IV, Magistrate Judge for the Southern District of New York. Joe gave me a copy of the guide, which I read during my flight back to Houston and found to be a terrific publication that details various mechanisms that can reduce the volume of ESI to review by up to 90 percent or more. You can download the publication here (for personal review, not re-publication), and also read a summary article about it from Joe in InsideCounsel here.

Mechanisms for reducing costs covered in the Guide include:

DeNISTing: Excluding files known to be associated with commercial software, such as help files, templates, etc., as compiled by the National Institute of Standards and Technology, can eliminate a high number of files that will clearly not be responsive;
Duplicate Consolidation (aka “deduping”): Deduping across custodians as opposed to just within custodians reduces costs 38% for across-custodian as opposed to 21% for within custodian;
Email Threading: The ability to review the entire email thread at once reduces costs 36% over having to review each email in the thread;
Domain Name Analysis (aka Domain Categorization): As noted previously in eDiscoveryDaily, the ability to classify items based on the domain of the sender of the email can significantly reduce the collection to be reviewed by identifying emails from parties that are clearly not responsive to the case. It can also be a great way to quickly identify some of the privileged emails;
Predictive Coding: As noted previously in eDiscoveryDaily, predictive coding is the use of machine learning technologies to categorize an entire collection of documents as responsive or non-responsive, based on human review of only a subset of the document collection. According to this report, “A recent survey showed that, on average, predictive coding reduced review costs by 45 percent, with several respondents reporting much higher savings in individual cases”.

The publication also addresses concepts such as focused sampling, foreign language translation costs and searching audio records and tape backups. It even addresses some of the most inefficient (and therefore, costly) practices of ESI processing and review, such as wholesale printing of ESI to paper for review (either in paper form or ultimately converted to TIFF or PDF), which is still more common than you might think. Finally, it references some key rules of the ABA Model Rules of Professional Conduct to address the ethical duty of attorneys in effective management of ESI. It’s a comprehensive publication that does a terrific job of explaining best practices for efficient discovery of ESI.

So, what do you think? How many of these practices have been implemented by your organization? Please share any comments you might have or if you’d like to know more about a particular topic.

eDiscovery Trends: 2011 Predictions — By The Numbers

January 4, 2011

Comedian Nick Bakay”>Nick Bakay always ends his Tale of the Tape skits where he compares everything from Married vs. Single to Divas vs. Hot Dogs with the phrase “It’s all so simple when you break things down scientifically.”

The late December/early January time frame is always when various people in eDiscovery make their annual predictions as to what trends to expect in the coming year. We’ll have some of our own in the next few days (hey, the longer we wait, the more likely we are to be right!). However, before stating those predictions, I thought we would take a look at other predictions and see if we can spot some common trends among those, “googling” for 2011 eDiscovery predictions, and organized the predictions into common themes. I found serious predictions here, here, here, here and here. Oh, also here and here.

A couple of quick comments: 1) I had NO IDEA how many times that predictions are re-posted by other sites, so it took some work to isolate each unique set of predictions. I even found two sets of predictions from ZL Technologies, one with twelve predictions and another with seven, so I had to pick one set and I chose the one with seven (sorry, eWEEK!). If I have failed to accurately attribute the original source for a set of predictions, please feel free to comment. 2) This is probably not an exhaustive list of predictions (I have other duties in my “day job”, so I couldn’t search forever), so I apologize if I’ve left anybody’s published predictions out. Again, feel free to comment if you’re aware of other predictions.

Here are some of the common themes:

Cloud and SaaS Computing: Six out of seven “prognosticators” indicated that adoption of Software as a Service (SaaS) “cloud” solutions will continue to increase, which will become increasingly relevant in eDiscovery. No surprise here, given last year’s IDC forecast for SaaS growth and many articles addressing the subject, including a few posts right here on this blog.
Collaboration/Integration: Six out of seven “augurs” also had predictions related to various themes associated with collaboration (more collaboration tools, greater legal/IT coordination, etc.) and integration (greater focus by software vendors on data exchange with other systems, etc.). Two people specifically noted an expectation of greater eDiscovery integration within organization governance, risk management and compliance (GRC) processes.
In-House Discovery: Five “pundits” forecasted eDiscovery functions and software will continue to be brought in-house, especially on the “left-side of the EDRM model” (Information Management).
Diverse Data Sources: Three “soothsayers” presaged that sources of data will continue to be more diverse, which shouldn’t be a surprise to anyone, given the popularity of gadgets and the rise of social media.
Social Media: Speaking of social media, three “prophets” (yes, I’ve been consulting my thesaurus!) expect social media to continue to be a big area to be addressed for eDiscovery.
End to End Discovery: Three “psychics” also predicted that there will continue to be more single-source end-to-end eDiscovery offerings in the marketplace.

The “others receiving votes” category (two predicting each of these) included maturing and acceptance of automated review (including predictive coding), early case assessment moving toward the Information Management stage, consolidation within the eDiscovery industry, more focus on proportionality, maturing of global eDiscovery and predictive/disruptive pricing.

Predictive/disruptive pricing (via Kriss Wilson of Superior Document Services and Charles Skamser of eDiscovery Solutions Group respective blogs) is a particularly intriguing prediction to me because data volumes are continuing to grow at an astronomical rate, so greater volumes lead to greater costs. Creativity will be key in how companies deal with the larger volumes effectively, and pressures will become greater for providers (even, dare I say, review attorneys) to price their services more creatively.

Another interesting prediction (via ZL Technologies) is that “Discovery of Databases and other Structured Data will Increase”, which is something I’ve expected to see for some time. I hope this is finally the year for that.

Finally, I said that I found serious predictions and analyzed them; however, there are a couple of not-so-serious sets of predictions here and here. My favorite prediction is from The Posse List, as follows: “LegalTech…renames itself “EDiscoveryTech” after Law.com survey reveals that of the 422 vendors present, 419 do e-discovery, and the other 3 are Hyundai HotWheels, Speedway Racers and Convert-A-Van who thought they were at the Javits Auto Show.”

So, what do you think? Care to offer your own “hunches” from your crystal ball? Please share any comments you might have or if you’d like to know more about a particular topic.

eDiscovery Trends: Predictive Coding Strategy and Survey Results

December 16, 2010

Yesterday, we introduced the Virtual LegalTech online educational session Frontiers of E-Discovery: What Lawyers Need to Know About “Predictive Coding” and defined predictive coding while also noting the two “learning” methods that most predictive coding mechanisms use to predict document classifications. To get background information regarding the session, including information about the speakers (Jason Baron, Maura Grossman and Bennett Borden), click here.

The session also focused on strategies for using predictive coding and results of the TREC 2010 Legal Track Learning Task on the effectiveness of “Predictive Coding” technologies. Strategies discussed by Bennett Borden include:

Understanding the technology used by a particular provider: Not only will supervised and active learning mechanisms often yield different results, but there are differing technologies within each of these learning mechanisms.
Understand the state of the law regarding predictive coding technology: So far, there is no case law available regarding use of this technology and, while it may eventually be the future of document review, that has yet to be established.
Obtain buy-in by the requesting party to use predictive coding technology: It’s much easier when the requesting party has agreed to your proposed approach and that agreement is included in an order of the court which covers the approach and also includes a FRE 502 “clawback” agreement and order. To have a chance to obtain that buy-in and agreement, you’ll need a diligent approach that includes “tiering” of the collection by probable responsiveness and appropriate sampling of each tier level.

Maura Grossman then described TREC 2010 Legal Track Learning Task on the effectiveness of “Predictive Coding” technologies. The team took the EDRM Enron Version 2 Dataset of 1.3 million public domain files, deduped it down to 685,000+ unique files and 5.5 GB of uncompressed data. The team also identified eight different hypothetical eDiscovery requests for the test.

Participating predictive coding technologies were then given a “seed set” of roughly 1,000 documents that had previously been identified by TREC as responsive or non-responsive to each of the requests. Using this information, participants were required to rank the documents in the larger collection from most likely to least likely to be responsive, and estimate the likelihood of responsiveness as a probability for each document. The study ranked the participants on recall rate accuracy based on 30% of the collection retrieved (200,000 files) and also on the predicted recall to determine a prediction accuracy.

The results? Actual recall rates for all eight discovery requests ranged widely among the tools from 85.1% actual recall down to 38.2% (on individual requests, the range was even wider – as much as 82% different between the high and the low). The prediction accuracy rates for the tools also ranged somewhat widely, from a high of 95% to a low of 42%.

Based on this study, it is clear that these technologies can differ significantly on how effective and efficient they are at correctly ranking and categorizing remaining documents in the collection based on the exemplar “seed set” of documents. So, it’s always important to conduct sampling of both machine coded and human coded documents for quality control in any project, with or without predictive coding (we sometimes forget that human coded documents can just as often be incorrectly coded!).

For more about the TREC 2010 Legal Track study, click here. As noted yesterday, you can also check out a replay of the session or download the slides for the presentation at the Virtual LegalTech site.

Full Disclosure: Trial Solutions provides predictive coding services using Hot Neuron LLC’s Clustify™, which categorizes documents by looking for similar documents in the exemplar set that satisfy a user-specified criteria, such as a minimum conceptual similarity or near-duplicate percentage.

So, what do you think? Have you used predictive coding on a case? Please share any comments you might have or if you’d like to know more about a particular topic.

eDiscovery Trends: What the Heck is “Predictive Coding”?

December 15, 2010

Yesterday, ALM hosted another Virtual LegalTech online “live” day online. Every quarter, theVirtual LegalTech site has a “live” day with educational sessions from 9 AM to 5 PM ET, most of which provide CLE credit in certain states (New York, California, Florida, and Illinois).

One of yesterday’s sessions was Frontiers of E-Discovery: What Lawyers Need to Know About “Predictive Coding”. The speakers for this session were:

Jason Baron: Director of Litigation for the National Archives and Records Administration, a founding co-coordinator of the National Institute of Standards and Technology’s Text Retrieval Conference (“TREC”) legal track and co-chair and editor-in-chief for various working groups for The Sedona Conference®;

Maura Grossman: Counsel at Wachtell, Lipton, Rosen & Katz, co-chair of the eDiscovery Working Group advising the New York State Unified Court System and coordinator of the 2010 TREC legal track; and

Bennett Borden: co-chair of the e-Discovery and Information Governance Section at Williams Mullen and member of Working Group I of The Sedona Conference on Electronic Document Retention and Production, as well as the Cloud Computing Drafting Group.

This highly qualified panel discussed a number of topics related to predictive coding, including practical applications of predictive coding technologies and results of the TREC 2010 Legal Track Learning Task on the effectiveness of “Predictive Coding” technologies.

Before discussing the strategies for using predictive coding technologies and the results of the TREC study, it’s important to understand what predictive coding is. The panel gave the best descriptive definition that I’ve seen yet for predictive coding, as follows:

“The use of machine learning technologies to categorize an entire collection of documents as responsive or non-responsive, based on human review of only a subset of the document collection. These technologies typically rank the documents from most to least likely to be responsive to a specific information request. This ranking can then be used to “cut” or partition the documents into one or more categories, such as potentially responsive or not, in need of further review or not, etc.”

The panel used an analogy for predictive coding by relating it to spam filters that review and classify email and learn based on previous classifications which emails can be considered “spam”. Just as no spam filter perfectly classifies all emails as spam or legitimate, predictive coding does not perfectly identify all relevant documents. However, they can “learn” to identify most of the relevant documents based on one of two “learning” methods:

Supervised Learning: a human chooses a set of “exemplar” documents that feed the system and enable it to rank the remaining documents in the collection based on their similarity to the exemplars (e.g., “more like this”);
Active Learning: the system chooses the exemplars on which human reviewers make relevancy determinations, then the system learns from those classifications to apply to the remaining documents in the collection.

Tomorrow, I “predict” we will get into the strategies and the results of the TREC study. You can check out a replay of the session at theVirtual LegalTech site. You’ll need to register – it’s free – then login and go to the CLE Center Auditorium upon entering the site (which is up all year, not just on “live days”). Scroll down until you see this session and then click on “Attend Now” to view the replay presentation. You can also go to the Resource Center at the site and download the slides for the presentation.

So, what do you think? Do you have experience with predictive coding? Please share any comments you might have or if you’d like to know more about a particular topic.

eDiscovery Searching: Types of Exception Files

November 15, 2010

Friday, we talked about how to address the handling of exception files through agreement with opposing counsel (typically, via the meet and confer) to manage costs and avoid the potential for spoliation claims. There are different types of exception files that might be encountered in a typical ESI collection and it’s important to know how those files can be recovered.

Types of Exception Files

It’s important to note that efforts to “fix” these files will often also change the files (and the metadata associated with them), so it’s important to establish with opposing counsel what measures to address the exceptions are acceptable. Some files may not be recoverable and you need to agree up front how far to go to attempt to recover them.

Corrupted Files: Files can become corrupted for a variety of reasons, from application failures to system crashes to computer viruses. I recently had a case where 40% of the collection was contained in 2 corrupt Outlook PST files – fortunately, we were able to repair those files and recover the messages. If you have readily accessible backups of the files, try to restore them from backup. If not, you will need to try using a repair utility. Outlook comes with a utility called SCANPST.EXE that scans and repairs PST and OST files, and there are utilities (including freeware utilities) available via the web for most file types. If all else fails, you can hire a data recovery expert, but that can get very expensive.
Password Protected Files: Most collections usually contain at least some password protected files. Files can require a password to enable them to be edited, or even just to view them. As the most popular publication format, PDF files are often password protected from editing, but they can still be viewed to support review (though some search engines may fail to index them). If a file is password protected, you can try to obtain the password from the custodian providing the file – if the custodian is unavailable or unable to remember the password, you can try a password cracking application, which will run through a series of character combinations to attempt to find the password. Be patient, it takes time, and doesn’t always succeed.
Unsupported File Types: In most collections, there are some unusual file types that aren’t supported by the review application, such as files for legacy or specialized applications (e.g., AutoCad for engineering drawings). You may not even initially know what type of files they are; if not, you can find out based on file extension by looking the file extension up in FILExt. If your review application can’t read the files, it also can’t index the files for searching or display them for review. If those files may be responsive to discovery requests, review them with the native application to determine their relevancy.
No-Text Files: Files with no searchable text aren’t really exceptions – they have to be accounted for, but they won’t be retrieved in searches, so it’s important to make sure they don’t “slip through the cracks”. It’s common to perform Optical Character Recognition (OCR) on TIFF files and image-only PDF files, because they are common document formats. Other types of no-text files, such as pictures in JPEG or PNG format, are usually not OCRed, unless there is an expectation that they will have significant text.

It’s important for review applications to be able to identify exception files, so that you know they won’t be retrieved in searches without additional processing. FirstPass™, powered by Venio FPR™, is one example of an application that will flag those files during processing and enable you to search for those exceptions, so you can determine how to handle them.

So, what do you think? Have you encountered other types of exceptions? Please share any comments you might have or if you’d like to know more about a particular topic.

eDiscovery Searching: Exceptions are the Rule

November 12, 2010

Virtually every collection of electronically stored information (ESI) has at least some files that cannot be effectively searched. Corrupt files, password protected files and other types of exception files are frequent components of your ESI collection and it can become very expensive to make these files searchable or reviewable. Being without an effective plan for addressing these files could lead to problems – even spoliation claims – in your case.

How to Address Exception Files

The best way to develop a plan for addressing these files that is reasonable and cost-effective is to come to agreement with opposing counsel on how to handle them. The prime opportunity to obtain this agreement is during the meet and confer with opposing counsel. The meet and confer gives you the opportunity to agree on how to address the following:

Efforts Required to Make Unusable Files Usable: Corrupted and password protected files may be fairly easily addressed in some cases, whereas in others, it takes extreme (i.e., costly) efforts to fix those files (if they can be fixed at all). Up-front agreement with the opposition helps you determine how far to go in your recovery efforts to keep those recovery costs manageable.
Exception Reporting: Because there will usually be some files for which recovery is unsuccessful (or not attempted, if agreed upon with the opposition), you need to agree on how those files will be reported, so that they are accounted for in the production. The information on exception reports will vary depending on agreed upon format between parties, but should typically include: file name and path, source custodian and reason for the exception (e.g., the file was corrupt).

If your case is in a jurisdiction where a meet and confer is not required (such as state cases where the state has no rules for eDiscovery), it is still best to reach out to opposing counsel to agree on the handling of exception files to control costs for addressing those files and avoid potential spoliation claims.

On Monday, we will talk about the types of exception files and the options for addressing them. Oh, the suspense! Hang in there!

So, what do you think? Have you been involved in any cases where the handling of exception files was disputed? Please share any comments you might have or if you’d like to know more about a particular topic.