Analysis Archives

eDiscovery Trends: Jim McGann of Index Engines

February 18, 2011

This is the third of the LegalTech New York (LTNY) Thought Leader Interview series. eDiscoveryDaily interviewed several thought leaders at LTNY this year and asked each of them the same three questions:

What do you consider to be the current significant trends in eDiscovery on which people in the industry are, or should be, focused?
Which of those trends are evident here at LTNY, which are not being talked about enough, and/or what are your general observations about LTNY this year?
What are you working on that you’d like our readers to know about?

Today’s thought leader is Jim McGann. Jim is Vice President of Information Discovery at Index Engines. Jim has extensive experience with the eDiscovery and Information Management in the Fortune 2000 sector. He has worked for leading software firms, including Information Builders and the French-based engineering software provider Dassault Systemes. In recent years he has worked for technology-based start-ups that provided financial services and information management solutions.

What do you consider to be the current significant trends in eDiscovery on which people in the industry are, or should be, focused?

What we’re seeing is that companies are becoming a bit more proactive. Over the past few years we’ve seen companies that have simply been reacting to litigation and it’s been a very painful process because ESI collection has been a “fire drill” – a very last minute operation. Not because lawyers have waited and waited, but because the data collection process has been slow, complex and overly expensive. But things are changing. Companies are seeing that eDiscovery is here to stay, ESI collection is not going away and the argument of saying that it’s too complex or expensive for us to collect is not holding water. So, companies are starting to take a proactive stance on ESI collection and understanding their data assets proactively. We’re talking to companies that are not specifically responding to litigation; instead, they’re building a defensible policy that they can apply to their data sources and make data available on demand as needed.

Which of those trends are evident here at LTNY, which are not being talked about enough, and/or what are your general observations about LTNY this year?

{Interviewed on the first morning of LTNY} Well, in walking the floor as people were setting up, you saw a lot of early case assessment last year; this year you’re seeing a lot of information governance.. That’s showing that eDiscovery is really rolling into the records management/information governance area. On the CIO and General Counsel level, information governance is getting a lot of exposure and there’s a lot of technology that can solve the problems. Litigation support’s role will be to help the executives understand the available technology and how it applies to information governance and records management initiatives. You’ll see more information governance messaging, which is really a higher level records management message.

As for other trends, one that I’ll tie Index Engines into is ESI collection and pricing. Per GB pricing is going down as the volume of data is going up. Years ago, prices were a thousand per GB, then hundreds of dollars per GB, etc. Now the cost is close to tens of dollars per GB. To really manage large volumes of data more cost-effectively, the collection price had to become more affordable. Because Index Engines can make data on backup tapes searchable very cost-effectively, for as little as $50 per tape, data on tape has become as easy to access and search as online data. Perhaps even easier because it’s not on a live network. Backup tapes have a bad reputation because people think of them as complex or expensive, but if you take away the complexity and expense (which is what Index Engines has done), then they really become “full point-in-time” snapshots. So, if you have litigation from a specific date range, you can request that data snapshot (which is a tape) and perform discovery on it. Tape is really a natural litigation hold when you think about it, and there is no need to perform the hold retroactively.

So, what does the ease of which the information can be indexed from tape do to address the inaccessible argument for tape retrieval? That argument has been eroding over the years, thanks to technology like ours. And, you see decisions from judges like Judge Scheindlin saying “if you cannot find data in your primary network, go to your backup tapes”, indicating that they consider backup tapes in the next source right after online networks. You also see people like Craig Ball writing that backup tapes may be the most convenient and cost-effective way to get access to data. If you had a choice between doing a “server crawl” in a corporate environment or just asking for a backup tape of that time frame, tape is the much more convenient and less disruptive option. So, if your opponent goes to the judge and says it’s going to take millions of dollars to get the information off of twenty tapes, you must know enough to be in front of a judge and say “that’s not accurate”. Those are old numbers. There are court cases where parties have been instructed to use tapes as a cost-effective means of getting to the data. Technology removes the inaccessible argument by making it easier, faster and cheaper to retrieve data from backup tapes.

The erosion of the accessibility burden is sparking the information governance initiatives. We’re seeing companies come to us for legacy data remediation or management projects, basically getting rid of old tapes. They are saying “if I’ve got ten years of backup tapes sitting in offsite storage, I need to manage that proactively and address any liability that’s there” (that they may not even be aware exists). These projects reflect a proactive focus towards information governance by remediating those tapes and getting rid of data they don’t need. Ninety-eight percent of the data on old tapes is not going to be relevant to any case. The remaining two percent can be found and put into the company’s litigation hold system, and then they can get rid of the tapes.

How do incremental backups play into that? Tapes are very incremental and repetitive. If you’re backing up the same data over and over again, you may have 50+ copies of the same email. Index Engines technology automatically gets rid of system files and applies a standard MD5Hash to dedupe. Also, by using tape cataloguing, you can read the header and say “we have a Saturday full backup and five incremental during the week, then another Saturday full backup”. You can ignore the incremental tapes and just go after the full backups. That’s a significant percent of the tapes you can ignore.

What are you working on that you’d like our readers to know about?

Index Engines just announced today a partnership with LeClairRyan. This partnership combines legal expertise for data retention with the technology that makes applying the policy to legacy data possible. For companies that want to build policy for the retention of legacy data and implement the tape remediation process we have advisors like LeClairRyan that can provide legacy data consultation and oversight. By proactively managing the potential liability of legacy data, you are also saving the IT costs to explore that data.

Index Engines also just announced a new cloud-based tape load service that will provide full identification, search and access to tape data for eDiscovery. The Look & Learn service, starting at $50 per tape, will provide clients with full access to the index of their tape data without the need to install any hardware or software. Customers will be able to search the index and gather knowledge about content, custodians, email and metadata all via cloud access to the Index Engines interface, making discovery of data from tapes even more convenient and affordable.

Thanks, Jim, for participating in the interview!

And to the readers, as always, please share any comments you might have or if you’d like to know more about a particular topic!

eDiscovery Trends: Alon Israely, Esq., CISSP of BIA

February 16, 2011

This is the second of the LegalTech New York (LTNY) Thought Leader Interview series. eDiscoveryDaily interviewed several thought leaders at LTNY this year and asked each of them the same three questions:

What do you consider to be the current significant trends in eDiscovery on which people in the industry are, or should be, focused?
Which of those trends are evident here at LTNY, which are not being talked about enough, and/or what are your general observations about LTNY this year?
What are you working on that you’d like our readers to know about?

Today’s thought leader is Alon Israely. Alon is a Senior Advisor in BIA’s Advisory Services group and when he’s not advising clients on e-discovery issues he works closely with BIA’s product development group for its core technology products. Alon has over fifteen years of experience in a variety of advanced computing-related technologies and has consulted with law firms and their clients on a variety of technology issues, including expert witness services related to computer forensics, digital evidence management and data security.

What do you consider to be the current significant trends in eDiscovery on which people in the industry are, or should be, focused?

I think one of the important trends for corporate clients and law firms is cost control, whether it’s trying to minimize the amount of project management hours that are being billed or the manner in which the engagement is facilitated. I’m not suggesting going full-bore necessarily, but taking baby steps to help control costs is a good approach. I don’t think it’s only about bringing prices down, because I think that the industry in general has been able to do that naturally well. But, I definitely see a new focus on the manner in which costs are managed and outsourced. So, very specifically, scoping correctly is key, making sure you’re using the right tool for the right job, keeping efficiencies (whether that’s on the vendor side or the client side) by doing things such as not having five phone calls for a meeting to figure out what the key words are for field searching or just going out and imaging every drive before deciding what’s really needed. Bringing simple efficiencies to the mechanics of doing e-discovery saves tons of money in unnecessary legal, vendor and project management fees. You can do things that are about creating efficiencies, but are not necessarily changing the process or changing the pricing.

I also see trends in technology, using more focused tools and different tools to facilitate a single project. Historically, parties would hire three or four different vendors for a single project, but today it may be just one or two vendors or maybe even no vendors, (just the law firm) but, it’s the use of the right technologies for the right situations – maybe not just one piece of software, but leveraging several for different parts of the process. Overall, I foresee fewer vendors per project, but more vendors increasing their stable of tools. So, whereas a vendor may have had a review tool and one way of doing collection, now they may have two or three review tools, including an ECA tool, and one or two ways of doing collections. They have a toolkit from which they can choose the best set of tools to bring to the engagement. Because they have more tools to market, vendors can have the right tool in-their-back-pocket whereas before the tool belonged to just one service provider so you bought from them, or you just didn’t have it.

Which of those trends are evident here at LTNY, which are not being talked about enough, and/or what are your general observations about LTNY this year?

{Interviewed on the first morning of LTNY} I think you have either a little or a lot of – depending on how aggressive I want to be with my opinion – that there seems to be a disconnect between what they’re speaking about in the panels and what we’re seeing on the floor. But, I think that’s OK in that the conference itself, is usually a little bit ahead of the curve with respect to topics, and the technology will catch up. You have topics such as predictive coding and social networking related issues – those are two big ones that you’ll see. I think, for example, there are very few companies that have a solution for social networking, though we happen to have one. And, predictive coding is the same scenario. You have a lot of providers that talk about it, but you have a handful that actually do it, and you have probably even fewer than that who do it right. I think that next year you’ll see many predictive coding solutions and technologies and many more tools that have that capability built into them. So, on the conference side, there is one level of information and on the floor side, a different level.

What are you working on that you’d like our readers to know about?

BIA has a new product called TotalDiscovery.com, the industry’s first SaaS (software-as-a-service), on-demand collection technology that provides defensible collections. We just rolled it out, we’re introducing it here at LegalTech and we’re starting a technology preview and signing up people who want to use the application or try it. It’s specifically for attorneys, corporations, service providers – anyone who’s in the business and needs a tool for defensible data collection performed with agility (always hard to balance) – so without having to buy software or have expert training, users simply login or register and can start immediately. You don’t have to worry about the traditional business processes to get things set up and started. Which, if you think about it on the collections side of e-discovery it means that the client’s CEO or VP of Marketing can call you up and say “I’m leaving, I have my PST here, can you just come get it?” and you can facilitate that process through the web, download an application, walk through a wizard, collect it defensibly, encrypt it and then deliver a filtered set, as needed, for review..

The tool is designed to collect defensibly and to move the collected data – or some subset of that data –to delivery, from there you would select your review tool of choice and we hand it off to the selected review tool. So, we’re not trying to be everything, we’re focused on automating the left side of the EDRM. We have loads to certain tools, having been a service provider for ten years, and we’re connecting with partners so that we can do the handoff, so when the client says “I’m ready to deliver my data”, they can choose OnDemand or Concordance or another review tool, and then either directly send it or the client can download and ship it. We’re not trying to be a review tool and not trying to be an ECA tool that helps you find the needle in the haystack; instead, we’re focused on collecting the data, normalizing it, cataloguing it and handing if off for the attorneys to do their work.

Thanks, Alon, for participating in the interview!

And to the readers, as always, please share any comments you might have or if you’d like to know more about a particular topic!

eDiscovery Best Practices: Judges’ Guide to Cost-Effective eDiscovery

February 7, 2011

Last week at LegalTech, I met Joe Howie at the blogger’s breakfast on Tuesday morning. Joe is the founder of Howie Consulting and is the Director of Metrics Development and Communications for the eDiscovery Institute, which is a 501(c)(3) nonprofit research organization for eDiscovery.

eDiscovery Institute has just released a new publication that is a vendor-neutral guide for approaches to considerably reduce discovery costs for ESI. The Judges’ Guide to Cost-Effective E-Discovery, co-written by Anne Kershaw (co-Founder and President of the eDiscovery Institute) and Joe Howie, also contains a foreword by the Hon. James C. Francis IV, Magistrate Judge for the Southern District of New York. Joe gave me a copy of the guide, which I read during my flight back to Houston and found to be a terrific publication that details various mechanisms that can reduce the volume of ESI to review by up to 90 percent or more. You can download the publication here (for personal review, not re-publication), and also read a summary article about it from Joe in InsideCounsel here.

Mechanisms for reducing costs covered in the Guide include:

DeNISTing: Excluding files known to be associated with commercial software, such as help files, templates, etc., as compiled by the National Institute of Standards and Technology, can eliminate a high number of files that will clearly not be responsive;
Duplicate Consolidation (aka “deduping”): Deduping across custodians as opposed to just within custodians reduces costs 38% for across-custodian as opposed to 21% for within custodian;
Email Threading: The ability to review the entire email thread at once reduces costs 36% over having to review each email in the thread;
Domain Name Analysis (aka Domain Categorization): As noted previously in eDiscoveryDaily, the ability to classify items based on the domain of the sender of the email can significantly reduce the collection to be reviewed by identifying emails from parties that are clearly not responsive to the case. It can also be a great way to quickly identify some of the privileged emails;
Predictive Coding: As noted previously in eDiscoveryDaily, predictive coding is the use of machine learning technologies to categorize an entire collection of documents as responsive or non-responsive, based on human review of only a subset of the document collection. According to this report, “A recent survey showed that, on average, predictive coding reduced review costs by 45 percent, with several respondents reporting much higher savings in individual cases”.

The publication also addresses concepts such as focused sampling, foreign language translation costs and searching audio records and tape backups. It even addresses some of the most inefficient (and therefore, costly) practices of ESI processing and review, such as wholesale printing of ESI to paper for review (either in paper form or ultimately converted to TIFF or PDF), which is still more common than you might think. Finally, it references some key rules of the ABA Model Rules of Professional Conduct to address the ethical duty of attorneys in effective management of ESI. It’s a comprehensive publication that does a terrific job of explaining best practices for efficient discovery of ESI.

So, what do you think? How many of these practices have been implemented by your organization? Please share any comments you might have or if you’d like to know more about a particular topic.

Deadline Extended to Vote for the Most Significant eDiscovery Case of 2010

February 7, 2011

Our ‘little experiment’ to see what the readers of eDiscoveryDaily think about case law developments in 2010 needs more time as we have not yet received enough votes yet to have a statistically significant result. So, we’ve extended the deadline to select the case with the most significant impact on eDiscovery practices in 2010 to February 28. Evidently, calling out the vote on the last business day before LegalTech is not the best timing. Live and learn!

As noted previously, we have “nominated” five cases, which we feel were the most significant in different issues of case law, including duty to preserve and sanctions, clawback agreements under Federal Rule of Evidence 502, not reasonably accessible arguments and discoverability of social media content. If you feel that some other case was the most significant case of 2010, you can select that case instead. Again, it’s very important to note that you can vote anonymously, so we’re not using this as a “hook” to get your information. You can select your case without providing any personal information. However, we would welcome your comments as to why you selected the case you did and you can – optionally – identify yourself as well.

To get more information about the nominated cases (as well as other significant cases), click here. To cast your vote, click here.

And, as always, please share any comments you might have or if you’d like to know more about a particular topic.

Vote for the Most Significant eDiscovery Case of 2010!

January 28, 2011

Since it’s awards season, we thought we would get into the act from an eDiscovery standpoint. Sure, you have Oscars, Emmys and Grammys – but what about “EDDies”? (I’ll bet you wondered what Eddie Munster could possibly have to do with eDiscovery, didn’t you?)

So, we’re conducting a ‘little experiment’ to see what the readers of eDiscoveryDaily think about case law developments in 2010. This is our first annual “EDDies” award to select the case with the most significant impact on eDiscovery practices in 2010. No cash or prizes being awarded, or even a statuette, but a chance to see what the readers think was the most important case of the year from an eDiscovery standpoint.

We have “nominated” five cases below, which we feel were the most significant in different issues of case law, including duty to preserve and sanctions, clawback agreements under Federal Rule of Evidence 502, not reasonably accessible arguments and discoverability of social media content. We have a link to review more information about each case, and a link at the bottom of this post to cast your vote.

Very Important! You can vote anonymously, so we’re not using this as a “hook” to get your information. You can click on the link at the bottom, select your case and be done with it. However, we would welcome your comments as to why you selected the case you did and you can – optionally – identify yourself as well. eDiscoveryDaily will publish selected comments to reflect opinion of the voters as well as the vote results on February 7. Click here to cast your vote now!

So, here are the cases:

Duty to Preserve/Sanctions

The Pension Committee of the Montreal Pension Plan v. Banc of America Securities, LLC, 29010 U.S. Dist. Lexis 4546 (S.D.N.Y. Jan. 15, 2010) (as amended May 28, 2010) – “Pension Committee”: The case that defined negligence, gross negligence, and willfulness in the electronic discovery context and demonstrated the consequences (via sanctions) resulting from those activities. Judge Shira Scheindlin titled her 85-page opinion “Zubulake Revisited: Six Years Later”. For more on this case, click here.
Victor Stanley, Inc. v. Creative Pipe, Inc., 2010 WL 3530097 (D. Md. 2010) – “Victor Stanley II”: The case of “the gang that couldn’t spoliate straight” where one of the defendants faced imprisonment for up to 2 years (subsequently set aside on appeal) and the opinion included a 12 page chart delineating the preservation and spoliation standards in each judicial circuit. For more on this case, click here and here.

Clawback Agreements

Rajala v. McGuire Woods LLP, 2010 WL 2949582 (D. Kan. July 22, 2010) – “Rajala”: The case that addressed the applicability of Federal Rule of Evidence 502(d) and (e) for “clawback” provisions for inadvertently produced privileged documents. For more on this case, click here.

Not Reasonably Accessible

Major Tours, Inc. v. Colorel, 2010 WL 2557250 (D.N.J. June 22, 2010) – “Major Tours”: The case that established a precedent that a party may obtain a Protective Order relieving it of the duty to access backup tapes, even when that party’s failure to issue a litigation hold caused the data not to be available via any other means. For more on this case, click here.

Social Media Discovery

Crispin v. Christian Audigier Inc., 2010 U.S. Dist. Lexis 52832 (C.D. Calif. May 26, 2010) – “Crispin”: The case that used a 24 year old law (The Stored Communications Act of 1986) to address whether ‘private’ data on social networks is discoverable. For more on this case, click here.

If you feel that some other case was the most significant case of 2010, you can select that case instead. Other notable cases include:

Rimkus v. Cammarata, 2010 WL 645253 (S.D. Tex. Feb. 19, 2010): Where District Court Judge Lee Rosenthal examined spoliation laws of each of the 13 Federal Circuit Courts of Appeal.
Orbit One Communications Inc. v. Numerex Corp., 2010 WL 4615547 (S.D.N.Y. Oct. 26, 2010): Magistrate Judge James C. Francis concluded that sanctions for spoliation must be based on the loss of at least some information relevant to the dispute (differing with “Pension Committee” in this manner).
DeGeer v. Gillis, 2010 U.S. Dist. Lexis 97457(N.D. Ill. Sept. 17, 2010): Demonstration of inadvertent disclosure made FRE 502(d) effective, negating waiver of privilege.
Takeda Pharmaceutical Co., Ltd. v. Teva Pharmaceuticals USA, Inc., 2010 WL 2640492 (D. Del. June 21, 2010): Defendants’ motion to compel the production of ESI for a period of 18 years was granted, with imposed cost-shifting.
E.E.O.C. v. Simply Storage Management, LLC, 2010 U.S. Dist. Lexis 52766 (S.D. Ind. May 11, 2010): EEOC is ordered to produce certain social networking communications.
McMillen v. Hummingbird Speedway, Inc., No. 113-2010 CD (C.P. Jefferson, Sept. 9, 2010): Motion to Compel discovery of social network account log-in names and passwords was granted.

Click here to cast your vote now! Results will be published in eDiscoveryDaily on February 7.

The success of this ‘little experiment’ will determine whether next year there is a second annual “EDDies” award. 😉

And, as always, please share any comments you might have or if you’d like to know more about a particular topic.

eDiscovery Searching: For Defensible Searching, Be a "STARR"

January 24, 2011

Defensible searching has become a priority in eDiscovery as parties in several cases have experienced significant consequences (including sanctions) for not implementing a defensible search strategy in responding to discovery requests.

Probably the most famous case where search approach has been an issue was Victor Stanley, Inc. v. Creative Pipe , Inc., 250 F.R.D. 251 (D. Md. 2008), where Judge Paul Grimm noted that “only prudent way to test the reliability of the keyword search is to perform some appropriate sampling of the documents” and found that privilege on 165 inadvertently produced documents was waived, in part, because of the inadequacy of the search approach.

A defensible search strategy is part using an effective tool (with advanced search capabilities such as “fuzzy”, wildcard, synonym and proximity searching) and part using an effective approach to test and verify search results.

I have an acronym that I use to reflect the defensible search process. I call it “STARR” – as in “STAR” with an extra “R” or Green Bay Packer football legend Bart Starr (sorry, Bears fans!). For each search that you need to conduct, here’s how it goes:

Search: Construct the best search you can to maximize recall and precision for the desired result. An effective tool gives you more options for constructing a more effective search, which should help in maximizing recall and precision. For example, as noted on this blog a few days ago, a proximity search can, under the right circumstances, provide a more precise search result without sacrificing recall.
Test: Once you’ve conducted the search, it’s important to test two datasets to determine the effectiveness of the search:
- Result Set: Test the result set by randomly selecting an appropriate sample percentage of the files and reviewing those to determine their responsiveness to the intent of the search. The appropriate percentage of files to be reviewed depends on the size of the result set – the smaller the set, the higher percentage of it that should be reviewed.
- Files Not Retrieved: While testing the result set is important, it is also important to randomly select an appropriate sample percentage of the files that were not retrieved in the search and review those as well to see if any responsive hits are identified as missed by the search.
Analyze: Analyze the results of the random sample testing of both the result set and also the files not retrieved to determine how effective the search was in retrieving mostly responsive files and whether any responsive files were identified as missed by the search performed.
Revise: If the search retrieved a low percentage of responsive files and retrieved a high percentage of non-responsive files, then precision of the search may need to be improved. If the files not retrieved contained any responsive files, then recall of the search may need to be improved. Evaluate the results and see what, if any, revisions can be made to the search to improve precision and/or recall.
Repeat: Once you’ve identified revisions you can make to your search, repeat the process. Search, Test, Analyze and (if necessary) Revise the search again until the precision and recall of the search is maximized to the extent possible.

While you can’t guarantee that you will retrieve all of the responsive files or eliminate all of the non-responsive ones, a defensible approach to get as close as you can to that goal will minimize the number of files for review, potentially saving considerable costs and making you a “STARR” in the courtroom when defending your search approach.

So, what do you think? Are you a “STARR” when it comes to defensible searching? Please share any comments you might have or if you’d like to know more about a particular topic.

eDiscovery Best Practices: EDRM Data Set for Great Test Data

January 17, 2011

In it’s almost six years of existence, the Electronic Discovery Reference Model (EDRM) Project has implemented a number of mechanisms to standardize the practice of eDiscovery. Having worked on the EDRM Metrics project for the past four years, I have seen some of those mechanisms implemented firsthand.

One of the most significant recent accomplishments by EDRM is the EDRM Data Set. Anyone who works with eDiscovery applications and processes understands the importance to be able to test those applications in as many ways as possible using realistic data that will illustrate expected results. The use of test data is extremely useful in crafting a defensible discovery approach, by enabling you to determine the expected results within those applications and processes before using them with your organization’s live data. It can also help you identify potential anomalies (those never occur, right?) up front so that you can be proactive to develop an approach to address those anomalies before encountering them in your own data.

Using public domain data from Enron Corporation (originating from the Federal Energy Regulatory Commission Enron Investigation), the EDRM Data Set Project provides industry-standard, reference data sets of electronically stored information (ESI) to test those eDiscovery applications and processes. In 2009, the EDRM Data Set project released its first version of the Enron Data Set, comprised of Enron e-mail messages and attachments within Outlook PST files, organized in 32 zipped files.

This past November, the EDRM Data Set project launched Version 2 of the EDRM Enron Email Data Set. Straight from the press release announcing the launch, here are some of the improvements in the newest version:

Larger Data Set: Contains 1,227,255 emails with 493,384 attachments (included in the emails) covering 151 custodians;
Rich Metadata: Includes threading information, tracking IDs, and general Internet headers;
Multiple Email Formats: Provision of both full and de-duplicated email in PST, MIME and EDRM XML, which allows organizations to test and compare results across formats.

The Text REtrieval Conference (TREC) Legal Track project provided input for this version of the data set, which, as noted previously on this blog, has used the EDRM data set for its research. Kudos to John Wang, Project Lead for the EDRM Data Set Project and Product Manager at ZL Technologies, Inc., and the rest of the Data Set team for such an extensive test set collection!

So, what do you think? Do you use the EDRM Data Set for testing your eDiscovery processes? Please share any comments you might have or if you’d like to know more about a particular topic.

eDiscovery Searching: Proximity, Not Absence, Makes the Heart Grow Fonder

January 14, 2011

Recently, I assisted a large corporate client where there were several searches conducted across the company’s enterprise-wide document management systems (DMS) for ESI potentially responsive to the litigation. Some of the individual searches on these systems retrieved over 200,000 files by themselves!

DMS systems are great for what they are intended to do – provide a storage archive for documents generated within the organization, version tracking of those documents and enable individuals to locate specific documents for reference or modification (among other things). However, few of them are developed with litigation retrieval in mind. Sure, they have search capabilities, but it can sometimes be like using a sledgehammer to hammer a thumbtack into the wall – advanced features to increase the precision of those searches may often be lacking.

Let’s say in an oil company you’re looking for documents related to “oil rights” (such as “oil rights”, “oil drilling rights”, “oil production rights”, etc.). You could perform phrase searches, but any variations that you didn’t think of would be missed (e.g., “rights to drill for oil”, etc.). You could perform an AND search (i.e., “oil” AND “rights”), and that could very well retrieve all of the files related to “oil rights”, but it would also retrieve a lot of files where “oil” and “rights” appear, but have nothing to do with each other. A search for “oil” AND “rights” in an oil company’s DMS systems may retrieve every published and copyrighted document in the systems mentioning the word “oil”. Why? Because almost every published and copyrighted document will have the phrase “All Rights Reserved” in the document.

That’s an example of the type of issue we were encountering with some of those searches that yielded 200,000 files with hits. And, that’s where proximity searching comes in. Proximity searching is simply looking for two or more words that appear close to each other in the document (e.g., “oil within 5 words of rights”) – the search will only retrieve the file if those words are as close as specified to each other, in either order. Proximity searching helped us reduce that collection to a more manageable number for review, even though the enterprise-wide document management system didn’t have a proximity search feature.

How? We wound up taking a two-step approach to get the collection to a more likely responsive set. First, we did the “AND” search in the DMS system, understanding that we would retrieve a large number of files, and exported those results. After indexing them with a first pass review tool that has more precise search alternatives (at Trial Solutions, we use FirstPass™, powered by Venio FPR™, for first pass review), we performed a second search on the set using proximity searching to limit the result set to only files where the terms were near each other. Then, tested the results and revised where necessary to retrieve a result set that maximized both recall and precision.

The result? We were able to reduce an initial result set of 200,000 files to just over 5,000 likely responsive files by applying the proximity search to the first result set. And, we probably saved $50,000 to $100,000 in review costs – on a single search.

I also often use proximity searches as alternatives to phrase searches to broaden the recall of those searches to identify additional potentially responsive hits. For example, a search for “Doug Austin” doesn’t retrieve “Austin, Doug” and a search for “Dye 127” doesn’t retrieve “Dye #127”. One character difference is all it takes for a phrase search to miss a potentially responsive file. With proximity searching, you can look for these terms close to each other and catch those variations.

So, what do you think? Do you use proximity searching in your culling for review? Please share any comments you might have or if you’d like to know more about a particular topic.

eDiscovery Trends: 2011 Predictions — By The Numbers

January 4, 2011

Comedian Nick Bakay”>Nick Bakay always ends his Tale of the Tape skits where he compares everything from Married vs. Single to Divas vs. Hot Dogs with the phrase “It's all so simple when you break things down scientifically.”

The late December/early January time frame is always when various people in eDiscovery make their annual predictions as to what trends to expect in the coming year. We’ll have some of our own in the next few days (hey, the longer we wait, the more likely we are to be right!). However, before stating those predictions, I thought we would take a look at other predictions and see if we can spot some common trends among those, “googling” for 2011 eDiscovery predictions, and organized the predictions into common themes. I found serious predictions here, here, here, here and here. Oh, also here and here.

A couple of quick comments: 1) I had NO IDEA how many times that predictions are re-posted by other sites, so it took some work to isolate each unique set of predictions. I even found two sets of predictions from ZL Technologies, one with twelve predictions and another with seven, so I had to pick one set and I chose the one with seven (sorry, eWEEK!). If I have failed to accurately attribute the original source for a set of predictions, please feel free to comment. 2) This is probably not an exhaustive list of predictions (I have other duties in my “day job”, so I couldn’t search forever), so I apologize if I’ve left anybody’s published predictions out. Again, feel free to comment if you’re aware of other predictions.

Here are some of the common themes:

Cloud and SaaS Computing: Six out of seven “prognosticators” indicated that adoption of Software as a Service (SaaS) “cloud” solutions will continue to increase, which will become increasingly relevant in eDiscovery. No surprise here, given last year’s IDC forecast for SaaS growth and many articles addressing the subject, including a few posts right here on this blog.
Collaboration/Integration: Six out of seven “augurs” also had predictions related to various themes associated with collaboration (more collaboration tools, greater legal/IT coordination, etc.) and integration (greater focus by software vendors on data exchange with other systems, etc.). Two people specifically noted an expectation of greater eDiscovery integration within organization governance, risk management and compliance (GRC) processes.
In-House Discovery: Five “pundits” forecasted eDiscovery functions and software will continue to be brought in-house, especially on the “left-side of the EDRM model” (Information Management).
Diverse Data Sources: Three “soothsayers” presaged that sources of data will continue to be more diverse, which shouldn’t be a surprise to anyone, given the popularity of gadgets and the rise of social media.
Social Media: Speaking of social media, three “prophets” (yes, I’ve been consulting my thesaurus!) expect social media to continue to be a big area to be addressed for eDiscovery.
End to End Discovery: Three “psychics” also predicted that there will continue to be more single-source end-to-end eDiscovery offerings in the marketplace.

The “others receiving votes” category (two predicting each of these) included maturing and acceptance of automated review (including predictive coding), early case assessment moving toward the Information Management stage, consolidation within the eDiscovery industry, more focus on proportionality, maturing of global eDiscovery and predictive/disruptive pricing.

Predictive/disruptive pricing (via Kriss Wilson of Superior Document Services and Charles Skamser of eDiscovery Solutions Group respective blogs) is a particularly intriguing prediction to me because data volumes are continuing to grow at an astronomical rate, so greater volumes lead to greater costs. Creativity will be key in how companies deal with the larger volumes effectively, and pressures will become greater for providers (even, dare I say, review attorneys) to price their services more creatively.

Another interesting prediction (via ZL Technologies) is that “Discovery of Databases and other Structured Data will Increase”, which is something I’ve expected to see for some time. I hope this is finally the year for that.

Finally, I said that I found serious predictions and analyzed them; however, there are a couple of not-so-serious sets of predictions here and here. My favorite prediction is from The Posse List, as follows: “LegalTech…renames itself “EDiscoveryTech” after Law.com survey reveals that of the 422 vendors present, 419 do e-discovery, and the other 3 are Hyundai HotWheels, Speedway Racers and Convert-A-Van who thought they were at the Javits Auto Show.”

So, what do you think? Care to offer your own “hunches” from your crystal ball? Please share any comments you might have or if you’d like to know more about a particular topic.

eDiscovery Trends: Predictive Coding Strategy and Survey Results

December 16, 2010

Yesterday, we introduced the Virtual LegalTech online educational session Frontiers of E-Discovery: What Lawyers Need to Know About “Predictive Coding” and defined predictive coding while also noting the two “learning” methods that most predictive coding mechanisms use to predict document classifications. To get background information regarding the session, including information about the speakers (Jason Baron, Maura Grossman and Bennett Borden), click here.

The session also focused on strategies for using predictive coding and results of the TREC 2010 Legal Track Learning Task on the effectiveness of “Predictive Coding” technologies. Strategies discussed by Bennett Borden include:

Understanding the technology used by a particular provider: Not only will supervised and active learning mechanisms often yield different results, but there are differing technologies within each of these learning mechanisms.
Understand the state of the law regarding predictive coding technology: So far, there is no case law available regarding use of this technology and, while it may eventually be the future of document review, that has yet to be established.
Obtain buy-in by the requesting party to use predictive coding technology: It’s much easier when the requesting party has agreed to your proposed approach and that agreement is included in an order of the court which covers the approach and also includes a FRE 502 “clawback” agreement and order. To have a chance to obtain that buy-in and agreement, you’ll need a diligent approach that includes “tiering” of the collection by probable responsiveness and appropriate sampling of each tier level.

Maura Grossman then described TREC 2010 Legal Track Learning Task on the effectiveness of “Predictive Coding” technologies. The team took the EDRM Enron Version 2 Dataset of 1.3 million public domain files, deduped it down to 685,000+ unique files and 5.5 GB of uncompressed data. The team also identified eight different hypothetical eDiscovery requests for the test.

Participating predictive coding technologies were then given a “seed set” of roughly 1,000 documents that had previously been identified by TREC as responsive or non-responsive to each of the requests. Using this information, participants were required to rank the documents in the larger collection from most likely to least likely to be responsive, and estimate the likelihood of responsiveness as a probability for each document. The study ranked the participants on recall rate accuracy based on 30% of the collection retrieved (200,000 files) and also on the predicted recall to determine a prediction accuracy.

The results? Actual recall rates for all eight discovery requests ranged widely among the tools from 85.1% actual recall down to 38.2% (on individual requests, the range was even wider – as much as 82% different between the high and the low). The prediction accuracy rates for the tools also ranged somewhat widely, from a high of 95% to a low of 42%.

Based on this study, it is clear that these technologies can differ significantly on how effective and efficient they are at correctly ranking and categorizing remaining documents in the collection based on the exemplar “seed set” of documents. So, it’s always important to conduct sampling of both machine coded and human coded documents for quality control in any project, with or without predictive coding (we sometimes forget that human coded documents can just as often be incorrectly coded!).

For more about the TREC 2010 Legal Track study, click here. As noted yesterday, you can also check out a replay of the session or download the slides for the presentation at the Virtual LegalTech site.

Full Disclosure: Trial Solutions provides predictive coding services using Hot Neuron LLC’s Clustify™, which categorizes documents by looking for similar documents in the exemplar set that satisfy a user-specified criteria, such as a minimum conceptual similarity or near-duplicate percentage.

So, what do you think? Have you used predictive coding on a case? Please share any comments you might have or if you’d like to know more about a particular topic.

Analysis