Searching Archives

eDiscovery Best Practices: Testing Your Search Using Sampling

April 5, 2011

Friday, we talked about how to determine an appropriate sample size to test your search results as well as the items NOT retrieved by the search, using a site that provides a sample size calculator. Yesterday, we talked about how to make sure the sample size is randomly selected.

Today, we’ll walk through an example of how you can test and refine a search using sampling.

TEST #1: Let’s say in an oil company we’re looking for documents related to oil rights. To try to be as inclusive as possible, we will search for “oil” AND “rights”. Here is the result:

Files retrieved with “oil” AND “rights”: 200,000
Files NOT retrieved with “oil” AND “rights”: 1,000,000

Using the site to determine an appropriate sample size that we identified before, we determine a sample size of 662 for the retrieved files and 664 for the non-retrieved files to achieve a 99% confidence level with a margin of error of 5%. We then use this site to generate random numbers and then proceed to review each item in the retrieved and NOT retrieved items sets to determine responsiveness to the case. Here are the results:

Retrieved Items: 662 reviewed, 24 responsive, 3.6% responsive rate.
NOT Retrieved Items: 664 reviewed, 661 non-responsive, 99.5% non-responsive rate.

Nearly every item in the NOT retrieved category was non-responsive, which is good. But, only 3.6% of the retrieved items were responsive, which means our search was WAY over-inclusive. At that rate, 192,800 out of 200,000 files retrieved will be NOT responsive and will be a waste of time and resource to review. Why? Because, as we determined during the review, almost every published and copyrighted document in our oil company has the phrase “All Rights Reserved” in the document and will be retrieved.

TEST #2: Let’s try again. This time, we’ll conduct a phrase search for “oil rights” (which requires those words as an exact phrase). Here is the result:

Files retrieved with “oil rights”: 1,500
Files NOT retrieved with “oil rights”: 1,198,500

This time, we determine a sample size of 461 for the retrieved files and (again) 664 for the NOT retrieved files to achieve a 99% confidence level with a margin of error of 5%. Even though, we still have a sample size of 664 for the NOT retrieved files, we generate a new list of random numbers to review those items, as well as the 461 randomly selected retrieved items. Here are the results:

Retrieved Items: 461 reviewed, 435 responsive, 94.4% responsive rate.
NOT Retrieved Items: 664 reviewed, 523 non-responsive, 78.8% non-responsive rate.

Nearly every item in the retrieved category was responsive, which is good. But, only 78.8% of the NOT retrieved items were responsive, which means over 20% of the NOT retrieved items were actually responsive to the case (we also failed to retrieve 8 of the items identified as responsive in the first iteration). So, now what?

TEST #3: If you saw this previous post, you know that proximity searching is a good alternative for finding hits that are close to each other without requiring the exact phrase. So, this time, we’ll conduct a proximity search for “oil within 5 words of rights”. Here is the result:

Files retrieved with “oil within 5 words of rights”: 5,700
Files NOT retrieved with “oil within 5 words of rights”: 1,194,300

This time, we determine a sample size of 595 for the retrieved files and (once again) 664 for the NOT retrieved files, generating a new list of random numbers for both sets of items. Here are the results:

Retrieved Items: 595 reviewed, 542 responsive, 91.1% responsive rate.
NOT Retrieved Items: 664 reviewed, 655 non-responsive, 98.6% non-responsive rate.

Over 90% of the items in the retrieved category were responsive AND nearly every item in the NOT retrieved category was non-responsive, which is GREAT. Also, all but one of the items previously identified as responsive was retrieved. So, this is a search that appears to maximize recall and precision.

Had we proceeded with the original search, we would have reviewed 200,000 files – 192,800 of which would have been NOT responsive to the case. By testing and refining, we only had to review 8,815 files – 3,710 sample files reviewed plus the remaining retrieved items from the third search (5,700 – 595 = 5,105) – most of which ARE responsive to the case. We saved tens of thousands in review costs while still retrieving most of the responsive files, using a defensible approach.

Keep in mind that this is a simple example — we’re not taking into account misspellings and other variations we may want to include in our criteria.

So, what do you think? Do you use sampling to test your search results? Please share any comments you might have or if you’d like to know more about a particular topic.

eDiscovery Best Practices: A “Random” Idea on Search Sampling

April 4, 2011

Friday, we talked about how to determine an appropriate sample size to test your search results as well as the items NOT retrieved by the search, using a site that provides a sample size calculator. Today, we’ll talk about how to make sure the sample size is randomly selected.

A randomly selected sample gives each file an equal chance of being reviewed and eliminates the chance of bias being introduced into the sample which might skew the results. Merely selecting the first or last x number of items (or any other group) in the set may not reflect the population as a whole – for example, all of those items could come from a single custodian. To ensure a fair, defensible sample, it needs to be selected randomly.

So, how do you select the numbers randomly? Once again, the Internet helps us out here.

One site, Random.org, has a random integer generator which will randomly generate whole numbers. You simply need to supply the number of random integers that you need to be generated, the starting number and ending number of the range within which the randomly generated numbers should fall. The site will then generate a list of numbers that you can copy and paste into a text file or even a spreadsheet. The site also provides an Advanced mode, which provides options for the numbers (e.g., decimal, hexadecimal), output format and how the randomization is ‘seeded’ (to generate the numbers).

In the example from Friday, you would provide 660 as the number of random integers to be generated, with a starting number of 1 and an ending number of 100,000 to get a list of random numbers for testing your search that yielded 100,000 files with hits (664, 1 and 1,000,000 respectively to get a list of numbers to test the non-hits). You could paste the numbers into a spreadsheet, sort them and then retrieve the files by position in the result set based on the random numbers retrieved and review each of them to determine whether they reflect the intent of the search. You’ll then have a good sense of how effective your search was, based on the random sample. And, probably more importantly, using that random sample to test your search results will be a highly defensible method to verify your approach in court.

Tomorrow, we'll walk through a sample iteration to show how the sampling will ultimately help us refine our search.

So, what do you think? Do you use sampling to test your search results? Please share any comments you might have or if you’d like to know more about a particular topic.

eDiscovery Best Practices: Determining Appropriate Sample Size to Test Your Search

April 1, 2011

We’ve talked about searching best practices quite a bit on this blog. One part of searching best practices (as part of the “STARR” approach I described in an earlier post) is to test your search results (both the result set and the files not retrieved) to determine whether the search you performed is effective at maximizing both precision and recall to the extent possible, so that you retrieve as many responsive files as possible without having to review too many non-responsive files. One question I often get is: how many files do you need to review to test the search?

If you remember from statistics class in high school or college, statistical sampling is choosing a percentage of the results population at random for inspection to gather information about the population as a whole. This saves considerable time, effort and cost over reviewing every item in the results population and enables you to obtain a “confidence level” that the characteristics of the population reflect your sample. Statistical sampling is a method used for everything from exit polls to predict elections to marketing surveys to poll customers on brand popularity and is a generally accepted method of drawing conclusions for an overall results population. You can sample a small portion of a large set to obtain a 95% or 99% confidence level in your findings (with a margin of error, of course).

So, does that mean you have to find your old statistics book and dust off your calculator or (gasp!) slide rule? Thankfully, no.

There are several sites that provide sample size calculators to help you determine an appropriate sample size, including this one. You’ll simply need to identify a desired confidence level (typically 95% to 99%), an acceptable margin of error (typically 5% or less) and the population size.

So, if you perform a search that retrieves 100,000 files and you want a sample size that provides a 99% confidence level with a margin of error of 5%, you’ll need to review 660 of the retrieved files to achieve that level of confidence in your sample (only 383 files if a 95% confidence level will do). If 1,000,000 files were not retrieved, you would only need to review 664 of the not retrieved files to achieve that same level of confidence (99%, with a 5% margin of error) in your sample. As you can see, the sample size doesn’t need to increase much when the population gets really large and you can review a relatively small subset to understand your collection and defend your search methodology to the court.

On Monday, we will talk about how to randomly select the files to review for your sample. Same bat time, same bat channel!

So, what do you think? Do you use sampling to test your search results? Please share any comments you might have or if you’d like to know more about a particular topic.

eDiscovery Trends: Forbes on the Rise of Predictive Coding

March 28, 2011

First the New York Times with an article about eDiscovery, now Forbes. Who’s next, The Wall Street Journal? 😉

Forbes published a blog post entitled E-Discovery And the Rise of Predictive Coding a few days ago. Written by Ben Kerschberg, Founder of Consero Group LLC, it gets into some legal issues and considerations regarding predictive coding that are interesting. For some background on predictive coding, check out our December blog posts, here and here.

First, the author provides a very brief history of document review, starting with bankers boxes and WordPerfect and “[a]fter an interim phase best characterized by simple keyword searches and optical character recognition”, it evolved to predictive coding. OK, that’s like saying that Gone with the Wind started with various suitors courting Scarlett O’Hara and after an interim phase best characterized by the Civil War, marriage and heartache, Rhett says to Scarlett, “Frankly, my dear, I don’t give a damn.” A bit oversimplification of how review has evolved.

Nonetheless, the article gets into a couple of important legal issues raised by predictive coding. They are:

Satisfying Reasonable Search Requirements: Whether counsel can utilize the benefits of predictive coding and still meet legal obligations to conduct a reasonable search for responsive documents under the federal rules. The question is, what constitutes a reasonable search under Federal Rule 26(g)(1)(A), which requires that the responding attorney attest by signature that “with respect to a disclosure, it is complete and correct as of the time it is made”?
Protecting Privilege: Whether counsel can protect attorney-client privilege for their client when a privileged document is inadvertently disclosed. Fed. Rule of. Evidence 502 provides that a court may order that a privilege or protection is not waived by disclosure if the disclosure was inadvertent and the holder of the privilege took reasonable steps to prevent disclosure. Again, what’s reasonable?

The author concludes that the use of predictive coding is reasonable, because it a) makes document review more efficient by providing only those documents to the reviewer that have been selected by the algorithm; b) makes it more likely that responsive documents will be produced, saving time and resources; and c) refines relevant subsets for review, which can then be validated statistically.

So, what do you think? Does predictive coding enable attorneys to satisfy these legal issues? Is it reasonable? Please share any comments you might have or if you’d like to know more about a particular topic.

eDiscovery Best Practices: Does Size Matter?

March 25, 2011

I admit it, with a title like “Does Size Matter?”, I’m looking for a few extra page views…. 😉

I frequently get asked how big does an ESI collection need to be to benefit from eDiscovery technology. In a recent case with one of my clients, the client had a fairly small collection – only about 4 GB. But, when a judge ruled that they had to start conducting depositions in a week, they needed to review that data in a weekend. Without FirstPass™, powered by Venio FPR™ to cull the data and OnDemand® to manage the linear review, they would not have been able to make that deadline. So, they clearly benefited from the use of eDiscovery technology in that case.

But, if you’re not facing a tight deadline, how large does your collection need to be for the use of eDiscovery technology to provide benefits?

I recently conducted a webinar regarding the benefits of First Pass Review – aka Early Case Assessment, or a more accurate term (as George Socha points out regularly), Early Data Assessment. One of the topics discussed in that webinar was the cost of review for each gigabyte (GB). Extrapolated from an analysis conducted by Anne Kershaw a few years ago (and published in the Gartner report E-Discovery: Project Planning and Budgeting 2008-2011), here is a breakdown:

Estimated Cost to Review All Documents in a GB:

Pages per GB: 75,000
Pages per Document: 4
Documents Per GB: 18,750
Review Rate: 50 documents per hour
Total Review Hours: 375
Reviewer Billing Rate: $50 per hour

Total Cost to Review Each GB: $18,750

Notes: The number of pages per GB can vary widely. Page per GB estimates tend to range from 50,000 to 100,000 pages per GB, so 75,000 pages (18,750 documents) seems an appropriate average. 50 documents reviewed per hour is considered to be a fast review rate and $50 per hour is considered to be a bargain price. eDiscovery Daily provided an earlier estimate of $16,650 per GB based on assumptions of 20,000 documents per GB and 60 documents reviewed per hour – the assumptions may change somewhat, but, either way, the cost for attorney review of each GB could be expected to range from at least $16,000 to $18,000, possibly more.

Advanced culling and searching capabilities of First Pass Review tools like FirstPass can enable you to cull out 70-80% of most collections as clearly non-responsive without having to conduct attorney review on those files. If you have merely a 2 GB collection and assume the lowest review cost above of $16,000 per GB, the use of a First Pass Review tool to cull out 70% of the collection can save $22,400 in attorney review costs. Is that worth it?

So, what do you think? Do you use eDiscovery technology for only the really large cases or ALL cases? Please share any comments you might have or if you’d like to know more about a particular topic.

eDiscovery Best Practices: Is Disclosure of Search Terms Required?

March 24, 2011

I read a terrific article a couple of days ago from the New York Law Journal via Law Technology News entitled Search Terms Are More Than Mere Words, that had some interesting takes about the disclosure of search terms in eDiscovery. The article was written by David J. Kessler, Robert D. Owen, and Emily Johnston of Fulbright & Jaworski. The primary emphasis of the article was with regard to the forced disclosure of search terms by courts.

In the age of “meet and confer”, it has become much more common for parties to agree to exchange search terms in a case to limit costs and increase transparency. However, as the authors correctly note, search terms reflect counsel’s strategy for the case and, therefore, work product. Their position is that courts should not force disclosure of search terms and that disclosure of terms is “not appropriate under the Federal Rules of Civil Procedure”. The article provides a compelling argument as to why forced disclosure is not appropriate and provides some good case cites where courts have accepted or rejected requests to compel provision of search terms. I won’t try to recap them all here – check out the article for more information.

So, should disclosure of search terms be generally required? If not, what does that mean in terms of utilizing a defensible approach to searching?

Personally, I agree with the authors that forced disclosure of search terms is generally not appropriate, as it does reflect strategy and work product. However, there is an obligation for each party to preserve, collect, review and produce all relevant materials to the best of their ability (that are not privileged, of course). Searching is an integral part of that process. And, the article does note that “chosen terms may come under scrutiny if there is a defect in the production”, though “[m]ere speculation or unfounded accusations” should not lead to a requirement to disclose search terms.

With that said, the biggest component of most eDiscovery collections today is email, and that email often reflects discussions between parties in the case. In these cases, it’s much easier for opposing counsel to identify legitimate defects in the production because they have some of the same correspondence and documents and can often easily spot discrepancies in the production set. If they identify legitimate omissions from the production, those omissions could cause the court to call into question your search procedures. Therefore, it’s important to conduct a defensible approach to searching (such as the “STARR” approach I described in an earlier post) to be able to defend yourself if those questions arise. Demonstrating a defensible approach to searching will offer the best chance to preserve your rights to protect your work product of search terms that reflect your case strategy.

So, what do you think? Do you think that forced disclosure of search terms is appropriate? Please share any comments you might have or if you’d like to know more about a particular topic.

eDiscovery Trends: Despite What NY Times Says, Lawyers Not Going Away

March 14, 2011

There was a TV commercial in the mid-80’s where a soap opera actor delivered the line “I’m not a doctor, but I play one on TV”. Can you remember the product it was advertising (without clicking on the link)? If so, you win the trivia award of the day! 😉

I’m a technologist who has been working in litigation support and eDiscovery for over twenty years. If you’ve been reading eDiscovery Daily for awhile, you’ve probably noticed that I’ve written several posts regarding significant case law as it pertains to eDiscovery. I often feel that I should offer a disclaimer before each of these posts saying “I’m not a lawyer, but I play one on the Web”. As the disclaimer at the bottom of the page stipulates, these posts aren’t meant to provide legal advice and it is not my intention to do so, but merely to identify cases that may be of interest to our readers and I try to provide a basic recap of these cases and leave it at that. As Clint Eastwood once said, “A man’s got to know his limitations”.

A few days ago, The New York Times published an article entitled Armies of Expensive Lawyers, Replaced by Cheaper Software which discussed how, using ‘artificial intelligence, “e-discovery” software can analyze documents in a fraction of the time for a fraction of the cost’ (extraneous comma in the title notwithstanding). The article goes on to discuss linguistic and sociological techniques for retrieval of relevant information and discusses how the Enron Corpus, available in a number of forms, including through EDRM, has enabled software providers to make great strides in analytical capabilities using this large base of data to use in testing. It also discusses whether this will precipitate a march to the unemployment line for scores of attorneys.

A number of articles and posts since then have offered commentary as to whether that will be the case. Technology tools will certainly reduce document populations significantly, but, as the article noted, “[t]he documents that the process kicks out still have to be read by someone”. Not only that, the article still makes the assumption that people too often make with search technology – that it’s a “push a button and get your answer” approach to identifying relevant documents. But, as has been noted in several cases and also here on this blog, searching is an iterative process where sampling the search results is recommended to confirm that the search maximizes recall and precision to the extent possible. Who do you think is going to perform that sampling? Lawyers – that’s who (working with technologists like me, of course!). And, some searches will require multiple iterations of sampling and analysis before the search is optimized.

Therefore, while the “armies” of lawyers many not need near as many members of the infantry, they will still need plenty of corporals, sergeants, captains, colonels and generals. And, for those entry-level reviewing attorneys that no longer have a place on review projects? Well, we could always use a few more doctors on TV, right? 😉

So, what do you think? Are you a review attorney that has been impacted by technology – positively or negatively? Please share any comments you might have or if you’d like to know more about a particular topic.

eDiscovery Case Law: Spoliate Evidence, Don’t Go to Jail, but Pay a Million Dollars

March 11, 2011

As previously referenced in eDiscovery Daily, defendant Mark Pappas, President of Creative Pipe, Inc., was ordered by Magistrate Judge Paul W. Grimm to “be imprisoned for a period not to exceed two years, unless and until he pays to Plaintiff the attorney's fees and costs that will be awarded to Plaintiff as the prevailing party pursuant to Fed. R. Civ. P. 37(b)(2)(C).”. Judge Grimm found that “Defendants…deleted, destroyed, and otherwise failed to preserve evidence; and repeatedly misrepresented the completeness of their discovery production to opposing counsel and the Court.”

However, ruling on the defendants’ appeal, District Court Judge Marvin J. Garbis declined to adopt the order regarding incarceration, stating: “[T]he court does not find it appropriate to Order Defendant Pappas incarcerated for future possible failure to comply with his obligation to make payment of an amount to be determined in the course of further proceedings.”

So, how much is he ordered to pay? Now we know.

On January 24, 2011, Judge Grimm entered an order awarding a total of $1,049,850.04 in “attorney’s fees and costs associated with all discovery that would not have been un[der]taken but for Defendants' spoliation, as well as the briefings and hearings regarding Plaintiff’s Motion for Sanctions.” Judge Grimm explained, “the willful loss or destruction of relevant evidence taints the entire discovery and motions practice.” So, the court found that “Defendants’ first spoliation efforts corresponded with the beginning of litigation” and that “Defendants’ misconduct affected the entire discovery process since the commencement of this case.”

As a result, the court awarded $901,553.00 in attorney’s fees and $148,297.04 in costs. Those costs included $95,969.04 for the Plaintiff’s computer forensic consultant that was “initially hired . . . to address the early evidence of spoliation by Defendants and to prevent further destruction of data”. The Plaintiff’s forensic consultant also provided processing services and participated in the preparation of plaintiff’s search and collection protocol, which the court found “pertained to Defendants’ spoliation efforts.”

So, what do you think? Will the defendant pay? Or will he be subject to possible jail time yet again? Please share any comments you might have or if you’d like to know more about a particular topic.

eDiscovery Trends: Tom O’Connor of Gulf Coast Legal Technology Center

March 2, 2011

This is the eighth of the LegalTech New York (LTNY) Thought Leader Interview series. eDiscoveryDaily interviewed several thought leaders at LTNY this year and asked each of them the same three questions:

What do you consider to be the current significant trends in eDiscovery on which people in the industry are, or should be, focused?
Which of those trends are evident here at LTNY, which are not being talked about enough, and/or what are your general observations about LTNY this year?
What are you working on that you’d like our readers to know about?

Today’s thought leader is Tom O’Connor. Tom is a nationally known consultant, speaker and writer in the area of computerized litigation support systems. A frequent lecturer on the subject of legal technology, Tom has been on the faculty of numerous national CLE providers and has taught college level courses on legal technology. Tom's involvement with large cases led him to become familiar with dozens of various software applications for litigation support and he has both designed databases and trained legal staffs in their use on many of the cases mentioned above. This work has involved both public and private law firms of all sizes across the nation. Tom is the Director of the Gulf Coast Legal Technology Center in New Orleans.

What do you consider to be the current significant trends in eDiscovery on which people in the industry are, or should be, focused?

I think that there is still a lack of general baseline understanding of, not just eDiscovery principles, but technology principles. Attorneys have been coming to LegalTech for over 30 years and have seen people like Michael Arkfeld, Browning Marean and folks like Neil Aresty, who got me started in the business. The nouns have changed, from DOS to Windows, from paper to images, and now its eDiscovery. The attorneys just haven’t been paying attention. Bottom line is: for years and years, they didn’t care about technology. They didn’t learn it in law school because a) they had no inclination to learn technology and b) they didn’t have any real ability to learn it, myself included. With the exception of a few people like Craig Ball and George Socha, who are versed in the technical side of things – the average attorney is not versed at all. So, the technology side of the litigation world consisted of the lit support people, the senior paralegals, the support staff and the IT people (to the minimal extent they assisted in litigation). That all changed when the Federal Civil Rules changed, and it became a requirement.

So, if I pick up a piece of paper here and ten years ago used this as an exhibit, would the judge say “Hey, counsel, that’s quite a printout you have there, is that a Sans Serif font? Is that 14 point or 15 point? Did you print this on an IBM 3436?” Of course not. The judge would authenticate it and admit it – or not – and there might be an argument. Now, when we go to introduce evidence, there are all sorts of questions that are technical in nature – “Where did you get that PST file? How did that email get generated? Did you run HASH values on that?”, etc. And, I’m not just making this up. If you look at decisions by Judge Grimm or Facciola or Peck or Waxse, they’re asking these questions. Attorneys, of course, have been caught like the “deer in the headlights” in response to those questions and now they’re trying to pick up that knowledge. If there’s one real trend I’m seeing this year, it’s that attorneys are finally taking technology seriously and trying to play catch up with their staff on understanding what all of this stuff is about. Judges are irritated about it. We have had major sanctions because of it. And, if they had been paying attention for the last ten years, we wouldn’t be in the mess that we are now.

Of course, some people disagree and think that the sheer volume of data that we have is contributing to that and folks like Ralph Losey, who I respect, think we should tweak the rules to change what’s relevant. It shouldn’t be anything that reasonably could lead to something of value in the case, we should “ratchet it down” so that the volume is reduced. My feeling on that is that we’ve got the technology tools to reduce the volume – if they’re used properly. The tools are better now than they were three years ago, but we had the tools to do that for awhile. There’s no reason for these whole scale “data dumps” that we see, and I forget if it was either Judge Grimm or Facciola who had a case where in his opinion he said “we’ve got to stop with these boilerplate requests for discovery and responses for requests for discovery and make them specific”.

So, that’s the trend I see, that lawyers are finally trying to take some time to try to get up to speed – whining and screaming pitifully all the way about how it’s not fair, and the sanctions are too high and there’s too much data. Get a life, get a grip. Use the tools that are out there that have been given to you for years. So, if I sound cynical, it’s because I am.

Which of those trends are evident here at LTNY, which are not being talked about enough, and/or what are your general observations about LTNY this year?

{Interviewed on the final afternoon of LTNY} Well, as always, a good show. This year, I think it was a great show, which is actually a bit of a surprise to me. I was worried, not that it would go down from last year, but that we had maybe flattened out because of the economy (and the weather). But, the turnout was great, the exhibit halls were great, a lot of good information. I think we’re seeing a couple of trends from vendors in general, especially in the eDiscovery space. We’re seeing vendors trying to consolidate. I think attorneys who work in this space are concerned with moving large amounts of data from one stage of the EDRM model to another. That’s problematic, because of the time and energy involved, the possible hazards involved and even authentication issues involved. So, the response to that is that some vendors attempt to do “end-to-end” or at least do three out of the six stages and reduce the movement or partner with each other with open APIs and transparent calls, so that process is easier.

At the same time, we’re seeing the process faster and more efficient with increased speed times for ingestion and processing, which is great. Maybe a bigger trend and one that will play out as the year goes along is a change in the pricing model, clearly getting away from per GB pricing to some other alternative such as, maybe, per case or per matter. Because of the huge amount of data we have do so. But also, we’re leaving out an area that Craig Ball addressed last year with his EDna challenge – what about the low end of the spectrum? This is great if you’re Pillsbury or DLA Piper or Fulbright & Jaworski – they can afford Clearwell or Catalyst or Relativity and can afford to call in KPMG or Deloitte. But, what about the smaller cases? They can benefit from technology as well. Craig addressed it with his EDna challenge for the $1,000 case and asked people to respond within those parameters. Browning Marean and I were asking “what about the $500,000 case?” Not that there’s anything bad about low end technology, you can use Adobe and S1 and some simple databases to do a great job. But, what about in the middle, where I still can’t afford to buy Relativity and I still can’t afford to process with Clearwell? What am I going to use? And, that’s where I think new pricing and some of the new products will address that. I’ve seen some hot new products, especially cloud based products, for small firms. That’s a big change for this year’s show, which, since it’s in New York, has been geared to big firms and big cases.

What are you working on that you’d like our readers to know about?

I think the things that excite me the most that are going on this year are the educational efforts I’m involved in. They include Ralph Losey’s online educational series through his blog, eDiscovery Team and Craig Ball through the eDiscovery Training Academy at Georgetown Law School in June. Both are very exciting.

And, my organization, the Gulf Coast Legal Technology Center continues to do a lot of CLE and pro-bono activities for the Mississippi and Louisiana bar, which are still primarily small firms. We also continue to assist Gulf Coast firms with technology needs as they continue to rebuild their legal technology infrastructure after Katrina.

Thanks, Tom, for participating in the interview!

And to the readers, as always, please share any comments you might have or if you’d like to know more about a particular topic!

eDiscovery Trends: Jack Halprin of Autonomy

February 23, 2011

This is the fifth of the LegalTech New York (LTNY) Thought Leader Interview series. eDiscoveryDaily interviewed several thought leaders at LTNY this year and asked each of them the same three questions:

What do you consider to be the current significant trends in eDiscovery on which people in the industry are, or should be, focused?
Which of those trends are evident here at LTNY, which are not being talked about enough, and/or what are your general observations about LTNY this year?
What are you working on that you’d like our readers to know about?

Today’s thought leader is Jack Halprin. As Vice President, eDiscovery and Compliance with Autonomy, Jack serves as internal and external legal subject matter expert for best practices and defensible processes around litigation, electronic discovery, legal hold, and compliance issues. He speaks frequently on enterprise legal risk management, compliance, and eDiscovery at industry events and seminars, and has authored numerous articles on eDiscovery, legal hold, social media, and knowledge management. He is actively involved in The Sedona Conference, ACC, and Electronic Discovery Reference Model (EDRM). With a BA in Chemistry from Yale University, a JD from the University of California-Los Angeles, and certifications from the California, Connecticut, Virginia and Patent Bars, Mr. Halprin has varied expertise that lends itself well to both the legal and technical aspects of electronic discovery collection and preservation.

What do you consider to be the current significant trends in eDiscovery on which people in the industry are, or should be, focused?

If I look at the overall trends, social media and the cloud are probably the two hottest topics from a technology perspective and also a data management perspective. From the legal perspective, you’re looking at preservation issues and sanctions as well as the idea of proportionality. You also see a greater need for technology that can meet the needs of attorneys and understand the meaning of information. More and more, everyone is realizing that keyword searches are lacking – they aren’t really as effective as everyone thinks they are.

We’re also starting to see two other technology related trends. The industry is consolidating and customers are really starting to look for a single platform. The current process of importing/exporting of data from storage to legal hold collection, to early case assessment, to review, to production and creating several extra copies of the documents in the process is not manageable going forward. Customers want to be able to preserve in place, to analyze in place, and they don’t want to have to collect and duplicate the data again and again. If you look at the left side of EDRM, the more proactive side, they don’t want put data or documents in a special repository unless it’s a true record that no one needs to access on a regular basis. They want to work with active data where it lives.

You’ll see a reduction in the number of vendors in the next year or two, and the technology will not only be able to handle the current data sources, but the increased data volumes and new types of data we’re seeing. Everyone is looking at social media and saying “how are we going to handle this”, when it’s really just another data source that has to be addressed. Yes, it’s challenging because there is so much of it and it is even more conversational than email, taking it to a whole new level, but it’s really no different from other data sources. A keyword search on a social media site is not going to net you the results you’re looking for, but conceptual search to understand the context of what people mean will help you identify the relevant information. Growth rates are predicted at more than 60 percent for unstructured information, but social media is growing at a much faster clip. A lot of people are looking at social media and moving to the cloud to manage this data, reducing some of the infrastructure costs, taking strain off the network and reducing their IT footprint.

Which of those trends are evident here at LTNY, which are not being talked about enough, and/or what are your general observations about LTNY this year?

{Interviewed on the first afternoon of LTNY} I’ll take it first from the Autonomy perspective. We have social media solutions, which we’ve had for our marketing business (Interwoven) for some time. We’ve also had social media governance technology for quite some time as well, and we announced today new capabilities for identifying, preserving and collecting social media for eDiscovery, which is part of and builds on our end-to-end solution. I haven’t spent much time on the floor yet, but based on everything I’ve seen in the eDiscovery space, a lot of people are talking about social media, but no one really understands how to address it. You’ve got people scraping {social media} pages, but if you scrape the page without the active link or without capturing the context behind it, you’re missing the wealth of the information. We’re taking a different approach, we take the entire page, including the context and active links.

There’s also a wide disparity in terms of the cloud. Is it public? Is it private? How much control do you have over your data when it’s in the cloud? You’ve got a lot of vendors out there that aren’t transparent about their data centers. You’ve got vendors that say they’re SAS 70 Type II certified, but it’s their data center, not the vendor itself, that is certified. So, who’s got the experience? Every year at LegalTech, there are probably forty new vendors out there and the next year, half or more of them are gone.

As for the tone of the show, I think it’s certainly more upbeat than last year when attendance was down, and it’s a bit more “bouncy” this year. With that in mind, you’ll continue to see acquisitions and you’ll have the issue companies merged through acquisition using different technologies and different search engines, meaning they’re not on a single platform and not really a single solution. So, that gets back to the idea that customers are really looking for a single platform with a single engine underneath it. That’s how we approach it, and I think others are trying to get to that point, but I don’t think there are many vendors there yet. That’s where the trend is heading.

What are you working on that you’d like our readers to know about?

In addition to the new social media eDiscovery capabilities described above, we’ve announced the Autonomy Chaining Console, which is a dashboard to provide corporate legal departments with greater visibility and defensibility across the entire process and to eliminate those risky data import/export handoffs through each step. Many of the larger corporations have hundreds of cases, dozens of outside law firms, and terabytes of data to manage. The process today is very “silo” oriented – data is sent to processing vendors, it is sent to law firms, etc. So, you get these “weak links in the chain” where data can get lost and risks of spoliation and costs increase. Autonomy announced the whole idea of chaining last year promoting the idea that we can seamlessly connect law firms and their corporate clients in a secure manner, so that the law firm can login to a secure portal and can manage the data that they’re allowed to access. The Chaining Console strengthens that capability, and it adds Autonomy IDOL’s ability to understand meaning and allows corporate and outside counsel to look at the same data on the same solution. It uses IDOL to determine potential custodians, understand fact patterns and identify other companies that may be involved by really analyzing the data and providing an understanding of what’s there. It can also monitor and track risk, so you can set up certain policies around key issues; for example, insider trading, securities fraud, FCPA, etc. Using those policies, it can alert you to the risks that are there and possibly identify the custodians that are engaging in risky behavior. And, of course, it tracks the data from start to finish, giving corporate counsel, legal IT, IT, litigation support, litigation counsel as well as outside counsel a single view of the data on a single dashboard. It strengthens our message and takes us to the next step in really providing the end-to-end platform for our clients.

We’ve also announced iManage in the cloud for legal information management in the cloud. The cloud-based Information Management platform combines WorkSite, Records Manager, Universal Search, Process Automation and ConflictsManager to help attorneys manage the content throughout the matter lifecycle from inception to disposition. It uses IDOL’s ability to group concepts, so if you have a conflict with Apple, it knows that you’re searching for terms related to Apple computer such as Mac, iPhone, Steve Jobs, Steve Wozniak, Jonathon Ives and understands that these are related terms and individuals. And, we’ve just announced the cloud-based version of that. We’re already managing information governance in the cloud for a lot of our clients and the platform leverages our private cloud, which is the world’s largest private cloud with over 17 petabytes of data.

And, then we have a market leadership announcement with additional major law firms that are using our solutions, such as Brownstein Hyatt Farber Schreck LLP, Brown Rudnick LLP, Fennemore Craig, etc. So, we have four press releases with new developments at Autonomy that we’ve announced here at the show.

Thanks, Jack, for participating in the interview!

And to the readers, as always, please share any comments you might have or if you’d like to know more about a particular topic!

Searching

eDiscovery Best Practices: Testing Your Search Using Sampling

eDiscovery Best Practices: A “Random” Idea on Search Sampling

eDiscovery Best Practices: Determining Appropriate Sample Size to Test Your Search

eDiscovery Trends: Forbes on the Rise of Predictive Coding

eDiscovery Best Practices: Does Size Matter?

eDiscovery Best Practices: Is Disclosure of Search Terms Required?

eDiscovery Trends: Despite What NY Times Says, Lawyers Not Going Away

eDiscovery Case Law: Spoliate Evidence, Don’t Go to Jail, but Pay a Million Dollars

eDiscovery Trends: Tom O’Connor of Gulf Coast Legal Technology Center

eDiscovery Trends: Jack Halprin of Autonomy

Status: Updated