Review

eDiscovery Best Practices: Your ESI Collection May Be Larger Than You Think

 

Here’s a sample scenario: You identify custodians relevant to the case and collect files from each.  Roughly 100 gigabytes (GB) of Microsoft Outlook email PST files and loose “efiles” is collected in total from the custodians.  You identify a vendor to process the files to load into a review tool, so that you can perform first pass review and, eventually, linear review and produce the files to opposing counsel.  After processing, the vendor sends you a bill – and they’ve charged you to process over 200 GB!!  What happened?!?

Did the vendor accidentally “double-bill” you?  That would be great – but no.  There’s a much more logical explanation and, unfortunately, you may wind up paying a lot more to process these files that you expected.

Many of the files in most ESI collections are stored in what are known as “archive” or “container” files.  For example, as noted above, Outlook emails are typically saved for each custodian in a personal storage (.PST) file format, which is an expanding container file. For most custodians, all of their email (and the corresponding attachments, if present) resides in a few PST files.  The scanned size for the PST file is the size of the file on disk.

Did you ever see one of those vacuum bags that you store clothes in and then suck all the air out so that the clothes won’t take as much space?  The PST file is like one of those vacuum bags – it typically stores the emails and attachments in a compressed format to save space.  When the emails and attachments are processed into a review tool, they are expanded into their normal size.  This expanded size can be 1.5 to 2 times larger than the scanned size (or more).  And, that’s what many vendors will bill on – the expanded size.

There are other types of archive container files that compress the contents – .zip and .rar files are two examples of compressed container files.  These files are often used to not only to compress files for storage on hard drives, but they are also used to compact or group a set of files when transmitting them, usually in – you guessed it – email.  With email comprising a majority of most ESI collections and the popularity of other archive container files for compressing file collections, the expanded size of your collection may be considerably larger than it appears when stored on disk.  It’s important to be prepared for that and know your options when processing that data, so you can effectively anticipate those processing costs.

So, what do you think?  Have you ever been surprised by processing costs of your ESI?   Please share any comments you might have or if you’d like to know more about a particular topic.

eDiscovery Best Practices: Testing Your Search Using Sampling

Friday, we talked about how to determine an appropriate sample size to test your search results as well as the items NOT retrieved by the search, using a site that provides a sample size calculator.  Yesterday, we talked about how to make sure the sample size is randomly selected.

Today, we’ll walk through an example of how you can test and refine a search using sampling.

TEST #1: Let’s say in an oil company we’re looking for documents related to oil rights.  To try to be as inclusive as possible, we will search for “oil” AND “rights”.  Here is the result:

  • Files retrieved with “oil” AND “rights”: 200,000
  • Files NOT retrieved with “oil” AND “rights”: 1,000,000

Using the site to determine an appropriate sample size that we identified before, we determine a sample size of 662 for the retrieved files and 664 for the non-retrieved files to achieve a 99% confidence level with a margin of error of 5%.  We then use this site to generate random numbers and then proceed to review each item in the retrieved and NOT retrieved items sets to determine responsiveness to the case.  Here are the results:

  • Retrieved Items: 662 reviewed, 24 responsive, 3.6% responsive rate.
  • NOT Retrieved Items: 664 reviewed, 661 non-responsive, 99.5% non-responsive rate.

Nearly every item in the NOT retrieved category was non-responsive, which is good.  But, only 3.6% of the retrieved items were responsive, which means our search was WAY over-inclusive.  At that rate, 192,800 out of 200,000 files retrieved will be NOT responsive and will be a waste of time and resource to review.  Why?  Because, as we determined during the review, almost every published and copyrighted document in our oil company has the phrase “All Rights Reserved” in the document and will be retrieved.

TEST #2: Let’s try again.  This time, we’ll conduct a phrase search for “oil rights” (which requires those words as an exact phrase).  Here is the result:

  • Files retrieved with “oil rights”: 1,500
  • Files NOT retrieved with “oil rights”: 1,198,500

This time, we determine a sample size of 461 for the retrieved files and (again) 664 for the NOT retrieved files to achieve a 99% confidence level with a margin of error of 5%.  Even though, we still have a sample size of 664 for the NOT retrieved files, we generate a new list of random numbers to review those items, as well as the 461 randomly selected retrieved items.  Here are the results:

  • Retrieved Items: 461 reviewed, 435 responsive, 94.4% responsive rate.
  • NOT Retrieved Items: 664 reviewed, 523 non-responsive, 78.8% non-responsive rate.

Nearly every item in the retrieved category was responsive, which is good.  But, only 78.8% of the NOT retrieved items were responsive, which means over 20% of the NOT retrieved items were actually responsive to the case (we also failed to retrieve 8 of the items identified as responsive in the first iteration).  So, now what?

TEST #3: If you saw this previous post, you know that proximity searching is a good alternative for finding hits that are close to each other without requiring the exact phrase.  So, this time, we’ll conduct a proximity search for “oil within 5 words of rights”.  Here is the result:

  • Files retrieved with “oil within 5 words of rights”: 5,700
  • Files NOT retrieved with “oil within 5 words of rights”: 1,194,300

This time, we determine a sample size of 595 for the retrieved files and (once again) 664 for the NOT retrieved files, generating a new list of random numbers for both sets of items.  Here are the results:

  • Retrieved Items: 595 reviewed, 542 responsive, 91.1% responsive rate.
  • NOT Retrieved Items: 664 reviewed, 655 non-responsive, 98.6% non-responsive rate.

Over 90% of the items in the retrieved category were responsive AND nearly every item in the NOT retrieved category was non-responsive, which is GREAT.  Also, all but one of the items previously identified as responsive was retrieved.  So, this is a search that appears to maximize recall and precision.

Had we proceeded with the original search, we would have reviewed 200,000 files – 192,800 of which would have been NOT responsive to the case.  By testing and refining, we only had to review 8,815 files –  3,710 sample files reviewed plus the remaining retrieved items from the third search (5,700595 = 5,105) – most of which ARE responsive to the case.  We saved tens of thousands in review costs while still retrieving most of the responsive files, using a defensible approach.

Keep in mind that this is a simple example — we’re not taking into account misspellings and other variations we may want to include in our criteria.

So, what do you think?  Do you use sampling to test your search results?   Please share any comments you might have or if you’d like to know more about a particular topic.

eDiscovery Best Practices: A “Random” Idea on Search Sampling

 

Friday, we talked about how to determine an appropriate sample size to test your search results as well as the items NOT retrieved by the search, using a site that provides a sample size calculator.  Today, we’ll talk about how to make sure the sample size is randomly selected.

A randomly selected sample gives each file an equal chance of being reviewed and eliminates the chance of bias being introduced into the sample which might skew the results.  Merely selecting the first or last x number of items (or any other group) in the set may not reflect the population as a whole – for example, all of those items could come from a single custodian.  To ensure a fair, defensible sample, it needs to be selected randomly.

So, how do you select the numbers randomly?  Once again, the Internet helps us out here.

One site, Random.org, has a random integer generator which will randomly generate whole numbers.  You simply need to supply the number of random integers that you need to be generated, the starting number and ending number of the range within which the randomly generated numbers should fall.  The site will then generate a list of numbers that you can copy and paste into a text file or even a spreadsheet.  The site also provides an Advanced mode, which provides options for the numbers (e.g., decimal, hexadecimal), output format and how the randomization is ‘seeded’ (to generate the numbers).

In the example from Friday, you would provide 660 as the number of random integers to be generated, with a starting number of 1 and an ending number of 100,000 to get a list of random numbers for testing your search that yielded 100,000 files with hits (664, 1 and 1,000,000 respectively to get a list of numbers to test the non-hits).  You could paste the numbers into a spreadsheet, sort them and then retrieve the files by position in the result set based on the random numbers retrieved and review each of them to determine whether they reflect the intent of the search.  You’ll then have a good sense of how effective your search was, based on the random sample.  And, probably more importantly, using that random sample to test your search results will be a highly defensible method to verify your approach in court.

Tomorrow, we'll walk through a sample iteration to show how the sampling will ultimately help us refine our search.

So, what do you think?  Do you use sampling to test your search results?   Please share any comments you might have or if you’d like to know more about a particular topic.

eDiscovery Best Practices: Determining Appropriate Sample Size to Test Your Search

 

We’ve talked about searching best practices quite a bit on this blog.  One part of searching best practices (as part of the “STARR” approach I described in an earlier post) is to test your search results (both the result set and the files not retrieved) to determine whether the search you performed is effective at maximizing both precision and recall to the extent possible, so that you retrieve as many responsive files as possible without having to review too many non-responsive files.  One question I often get is: how many files do you need to review to test the search?

If you remember from statistics class in high school or college, statistical sampling is choosing a percentage of the results population at random for inspection to gather information about the population as a whole.  This saves considerable time, effort and cost over reviewing every item in the results population and enables you to obtain a “confidence level” that the characteristics of the population reflect your sample.  Statistical sampling is a method used for everything from exit polls to predict elections to marketing surveys to poll customers on brand popularity and is a generally accepted method of drawing conclusions for an overall results population.  You can sample a small portion of a large set to obtain a 95% or 99% confidence level in your findings (with a margin of error, of course).

So, does that mean you have to find your old statistics book and dust off your calculator or (gasp!) slide rule?  Thankfully, no.

There are several sites that provide sample size calculators to help you determine an appropriate sample size, including this one.  You’ll simply need to identify a desired confidence level (typically 95% to 99%), an acceptable margin of error (typically 5% or less) and the population size.

So, if you perform a search that retrieves 100,000 files and you want a sample size that provides a 99% confidence level with a margin of error of 5%, you’ll need to review 660 of the retrieved files to achieve that level of confidence in your sample (only 383 files if a 95% confidence level will do).  If 1,000,000 files were not retrieved, you would only need to review 664 of the not retrieved files to achieve that same level of confidence (99%, with a 5% margin of error) in your sample.  As you can see, the sample size doesn’t need to increase much when the population gets really large and you can review a relatively small subset to understand your collection and defend your search methodology to the court.

On Monday, we will talk about how to randomly select the files to review for your sample.  Same bat time, same bat channel!

So, what do you think?  Do you use sampling to test your search results?   Please share any comments you might have or if you’d like to know more about a particular topic.

eDiscovery Trends: Forbes on the Rise of Predictive Coding

 

First the New York Times with an article about eDiscovery, now Forbes.  Who’s next, The Wall Street Journal?  😉

Forbes published a blog post entitled E-Discovery And the Rise of Predictive Coding a few days ago.  Written by Ben Kerschberg, Founder of Consero Group LLC, it gets into some legal issues and considerations regarding predictive coding that are interesting.  For some background on predictive coding, check out our December blog posts, here and here.

First, the author provides a very brief history of document review, starting with bankers boxes and WordPerfect and “[a]fter an interim phase best characterized by simple keyword searches and optical character recognition”, it evolved to predictive coding.  OK, that’s like saying that Gone with the Wind started with various suitors courting Scarlett O’Hara and after an interim phase best characterized by the Civil War, marriage and heartache, Rhett says to Scarlett, “Frankly, my dear, I don’t give a damn.”  A bit oversimplification of how review has evolved.

Nonetheless, the article gets into a couple of important legal issues raised by predictive coding.  They are:

  • Satisfying Reasonable Search Requirements: Whether counsel can utilize the benefits of predictive coding and still meet legal obligations to conduct a reasonable search for responsive documents under the federal rules.  The question is, what constitutes a reasonable search under Federal Rule 26(g)(1)(A), which requires that the responding attorney attest by signature that “with respect to a disclosure, it is complete and correct as of the time it is made”?
  • Protecting Privilege: Whether counsel can protect attorney-client privilege for their client when a privileged document is inadvertently disclosed.  Fed. Rule of. Evidence 502 provides that a court may order that a privilege or protection is not waived by disclosure if the disclosure was inadvertent and the holder of the privilege took reasonable steps to prevent disclosure.  Again, what’s reasonable?

The author concludes that the use of predictive coding is reasonable, because it a) makes document review more efficient by providing only those documents to the reviewer that have been selected by the algorithm; b) makes it more likely that responsive documents will be produced, saving time and resources; and c) refines relevant subsets for review, which can then be validated statistically.

So, what do you think?  Does predictive coding enable attorneys to satisfy these legal issues?   Is it reasonable?  Please share any comments you might have or if you’d like to know more about a particular topic.

eDiscovery Best Practices: Does Size Matter?

 

I admit it, with a title like “Does Size Matter?”, I’m looking for a few extra page views….  😉

I frequently get asked how big does an ESI collection need to be to benefit from eDiscovery technology.  In a recent case with one of my clients, the client had a fairly small collection – only about 4 GB.  But, when a judge ruled that they had to start conducting depositions in a week, they needed to review that data in a weekend.  Without FirstPass™, powered by Venio FPR™ to cull the data and OnDemand® to manage the linear review, they would not have been able to make that deadline.  So, they clearly benefited from the use of eDiscovery technology in that case.

But, if you’re not facing a tight deadline, how large does your collection need to be for the use of eDiscovery technology to provide benefits?

I recently conducted a webinar regarding the benefits of First Pass Review – aka Early Case Assessment, or a more accurate term (as George Socha points out regularly), Early Data Assessment.  One of the topics discussed in that webinar was the cost of review for each gigabyte (GB).  Extrapolated from an analysis conducted by Anne Kershaw a few years ago (and published in the Gartner report E-Discovery: Project Planning and Budgeting 2008-2011), here is a breakdown:

Estimated Cost to Review All Documents in a GB:

  • Pages per GB:                75,000
  • Pages per Document:      4
  • Documents Per GB:        18,750
  • Review Rate:                 50 documents per hour
  • Total Review Hours:       375
  • Reviewer Billing Rate:     $50 per hour

Total Cost to Review Each GB:      $18,750

Notes: The number of pages per GB can vary widely.  Page per GB estimates tend to range from 50,000 to 100,000 pages per GB, so 75,000 pages (18,750 documents) seems an appropriate average.  50 documents reviewed per hour is considered to be a fast review rate and $50 per hour is considered to be a bargain price.  eDiscovery Daily provided an earlier estimate of $16,650 per GB based on assumptions of 20,000 documents per GB and 60 documents reviewed per hour – the assumptions may change somewhat, but, either way, the cost for attorney review of each GB could be expected to range from at least $16,000 to $18,000, possibly more.

Advanced culling and searching capabilities of First Pass Review tools like FirstPass can enable you to cull out 70-80% of most collections as clearly non-responsive without having to conduct attorney review on those files.  If you have merely a 2 GB collection and assume the lowest review cost above of $16,000 per GB, the use of a First Pass Review tool to cull out 70% of the collection can save $22,400 in attorney review costs.  Is that worth it?

So, what do you think?  Do you use eDiscovery technology for only the really large cases or ALL cases?   Please share any comments you might have or if you’d like to know more about a particular topic.

eDiscovery Best Practices: What is “Reduping?”

 

As emails are sent out to multiple custodians, deduplication (or “deduping”) has become a common practice to eliminate multiple copies of the same email or file from the review collection, saving considerable review costs and ensuring consistency by not having different reviewers apply different responsiveness or privilege determinations to the same file (e.g., one copy of a file designated as privileged while the other is not may cause a privileged file to slip into the production set).  Deduping can be performed either across custodians in a case or within each custodian.

Everyone who works in electronic discovery knows what “deduping” is.  But how many of you know what “reduping” is?  Here’s the answer:

“Reduping” is the process of re-introducing duplicates back into the population for production after completing review.  There are a couple of reasons why a producing party may want to “redupe” the collection after review:

  • Deduping Not Requested by Receiving Party: As opposing parties in many cases still don’t conduct a meet and confer or discuss specifications for production, they may not have discussed whether or not to include duplicates in the production set.  In those cases, the producing party may choose to produce the duplicates, giving the receiving party more files to review and driving up their costs.  The attitude of the producing party can be “hey, they didn’t specify, so we’ll give them more than they asked for.”
  • Receiving Party May Want to See Who Has Copies of Specific Files: Sometimes, the receiving party does request that “dupes” are identified, but only within custodians, not across them.  In those cases, it’s because they want to see who had a copy of a specific email or file.  However, the producing party still doesn’t want to review the duplicates (because of increasing costs and the possibility of inconsistent designations), so they review a deduped collection and then redupe after review is complete.

Many review applications support the capability for reduping.  For example, FirstPass™, powered by Venio FPR™, suppresses the duplicates from review, but applies the same tags to the duplicates of any files tagged during first pass review.  When it’s time to export the collection, to either move the potentially responsive files on to linear review (in a product like OnDemand®) or straight to production, the user can decide at that time whether or not to export the dupes.  Those dupes have the same designations as the primary copies, ensuring consistency in handling them downstream.

So, what do you think?  Does your review tool support “reduping”?   Please share any comments you might have or if you’d like to know more about a particular topic.

eDiscovery Trends: Despite What NY Times Says, Lawyers Not Going Away

 

There was a TV commercial in the mid-80’s where a soap opera actor delivered the line “I’m not a doctor, but I play one on TV”.  Can you remember the product it was advertising (without clicking on the link)?  If so, you win the trivia award of the day!  😉

I’m a technologist who has been working in litigation support and eDiscovery for over twenty years.  If you’ve been reading eDiscovery Daily for awhile, you’ve probably noticed that I’ve written several posts regarding significant case law as it pertains to eDiscovery.  I often feel that I should offer a disclaimer before each of these posts saying “I’m not a lawyer, but I play one on the Web”.  As the disclaimer at the bottom of the page stipulates, these posts aren’t meant to provide legal advice and it is not my intention to do so, but merely to identify cases that may be of interest to our readers and I try to provide a basic recap of these cases and leave it at that.  As Clint Eastwood once said, “A man’s got to know his limitations”.

A few days ago, The New York Times published an article entitled Armies of Expensive Lawyers, Replaced by Cheaper Software which discussed how, using ‘artificial intelligence, “e-discovery” software can analyze documents in a fraction of the time for a fraction of the cost’ (extraneous comma in the title notwithstanding).  The article goes on to discuss linguistic and sociological techniques for retrieval of relevant information and discusses how the Enron Corpus, available in a number of forms, including through EDRM, has enabled software providers to make great strides in analytical capabilities using this large base of data to use in testing.  It also discusses whether this will precipitate a march to the unemployment line for scores of attorneys.

A number of articles and posts since then have offered commentary as to whether that will be the case.  Technology tools will certainly reduce document populations significantly, but, as the article noted, “[t]he documents that the process kicks out still have to be read by someone”.  Not only that, the article still makes the assumption that people too often make with search technology – that it’s a “push a button and get your answer” approach to identifying relevant documents.  But, as has been noted in several cases and also here on this blog, searching is an iterative process where sampling the search results is recommended to confirm that the search maximizes recall and precision to the extent possible.  Who do you think is going to perform that sampling?  Lawyers – that’s who (working with technologists like me, of course!).  And, some searches will require multiple iterations of sampling and analysis before the search is optimized.

Therefore, while the “armies” of lawyers many not need near as many members of the infantry, they will still need plenty of corporals, sergeants, captains, colonels and generals.  And, for those entry-level reviewing attorneys that no longer have a place on review projects?  Well, we could always use a few more doctors on TV, right?  😉

So, what do you think?  Are you a review attorney that has been impacted by technology – positively or negatively?   Please share any comments you might have or if you’d like to know more about a particular topic.

Managing an eDiscovery Contract Review Team: Use the Team’s Knowledge

The document review effort is the litigation team’s first in-depth exposure to the client’s electronic documents.  The review staff will have more exposure to a broader range of documents than anyone else on the team, at least in the beginning of the case.  When you are using contract reviewers, they will go away when the review is completed.  You don’t want to lose what they’ve learned when the project is over, so you should take some steps to use their knowledge.  Here are two things you can do:

  • Ask for summary memos:  Ask supervisors on the project to prepare a summary memo for each custodian.  To get good summary information you should provide specific instructions for the information you would like included.  You could, for example, ask for this information about each custodian:
    • A description of the types of documents in the collection (for example, letter, monthly reports, work sheets, and so on).
    • A description of the general topics that are covered.
    • An approximate date range of the documents in the custodian’s files.
    • A list of key individuals (and organizations) with whom the custodian frequently corresponds.
  • Interview the review team:  Meet periodically with the group.  Spend an hour at the end of a workday and interview them about what they are seeing in the collection.  If there are certain topics you are hoping to see covered in the documents, ask the team about them.  Likewise, if there are certain topics that you hope not to see, ask about those as well.  This type of exchange will serve three purposes:
    • It will give senior litigation team members useful information about the document collection.
    • It will be useful for review team members to learn about what other team members are seeing.
    • It’s great for team morale.  It really reinforces that their work is important and that their input is valuable.

What steps do you take to make use of what the review team learns in the document review?  Do you have suggestions you can share with us?

This concludes our blog series on Managing an eDiscovery Contract Review Team.  I hope you found it useful!

Please share any comments you have and let us know if you’d like to know more about an eDiscovery topic.

Managing an eDiscovery Contract Review Team: Keep the Staff Motivated

 

In the last blog post, we talked about steps you can take to ensure high-quality, consistent work from a contract review staff.  There is one more, very important thing you should do:  keep the staff motivated.  There is no question that a motivated, content staff will produce better work than a staff that is indifferent.  Here are a few things you can do:

  • Give them the big picture:  Let the review staff know how their work fits into the overall litigation process, how their work product will be used, and how important their contribution is to the case.
  • Keep them up-to-date on the status of the case:  Let them know what’s going on.  Tell them when case milestones have been met, when initial production deadlines have been met, and what the attorneys are doing.  
  • Have senior attorneys give them some attention:  Ask senior attorneys on the case to stop by periodically and speak to the group.  This, more than anything, will reinforce how important their work is to the case.
  • Give frequent feedback to each member of the team:  Each supervisor should be responsible for giving regular feedback to members of the team.  This should be a daily task, done with team members on a rotating basis.  Every team member – even those doing excellent work – should get one-on-one time with the supervisor. 
  • Make sure the work environment is comfortable and pleasant:  Things like good lighting, comfortable chairs, good ventilation and a comfortable temperature can have a huge effect on both morale and productivity.

What do you do to keep a contract review staff motivated?  Do you have suggestions you can share with us?  Please share any comments you have and let us know if you’d like to know more about an eDiscovery topic.