eDiscovery Daily Blog

That Was Random: eDiscovery Best Practices

This may be the shortest title ever in this history of this blog.  In the first year of the blog’s existence (when we had a lot fewer subscribers and readers than we do now), we did a three part series on how to perform an iterative random sample of a search result set to evaluate the results.  As we have discussed the topic recently in two webinars, I thought I would revisit it here for those who didn’t attend the webinars (though you can still check them out on demand) and weren’t readers of our blog back then.

Searching is part science and part art – there are too many variables to make assumptions about your search results, so you must test your search results to determine whether your search criteria is valid or needs revision.  Often, you’ll find variables that you didn’t anticipate that force you to revise your search.

It’s not just important to test the result set from your search, it’s also important to test what was not retrieved to look for result hits you might have missed.

How many documents should you review in your test of each set?  How do you select the documents to test?  Initial testing can help identify some issues, but to develop a level of demonstrable confidence in your search, your test should involve random sampling of each set.

To determine the number of documents you need to sample, you need three things:

  • Size of the Test Set: This would be the size of the result set OR the size of the set of documents NOT retrieved in the result set;
  • Confidence Level: The amount of uncertainty you can tolerate in your results. A typical confidence level is 95% to 99%;
  • Margin of Error: The amount of error that you can tolerate (typically 5% or less).

The good news is that you don’t have to dust off your statistics textbook from college.  There are several sites that provide sample size calculators to help you determine an appropriate sample size, including this one.

Here’s an example.  A search retrieves 100,000 files, with 1,000,000 files NOT retrieved:

Retrieved: Size of the test set = 100,000; confidence level = 99%, margin of error = 5%.  You need to review 660 retrieved files to obtain a 99% confidence level in the resulting test (goes down to 383 retrieved files if a 95% confidence level will do).

NOT Retrieved: Size of the test set = 1,000,000; confidence level = 99%, margin of error = 5%.  You need to review 664 retrieved files to obtain a 99% confidence level in the resulting test (goes down to 384 retrieved files for a 95% confidence level).

As you can see, the sample size doesn’t need to increase much when the population gets really large and you can review a relatively small subset to check the results.

Once you have determined the number of documents you need in your test set, it’s best to generate a random selection of the documents you plan to test to avoid potential bias in your results.  This site has a random integer generator which will randomly generate whole numbers.  Simply supply the number of random integers that you need to be generated, the starting number and ending number of the range and the site will then generate a list of numbers that you can copy and paste into a text file or even a spreadsheet.

You can then apply those randomly generated numbers to your result set to review those selected documents to test your search.

In our Best Practices for Effective eDiscovery Searching webcast earlier this year, I walked through an iterative example of testing and refining a search until it achieved the best balance of recall and precision.  The example is based on a real-life example I once performed for a client.  You can check it out on the webcast or via this old post here from our first year.

So, what do you think?  Do you use sampling to test your search results?  If not, why not?  Please share any comments you might have or if you’d like to know more about a particular topic.

Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by CloudNine. eDiscovery Daily is made available by CloudNine solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscovery Daily should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.