Determining Appropriate Sample Size to Test Your Search: eDiscovery Throwback Thursdays

June 13, 2019

If you missed it last week, we started a new series – Throwback Thursdays – here on the blog, where we are revisiting some of the eDiscovery best practice posts we have covered over the years and discuss whether any of those recommended best practices have changed since we originally covered them.

This post was originally published on April 1, 2011 – no fooling! It was part of a three-post series that we will revisit over the next three weeks – we have continued to touch on this topic over the years, including our webcast just last month. One of our best!

One part of searching best practices is to test your search results (both the result set and the files not retrieved) to determine whether the search you performed is effective at maximizing both precision and recall to the extent possible, so that you retrieve as many responsive files as possible without having to review too many non-responsive files. One question I often get is: how many files do you need to review to test the search?

If you remember from statistics class in high school or college, statistical sampling is choosing a percentage of the results population at random for inspection to gather information about the population as a whole. This saves considerable time, effort and cost over reviewing every item in the results population and enables you to obtain a “confidence level” that the characteristics of the population reflect your sample. Statistical sampling is a method used for everything from exit polls to predict elections to marketing surveys to poll customers on brand popularity and is a generally accepted method of drawing conclusions for an overall results population. You can sample a small portion of a large set to obtain a 95% or 99% confidence level in your findings (with a margin of error, of course).

So, does that mean you have to find your old statistics book and dust off your calculator or (gasp!) slide rule? Thankfully, no.

There are several sites that provide sample size calculators to help you determine an appropriate sample size, including this one. Many eDiscovery platforms do so as well. You’ll simply need to identify a desired confidence level (typically 95% to 99%), an acceptable margin of error (typically 5% or less) and the population size.

So, if you perform a search that retrieves 100,000 files and you want a sample size that provides a 99% confidence level with a margin of error of 5%, you’ll need to review 660 of the retrieved files to achieve that level of confidence in your sample (only 383 files if a 95% confidence level will do). Here’s an illustration of that using the site referenced above.

If 1,000,000 files were not retrieved, you would only need to review 664 of the not retrieved files to achieve that same level of confidence (99%, with a 5% margin of error) in your sample – only four more files to review than the previous sample, even though the collection is 900,000 files larger! Don’t believe me? See for yourself here.

As you can see, the sample size doesn’t need to increase much when the population gets really large and you can review a relatively small subset to understand your collection and defend your search methodology to the court.

Next week, we will talk about how to randomly select the files to review for your sample.

So, what do you think? Do you use sampling to test your search results? Please share any comments you might have or if you’d like to know more about a particular topic.

Sponsor: This blog is sponsored by CloudNine, which is a data and legal discovery technology company with proven expertise in simplifying and automating the discovery of data for audits, investigations, and litigation. Used by legal and business customers worldwide including more than 50 of the top 250 Am Law firms and many of the world’s leading corporations, CloudNine’s eDiscovery automation software and services help customers gain insight and intelligence on electronic data.

Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by CloudNine. eDiscovery Daily is made available by CloudNine solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscovery Daily should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.

eDiscovery Daily Blog