eDiscovery Daily Blog
eDiscovery Best Practices: Determining Appropriate Sample Size to Test Your Search
We’ve talked about searching best practices quite a bit on this blog. One part of searching best practices (as part of the “STARR” approach I described in an earlier post) is to test your search results (both the result set and the files not retrieved) to determine whether the search you performed is effective at maximizing both precision and recall to the extent possible, so that you retrieve as many responsive files as possible without having to review too many non-responsive files. One question I often get is: how many files do you need to review to test the search?
If you remember from statistics class in high school or college, statistical sampling is choosing a percentage of the results population at random for inspection to gather information about the population as a whole. This saves considerable time, effort and cost over reviewing every item in the results population and enables you to obtain a “confidence level” that the characteristics of the population reflect your sample. Statistical sampling is a method used for everything from exit polls to predict elections to marketing surveys to poll customers on brand popularity and is a generally accepted method of drawing conclusions for an overall results population. You can sample a small portion of a large set to obtain a 95% or 99% confidence level in your findings (with a margin of error, of course).
So, does that mean you have to find your old statistics book and dust off your calculator or (gasp!) slide rule? Thankfully, no.
There are several sites that provide sample size calculators to help you determine an appropriate sample size, including this one. You’ll simply need to identify a desired confidence level (typically 95% to 99%), an acceptable margin of error (typically 5% or less) and the population size.
So, if you perform a search that retrieves 100,000 files and you want a sample size that provides a 99% confidence level with a margin of error of 5%, you’ll need to review 660 of the retrieved files to achieve that level of confidence in your sample (only 383 files if a 95% confidence level will do). If 1,000,000 files were not retrieved, you would only need to review 664 of the not retrieved files to achieve that same level of confidence (99%, with a 5% margin of error) in your sample. As you can see, the sample size doesn’t need to increase much when the population gets really large and you can review a relatively small subset to understand your collection and defend your search methodology to the court.
On Monday, we will talk about how to randomly select the files to review for your sample. Same bat time, same bat channel!
So, what do you think? Do you use sampling to test your search results? Please share any comments you might have or if you’d like to know more about a particular topic.