eDiscovery Best Practices: A “Random” Idea on Search Sampling

April 4, 2011

Friday, we talked about how to determine an appropriate sample size to test your search results as well as the items NOT retrieved by the search, using a site that provides a sample size calculator. Today, we’ll talk about how to make sure the sample size is randomly selected.

A randomly selected sample gives each file an equal chance of being reviewed and eliminates the chance of bias being introduced into the sample which might skew the results. Merely selecting the first or last x number of items (or any other group) in the set may not reflect the population as a whole – for example, all of those items could come from a single custodian. To ensure a fair, defensible sample, it needs to be selected randomly.

So, how do you select the numbers randomly? Once again, the Internet helps us out here.

One site, Random.org, has a random integer generator which will randomly generate whole numbers. You simply need to supply the number of random integers that you need to be generated, the starting number and ending number of the range within which the randomly generated numbers should fall. The site will then generate a list of numbers that you can copy and paste into a text file or even a spreadsheet. The site also provides an Advanced mode, which provides options for the numbers (e.g., decimal, hexadecimal), output format and how the randomization is ‘seeded’ (to generate the numbers).

In the example from Friday, you would provide 660 as the number of random integers to be generated, with a starting number of 1 and an ending number of 100,000 to get a list of random numbers for testing your search that yielded 100,000 files with hits (664, 1 and 1,000,000 respectively to get a list of numbers to test the non-hits). You could paste the numbers into a spreadsheet, sort them and then retrieve the files by position in the result set based on the random numbers retrieved and review each of them to determine whether they reflect the intent of the search. You’ll then have a good sense of how effective your search was, based on the random sample. And, probably more importantly, using that random sample to test your search results will be a highly defensible method to verify your approach in court.

Tomorrow, we'll walk through a sample iteration to show how the sampling will ultimately help us refine our search.

So, what do you think? Do you use sampling to test your search results? Please share any comments you might have or if you’d like to know more about a particular topic.

WHAT CLIENTS ARE SAYING ABOUT CLOUDNINE

Great value product.

“Offers the major features we were looking for, at a fraction of pricing of other competitors.”

I used CloudNine as part of fraud investigation for email searches.

“…The tag function made it easy to flag the search results. I was impressed with the ease of use for a first-time user. The speed and ease of loading data and being able to review it immediately is a tremendous advantage over other Cloud-based platforms.”

Excellent tool with outstanding support

“CloudNine Review is excellent, it takes the best of the (market leader) review solution and leaves out all of the fiddly bits that make that product excruciating to use. Their upload and processing is automatic, and their pricing structure is the best I’ve seen.”

Great software that is easy to log on, user-friendly, has a great layout, and is easy to navigate.

“…CloudNine is great at searching documents, including tagging, and exporting. Software tailored to our business needs and streamlined the task at hand.”

Discovery Production

This software is easy to use and allows us to upload and download documents as they become ready, saving us both time and money.

Stephanie Plake, Assistant to Attorney at Law Office

eDiscovery Daily Blog