eDiscovery Daily Blog

A “Random” Idea on Search Sampling: eDiscovery Throwback Thursdays

If you missed it the past couple of weeks, we started a new series – Throwback Thursdays – here on the blog, where we are revisiting some of the eDiscovery best practice posts we have covered over the years and discuss whether any of those recommended best practices have changed since we originally covered them.

This post was originally published on April 4, 2011.  It was part of a three-post series that we started revisiting last week and will conclude the series next week.  We have continued to touch on this topic over the years, including our webcast just last month.  One of our best!

Last Thursday, we talked about how to determine an appropriate sample size to test your search results as well as the items NOT retrieved by the search, using a site that provides a sample size calculator.  Today, we’ll talk about how to make sure the sample size is randomly selected.

A randomly selected sample gives each file an equal chance of being reviewed and eliminates the chance of bias being introduced into the sample which might skew the results.  Merely selecting the first or last x number of items (or any other group) in the set may not reflect the population as a whole – for example, all of those items could come from a single custodian.  To ensure a fair, defensible sample, it needs to be selected randomly.

So, how do you select the numbers randomly?  Of course, many eDiscovery platforms do that for you and enable you to generate your own random sample.  However, to illustrate the concept, we will once again demonstrate here with a web site.

Here’s one site, Random.org, that has a random integer generator which will randomly generate whole numbers.  You simply need to supply the number of random integers that you need to be generated, the starting number and ending number of the range within which the randomly generated numbers should fall.  The site will then generate a list of numbers that you can copy and paste into a text file or even a spreadsheet.  The site also provides an Advanced mode, which provides options for the numbers (e.g., decimal, hexadecimal), output format and how the randomization is ‘seeded’ (to generate the numbers).

In the example from Friday, you would provide 660 as the number of random integers to be generated, with a starting number of 1 and an ending number of 100,000 to get a list of random numbers for testing your search that yielded 100,000 files with hits (664, 1 and 1,000,000 respectively to get a list of numbers to test the non-hits).  Here’s how that looks on the site:

And, here is an example of the beginning of the list of numbers it generates (run it again and it will generate a different set of numbers):

You could paste the numbers into an Excel spreadsheet (Paste Special, then select Text) and sort them.  Here’s how that looks (sort dialog shown with the sort already performed):

You can then retrieve the files by position in the result set (typically Doc ID if you’re doing the entire collection) based on the random numbers retrieved review each of them to determine whether they reflect the intent of the search, which will give you a good sense of how effective your search was, based on the random sample.  And, probably more importantly, using that random sample to test your search results will be a highly defensible method to verify your approach in court.

So, what do you think?  Do you use sampling to test your search results?   Please share any comments you might have or if you’d like to know more about a particular topic.

Sponsor: This blog is sponsored by CloudNine, which is a data and legal discovery technology company with proven expertise in simplifying and automating the discovery of data for audits, investigations, and litigation. Used by legal and business customers worldwide including more than 50 of the top 250 Am Law firms and many of the world’s leading corporations, CloudNine’s eDiscovery automation software and services help customers gain insight and intelligence on electronic data.

Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by CloudNine. eDiscovery Daily is made available by CloudNine solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscovery Daily should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.