eDiscovery Best Practices: Testing Your Search Using Sampling

April 5, 2011

Friday, we talked about how to determine an appropriate sample size to test your search results as well as the items NOT retrieved by the search, using a site that provides a sample size calculator. Yesterday, we talked about how to make sure the sample size is randomly selected.

Today, we’ll walk through an example of how you can test and refine a search using sampling.

TEST #1: Let’s say in an oil company we’re looking for documents related to oil rights. To try to be as inclusive as possible, we will search for “oil” AND “rights”. Here is the result:

Files retrieved with “oil” AND “rights”: 200,000
Files NOT retrieved with “oil” AND “rights”: 1,000,000

Using the site to determine an appropriate sample size that we identified before, we determine a sample size of 662 for the retrieved files and 664 for the non-retrieved files to achieve a 99% confidence level with a margin of error of 5%. We then use this site to generate random numbers and then proceed to review each item in the retrieved and NOT retrieved items sets to determine responsiveness to the case. Here are the results:

Retrieved Items: 662 reviewed, 24 responsive, 3.6% responsive rate.
NOT Retrieved Items: 664 reviewed, 661 non-responsive, 99.5% non-responsive rate.

Nearly every item in the NOT retrieved category was non-responsive, which is good. But, only 3.6% of the retrieved items were responsive, which means our search was WAY over-inclusive. At that rate, 192,800 out of 200,000 files retrieved will be NOT responsive and will be a waste of time and resource to review. Why? Because, as we determined during the review, almost every published and copyrighted document in our oil company has the phrase “All Rights Reserved” in the document and will be retrieved.

TEST #2: Let’s try again. This time, we’ll conduct a phrase search for “oil rights” (which requires those words as an exact phrase). Here is the result:

Files retrieved with “oil rights”: 1,500
Files NOT retrieved with “oil rights”: 1,198,500

This time, we determine a sample size of 461 for the retrieved files and (again) 664 for the NOT retrieved files to achieve a 99% confidence level with a margin of error of 5%. Even though, we still have a sample size of 664 for the NOT retrieved files, we generate a new list of random numbers to review those items, as well as the 461 randomly selected retrieved items. Here are the results:

Retrieved Items: 461 reviewed, 435 responsive, 94.4% responsive rate.
NOT Retrieved Items: 664 reviewed, 523 non-responsive, 78.8% non-responsive rate.

Nearly every item in the retrieved category was responsive, which is good. But, only 78.8% of the NOT retrieved items were responsive, which means over 20% of the NOT retrieved items were actually responsive to the case (we also failed to retrieve 8 of the items identified as responsive in the first iteration). So, now what?

TEST #3: If you saw this previous post, you know that proximity searching is a good alternative for finding hits that are close to each other without requiring the exact phrase. So, this time, we’ll conduct a proximity search for “oil within 5 words of rights”. Here is the result:

Files retrieved with “oil within 5 words of rights”: 5,700
Files NOT retrieved with “oil within 5 words of rights”: 1,194,300

This time, we determine a sample size of 595 for the retrieved files and (once again) 664 for the NOT retrieved files, generating a new list of random numbers for both sets of items. Here are the results:

Retrieved Items: 595 reviewed, 542 responsive, 91.1% responsive rate.
NOT Retrieved Items: 664 reviewed, 655 non-responsive, 98.6% non-responsive rate.

Over 90% of the items in the retrieved category were responsive AND nearly every item in the NOT retrieved category was non-responsive, which is GREAT. Also, all but one of the items previously identified as responsive was retrieved. So, this is a search that appears to maximize recall and precision.

Had we proceeded with the original search, we would have reviewed 200,000 files – 192,800 of which would have been NOT responsive to the case. By testing and refining, we only had to review 8,815 files – 3,710 sample files reviewed plus the remaining retrieved items from the third search (5,700 – 595 = 5,105) – most of which ARE responsive to the case. We saved tens of thousands in review costs while still retrieving most of the responsive files, using a defensible approach.

Keep in mind that this is a simple example — we’re not taking into account misspellings and other variations we may want to include in our criteria.

So, what do you think? Do you use sampling to test your search results? Please share any comments you might have or if you’d like to know more about a particular topic.

eDiscovery Daily Blog