EDRM

eDiscovery Best Practices: Message Thread Review Saves Costs and Improves Consistency

 

Insanity is doing the same thing over and over again and expecting a different result.  But, in ESI review, it can be even worse when you get a different result.

One of the biggest challenges when reviewing ESI is identifying duplicates so that your reviewers aren’t reviewing the same files again and again.  Not only does that drive up costs unnecessarily, but it could lead to problems if the same file is categorized differently by different reviewers (for example, inadvertent production of a duplicate of a privileged file if it is not correctly categorized).

Of course, there are a number of ways to identify duplicates.  Exact duplicates (that contain the exact same content in the same file format) can be identified through hash values, which are a digital fingerprint of the content of the file.  MD5 and SHA-1 are the most popular hashing algorithms, which can identify exact duplicates of a file, so that they can be removed from the review population.  Since many of the same emails are emailed to multiple parties and the same files are stored on different drives, deduplication through hashing can save considerable review costs.

Sometimes, files are not exact duplicates but contain the same (or almost the same) information.  One example is a Word document published to an Adobe PDF file – the content is the same, but the file format is different, so the hash value will be different.  Near-deduplication can be used to identify files where most or all of the content matches so they can be verified as duplicates and eliminated from review.

Then, there is message thread analysis.  Of course, most email messages are part of a larger discussion, which could be just between two parties, or include a number of parties in the discussion.  To review each email in the discussion thread would result in much of the same information being reviewed over and over again.  Instead, message thread analysis pulls those messages together and enables them to be reviewed as an entire discussion.  That includes any side conversations within the discussion that may or may not be related to the original topic (e.g., a side discussion about lunch plans or did you see American Idol last night).

FirstPass®, powered by Venio FPR™, is one example of an application that provides a mechanism for message thread analysis of Outlook emails that pulls the entire thread into one conversation for review as one big “tree”.  The “tree” representation gives you the ability to see all of the conversations within the discussion and focus your review on the last emails in each conversation to see what is said without having to review each email.  Side conversations are “branches” of the tree and FirstPass enables you to tag individual messages, specific branches or the entire tree as responsive, non-responsive, privileged or some other designation.  Also, because of the way that Outlook tracks emails in the thread, FirstPass identifies messages that are missing from the collection with a red X, enabling you to investigate and determine if additional collection is needed and avoiding potential spoliation claims.

With message thread analysis, you can minimize review of duplicative information within emails, saving time and cost and ensuring consistency in the review.

So, what do you think?  Does your review tool support message thread analysis?   Please share any comments you might have or if you’d like to know more about a particular topic.

eDiscovery Case Law: Written Litigation Hold Notice Not Required

The Pension Committee case was one of the most important cases of 2010 (or any year, for that matter).  So, perhaps it’s not surprising that it is starting to become frequently cited by those looking for sanction for failure to issue a written litigation hold.

In Steuben Foods, Inc. v. Country Gourmet Foods, LLC, No. 08-CV-561S(F), (W.D.N.Y. Apr. 21, 2011), a U.S. District Court in the Western District of New York declined to follow the Pension Committee decision in the Southern District of New York to the extent that the Pension Committee decision held “that implementation of a written litigation hold notice is required in order to avoid an inference that relevant evidence has been presumptively destroyed by the party failing to implement such written litigation hold.”

Steuben Foods alleged that Country Gourmet breached its exclusive supply contract with Steuben when County Gourmet sold all its assets except the supply contract to Campbell Soup. Campbell sought sanctions against Steuben when several emails were not produced by Steuben and Steuben conceded that its litigation hold procedure had not included a written notice. Steuben’s corporate counsel had orally directed each of eight managers and corporate officers to identify all electronically stored information, including paper documents and email communications, pertaining to Country Gourmet or Campbell and not to discard or delete or otherwise destroy such documents pending the litigation.

Campbell pointed to the Pension Committee decision, Pension Committee of the University of Montreal Pension Plan v. Banc of America Securities, LLC, 685 F. Supp. 2d 456, 476 (S.D.N.Y. 2010), “in which the court found that the absence of a written litigation hold notice supported its conclusion that plaintiffs had been grossly negligent in their obligations to preserve relevant electronically stored documents and that plaintiffs’ document production failures, coupled with the absence of a timely written litigation hold, permitted the inference that relevant documents were culpably destroyed or lost as a result.”

The court declined to infer from the absence of a written litigation hold, as the Pension Committee court did, that relevant documents were culpably destroyed or lost:

“Accordingly, the court in this case declines to hold that implementation of a written litigation hold notice is required in order to avoid an inference that relevant evidence has been presumptively destroyed by the party failing to implement such written litigation hold.”

The court noted that the relatively small size of Steuben with 400 employees “lends itself to a direct oral communication of the need to preserve documents relevant to Plaintiff’s case” and was a reason “why a written litigation hold is not essential to avoid potential sanctions for spoliation.” In any event, according to the court, Campbell was not prejudiced by any failure of Steuben to produce email because Country Gourmet provided copies of the email to Campbell and Campbell could show no prejudice resulting from any claimed negligence of Steuben in not having a written litigation hold.

So, what do you think?  Should a written litigation hold be required in every case?  Would that have made a difference in this one?  Please share any comments you might have or if you’d like to know more about a particular topic.

Case Summary Source: Applied Discovery (free subscription required).  For eDiscovery news and best practices, check out the Applied Discovery Blog here.

Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by CloudNine Discovery. eDiscoveryDaily is made available by CloudNine Discovery solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscoveryDaily should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.

eDiscovery Best Practices: Does Anybody Really Know What Time It Is?

 

Does anybody really know what time it is?  Does anybody really care?

OK, it’s an old song by Chicago (back then, they were known as the Chicago Transit Authority).  But, the question of what time it really is has a significant effect on how eDiscovery is handled.

Time Zone: In many litigation cases, one of the issues that should be discussed and agreed upon is the time zone to apply to the produced files.  Why is it a big deal?  Let’s look at one example:

A multinational corporation has offices from coast to coast and potentially responsive emails are routinely sent between East Coast and West Coast offices.  If an email is sent from a party in the West Coast office at 10 PM on June 30, 2005 and is received by a party in the East Coast office at 1 AM on July 1, 2005, and the relevant date range is from July 1, 2005 thru December 31, 2006, then the choice of time zones will determine whether or not that email falls within the relevant date range.  The time zone is based on the workstation setting, so they could actually be in the same office when the email is sent (if someone is traveling).

Usually the choice is to either use a standard time zone for all files in the litigation – such as Greenwich Mean Time (GMT) or the time zone where the producing party is located – or to use the time zone associated with each custodian, which means that the time zone used will depend on where the data came from.  It’s important to determine the handling of time zones up front in cases where multiple time zones are involved to avoid potential disputes down the line.

Which Date to Use?: Each email and efile has one or more date and time stamps associated with it.  Emails have date/time sent, as well as date/time received.  Efiles have creation date/time, last modified date/time and even last printed date/time.  Efile creation dates do not necessarily reflect when a file was actually created; they indicate when a file came to exist on a particular storage medium, such as a hard drive. So, creation dates can reflect when a user or computer process created a file. However, they can also reflect the date and time that a file was copied to the storage medium – as a result, the creation date can be later than the last modified date.  It’s common to use date sent for Sent Items emails and date received for Inbox emails and to use last modified date for efiles.  But, there are exceptions, so again it’s important to agree up front as to which date to use.

So, what do you think?  Have you had any date disputes in your eDiscovery projects?   Please share any comments you might have or if you’d like to know more about a particular topic.

eDiscovery Best Practices: What Are the Skeletons in Your ESI Closet?

 

At eDiscoveryDaily, we try not to re-post articles or blog posts from other sources.  We may reference them, but we usually try to add some commentary or analysis to provide a unique spin on the topic.  However, I read a post Thursday on one of the better legal blogs out there – Ride the Lightning from Sharon Nelson – that was a guest post by Jim McGann, VP of Information Discovery at Index Engines that I thought was well done and good information for our readers.  Jim has been interviewed by eDiscoveryDaily here and here and always has terrific insight on ESI issues.  You can click here to read the post directly on Ride the Lightning or simply read below.

Law firms and corporations alike tend to keep data storage devices well beyond what their compliance requirements or business needs actually dictate.  These so-called “skeletons in the closet” pose a major problem when the entity gets sued or subpoenaed. All that dusty data is suddenly potentially discoverable. Legal counsel can be proactive and initiate responsible handling of this legacy data by defining a new, defensible information governance process.

  1. Understand all data sources. The first choice when faced with an ESI collection is to look at current online network data. However, many other sources of email and files exist on corporate networks, sources that may be more defensible and even cost effective to collect from, including offsite storage typically residing on backup tapes. Tape as a collection source has been overlooked because it was historically difficult and expensive to collect from legacy backup tapes.
  2. Get proactive with legal requirements. Defining what ESI data should be kept and placed on litigation hold and what can be purged are the first steps in a proactive strategy. These legal requirements will allow clients to put a policy in place to save specific content, certain custodians and intellectual property so that it is identifiable and ready for on demand discovery.
  3. Understand technology limitations. Only use tools that index all the content, and don’t change any of the metadata. Some older search solutions compromise the indexing process, and this may come to haunt you in the end.
  4. Become a policy expert. As new technology comes on the market, it tends to improve and strengthen the discovery process. Taking the time to understand technology trends allows you to stay one step ahead of the game and create a current defensible collection process and apply policy to it.

So, what do you think?  Do you have “skeletons” in your ESI closet?   Please share any comments you might have or if you’d like to know more about a particular topic.

eDiscovery Case Law: Facebook Did Not Deduce That They Must Produce

In this case, United States Magistrate Judge Howard Lloyd of the Northern District of California compelled Facebook to produce ESI that was previously produced in a converted, non-searchable format and further ordered Facebook not to use a third-party vendor’s online production software to merely “provide access” to it.  The court’s order granting the plaintiff’s Motion To Compel Production in In re Facebook PPC Advertising Litigation, 2011 WL 1324516 (N.D.Cal. Apr. 6, 2011) addressed the importance of ESI Protocols, the requirement to produce ESI in native formats, and production of documents versus providing access to them.  A copy of the order can be found here.

Several plaintiffs brought a class action against Facebook for breach of contract and violation of California’s Unfair Competition Law, suing Facebook for allegedly misrepresenting the quality of its “click filters,” which are filters used to prevent charging merchants when advertisements are inadvertently clicked.  When discovery disputes occurred, plaintiffs filed their Motion To Compel, alleging:

  1. Facebook refused to agree to an ESI Protocol to establish the manner and form of electronic production, including agreement on search words or phrases, custodians and time frames for production,
  2. Facebook uploaded its responses to discovery requests to a commercial website (Watchdox.com) in a manner that seriously limited the plaintiffs’ ability to review them.  Documents on Watchdox.com could not be printed and Facebook, citing confidentiality concerns, retained the ability to cause documents to expire and no longer be accessible after a period of time.
  3. The documents loaded to Watchdox.com, as well as others that were actually produced, were not in their native format, and thus were unsearchable and unusable.  One such document was an 18,000 page customer complaint database printed to PDF which lacked any searchable features.

With regard to the refusal to agree to an ESI protocol, Facebook argued that such a protocol would result in “forcing the parties to anticipate and address all potential issues on the form of electronic production” and “would likely have the result of frustrating and slowing down the discovery process.” The court rejected this argument, noting “The argument that an ESI Protocol cannot address every single issue that may arise is not an argument to have no ESI Protocol at all”.

In reviewing Facebook’s production protocol, the Court noted that “each of these steps make the discovery process less efficient without providing any real benefit.” and found that Facebook’s privacy concerns were unreasonable since a two tiered protective order already existed in the case as well as the fact that confidential documents could be marked as such to prevent inadvertent disclosures.  The Court held that Facebook’s use of Watchdox.com was unduly burdensome on the Plaintiffs and thus ordered Facebook to produce any documents that had been uploaded to Watchdox.com in their native searchable formats.  The Court also ordered Facebook to reproduce previously produced documents that were provided in an unsearchable format in their native searchable formats.

So, what do you think?  Is merely providing access to documents sufficient for production?  Please share any comments you might have or if you’d like to know more about a particular topic.

eDiscovery Best Practices: 4 Steps to Effective eDiscovery With Software Analytics

 

I read an interesting article from Texas Lawyer via Law.com entitled “4 Steps to Effective E-Discovery With Software Analytics” that has some interesting takes on project management principles related to eDiscovery and I’ve interjected some of my thoughts into the analysis below.  A copy of the full article is located here.  The steps are as follows:

1. With the vendor, negotiate clear terms that serve the project's key objectives.  The article notes the important of tying each collection and review milestone (e.g., collecting and imaging data; filtering data by file type; removing duplicates; processing data for review in a specific review platform; processing data to allow for optical character recognition (OCR) searching; and converting data into a tag image file format (TIFF) for final production to opposing counsel) to contract terms with the vendor. 

The specific milestones will vary – for example, conversion to TIFF may not be necessary if the parties agree to a native production – so it’s important to know the size and complexity of the project, and choose only an experienced eDiscovery vendor who can handle the variations.

2. Collect and process data.  Forensically sound data collection and culling of obviously unresponsive files (such as system files) to drastically decrease the overall review costs are key services that a vendor provides in this area.  As we’ve noted many times on this blog, effective culling can save considerable review costs – each gigabyte (GB) culled can save $16-$18K in attorney review costs.

The article notes that a hidden cost is the OCR process of translating extracted text into a searchable form and that it’s an optimal negotiation point with the vendor.  This may have been true when most collections were paper based, but as most collections today are electronic based, the percentage of documents requiring OCR is considerably less than it used to be.  However, it is important to be prepared that there are some native files which will be “image only”, such as TIFFs and scanned PDFs – those will require OCR to be effectively searched.

3. Select a data and document review platform.  Factors such as ease of use, robustness, and reliability of analytic tools, support staff accessibility to fix software bugs quickly, monthly user and hosting fees, and software training and support fees should be considered when selecting a document review platform.

The article notes that a hidden cost is selecting a platform with which the firm’s litigation support staff has no experience as follow-up consultation with the vendor could be costly.  This can be true, though a good vendor training program and an intuitive interface can minimize or even eliminate this component.

The article also notes that to take advantage of the vendor’s more modern technology “[a] viable option is to use a vendor's review platform that fits the needs of the current data set and then transfer the data to the in-house system”.  I’m not sure why the need exists to transfer the data back – there are a number of vendors that provide a cost-effective solution appropriate for the duration of the case.

4. Designate clear areas of responsibility.  By doing so, you minimize or eliminate inefficiencies in the project and the article mentions the RACI matrix to determine who is responsible (individuals responsible for performing each task, such as review or litigation support), accountable (the attorney in charge of discovery), consulted (the lead attorney on the case), and informed (the client).

Managing these areas of responsibility effectively is probably the biggest key to project success and the article does a nice job of providing a handy reference model (the RACI matrix) for defining responsibility within the project.

So, what do you think?  Do you have any specific thoughts about this article?   Please share any comments you might have or if you’d like to know more about a particular topic.

eDiscovery Case Law: Conclusion of Case Does Not Preclude Later Sanctions

In Green v. Blitz U.S.A., Inc., (E.D. Tex. Mar. 1, 2011), the defendant in a product liability action that had been settled over a year earlier was sanctioned for “blatant discovery abuses” prior to the settlement. Defendant was ordered to add $250,000 to its settlement with plaintiff, to provide a copy of the court’s order to every plaintiff in every lawsuit against defendant for the past two years or else forfeit an additional $500,000 “purging” sanction, and to include the order in its first responsive pleading in every lawsuit for the next five years in which defendant became involved.

Defendant, a manufacturer of gasoline containers, was named in several product liability lawsuits, including this case in which plaintiff alleged that her husband’s death was caused in part by the lack of a flame arrestor on defendant’s gas cans. The jury in plaintiff’s case returned a verdict for defendant after counsel for defendant argued that “science shows” that flame arrestors did not work. The case was settled after the jury verdict for an undisclosed amount, but two years later, counsel for plaintiff sought sanctions and to have the case reopened after learning in another case against defendant that while the gas can lawsuits were underway, defendant had been instructing its employees to destroy email.

The court described defendant’s failure to implement a litigation hold as gas can cases were filed. A single employee met with other employees to ask them to look for documents, but he did not have any electronic searches made for documents and he did not consult with defendant’s information technology department on how to retrieve electronic documents.

The court held that defendant willfully violated the discovery order in the case by not producing key documents such as a handwritten note indicating a desire to install flame arrestors on gas cans and an email noting that the technology for flame arrestors existed given the common use of flame arrestors in the marine industry. “Any competent electronic discovery effort would have located this email,” according to the court, through a key word search. Defendant’s employee in charge of discovery did not conduct a key word search and, despite acknowledging that he was as computer “illiterate as they get,” did not seek help from defendant’s information technology department, which was routinely sending out instructions to employees to delete email and rotating backup tapes every two weeks while the litigation was underway.

The court declined to reopen the case since it had been closed for a year. However, based on its knowledge of the confidential settlement of the parties, the court ordered defendant to pay plaintiff an additional $250,000 as a civil contempt sanction to match the minimum amount that the settlement would have been if plaintiff had been provided documents withheld by defendant. The court also ordered a “civil purging sanction” of $500,000 which defendant could avoid upon showing proof that a copy of the court’s decision had been provided to every plaintiff in a lawsuit against defendant for the past two years. The court added a requirement that defendant include a copy of the court’s opinion in its first pleading in any lawsuit for the next five years in which defendant became a party.

As Yogi Berra would say, “It ain’t over ‘til it’s over”.

So, what do you think?  Should cases be re-opened after they’re concluded for discovery violations?  Please share any comments you might have or if you’d like to know more about a particular topic.

Case Summary Source: Applied Discovery (free subscription required).  For eDiscovery news and best practices, check out the Applied Discovery Blog here.

Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by CloudNine Discovery. eDiscoveryDaily is made available by CloudNine Discovery solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscoveryDaily should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.

eDiscovery Best Practices: Your ESI Collection May Be Larger Than You Think

 

Here’s a sample scenario: You identify custodians relevant to the case and collect files from each.  Roughly 100 gigabytes (GB) of Microsoft Outlook email PST files and loose “efiles” is collected in total from the custodians.  You identify a vendor to process the files to load into a review tool, so that you can perform first pass review and, eventually, linear review and produce the files to opposing counsel.  After processing, the vendor sends you a bill – and they’ve charged you to process over 200 GB!!  What happened?!?

Did the vendor accidentally “double-bill” you?  That would be great – but no.  There’s a much more logical explanation and, unfortunately, you may wind up paying a lot more to process these files that you expected.

Many of the files in most ESI collections are stored in what are known as “archive” or “container” files.  For example, as noted above, Outlook emails are typically saved for each custodian in a personal storage (.PST) file format, which is an expanding container file. For most custodians, all of their email (and the corresponding attachments, if present) resides in a few PST files.  The scanned size for the PST file is the size of the file on disk.

Did you ever see one of those vacuum bags that you store clothes in and then suck all the air out so that the clothes won’t take as much space?  The PST file is like one of those vacuum bags – it typically stores the emails and attachments in a compressed format to save space.  When the emails and attachments are processed into a review tool, they are expanded into their normal size.  This expanded size can be 1.5 to 2 times larger than the scanned size (or more).  And, that’s what many vendors will bill on – the expanded size.

There are other types of archive container files that compress the contents – .zip and .rar files are two examples of compressed container files.  These files are often used to not only to compress files for storage on hard drives, but they are also used to compact or group a set of files when transmitting them, usually in – you guessed it – email.  With email comprising a majority of most ESI collections and the popularity of other archive container files for compressing file collections, the expanded size of your collection may be considerably larger than it appears when stored on disk.  It’s important to be prepared for that and know your options when processing that data, so you can effectively anticipate those processing costs.

So, what do you think?  Have you ever been surprised by processing costs of your ESI?   Please share any comments you might have or if you’d like to know more about a particular topic.

eDiscovery Best Practices: Testing Your Search Using Sampling

Friday, we talked about how to determine an appropriate sample size to test your search results as well as the items NOT retrieved by the search, using a site that provides a sample size calculator.  Yesterday, we talked about how to make sure the sample size is randomly selected.

Today, we’ll walk through an example of how you can test and refine a search using sampling.

TEST #1: Let’s say in an oil company we’re looking for documents related to oil rights.  To try to be as inclusive as possible, we will search for “oil” AND “rights”.  Here is the result:

  • Files retrieved with “oil” AND “rights”: 200,000
  • Files NOT retrieved with “oil” AND “rights”: 1,000,000

Using the site to determine an appropriate sample size that we identified before, we determine a sample size of 662 for the retrieved files and 664 for the non-retrieved files to achieve a 99% confidence level with a margin of error of 5%.  We then use this site to generate random numbers and then proceed to review each item in the retrieved and NOT retrieved items sets to determine responsiveness to the case.  Here are the results:

  • Retrieved Items: 662 reviewed, 24 responsive, 3.6% responsive rate.
  • NOT Retrieved Items: 664 reviewed, 661 non-responsive, 99.5% non-responsive rate.

Nearly every item in the NOT retrieved category was non-responsive, which is good.  But, only 3.6% of the retrieved items were responsive, which means our search was WAY over-inclusive.  At that rate, 192,800 out of 200,000 files retrieved will be NOT responsive and will be a waste of time and resource to review.  Why?  Because, as we determined during the review, almost every published and copyrighted document in our oil company has the phrase “All Rights Reserved” in the document and will be retrieved.

TEST #2: Let’s try again.  This time, we’ll conduct a phrase search for “oil rights” (which requires those words as an exact phrase).  Here is the result:

  • Files retrieved with “oil rights”: 1,500
  • Files NOT retrieved with “oil rights”: 1,198,500

This time, we determine a sample size of 461 for the retrieved files and (again) 664 for the NOT retrieved files to achieve a 99% confidence level with a margin of error of 5%.  Even though, we still have a sample size of 664 for the NOT retrieved files, we generate a new list of random numbers to review those items, as well as the 461 randomly selected retrieved items.  Here are the results:

  • Retrieved Items: 461 reviewed, 435 responsive, 94.4% responsive rate.
  • NOT Retrieved Items: 664 reviewed, 523 non-responsive, 78.8% non-responsive rate.

Nearly every item in the retrieved category was responsive, which is good.  But, only 78.8% of the NOT retrieved items were responsive, which means over 20% of the NOT retrieved items were actually responsive to the case (we also failed to retrieve 8 of the items identified as responsive in the first iteration).  So, now what?

TEST #3: If you saw this previous post, you know that proximity searching is a good alternative for finding hits that are close to each other without requiring the exact phrase.  So, this time, we’ll conduct a proximity search for “oil within 5 words of rights”.  Here is the result:

  • Files retrieved with “oil within 5 words of rights”: 5,700
  • Files NOT retrieved with “oil within 5 words of rights”: 1,194,300

This time, we determine a sample size of 595 for the retrieved files and (once again) 664 for the NOT retrieved files, generating a new list of random numbers for both sets of items.  Here are the results:

  • Retrieved Items: 595 reviewed, 542 responsive, 91.1% responsive rate.
  • NOT Retrieved Items: 664 reviewed, 655 non-responsive, 98.6% non-responsive rate.

Over 90% of the items in the retrieved category were responsive AND nearly every item in the NOT retrieved category was non-responsive, which is GREAT.  Also, all but one of the items previously identified as responsive was retrieved.  So, this is a search that appears to maximize recall and precision.

Had we proceeded with the original search, we would have reviewed 200,000 files – 192,800 of which would have been NOT responsive to the case.  By testing and refining, we only had to review 8,815 files –  3,710 sample files reviewed plus the remaining retrieved items from the third search (5,700595 = 5,105) – most of which ARE responsive to the case.  We saved tens of thousands in review costs while still retrieving most of the responsive files, using a defensible approach.

Keep in mind that this is a simple example — we’re not taking into account misspellings and other variations we may want to include in our criteria.

So, what do you think?  Do you use sampling to test your search results?   Please share any comments you might have or if you’d like to know more about a particular topic.

eDiscovery Best Practices: A “Random” Idea on Search Sampling

 

Friday, we talked about how to determine an appropriate sample size to test your search results as well as the items NOT retrieved by the search, using a site that provides a sample size calculator.  Today, we’ll talk about how to make sure the sample size is randomly selected.

A randomly selected sample gives each file an equal chance of being reviewed and eliminates the chance of bias being introduced into the sample which might skew the results.  Merely selecting the first or last x number of items (or any other group) in the set may not reflect the population as a whole – for example, all of those items could come from a single custodian.  To ensure a fair, defensible sample, it needs to be selected randomly.

So, how do you select the numbers randomly?  Once again, the Internet helps us out here.

One site, Random.org, has a random integer generator which will randomly generate whole numbers.  You simply need to supply the number of random integers that you need to be generated, the starting number and ending number of the range within which the randomly generated numbers should fall.  The site will then generate a list of numbers that you can copy and paste into a text file or even a spreadsheet.  The site also provides an Advanced mode, which provides options for the numbers (e.g., decimal, hexadecimal), output format and how the randomization is ‘seeded’ (to generate the numbers).

In the example from Friday, you would provide 660 as the number of random integers to be generated, with a starting number of 1 and an ending number of 100,000 to get a list of random numbers for testing your search that yielded 100,000 files with hits (664, 1 and 1,000,000 respectively to get a list of numbers to test the non-hits).  You could paste the numbers into a spreadsheet, sort them and then retrieve the files by position in the result set based on the random numbers retrieved and review each of them to determine whether they reflect the intent of the search.  You’ll then have a good sense of how effective your search was, based on the random sample.  And, probably more importantly, using that random sample to test your search results will be a highly defensible method to verify your approach in court.

Tomorrow, we'll walk through a sample iteration to show how the sampling will ultimately help us refine our search.

So, what do you think?  Do you use sampling to test your search results?   Please share any comments you might have or if you’d like to know more about a particular topic.