Analysis

What’s in a Name? Potentially, a Lot of Permutations – eDiscovery Best Practices

When looking for documents in your collection that mention key individuals, conducting a name search for those individuals isn’t always as straightforward as you might think.  There are potentially a number of different ways names could be represented and if you don’t account for each one of them, you might fail to retrieve key responsive documents – OR retrieve way too many non-responsive documents.  Here are some considerations for conducting name searches.

The Ever-Limited Phrase Search vs. Proximity Searching

Routinely, when clients give me their preliminary search term lists to review, they will always include names of individuals that they want to search for, like this:

  • “Jim Smith”
  • “Doug Austin”

Phrase searches are the most limited alternative for searching because the search must exactly match the phrase.  For example, a phrase search of “Jim Smith” won’t retrieve “Smith, Jim” if his name appears that way in the documents.

That’s why I prefer to use a proximity search for individual names, it catches several variations and expands the recall of the search.  Proximity searching is simply looking for two or more words that appear close to each other in the document.  A proximity search for “Jim within 3 words of Smith” will retrieve “Jim Smith”, “Smith, Jim”, and even “Jim T. Smith”.  Proximity searching is also a more precise option in most cases than “AND” searches – Doug AND Austin will retrieve any document where someone named Doug is in (or traveling to) Austin whereas “Doug within 3 words of Austin” will ensure those words are near each other, making is much more likely they’re responsive to the name search.

Accounting for Name Variations

Proximity searches won’t always account for all variations in a person’s name.  What are other variations of the name “Jim”?  How about “James” or “Jimmy”?  Or even “Jimbo”?  I have a friend named “James” who is also called “Jim” by some of his other friends and “Jimmy” by a few of his other friends.  Also, some documents may refer to him by his initials – i.e., “J.T. Smith”.  All are potential variations to search for in your collection.

Common name derivations like those above can be deduced in many cases, but you may not always know the middle name or initial.  If so, it may take performing a search of just the last name and sampling several documents until you are able to determine that middle initial for searching (this may also enable you to identify nicknames like “JayDog”, which could be important given the frequently informal tone of emails, even business emails).

Applying the proximity and name variation concepts into our search, we might perform something like this to get our “Jim Smith” documents:

(jim OR jimmy OR james OR “j.t.”) w/3 Smith, where “w/3” is “within 3 words of”.  This is the syntax you would use to perform the search in OnDemand®, CloudNine Discovery’s online review tool.

That’s a bit more inclusive than the “Jim Smith” phrase search the client originally gave me.

BTW, why did I use “jim OR jimmy” instead of the wildcard “jim*”?  Because wildcard searches could yield additional terms I might not want (e.g., Joe Smith jimmied the lock).  Don’t get wild with wildcards!  Using the specific variations you want (e.g., “jim OR jimmy”) is usually best.

Thursday, we will talk about another way to retrieve documents that mention key individuals – through their email addresses.  Same bat time, same bat channel!

So, what do you think?  How do you handle searching for key individuals within your document collections?  Please share any comments you might have or if you’d like to know more about a particular topic.

Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by CloudNine Discovery. eDiscoveryDaily is made available by CloudNine Discovery solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscoveryDaily should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.

Judge Carter Refuses to Recuse Judge Peck in Da Silva Moore – eDiscovery Trends

It seems like ages ago when New York Magistrate Judge Andrew J. Peck denied the motion of the plaintiffs in Da Silva Moore v. Publicis Groupe & MSL Group, No. 11 Civ. 1279 (ALC) (AJP) to recuse himself in the case.  It was all the way back in June.  Now, District Court Judge Andrew L. Carter, Jr. has ruled on the plaintiff’s recusal request.

In his order from last Wednesday (November 7), Judge Carter stated as follows:

“On the basis of this Court’s review of the entire record, the Court is not persuaded that sufficient cause exists to warrant Magistrate Judge Peck’s disqualification…Judge Peck’s decision accepting computer-assisted review … was not influenced by bias, nor did it create any appearance of bias.”

Judge Carter also noted, “Disagreement or dissatisfaction with Magistrate Judge Peck’s ruling is not enough to succeed here…A disinterested observer fully informed of the facts in this case would find no basis for recusal”.

Since it has been a while, let’s recap the case for those who may have not been following it and may be new to the blog.

Back in February, Judge Peck issued an opinion making this case likely the first case to accept the use of computer-assisted review of electronically stored information (“ESI”) for this case.  However, on March 13, District Court Judge Andrew L. Carter, Jr. granted the plaintiffs’ request to submit additional briefing on their February 22 objections to the ruling.  In that briefing (filed on March 26), the plaintiffs claimed that the protocol approved for predictive coding “risks failing to capture a staggering 65% of the relevant documents in this case” and questioned Judge Peck’s relationship with defense counsel and with the selected vendor for the case, Recommind.

Then, on April 5, Judge Peck issued an order in response to Plaintiffs’ letter requesting his recusal, directing plaintiffs to indicate whether they would file a formal motion for recusal or ask the Court to consider the letter as the motion.  On April 13, (Friday the 13th, that is), the plaintiffs did just that, by formally requesting the recusal of Judge Peck (the defendants issued a response in opposition on April 30).  But, on April 25, Judge Carter issued an opinion and order in the case, upholding Judge Peck’s opinion approving computer-assisted review.

Not done, the plaintiffs filed an objection on May 9 to Judge Peck’s rejection of their request to stay discovery pending the resolution of outstanding motions and objections (including the recusal motion, which has yet to be ruled on.  Then, on May 14, Judge Peck issued a stay, stopping defendant MSLGroup’s production of electronically stored information.  Finally, on June 15, in a 56 page opinion and order, Judge Peck denied the plaintiffs’ motion for recusal, which Judge Carter has now upheld.

So, what do you think?  Will Judge Carter’s decision not to recuse Judge Peck restart the timetable for predictive coding on this case?  Please share any comments you might have or if you’d like to know more about a particular topic.

Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by CloudNine Discovery. eDiscoveryDaily is made available by CloudNine Discovery solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscoveryDaily should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.

Does This Scare You? – eDiscovery Horrors!

Today is Halloween.  While we could try to “scare” you with the traditional “frights”, we’re an eDiscovery blog, so every year we try to “scare” you in a different way instead.  Does this scare you?

The defendant had been previously sanctioned $500,000 ($475,000 to the plaintiff and $25,000 to the court) and held in contempt of court by the magistrate judge for spoliation, who also recommended an adverse inference instruction be issued at trial.  The defendant appealed to the district court, where Minnesota District Judge John Tunheim increased the award to the plaintiff to $600,000.  Oops!

What about this?

Even though the litigation hold letter from April 2008 was sent to the primary custodians, at least one principal was determined to have actively deleted relevant emails. Additionally, the plaintiffs made no effort to suspend the automatic destruction policy of emails, so emails that were deleted could not be recovered.  Ultimately, the court found that 9 of 14 key custodians had deleted relevant documents. After the defendants raised its spoliation concerns with the court, the plaintiffs continued to delete relevant information, including decommissioning and discarding an email server without preserving any of the relevant ESI.  As a result, the New York Supreme Court imposed the severest of sanctions against the plaintiffs for spoliation of evidence – dismissal of their $20 million case.

Or this?

For most organizations, information volume doubles every 18-24 months and 90% of the data in the world has been created in the last two years. In a typical company in 2011, storing that data consumed about 10% of the IT budget. At a growth rate of 40% (even as storage unit costs decline), storing this data will consume over 20% of the typical IT budget by 2014.

How about this?

There “was stunned silence by all attorneys in the court room after that order. It looks like neither side saw it coming.”

Or maybe this?

If you have deleted any of your photos from Facebook in the past three years, you may be surprised to find that they are probably still on the company’s servers.

Scary, huh?  If the possibility of sanctions, exponential data growth and judges ordering parties to perform predictive coding keep you awake at night, then the folks at eDiscovery Daily will do our best to provide useful information and best practices to enable you to relax and sleep soundly, even on Halloween!

Then again, if the expense, difficulty and risk of processing and loading up to 100 GB of data into an eDiscovery review application that you’ve never used before terrifies you, maybe you should check this out.

Of course, if you seriously want to get into the spirit of Halloween, click here.  This will really terrify you!

Those of you who are really mortified that the next post in Jane Gennarelli’s “Litigation 101” series won’t run this week, fear not – it will run tomorrow.

What do you think?  Is there a particular eDiscovery issue that scares you?  Please share your comments and let us know if you’d like more information on a particular topic.

Happy Halloween!

Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by CloudNine Discovery. eDiscoveryDaily is made available by CloudNine Discovery solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscoveryDaily should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.

Both Sides Instructed to Use Predictive Coding or Show Cause Why Not – eDiscovery Case Law

As reported in Ralph Losey’s e-Discovery Team® blog, Vice Chancellor J. Travis Laster in Delaware Chancery Court – in EORHB, Inc., et al v. HOA Holdings, LLC, C.A. No. 7409-VCL (Del. Ch. Oct. 15, 2012) – has issued a “surprise” bench order requiring both sides to use predictive coding and to use the same vendor.

As Ralph notes, this “appears to be the first time a judge has required both sides of a dispute to use predictive coding when neither has asked for it. It may also be the first time a judge has ordered parties to use the same vendor.”  Vice Chancellor Laster’s instruction was as follows:

“This seems to me to be an ideal non-expedited case in which the parties would benefit from using predictive coding.  I would like you all, if you do not want to use predictive coding, to show cause why this is not a case where predictive coding is the way to go.

I would like you all to talk about a single discovery provider that could be used to warehouse both sides’ documents to be your single vendor.  Pick one of these wonderful discovery super powers that is able to maintain the integrity of both side’s documents and insure that no one can access the other side’s information.  If you cannot agree on a suitable discovery vendor, you can submit names to me and I will pick one for you.

One thing I don’t want to do – one of the nice things about most of these situations is once people get to the indemnification realm, particularly if you get the business guys involved, they have some interest in working out a number and moving on.  The problem is that these types of indemnification claims can generate a huge amount of documents.  That’s why I would really encourage you all, instead of burning lots of hours with people reviewing, it seems to me this is the type of non-expedited case where we could all benefit from some new technology use.”

Ralph notes that there “was stunned silence by all attorneys in the court room after that order. It looks like neither side saw it coming.”  It will be interesting to see if either, or both party, proceeds to object and attempt to “show cause” as to why they shouldn’t use predictive coding.

So, what do you think?  Is this an isolated case or the start of a trend?  Please share any comments you might have or if you’d like to know more about a particular topic.

Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by CloudNine Discovery. eDiscoveryDaily is made available by CloudNine Discovery solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscoveryDaily should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.

Are You Requesting the Best Production Format for Your Case? – eDiscovery Best Practices

One of the blogs I read regularly is Ball in your Court from Craig Ball, a previous thought leader interviewee on this blog.  His post from last Tuesday, Are They Trying to Screw Me?, is one that all attorneys that request ESI productions should read.

Ball describes a fairly typical proposed production format, as follows:

“Documents will be produced as single page TIFF files with multi-page extracted text or OCR.  We will furnish delimited IPRO or Opticon load files and will later identify fielded information we plan to exchange.”

Then, he asks the question: “Are they trying to screw you?”  Answer: “Probably not.”  But, “Are you screwing yourself by accepting the proposed form of production?  Yes, probably.”

With regard to producing TIFF files, Ball notes that “Converting a native document to TIFF images is lobotomizing the document.”  The TIFF image is devoid of any of the metadata that provides valuable information about the way in which the document was used, making analysis of the produced documents a much more difficult effort.  Ball sums up TIFF productions by saying “Think of a TIFF as a PDF’s retarded little brother.  I mean no offense by that, but TIFFs are not just differently abled; they are severely handicapped.  Not born that way, but lamed and maimed on purpose.  The other side downgrades what they give you, making it harder to use and stripping it of potentially-probative content.”

Opposing counsel isn’t trying to screw you with a TIFF production.  They just do it because they always provide it that way.  And, you accept it that way because you’ve always accepted it that way.  Ball notes that “You may accept the screwed up proposal because, even if the data is less useful and incomplete, you won’t have to evolve.  You’ll pull the TIFF images into your browser and painstakingly read them one-by-one, just like good ol’ paper; all-the-while telling yourself that what you didn’t get probably wasn’t that important and promising yourself that next time, you’ll hold out for the good stuff—the native stuff.”

We recently ran a blog series called First Pass Review – Of Your Opponent’s Data.  In that series, we discussed how useful that Early Data Assessment/FirstPass Review applications can be in reviewing your opponent’s produced ESI.  At CloudNine Discovery, we use FirstPass®, powered by Venio FPR™ for first pass review – it provides a number of mechanisms that are useful in analyzing your opponent’s produced data.  Capabilities like email analytics and message thread analysis (where missing emails in threads can be identified), synonym searching, fuzzy searching and domain categorization are quite useful in developing an understanding of your opponents production.  However, these mechanisms are only as useful as the data they’re analyzing.  Email analytics, message thread analysis and domain categorization are driven by metadata, so they are useless on TIFF/OCR/data productions.  You can’t analyze what you don’t have.

It’s time to evolve.  To get the most information out of your opponent’s production, you need to request the production in native format.  Opponents are probably not trying to screw you by producing in TIFF format, but you are screwing yourself if you decide to accept it in that format.

So, what do you think?  Do you request native productions from your opponents?  If not, why not?  Please share any comments you might have or if you’d like to know more about a particular topic.

Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by CloudNine Discovery. eDiscoveryDaily is made available by CloudNine Discovery solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscoveryDaily should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.

Don’t Be “Duped”, Files with Different HASH Values Can Still Be the Same – eDiscovery Best Practices

A couple of months ago, we published a post discussing how the number of pages in each gigabyte can vary widely and, to help illustrate the concept, we took one of our blog posts and put it into several different file formats to illustrate how each file had the same content, yet was a different size.  That’s not the only concept that example illustrates.

Content is Often Republished

How many of you have ever printed or saved a file to Adobe Acrobat PDF format?  Personally, I do it all the time.  For example, I “publish” marketing slicks created in Microsoft® Publisher, “publish” finalized client proposals created in Microsoft Word and “publish” presentations created in Microsoft PowerPoint to PDF format regularly.  Microsoft now even includes Adobe PDF as one of the standard file formats to which you can save a file, I even have a free PDF print driver on my laptop, so I can conceivably create a PDF file for just about anything that I can print.  In each case, I’m duplicating the content of the file, but in a different file format designed for publishing that content.

Another way content is republished is via the ubiquitous “copy and paste” capability that is used by so many to duplicate content to another file.  Whether copying part or all of the content, “copy and paste” functionality is essentially available in just about every application to be able to duplicate content from one application to the next or even one file to the next in the same application.

Same Content, Different HASH

When publishing a file to PDF or copying the entire contents of a file to a new file, the contents of the file may be the same, but the HASH value, which is a digital fingerprint that reflects the contents and format of the file, will be different.  So, a Word file and the PDF file published from the Word file may contain the same content, but the HASH value will be different.  Even copying the content from one file to another in the same software program can result in different HASH values, or even different file sizes.  For example, I copied the entire contents of yesterday’s blog post, written in Word, into a brand new Word file.  Not only did they have different HASH values, but they were different sizes – the copied file was 8K smaller than the original.  So, these files, while identical in content, won’t be considered “duplicates” based on HASH value and won’t be “de-duped” out of the collection as a result.  As a result, these files are considered “near-dupes” for analysis purposes, even though the content is essentially identical.

What to Do with the Near-Dupes?

Identifying and culling these essentially identical near-dupes isn’t necessary in every case, but if it is, you’ll need to perform a process that groups similar documents together so that those near-dupes can be identified and addressed.  We call that “clustering”.  For more on the benefits of clustering, check out this blog post.

So, what do you think?  What do you do with “dupes” that have different HASH values?  Please share any comments you might have or if you’d like to know more about a particular topic.

Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by CloudNine Discovery. eDiscoveryDaily is made available by CloudNine Discovery solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscoveryDaily should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.

eDiscovery Daily is Two Years Old Today!

 

It’s hard to believe that it has been two years ago today since we launched the eDiscoveryDaily blog.  Now that we’ve hit the “terrible twos”, is the blog going to start going off on rants about various eDiscovery topics, like Will McAvoy in The Newsroom?   Maybe.  Or maybe not.  Wouldn’t that be fun!

As we noted when recently acknowledging our 500th post, we have seen traffic on our site (from our first three months of existence to our most recent three months) grow an amazing 442%!  Our subscriber base has nearly doubled in the last year alone!  We now have nearly seven times the visitors to the site as we did when we first started.  We continue to appreciate the interest you’ve shown in the topics and will do our best to continue to provide interesting and useful eDiscovery news and analysis.  That’s what this blog is all about.  And, in each post, we like to ask for you to “please share any comments you might have or if you’d like to know more about a particular topic”, so we encourage you to do so to make this blog even more useful.

We also want to thank the blogs and publications that have linked to our posts and raised our public awareness, including Pinhawk, The Electronic Discovery Reading Room, Unfiltered Orange, Litigation Support Blog.com, Litigation Support Technology & News, Ride the Lightning, InfoGovernance Engagement Area, Learn About E-Discovery, Alltop, Law.com, Justia Blawg Search, Atkinson-Baker (depo.com), ABA Journal, Complex Discovery, Next Generation eDiscovery Law & Tech Blog and any other publication that has picked up at least one of our posts for reference (sorry if I missed any!).  We really appreciate it!

We like to take a look back every six months at some of the important stories and topics during that time.  So, here are some posts over the last six months you may have missed.  Enjoy!

We talked about best practices for issuing litigation holds and how issuing the litigation hold is just the beginning.

By the way, did you know that if you deleted a photo on Facebook three years ago, it may still be online?

We discussed states (Delaware, Pennsylvania and Florida) that have implemented new rules for eDiscovery in the past few months.

We talked about how to achieve success as a non-attorney in a law firm, providing quality eDiscovery services to your internal “clients” and how to be an eDiscovery consultant, and not just an order taker, for your clients.

We warned you that stop words can stop your searches from being effective, talked about how important it is to test your searches before the meet and confer and discussed the importance of the first 7 to 10 days once litigation hits in addressing eDiscovery issues.

We told you that, sometimes, you may need to collect from custodians that aren’t there, differentiated between quality assurance and quality control and discussed the importance of making sure that file counts add up to what was collected (with an example, no less).

By the way, did you know the number of pages in a gigabyte can vary widely and the same exact content in different file formats can vary by as much as 16 to 20 times in size?

We provided a book review on Zubulake’s e-Discovery and then interviewed the author, Laura Zubulake, as well.

BTW, eDiscovery Daily has had 150 posts related to eDiscovery Case Law since the blog began.  Fifty of them have been in the last six months.

P.S. – We still haven't missed a business day yet without a post.  Yes, we are crazy.

Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by CloudNine Discovery. eDiscoveryDaily is made available by CloudNine Discovery solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscoveryDaily should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.

eDiscovery Best Practices: Quality Control, Making Sure the Numbers Add Up

 

Yesterday, we wrote about tracking file counts from collection to production, the concept of expanded file counts, and the categorization of files during processing.  Today, let’s walk through a scenario to show how the files collected are accounted for during the discovery process.

Tracking the Counts after Processing

We discussed the typical categories of excluded files after processing – obviously, what’s not excluded is available for searching and review.  Even if your approach includes a technology assisted review (TAR) methodology such as predictive coding, it’s still likely that you will want to do some culling out of files that are clearly non-responsive.

Documents during review may be classified in a number of ways, but the most common ways to classify documents as to whether they are responsive, non-responsive, or privileged.  Privileged documents are also typically classified as responsive or non-responsive, so that only the responsive documents that are privileged need be identified on a privilege log.  Responsive documents that are not privileged are then produced to opposing counsel.

Example of File Count Tracking

So, now that we’ve discussed the various categories for tracking files from collection to production, let’s walk through a fairly simple eMail based example.  We conduct a fairly targeted collection of a PST file from each of seven custodians in a given case.  The relevant time period for the case is January 1, 2010 through December 31, 2011.  Other than date range, we plan to do no other filtering of files during processing.  Duplicates will not be reviewed or produced.  We’re going to provide an exception log to opposing counsel for any file that cannot be processed and a privilege log for any responsive files that are privileged.  Here’s what this collection might look like:

  • Collected Files: 101,852 – After expansion, 7 PST files expand to 101,852 eMails and attachments.
  • Filtered Files: 23,564 – Filtering eMails outside of the relevant date range eliminates 23,564 files.
  • Remaining Files after Filtering: 78,288 – After filtering, there are 78,288 files to be processed.
  • NIST/System Files: 0 – eMail collections typically don’t have NIST or system files, so we’ll assume zero files here.  Collections with loose electronic documents from hard drives typically contain some NIST and system files.
  • Exception Files: 912 – Let’s assume that a little over 1% of the collection (912) is exception files like password protected, corrupted or empty files.
  • Duplicate Files: 24,215 – It’s fairly common for approximately 30% of the collection to include duplicates, so we’ll assume 24,215 files here.
  • Remaining Files after Processing: 53,161 – We have 53,161 files left after subtracting NIST/System, Exception and Duplicate files from the total files after filtering.
  • Files Culled During Searching: 35,618 – If we assume that we are able to cull out 67% (approximately 2/3 of the collection) as clearly non-responsive, we are able to cull out 35,618 files.
  • Remaining Files for Review: 17,543 – After culling, we have 17,543 files that will actually require review (whether manual or via a TAR approach).
  • Files Tagged as Non-Responsive: 7,017 – If approximately 40% of the document collection is tagged as non-responsive, that would be 7,017 files tagged as such.
  • Remaining Files Tagged as Responsive: 10,526 – After QC to ensure that all documents are either tagged as responsive or non-responsive, this leaves 10,526 documents as responsive.
  • Responsive Files Tagged as Privileged: 842 – If roughly 8% of the responsive documents are privileged, that would be 842 privileged documents.
  • Produced Files: 9,684 – After subtracting the privileged files, we’re left with 9,684 responsive, non-privileged files to be produced to opposing counsel.

The percentages I used for estimating the counts at each stage are just examples, so don’t get too hung up on them.  The key is to note the numbers in red above.  Excluding the interim counts in black, the counts in red represent the different categories for the file collection – each file should wind up in one of these totals.  What happens if you add the counts in red together?  You should get 101,852 – the number of collected files after expanding the PST files.  As a result, every one of the collected files is accounted for and none “slips through the cracks” during discovery.  That’s the way it should be.  If not, investigation is required to determine where files were missed.

So, what do you think?  Do you have a plan for accounting for all collected files during discovery?  Please share any comments you might have or if you’d like to know more about a particular topic.

Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by CloudNine Discovery. eDiscoveryDaily is made available by CloudNine Discovery solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscoveryDaily should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.

eDiscovery Best Practices: Quality Control, It’s a Numbers Game

 

Previously, we wrote about Quality Assurance (QA) and Quality Control (QC) in the eDiscovery process.  Both are important in improving the quality of work product and making the eDiscovery process more defensible overall.  For example, in attorney review, QA mechanisms include validation rules to ensure that entries are recorded correctly while QC mechanisms include a second review (usually by a review supervisor or senior attorney) to ensure that documents are being categorized correctly.  Another overall QC mechanism is tracking of document counts through the discovery process, especially from collection to production, to identify how every collected file was handled and why each non-produced document was not produced.

Expanded File Counts

Scanned counts of files collected are not the same as expanded file counts.  There are certain container file types, like Outlook PST files and ZIP archives that exist essentially to store a collection of other files.  So, the count that is important to track is the “expanded” file count after processing, which includes all of the files contained within the container files.  So, in a simple scenario where you collect Outlook PST files from seven custodians, the actual number of documents (emails and attachments) within those PST files could be in the tens of thousands.  That’s the starting count that matters if your goal is to account for every document in the discovery process.

Categorization of Files During Processing

Of course, not every document gets reviewed or even included in the search process.  During processing, files are usually categorized, with some categories of files usually being set aside and excluded from review.  Here are some typical categories of excluded files in most collections:

  • Filtered Files: Some files may be collected, and then filtered during processing.  A common filter for the file collection is the relevant date range of the case.  If you’re collecting custodians’ source PST files, those may include messages outside the relevant date range; if so, those messages may need to be filtered out of the review set.  Files may also be filtered based on type of file or other reasons for exclusion.
  • NIST and System Files: Many file collections also contain system files, like executable files (EXEs) or Dynamic Link Library (DLLs) that are part of the software on a computer which do not contain client data, so those are typically excluded from the review set.  NIST files are included on the National Institute of Standards and Technology list of files that are known to have no evidentiary value, so any files in the collection matching those on the list are “De-NISTed”.
  • Exception Files: These are files that cannot be processed or indexed, for whatever reason.  For example, they may be password-protected or corrupted.  Just because these files cannot be processed doesn’t mean they can be ignored, depending on your agreement with opposing counsel, you may need to at least provide a list of them on an exception log to prove they were addressed, if not attempt to repair them or make them accessible (BTW, it’s good to establish that agreement for disposition of exception files up front).
  • Duplicate Files: During processing, files that are exact duplicates may be put aside to avoid redundant review (and potential inconsistencies).  Some exact duplicates are typically identified based on the HASH value, which is a digital fingerprint generated based on the content and format of the file – if two files have the same HASH value, they have the same exact content and format.  Emails (and their attachments) may be identified as duplicates based on key metadata fields, so an attachment cannot be “de-duped” out of the collection by a standalone copy of the same file.

All of these categories of excluded files can reduce the set of files to actually be searched and reviewed.  Tomorrow, we’ll illustrate an example of a file set from collection to production to illustrate how each file is accounted for during the discovery process.

So, what do you think?  Do you have a plan for accounting for all collected files during discovery?  Please share any comments you might have or if you’d like to know more about a particular topic.

Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by CloudNine Discovery. eDiscoveryDaily is made available by CloudNine Discovery solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscoveryDaily should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.

eDiscovery Milestones: Our 500th Post!

One thing about being a daily blog is that the posts accumulate more quickly.  As a result, I’m happy to announce that today is our 500th post on eDiscoveryDaily!  In less than two years of existence!

When we launched on September 20, 2010, our goal was to be a daily resource for eDiscovery news and analysis and we have done our best to deliver on that goal.  During that time, we have published 144 posts on eDiscovery Case Law and have identified numerous cases related to Spoliation Claims and Sanctions.   We’ve covered every phase of the EDRM life cycle, including:

We’ve discussed key industry trends in Social Media Technology and Cloud Computing.  We’ve published a number of posts on eDiscovery best practices on topics ranging from Project Management to coordinating eDiscovery within Law Firm Departments to Searching and Outsourcing.  And, a lot more.  Every post we have published is still available on the site for your reference.

Comparing our first three months of existence with our most recent three months, we have seen traffic on our site grow an amazing 442%!  Our subscriber base has nearly doubled in the last year alone!

And, we have you to thank for that!  Thanks for making the eDiscoveryDaily blog a regular resource for your eDiscovery news and analysis!  We really appreciate the support!

I also want to extend a special thanks to Jane Gennarelli, who has provided some wonderful best practice post series on a variety of topics, ranging from project management to coordinating review teams to learning how to be a true eDiscovery consultant instead of an order taker.  Her contributions are always well received and appreciated by the readers – and also especially by me, since I get a day off!

We always end each post with a request: “Please share any comments you might have or if you’d like to know more about a particular topic.”  And, we mean it.  We want to cover the topics you want to hear about, so please let us know.

Tomorrow, we’ll be back with a new, original post.  In the meantime, feel free to click on any of the links above and peruse some of our 499 previous posts.  Maybe you missed some?  😉

Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by CloudNine Discovery. eDiscoveryDaily is made available by CloudNine Discovery solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscoveryDaily should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.