Processing Archives

Craig Ball of Craig D. Ball, P.C. – eDiscovery Trends, Part 3

March 8, 2013

This is the tenth (and final) of the 2013 LegalTech New York (LTNY) Thought Leader Interview series. eDiscoveryDaily interviewed several thought leaders at LTNY this year and generally asked each of them the following questions:

What are your general observations about LTNY this year and how it fits into emerging trends?
If last year’s “next big thing” was the emergence of predictive coding, what do you feel is this year’s “next big thing”?
What are you working on that you’d like our readers to know about?

Today’s thought leader is Craig Ball. A frequent court appointed special master in electronic evidence, Craig is a prolific contributor to continuing legal and professional education programs throughout the United States, having delivered over 1,000 presentations and papers. Craig’s articles on forensic technology and electronic discovery frequently appear in the national media, and he writes a monthly column on computer forensics and eDiscovery for Law Technology News called Ball in your Court, as well as blogs on those topics at ballinyourcourt.com.

Craig was very generous with his time again this year and our interview with Craig had so much good information in it, we couldn’t fit it all into a single post. Wednesday was part 1 and yesterday was part 2. Today is the third and last part. A three-parter!

Note: I asked Craig the questions in a different order and, since the show had not started yet when I interviewed him, instead asked about the sessions in which he was speaking.

What are you working on that you’d like our readers to know about?

I’m really trying to make 2013 the year of distilling an extensive but idiosyncratic body of work that I’ve amassed through years of writing and bring it together into a more coherent curriculum. I want to develop a no-cost casebook for law students and to structure my work so that it can be more useful for people in different places and phases of their eDiscovery education. So, I’ll be working on that in the first six or eight months of 2013 as both an academic and a personal project.

I’m also trying to go back to roots and rethink some of the assumptions that I’ve made about what people understand. It’s frustrating to find that lawyers talking about, say, load files when they don’t really know what a load file is, they’ve never looked at a load file. They’ve left it to somebody else and, so, the resolution of difficulties has gone through so many hands and is plagued by so much miscommunication. I’d like to put some things out there that will enable lawyers in a non-threatening and accessible way to gain comfort in having a dialog about the fundamentals of eDiscovery that you and I take for granted. So, that we don’t have to have this reliance upon vendors for the simplest issues. I don’t mean that vendors won’t do the work, but I don’t think we should have to bring a technical translator in for every phone call.

There should be a corpus of competence that every litigator brings to the party, enabling them to frame basic protocols and agreements that aren’t merely parroting something that they don’t understand, but enabling them to negotiate about issues in ways that the resolutions actually make sense. Saying “I won’t give you 500 search terms, but I’ll give you 250” isn’t a rational resolution. It’s arbitrary.

There are other kinds of cases that you can identify search terms “all the live long day” and they’re really never going to get you that much closer to the documents you want. The best example in recent years was the Pippins v. KPMG case. KPMG was arguing that they could use search terms against samples to identify forensically significant information about work day and work responsibility. That didn’t make any sense to me at all. The kinds of data they were looking for wasn’t going to be easily found by using keyword search. It was going to require finding data of a certain character and bringing a certain kind of analysis to it, not an objective culling method like search terms. Search terms have become like the expression “if you have a hammer, the whole world looks like a nail”. We need to get away from that.

I think a little education made palatable will go a long way. We need some good solid education and I’m trying to come up with something that people will borrow and build on. I want it to be something that’s good enough that people will say “let’s just steal his stuff”. That’s why I put it out there – it’s nice that they credit me and I appreciate it; but if what you really want to do is teach people, you don’t do it for the credit, you do it for the education. That’s what I’m about, more this year than ever before.

Thanks, Craig, for participating in the interview!

And to the readers, as always, please share any comments you might have or if you’d like to know more about a particular topic!

Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by CloudNine Discovery. eDiscoveryDaily is made available by CloudNine Discovery solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscoveryDaily should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.

Craig Ball of Craig D. Ball, P.C. – eDiscovery Trends, Part 2

March 7, 2013

This is the tenth (and final) of the 2013 LegalTech New York (LTNY) Thought Leader Interview series. eDiscoveryDaily interviewed several thought leaders at LTNY this year and generally asked each of them the following questions:

What are your general observations about LTNY this year and how it fits into emerging trends?
If last year’s “next big thing” was the emergence of predictive coding, what do you feel is this year’s “next big thing”?
What are you working on that you’d like our readers to know about?

Today’s thought leader is Craig Ball. A frequent court appointed special master in electronic evidence, Craig is a prolific contributor to continuing legal and professional education programs throughout the United States, having delivered over 1,000 presentations and papers. Craig’s articles on forensic technology and electronic discovery frequently appear in the national media, and he writes a monthly column on computer forensics and eDiscovery for Law Technology News called Ball in your Court, as well as blogs on those topics at ballinyourcourt.com.

Craig was very generous with his time again this year and our interview with Craig had so much good information in it, we couldn’t fit it all into a single post. Yesterday was part 1. Today is part 2 and part 3 will be published in the blog on Friday. A three-parter!

Note: I asked Craig the questions in a different order and, since the show had not started yet when I interviewed him, instead asked about the sessions in which he was speaking.

I noticed that you are speaking at a couple of sessions here. What would you like to tell me about those sessions?

{Interviewed the evening before the show} I am on a Technology Assisted Review panel with Maura Grossman and Ralph Losey that should be as close to a barrel of laughs as one can have talking about technology assisted review. It is based on a poker theme – which was actually Matt Nelson’s (of Symantec) idea. I think it is a nice analogy, because a good poker player is a master or mistress of probabilities, whether intuitively or overtly performing mental arithmetic that are essentially statistical and probability calculations. Such calculations are key to quality assurance and quality control in modern review.

We have to be cautious not to require the standards for electronic assessments to be dramatically higher than the standards applied to human assessments. It is one thing with a new technology to demand more of it to build trust. That’s a pragmatic imperative. It is another thing to demand so exalted a level of scrutiny that you essentially void all advantages of the new technology, including the cost savings and efficiencies it brings. You know the old story about the two hikers that encounter the angry grizzly bear? They freeze, and then one guy pulls out running shoes and starts changing into them. His friend says “What are you doing? You can’t outrun a grizzly bear!” The other guy says “I know. I only have to outrun you”. That is how I look at technology assisted review. It does not have to be vastly superior to human review; it only has to outrun human review. It just has to be as good or better while being faster and cheaper.

We cannot let the vague uneasiness about the technology cause it to implode. If we have to essentially examine everything in the discard pile, so that we not only pay for the new technology but also back it up with the old. That’s not going to work. It will take a few pioneers who get the “arrows in the back” early on—people who spend more to build trust around the technology that is missing at this juncture. Eventually, people are going to say “I’ve looked at the discard pile for the last three cases and this stuff works. I don’t need to look at all of that any more.

Even the best predictive coding systems are not going to be anywhere near 100% accurate. They start from human judgment where we’re not even sure what “100% accurate” is, in the context of responsiveness and relevance. There’s no “gold standard”. Two different qualified people can look at the same document and give a different assessment and approximately 40% of the time, they do. And, the way we decide who’s right is that we bring in a third person. We indulge the idea that the third person is the “topic authority” and what they say goes. We define their judgment as right; but, even their judgments are human. To err being human, they’re going to make misjudgments based on assumptions, fatigue, inattention, whatever.

So, getting back to the topic at hand, I do think that the focus on quality assurance is going to prompt a larger and long overdue discussion about the efficacy of human review. We’ve kept human review in this mystical world of work product for a very long time. Honestly, the rationale for work product doesn’t naturally extend over to decisions about responsiveness and relevance. Even though, most of my colleagues would disagree with me out of hand. They don’t want anybody messing with privilege or work product. It’s like religion or gun control—you can’t even start a rational debate.

Things are still so partisan and bitter. The notions of cooperation, collaboration, transparency, translucency, communication – they’re not embedded yet. People come to these processes with animosity so deeply seated that you’re not really starting on a level playing field with an assessment of what’s best for our system of justice. Justice is someone else’s problem. The players just want to win. That will be tough to change.

We “dinosaurs” will die off, and we won’t have to wait for the glaciers to advance. I think we will have some meteoric events that will change the speed at which the dinosaurs die. Technology assisted review is one. We’ve seen a meteoric rise in the discussion of the topic, the interest in the topic, and I think it will have a meteoric effect in terms of more rapidly extinguishing very bad and very expensive practices that don’t carry with them any more superior assurance of quality.

Craig Ball of Craig D. Ball, P.C. – eDiscovery Trends, Part 1

March 6, 2013

This is the tenth (and final) of the 2013 LegalTech New York (LTNY) Thought Leader Interview series. eDiscoveryDaily interviewed several thought leaders at LTNY this year and generally asked each of them the following questions:

What are your general observations about LTNY this year and how it fits into emerging trends?
If last year’s “next big thing” was the emergence of predictive coding, what do you feel is this year’s “next big thing”?
What are you working on that you’d like our readers to know about?

Today’s thought leader is Craig Ball. A frequent court appointed special master in electronic evidence, Craig is a prolific contributor to continuing legal and professional education programs throughout the United States, having delivered over 1,000 presentations and papers. Craig’s articles on forensic technology and electronic discovery frequently appear in the national media, and he writes a monthly column on computer forensics and eDiscovery for Law Technology News called Ball in your Court, as well as blogs on those topics at ballinyourcourt.com.

Craig was very generous with his time again this year and our interview with Craig had so much good information in it, we couldn’t fit it all into a single post. So, today is part 1. Parts 2 and 3 will be published in the blog on Thursday and Friday. A three-parter!

Note: I asked Craig the questions in a different order and, since the show had not started yet when I interviewed him, instead asked about the sessions in which he was speaking.

If last year’s “next big thing” was the emergence of predictive coding, what do you feel is this year’s “next big thing”?

I think this is the first year where I do not have a ready answer to that question. It’s like the wonderful movie Groundhog Day. I am on the educational planning board for the show, and as hard as we try to find and present fresh ideas, technology assisted review is once again the dominant topic.

This year, we will see a change of the marketing language repositioning the (forgive the jargon) “value proposition” for the tools being sold continuing to move more towards the concept of information governance. If knowledge management had a “hook up” here at LTNY with eDiscovery, their offspring would be information governance. Information governance represents a way to spread the cost of eDiscovery infrastructure among different budgets. It’s not a made up value proposition. Security and regulatory people do have a need, and many departments can ultimately benefit from more granular and regimented management of their unstructured and legacy information stores.

I remain something of a skeptic about what has come to be called “defensible deletion.” Most in-house IT people do not understand that, even after you purchase a single instance de-duplication solution, you’re still going to have as much of 40% “bloat” in your collection of data between local stores, embedded and encoded attachments, etc. So, there are marked efficiencies we can achieve by implementing sensible de-duplication and indexing mechanisms that are effective, ongoing and systemic. Consider enterprise indexing models that basically let your organization and its information face an indexing mechanism in much the same way as the internet faces Google. Almost all of us interact with the internet through Google, and often get the information we are seeking from the Google index or synopsis of the data without actually proceeding to the indexed site. The index itself becomes the resource, and the document indexed a distinct (and often secondary) source. We must ask ourselves: “if a document is indexed, does it ever leave our collection?”

I also think eDiscovery education is changing and I am cautiously optimistic. But, people are getting just enough better information about eDiscovery to be dangerous. And, they are still hurting themselves by expecting there to be some simple “I don’t really need to know it” rule of thumb that will get them through. And, that’s an enormous problem. You can’t cross examine from a script. Advocates need to understand the answers they get and know how to frame the follow up and the kill. My cautious optimism respecting education is function of my devoting so much more of my time to education at the law school and professional levels as well as for judicial organizations. I am seeing a lot more students interested in the material at a deeper level, and my law class that just concluded in December impressed me greatly. The level of enthusiasm the students brought to the topic and the quality and caliber of their questions were as good as any I get from my colleagues in the day to day practice of eDiscovery. Not just from lawyers, but also from people like you who are deeply immersed in this topic.

That is not so much a credit to my teaching (although I hope it might be). The greatest advantage that students have is that they have haven’t yet acquired bad habits and don’t come with preconceived notions about what eDiscovery is supposed to be. Conversely, many lawyers literally do not want to hear about certain topics–they “glaze” and immediately start looking for a way to say “this cannot be important, I cannot have to know this”. Law students don’t waste their energy that way. If the professor says “you need to know this”, then they make it their mission to learn. Yesterday, I had a conversation with a student where she said “I really wish we could have learned more about search strategies and more ways to apply sophisticated tools hands on”. That’s exactly what I wish lawyers would say.

I wish lawyers were clamoring to better understand things like search or de-duplication or the advantages of one form of production over another. Sometimes, I feel like I am alone in my assessment that these are crucial issues. If I am the only one thinking that settling on forms of productions early and embracing native forms of production is crucial to quality, what is wrong with me?

I am still surprised at how many people TIFF most of their collection or production.

They have no clue how really bad that is, not just in terms in cost but also in terms of efficiency. I am hoping the dialogue about TAR will bring us closer to a serious discussion about quality in eDiscovery. We never had much of a dialogue about the quality of human review or the quality of paper production. Either we didn’t have the need, or, more likely we were so immersed in what we were doing we did not have the language to even begin the conversation.

I wrote in a blog post recently about an experiment discussed in my college Introductory Psychology course where this cool experiment involved raising kittens such that they could only see for a few hours a day in an environment composed entirely horizontals or verticals. Apparently, if you are raised from birth only seeing verticals, you do not learn to see horizontals, and vice-versa. So, if I raise a kitten among the horizontals and take a black rod and put it in front of them, they see it when it is horizontal. But, if I orient it vertically, it disappears in their brain. That is kind of how we are with lawyers and eDiscovery.

There are just some topics that you and I and our colleagues see the importance of, but lawyers have been literally raised without the ability to see why those things matter. They see what has long been presented to them in, say, Summation or Concordance, as an assemblage of lousy load files and error ridden OCR and colorless images stripped of embedded commentary. They see this information so frequently and so exclusively that they think that’s the document and, since they only have paper document frames of reference (which aren’t really that much better than TIFFs), they think this must be what electronic evidence looks like. They can’t see the invisible plane they’ve been bred to overlook.

You can look at a stone axe and appreciate the merits of a bronze axe – if all that you’re comparing it to are prehistoric tools, a bronze axe looks pretty good. But, today we have chainsaws. I want lawyers demanding chainsaws to deal with electronic information and to throw away those incredibly expensive stone axes; but, unfortunately, they make more money using stone axes. But, not for long. I am seeing the “house of cards” start to shake and the house of cards I am talking about is the $100 to $300 (or more) per gigabyte pricing for eDiscovery. I think that model is not only going to be short lived, but will soon be seen as negligence in the lawyers who go that route and as exploitive gouging by service providers, like selling a bottle of water for $10 after Hurricane Sandy. There is a point at which price gouging will be called out. We can’t get there fast enough.

Judge Carter Refuses to Recuse Judge Peck in Da Silva Moore – eDiscovery Trends

November 12, 2012

It seems like ages ago when New York Magistrate Judge Andrew J. Peck denied the motion of the plaintiffs in Da Silva Moore v. Publicis Groupe & MSL Group, No. 11 Civ. 1279 (ALC) (AJP) to recuse himself in the case. It was all the way back in June. Now, District Court Judge Andrew L. Carter, Jr. has ruled on the plaintiff’s recusal request.

In his order from last Wednesday (November 7), Judge Carter stated as follows:

“On the basis of this Court’s review of the entire record, the Court is not persuaded that sufficient cause exists to warrant Magistrate Judge Peck’s disqualification…Judge Peck’s decision accepting computer-assisted review … was not influenced by bias, nor did it create any appearance of bias.”

Judge Carter also noted, “Disagreement or dissatisfaction with Magistrate Judge Peck’s ruling is not enough to succeed here…A disinterested observer fully informed of the facts in this case would find no basis for recusal”.

Since it has been a while, let’s recap the case for those who may have not been following it and may be new to the blog.

Back in February, Judge Peck issued an opinion making this case likely the first case to accept the use of computer-assisted review of electronically stored information (“ESI”) for this case. However, on March 13, District Court Judge Andrew L. Carter, Jr. granted the plaintiffs’ request to submit additional briefing on their February 22 objections to the ruling. In that briefing (filed on March 26), the plaintiffs claimed that the protocol approved for predictive coding “risks failing to capture a staggering 65% of the relevant documents in this case” and questioned Judge Peck’s relationship with defense counsel and with the selected vendor for the case, Recommind.

Then, on April 5, Judge Peck issued an order in response to Plaintiffs’ letter requesting his recusal, directing plaintiffs to indicate whether they would file a formal motion for recusal or ask the Court to consider the letter as the motion. On April 13, (Friday the 13th, that is), the plaintiffs did just that, by formally requesting the recusal of Judge Peck (the defendants issued a response in opposition on April 30). But, on April 25, Judge Carter issued an opinion and order in the case, upholding Judge Peck’s opinion approving computer-assisted review.

Not done, the plaintiffs filed an objection on May 9 to Judge Peck’s rejection of their request to stay discovery pending the resolution of outstanding motions and objections (including the recusal motion, which has yet to be ruled on. Then, on May 14, Judge Peck issued a stay, stopping defendant MSLGroup’s production of electronically stored information. Finally, on June 15, in a 56 page opinion and order, Judge Peck denied the plaintiffs’ motion for recusal, which Judge Carter has now upheld.

So, what do you think? Will Judge Carter’s decision not to recuse Judge Peck restart the timetable for predictive coding on this case? Please share any comments you might have or if you’d like to know more about a particular topic.

Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by CloudNine Discovery. eDiscoveryDaily is made available by CloudNine Discovery solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscoveryDaily should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.

Don’t Be “Duped”, Files with Different HASH Values Can Still Be the Same – eDiscovery Best Practices

October 2, 2012

A couple of months ago, we published a post discussing how the number of pages in each gigabyte can vary widely and, to help illustrate the concept, we took one of our blog posts and put it into several different file formats to illustrate how each file had the same content, yet was a different size. That’s not the only concept that example illustrates.

Content is Often Republished

How many of you have ever printed or saved a file to Adobe Acrobat PDF format? Personally, I do it all the time. For example, I “publish” marketing slicks created in Microsoft® Publisher, “publish” finalized client proposals created in Microsoft Word and “publish” presentations created in Microsoft PowerPoint to PDF format regularly. Microsoft now even includes Adobe PDF as one of the standard file formats to which you can save a file, I even have a free PDF print driver on my laptop, so I can conceivably create a PDF file for just about anything that I can print. In each case, I’m duplicating the content of the file, but in a different file format designed for publishing that content.

Another way content is republished is via the ubiquitous “copy and paste” capability that is used by so many to duplicate content to another file. Whether copying part or all of the content, “copy and paste” functionality is essentially available in just about every application to be able to duplicate content from one application to the next or even one file to the next in the same application.

Same Content, Different HASH

When publishing a file to PDF or copying the entire contents of a file to a new file, the contents of the file may be the same, but the HASH value, which is a digital fingerprint that reflects the contents and format of the file, will be different. So, a Word file and the PDF file published from the Word file may contain the same content, but the HASH value will be different. Even copying the content from one file to another in the same software program can result in different HASH values, or even different file sizes. For example, I copied the entire contents of yesterday’s blog post, written in Word, into a brand new Word file. Not only did they have different HASH values, but they were different sizes – the copied file was 8K smaller than the original. So, these files, while identical in content, won’t be considered “duplicates” based on HASH value and won’t be “de-duped” out of the collection as a result. As a result, these files are considered “near-dupes” for analysis purposes, even though the content is essentially identical.

What to Do with the Near-Dupes?

Identifying and culling these essentially identical near-dupes isn’t necessary in every case, but if it is, you’ll need to perform a process that groups similar documents together so that those near-dupes can be identified and addressed. We call that “clustering”. For more on the benefits of clustering, check out this blog post.

So, what do you think? What do you do with “dupes” that have different HASH values? Please share any comments you might have or if you’d like to know more about a particular topic.

Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by CloudNine Discovery. eDiscoveryDaily is made available by CloudNine Discovery solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscoveryDaily should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.

eDiscovery Daily is Two Years Old Today!

September 20, 2012

It’s hard to believe that it has been two years ago today since we launched the eDiscoveryDaily blog. Now that we’ve hit the “terrible twos”, is the blog going to start going off on rants about various eDiscovery topics, like Will McAvoy in The Newsroom? Maybe. Or maybe not. Wouldn’t that be fun!

As we noted when recently acknowledging our 500th post, we have seen traffic on our site (from our first three months of existence to our most recent three months) grow an amazing 442%! Our subscriber base has nearly doubled in the last year alone! We now have nearly seven times the visitors to the site as we did when we first started. We continue to appreciate the interest you’ve shown in the topics and will do our best to continue to provide interesting and useful eDiscovery news and analysis. That’s what this blog is all about. And, in each post, we like to ask for you to “please share any comments you might have or if you’d like to know more about a particular topic”, so we encourage you to do so to make this blog even more useful.

We also want to thank the blogs and publications that have linked to our posts and raised our public awareness, including Pinhawk, The Electronic Discovery Reading Room, Unfiltered Orange, Litigation Support Blog.com, Litigation Support Technology & News, Ride the Lightning, InfoGovernance Engagement Area, Learn About E-Discovery, Alltop, Law.com, Justia Blawg Search, Atkinson-Baker (depo.com), ABA Journal, Complex Discovery, Next Generation eDiscovery Law & Tech Blog and any other publication that has picked up at least one of our posts for reference (sorry if I missed any!). We really appreciate it!

We like to take a look back every six months at some of the important stories and topics during that time. So, here are some posts over the last six months you may have missed. Enjoy!

We talked about best practices for issuing litigation holds and how issuing the litigation hold is just the beginning.

By the way, did you know that if you deleted a photo on Facebook three years ago, it may still be online?

We discussed states (Delaware, Pennsylvania and Florida) that have implemented new rules for eDiscovery in the past few months.

We talked about how to achieve success as a non-attorney in a law firm, providing quality eDiscovery services to your internal “clients” and how to be an eDiscovery consultant, and not just an order taker, for your clients.

We warned you that stop words can stop your searches from being effective, talked about how important it is to test your searches before the meet and confer and discussed the importance of the first 7 to 10 days once litigation hits in addressing eDiscovery issues.

We told you that, sometimes, you may need to collect from custodians that aren’t there, differentiated between quality assurance and quality control and discussed the importance of making sure that file counts add up to what was collected (with an example, no less).

By the way, did you know the number of pages in a gigabyte can vary widely and the same exact content in different file formats can vary by as much as 16 to 20 times in size?

We provided a book review on Zubulake’s e-Discovery and then interviewed the author, Laura Zubulake, as well.

BTW, eDiscovery Daily has had 150 posts related to eDiscovery Case Law since the blog began. Fifty of them have been in the last six months.

P.S. – We still haven't missed a business day yet without a post. Yes, we are crazy.

Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by CloudNine Discovery. eDiscoveryDaily is made available by CloudNine Discovery solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscoveryDaily should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.

eDiscovery Best Practices: Quality Control, Making Sure the Numbers Add Up

September 18, 2012

Yesterday, we wrote about tracking file counts from collection to production, the concept of expanded file counts, and the categorization of files during processing. Today, let’s walk through a scenario to show how the files collected are accounted for during the discovery process.

Tracking the Counts after Processing

We discussed the typical categories of excluded files after processing – obviously, what’s not excluded is available for searching and review. Even if your approach includes a technology assisted review (TAR) methodology such as predictive coding, it’s still likely that you will want to do some culling out of files that are clearly non-responsive.

Documents during review may be classified in a number of ways, but the most common ways to classify documents as to whether they are responsive, non-responsive, or privileged. Privileged documents are also typically classified as responsive or non-responsive, so that only the responsive documents that are privileged need be identified on a privilege log. Responsive documents that are not privileged are then produced to opposing counsel.

Example of File Count Tracking

So, now that we’ve discussed the various categories for tracking files from collection to production, let’s walk through a fairly simple eMail based example. We conduct a fairly targeted collection of a PST file from each of seven custodians in a given case. The relevant time period for the case is January 1, 2010 through December 31, 2011. Other than date range, we plan to do no other filtering of files during processing. Duplicates will not be reviewed or produced. We’re going to provide an exception log to opposing counsel for any file that cannot be processed and a privilege log for any responsive files that are privileged. Here’s what this collection might look like:

Collected Files: 101,852 – After expansion, 7 PST files expand to 101,852 eMails and attachments.
Filtered Files: 23,564 – Filtering eMails outside of the relevant date range eliminates 23,564 files.
Remaining Files after Filtering: 78,288 – After filtering, there are 78,288 files to be processed.
NIST/System Files: 0 – eMail collections typically don’t have NIST or system files, so we’ll assume zero files here. Collections with loose electronic documents from hard drives typically contain some NIST and system files.
Exception Files: 912 – Let’s assume that a little over 1% of the collection (912) is exception files like password protected, corrupted or empty files.
Duplicate Files: 24,215 – It’s fairly common for approximately 30% of the collection to include duplicates, so we’ll assume 24,215 files here.
Remaining Files after Processing: 53,161 – We have 53,161 files left after subtracting NIST/System, Exception and Duplicate files from the total files after filtering.
Files Culled During Searching: 35,618 – If we assume that we are able to cull out 67% (approximately 2/3 of the collection) as clearly non-responsive, we are able to cull out 35,618 files.
Remaining Files for Review: 17,543 – After culling, we have 17,543 files that will actually require review (whether manual or via a TAR approach).
Files Tagged as Non-Responsive: 7,017 – If approximately 40% of the document collection is tagged as non-responsive, that would be 7,017 files tagged as such.
Remaining Files Tagged as Responsive: 10,526 – After QC to ensure that all documents are either tagged as responsive or non-responsive, this leaves 10,526 documents as responsive.
Responsive Files Tagged as Privileged: 842 – If roughly 8% of the responsive documents are privileged, that would be 842 privileged documents.
Produced Files: 9,684 – After subtracting the privileged files, we’re left with 9,684 responsive, non-privileged files to be produced to opposing counsel.

The percentages I used for estimating the counts at each stage are just examples, so don’t get too hung up on them. The key is to note the numbers in red above. Excluding the interim counts in black, the counts in red represent the different categories for the file collection – each file should wind up in one of these totals. What happens if you add the counts in red together? You should get 101,852 – the number of collected files after expanding the PST files. As a result, every one of the collected files is accounted for and none “slips through the cracks” during discovery. That’s the way it should be. If not, investigation is required to determine where files were missed.

So, what do you think? Do you have a plan for accounting for all collected files during discovery? Please share any comments you might have or if you’d like to know more about a particular topic.

Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by CloudNine Discovery. eDiscoveryDaily is made available by CloudNine Discovery solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscoveryDaily should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.

eDiscovery Best Practices: Quality Control, It’s a Numbers Game

September 17, 2012

Previously, we wrote about Quality Assurance (QA) and Quality Control (QC) in the eDiscovery process. Both are important in improving the quality of work product and making the eDiscovery process more defensible overall. For example, in attorney review, QA mechanisms include validation rules to ensure that entries are recorded correctly while QC mechanisms include a second review (usually by a review supervisor or senior attorney) to ensure that documents are being categorized correctly. Another overall QC mechanism is tracking of document counts through the discovery process, especially from collection to production, to identify how every collected file was handled and why each non-produced document was not produced.

Expanded File Counts

Scanned counts of files collected are not the same as expanded file counts. There are certain container file types, like Outlook PST files and ZIP archives that exist essentially to store a collection of other files. So, the count that is important to track is the “expanded” file count after processing, which includes all of the files contained within the container files. So, in a simple scenario where you collect Outlook PST files from seven custodians, the actual number of documents (emails and attachments) within those PST files could be in the tens of thousands. That’s the starting count that matters if your goal is to account for every document in the discovery process.

Categorization of Files During Processing

Of course, not every document gets reviewed or even included in the search process. During processing, files are usually categorized, with some categories of files usually being set aside and excluded from review. Here are some typical categories of excluded files in most collections:

Filtered Files: Some files may be collected, and then filtered during processing. A common filter for the file collection is the relevant date range of the case. If you’re collecting custodians’ source PST files, those may include messages outside the relevant date range; if so, those messages may need to be filtered out of the review set. Files may also be filtered based on type of file or other reasons for exclusion.
NIST and System Files: Many file collections also contain system files, like executable files (EXEs) or Dynamic Link Library (DLLs) that are part of the software on a computer which do not contain client data, so those are typically excluded from the review set. NIST files are included on the National Institute of Standards and Technology list of files that are known to have no evidentiary value, so any files in the collection matching those on the list are “De-NISTed”.
Exception Files: These are files that cannot be processed or indexed, for whatever reason. For example, they may be password-protected or corrupted. Just because these files cannot be processed doesn’t mean they can be ignored, depending on your agreement with opposing counsel, you may need to at least provide a list of them on an exception log to prove they were addressed, if not attempt to repair them or make them accessible (BTW, it’s good to establish that agreement for disposition of exception files up front).
Duplicate Files: During processing, files that are exact duplicates may be put aside to avoid redundant review (and potential inconsistencies). Some exact duplicates are typically identified based on the HASH value, which is a digital fingerprint generated based on the content and format of the file – if two files have the same HASH value, they have the same exact content and format. Emails (and their attachments) may be identified as duplicates based on key metadata fields, so an attachment cannot be “de-duped” out of the collection by a standalone copy of the same file.

All of these categories of excluded files can reduce the set of files to actually be searched and reviewed. Tomorrow, we’ll illustrate an example of a file set from collection to production to illustrate how each file is accounted for during the discovery process.

So, what do you think? Do you have a plan for accounting for all collected files during discovery? Please share any comments you might have or if you’d like to know more about a particular topic.

Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by CloudNine Discovery. eDiscoveryDaily is made available by CloudNine Discovery solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscoveryDaily should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.

eDiscovery Best Practices: Repairing a Corrupted Outlook PST File

September 11, 2012

We like to believe that there will never be any problems with the data that we preserve, collect and process for eDiscovery purposes. Sometimes, however, critical data may be difficult or impossible to use. Perhaps key files are password protected from being opened and the only way to open them is to “crack” the password. Or, perhaps a key file may be corrupted. If that file is an Outlook Personal Storage Table (PST) file, that file corruption could literally make tens of thousands of documents unavailable for discovery unless the file can be repaired.

I recently had a case where 40% of the collection was contained in 2 corrupt Outlook PST files. Had we not been able to repair those files, we would have been unable to access nearly half of the collection that needed to be reviewed for responsiveness in the case.

Fortunately, there is a repair tool for Outlook designed to repair corrupted PST files. It’s called SCANPST.EXE. It’s an official repair tool that is included in Office 2010 (as well as Office 2007 before it). As a very useful utility, you might think that SCANPST would be located in the Microsoft Office 2010 Tools folder within the Microsoft Office folder in Program files. But, you’d be wrong. Instead, you’ll have to open Windows Explorer and navigate to the C:Program FilesMicrosoft OfficeOffice14 folder (for Office 2010, at least) to find the SCANPST.EXE utility.

Double-click this file to open Microsoft Outlook Inbox Repair Tool. The utility will prompt for the path and name of the PST file (with a Browse button to browse to the corrupted PST file). There is also an Options button to enable you to log activity to a new log file, append to an existing log file or choose not to write to a log file. Before you start, you’ll need to close Outlook and all mail-enabled applications.

Once ready, press the Start button and the application will begin checking for errors. When the process is complete, it should indicate that it found errors on the corrupted PST file, along with a count of folders and items found in the file. The utility will also provide a check box to make a backup of the scanned file before repairing. ALWAYS make a backup – you never know what might happen during the repair process. Click the Repair button when ready and the utility will hopefully repair the corrupted PST file.

If SCANPST.EXE fails to repair the file, then there are some third party utilities available that may succeed where SCANPST failed. If all else fails, you can hire a data recovery expert, but that can get very expensive. Hopefully, you don’t have to resort to that.

By repairing the PST file, you are technically changing the file, so if the PST file is discoverable, it will probably be necessary to disclose the corruption to opposing counsel and the intent to attempt to repair the file to avoid potential spoliation claims.

So, what do you think? Have you encountered corrupted PST files in discovery? Please share any comments you might have or if you’d like to know more about a particular topic.

Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by CloudNine Discovery. eDiscoveryDaily is made available by CloudNine Discovery solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscoveryDaily should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.

eDiscovery Milestones: Our 500th Post!

August 30, 2012

One thing about being a daily blog is that the posts accumulate more quickly. As a result, I’m happy to announce that today is our 500th post on eDiscoveryDaily! In less than two years of existence!

When we launched on September 20, 2010, our goal was to be a daily resource for eDiscovery news and analysis and we have done our best to deliver on that goal. During that time, we have published 144 posts on eDiscovery Case Law and have identified numerous cases related to Spoliation Claims and Sanctions. We’ve covered every phase of the EDRM life cycle, including:

We’ve discussed key industry trends in Social Media Technology and Cloud Computing. We’ve published a number of posts on eDiscovery best practices on topics ranging from Project Management to coordinating eDiscovery within Law Firm Departments to Searching and Outsourcing. And, a lot more. Every post we have published is still available on the site for your reference.

Comparing our first three months of existence with our most recent three months, we have seen traffic on our site grow an amazing 442%! Our subscriber base has nearly doubled in the last year alone!

And, we have you to thank for that! Thanks for making the eDiscoveryDaily blog a regular resource for your eDiscovery news and analysis! We really appreciate the support!

I also want to extend a special thanks to Jane Gennarelli, who has provided some wonderful best practice post series on a variety of topics, ranging from project management to coordinating review teams to learning how to be a true eDiscovery consultant instead of an order taker. Her contributions are always well received and appreciated by the readers – and also especially by me, since I get a day off!

We always end each post with a request: “Please share any comments you might have or if you’d like to know more about a particular topic.” And, we mean it. We want to cover the topics you want to hear about, so please let us know.

Tomorrow, we’ll be back with a new, original post. In the meantime, feel free to click on any of the links above and peruse some of our 499 previous posts. Maybe you missed some? 😉

Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by CloudNine Discovery. eDiscoveryDaily is made available by CloudNine Discovery solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscoveryDaily should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.

Processing

Craig Ball of Craig D. Ball, P.C. – eDiscovery Trends, Part 3

Craig Ball of Craig D. Ball, P.C. – eDiscovery Trends, Part 2

Craig Ball of Craig D. Ball, P.C. – eDiscovery Trends, Part 1

Judge Carter Refuses to Recuse Judge Peck in Da Silva Moore – eDiscovery Trends

Don’t Be “Duped”, Files with Different HASH Values Can Still Be the Same – eDiscovery Best Practices

eDiscovery Daily is Two Years Old Today!

eDiscovery Best Practices: Quality Control, Making Sure the Numbers Add Up

eDiscovery Best Practices: Quality Control, It’s a Numbers Game

eDiscovery Best Practices: Repairing a Corrupted Outlook PST File

eDiscovery Milestones: Our 500th Post!

Status: Updated