Analysis Archives

The Grossman-Cormack Glossary of Technology Assisted Review – eDiscovery Resources

November 29, 2012

Do you know what a “Confidence Level” is? No, I’m not talking about Tom Brady completing football passes in coverage. How about “Harmonic Mean”? Maybe if I hum a few bars? Gaussian Calculator? Sorry, it has nothing to do with how many Tums you should eat after a big meal. No, the answer to all of these can be found in the new Grossman-Cormack Glossary of Technology Assisted Review.

Maura Grossman and Gordon Cormack are educating us yet again with regard to Technology Assisted Review (TAR) with a comprehensive glossary that defines key TAR-related terms and also provides some key case references, including EORHB, Global Aerospace, In Re: Actos:, Kleen Products and, of course, Da Silva Moore. The authors of the heavily cited article Technology-Assisted Review in E-Discovery Can Be More Effective and More Efficient Than Exhaustive Manual Review have provided a new reference document that may help many in the industry understand key TAR concepts better. Or, at least recognize key terms associated with TAR. This is version 1.01, published just this month and clearly intended to evolve over time. As the authors note in the Preamble:

“The introduction of TAR into the legal community has brought with it much confusion because different terms are being used to refer to the same thing (e.g., ‘technology assisted review,’ ‘computer-assisted review,’ ‘computer-aided review,’ ‘predictive coding,’ and ‘content based advanced analytics,’ to name but a few), and the same terms are also being used to refer to different things (e.g., ‘seed sets’ and ‘control sample’). Moreover, the introduction of complex statistical concepts, and terms-of-art from the science of information retrieval, have resulted in widespread misunderstanding and sometimes perversion of their actual meanings.

This glossary is written in an effort to bring order to chaos by introducing a common framework and set of definitions for use by the bar, the bench, and service providers. The glossary endeavors to be comprehensive, but its definitions are necessarily brief. Interested readers may look elsewhere for detailed information concerning any of these topics. The terms in the glossary are presented in alphabetical order, with all defined terms in capital letters.

In the future, we plan to create an electronic version of this glossary that will contain live links, cross references, and annotations. We also envision this glossary to be a living, breathing work that will evolve over time. Towards that end, we invite our colleagues in the industry to send us their comments on our definitions, as well as any additional terms they would like to see included in the glossary, so that we can reach a consensus on a consistent, common language relating to technology assisted review. Comments can be sent to us at mrgrossman@wlrk.com and gvcormac@uwaterloo.ca.”

Live links, with a Table of Contents, in a (hopefully soon) next iteration will definitely make this guide even more useful. Nonetheless, it’s a great resource for those of us that have bandied around these terms for some time.

So, what do you think? Will this glossary help educate the industry and help standardize use of the terms? Or will it lead to one big “Confusion Matrix”? (sorry, I couldn’t resist) Please share any comments you might have or if you’d like to know more about a particular topic.

Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by CloudNine Discovery. eDiscoveryDaily is made available by CloudNine Discovery solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscoveryDaily should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.

Louisiana Order Dictates That the Parties Cooperate on Technology Assisted Review – eDiscovery Case Law

November 19, 2012

During this Thanksgiving week, we at eDiscovery Daily thought it would be a good time to catch up on some cases we missed earlier in the year. So, we will cover a different case each day this week. Enjoy!

In the case In re Actos (Pioglitazone) Products Liability Litigation, No. 6:11-md-2299, (W.D. La. July 27, 2012), a case management order applicable to pretrial proceedings in a multidistrict litigation consolidating eleven civil actions, the court issued comprehensive instructions for the use of technology-assisted review (“TAR”).

In an order entitled “Procedures and Protocols Governing the Production of Electronically Stored Information (“ESI”) by the Parties,” U.S. District Judge Rebecca Doherty of the Western District of Louisiana set forth how the parties would treat data sources, custodians, costs, and format of production, among others. Importantly, the order contains a “Search Methodology Proof of Concept,” which governs the parties’ usage of TAR during the search and review of ESI.

The order states that the parties “agree to meet and confer regarding the use of advanced analytics” as a “document identification mechanism for the review and production of . . . data.” The parties will meet and confer to select four key custodians whose e-mail will be used to create an initial sample set, after which three experts will train the TAR system to score every document based on relevance. To quell the fears of TAR skeptics, the court provided that both parties will collaborate to train the system, and after the TAR process is completed, the documents will not only be randomly sampled for quality control, but the defendants may also manually review documents for relevance, confidentiality, and privilege.

The governance order repeatedly emphasizes that the parties are committing to collaborating throughout the TAR process and requires that they meet and confer prior to contacting the court for a resolution.

So, what do you think? Should more cases issue instructions like this? Please share any comments you might have or if you’d like to know more about a particular topic.

Case Summary Source: Applied Discovery (free subscription required). For eDiscovery news and best practices, check out the Applied Discovery Blog here.

Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by CloudNine Discovery. eDiscoveryDaily is made available by CloudNine Discovery solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscoveryDaily should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.

Email Metadata Leads to Petraeus Resignation – eDiscovery Trends

November 16, 2012

As reported on by Megan Garber of The Atlantic, email location data led FBI investigators to discover CIA director David Petraeus’ affair with Paula Broadwell that led to his resignation. The irony is that FBI investigators weren’t aware of, or looking for, information regarding the affair. Here’s what happened, according to the article.

“Sometime in May, The New York Times reports, Broadwell apparently began sending emails to Jill Kelley, the Petraeus acquaintance (her precise connection to the family isn’t yet fully clear) — and those emails were “harassing,” according to Kelley. The messages were apparently sent from an anonymous (or, at least, pseudonymous) account. Kelley reported those emails to the FBI, which launched an investigation — not into Petraeus, but into the harassing emails.”

“From there, the dominoes began to fall. And they were helped along by the rich data that email providers include in every message they send and deliver — even on behalf of its pseudonymous users. Using the ‘metadata footprints left by the emails,’ the Wall Street Journal reports, ‘FBI agents were able to determine what locations they were sent from. They matched the places, including hotels, where Ms. Broadwell was during the times the emails were sent.’ From there, ‘FBI agents and federal prosecutors used the information as probable cause to seek a warrant to monitor Ms. Broadwell’s email accounts.’”

Once the investigators received that warrant, they “learned that Ms. Broadwell and Mr. Petraeus had set up private Gmail accounts to use for their communications, which included explicit details of a sexual nature, according to U.S. officials. But because Mr. Petraeus used a pseudonym, agents doing the monitoring didn’t immediately uncover that he was the one communicating with Ms. Broadwell.”

Ultimately, monitoring of Ms. Broadwell’s emails identified the link to Mr. Petraeus and the investigation escalated, despite the fact that the investigators “never monitored Mr. Petraeus’s email accounts”.

Needless to say, if the Director of the CIA can be tripped up by email metadata from an account other than his own, it could happen to anyone. It certainly gives you an idea of the type of information that is discoverable not just from opposing parties, but third parties as well.

So, what do you think? Have you ever identified additional sources of data through discovery of email metadata? Please share any comments you might have or if you’d like to know more about a particular topic.

Thanks to Perry Segal’s e-Discovery Insights blog for the tip on this story!

Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by CloudNine Discovery. eDiscoveryDaily is made available by CloudNine Discovery solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscoveryDaily should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.

Searching for Email Addresses Can Have Lots of Permutations Too – eDiscovery Best Practices

November 15, 2012

Tuesday, we discussed the various permutations of names of individuals to include in your searching for a more complete result set, as well as the benefits of proximity searching (broader than a phrase search, more precise than an AND search) to search for names of individuals. Another way to identify documents associated with individuals is through their email addresses.

Variations of Email Addresses within a Domain

You may be planning to search for an individual based on their name and the email domain of their company (e.g., daustin@cloudnincloudnine.comm), but that’s not always inclusive of all possible email addresses for that individual. Email addresses for an individual’s domain might appear to be straightforward, but there might be aliases or other variations to search for to retrieve emails to and from that individual at that domain. For example, here are three of the email addresses to which I can receive email as a member of CloudNine Discovery:

daustin@cloudnincloudnine.comm;
daustin@ediscoverydaily.com (an alias to the first address to enable people to send me emails related to this blog);
support@cloudnincloudnine.comm (an Outlook contact group which includes people within our company that provide support to our applications OnDemand® and FirstPass®)

To retrieve all of the emails to and from me, you would have to include all of the above addresses (and others too). There are other variations you may need to account for, as well. Here are a couple:

Jim Smith[/O=FIRST ORGANIZATION/OU=EXCHANGE ADMINISTRATIVE GROUP (GZEJCPIG34TQEMU)/CN=RECIPIENTS/CN=JimSmith] (legacy Exchange distinguished name from old versions of Microsoft Exchange);
IMCEANOTES-Andy+20Zipper_Corp_Enron+40ECT@ENRON.com (an internal Lotus Notes representation of an email address from the Enron Data Set).

As you can see, email addresses from the business domain can be represented several different ways, so it’s important to account for that in your searching for emails for your key individuals.

Personal Email Addresses

Raise your hand if you’ve ever sent any emails from your personal email account(s) through the business domain, even if it’s to remind you of something. I suspect most of your hands are raised – I know mine is. Identifying personal email accounts for key individuals can be important for two reasons: 1) those emails within your collection may also be relevant and, 2) you may have to request additional emails from the personal email addresses in discovery if it can be demonstrated that those accounts contain relevant emails.

Searching for Email Addresses

To find all of the relevant email addresses (including the personal ones), you may need to perform searches of the email fields for variations of the person’s name. So, for example, to find emails for “Jim Smith”, you may need to find occurrences of “Jim”, “James”, “Jimmy”, “JT” and “Smith” within the “To”, “From”, “Cc” and “Bcc” fields. Then, you have to go through the list and identify the email addresses that appear to be those for Jim Smith. Any email addresses for which you’re not sure whether they belong to the individual or not (e.g., does jsmith1963@gmail.com belong to Jim Smith or Joe Smith?), you may need to retrieve and examine some of the emails to make that determination. If he uses nicknames for his personal email addresses (e.g., huggybear2012@msn.com), you should hopefully be able to identify those through emails that he sends to his business account.

In its Email Analytics module, FirstPass® makes it easy to search for email addresses for an individual – simply go to Global Email Search and type in the string to retrieve all email addresses in the collection with that string. It really streamlines the process of identifying email addresses for an individual and then reviewing those emails.

Whether or not your application simplifies that process, searching by email address is another way to identify documents pertaining to a key individual. The key is making sure your search includes all the email addresses possible for that individual.

So, what do you think? How do you handle searching for key individuals within your document collections? Please share any comments you might have or if you’d like to know more about a particular topic.

Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by CloudNine Discovery. eDiscoveryDaily is made available by CloudNine Discovery solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscoveryDaily should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.

What’s in a Name? Potentially, a Lot of Permutations – eDiscovery Best Practices

November 13, 2012

When looking for documents in your collection that mention key individuals, conducting a name search for those individuals isn’t always as straightforward as you might think. There are potentially a number of different ways names could be represented and if you don’t account for each one of them, you might fail to retrieve key responsive documents – OR retrieve way too many non-responsive documents. Here are some considerations for conducting name searches.

The Ever-Limited Phrase Search vs. Proximity Searching

Routinely, when clients give me their preliminary search term lists to review, they will always include names of individuals that they want to search for, like this:

“Jim Smith”
“Doug Austin”

Phrase searches are the most limited alternative for searching because the search must exactly match the phrase. For example, a phrase search of “Jim Smith” won’t retrieve “Smith, Jim” if his name appears that way in the documents.

That’s why I prefer to use a proximity search for individual names, it catches several variations and expands the recall of the search. Proximity searching is simply looking for two or more words that appear close to each other in the document. A proximity search for “Jim within 3 words of Smith” will retrieve “Jim Smith”, “Smith, Jim”, and even “Jim T. Smith”. Proximity searching is also a more precise option in most cases than “AND” searches – Doug AND Austin will retrieve any document where someone named Doug is in (or traveling to) Austin whereas “Doug within 3 words of Austin” will ensure those words are near each other, making is much more likely they’re responsive to the name search.

Accounting for Name Variations

Proximity searches won’t always account for all variations in a person’s name. What are other variations of the name “Jim”? How about “James” or “Jimmy”? Or even “Jimbo”? I have a friend named “James” who is also called “Jim” by some of his other friends and “Jimmy” by a few of his other friends. Also, some documents may refer to him by his initials – i.e., “J.T. Smith”. All are potential variations to search for in your collection.

Common name derivations like those above can be deduced in many cases, but you may not always know the middle name or initial. If so, it may take performing a search of just the last name and sampling several documents until you are able to determine that middle initial for searching (this may also enable you to identify nicknames like “JayDog”, which could be important given the frequently informal tone of emails, even business emails).

Applying the proximity and name variation concepts into our search, we might perform something like this to get our “Jim Smith” documents:

(jim OR jimmy OR james OR “j.t.”) w/3 Smith, where “w/3” is “within 3 words of”. This is the syntax you would use to perform the search in OnDemand®, CloudNine Discovery’s online review tool.

That’s a bit more inclusive than the “Jim Smith” phrase search the client originally gave me.

BTW, why did I use “jim OR jimmy” instead of the wildcard “jim*”? Because wildcard searches could yield additional terms I might not want (e.g., Joe Smith jimmied the lock). Don’t get wild with wildcards! Using the specific variations you want (e.g., “jim OR jimmy”) is usually best.

Thursday, we will talk about another way to retrieve documents that mention key individuals – through their email addresses. Same bat time, same bat channel!

So, what do you think? How do you handle searching for key individuals within your document collections? Please share any comments you might have or if you’d like to know more about a particular topic.

Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by CloudNine Discovery. eDiscoveryDaily is made available by CloudNine Discovery solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscoveryDaily should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.

Judge Carter Refuses to Recuse Judge Peck in Da Silva Moore – eDiscovery Trends

November 12, 2012

It seems like ages ago when New York Magistrate Judge Andrew J. Peck denied the motion of the plaintiffs in Da Silva Moore v. Publicis Groupe & MSL Group, No. 11 Civ. 1279 (ALC) (AJP) to recuse himself in the case. It was all the way back in June. Now, District Court Judge Andrew L. Carter, Jr. has ruled on the plaintiff’s recusal request.

In his order from last Wednesday (November 7), Judge Carter stated as follows:

“On the basis of this Court’s review of the entire record, the Court is not persuaded that sufficient cause exists to warrant Magistrate Judge Peck’s disqualification…Judge Peck’s decision accepting computer-assisted review … was not influenced by bias, nor did it create any appearance of bias.”

Judge Carter also noted, “Disagreement or dissatisfaction with Magistrate Judge Peck’s ruling is not enough to succeed here…A disinterested observer fully informed of the facts in this case would find no basis for recusal”.

Since it has been a while, let’s recap the case for those who may have not been following it and may be new to the blog.

Back in February, Judge Peck issued an opinion making this case likely the first case to accept the use of computer-assisted review of electronically stored information (“ESI”) for this case. However, on March 13, District Court Judge Andrew L. Carter, Jr. granted the plaintiffs’ request to submit additional briefing on their February 22 objections to the ruling. In that briefing (filed on March 26), the plaintiffs claimed that the protocol approved for predictive coding “risks failing to capture a staggering 65% of the relevant documents in this case” and questioned Judge Peck’s relationship with defense counsel and with the selected vendor for the case, Recommind.

Then, on April 5, Judge Peck issued an order in response to Plaintiffs’ letter requesting his recusal, directing plaintiffs to indicate whether they would file a formal motion for recusal or ask the Court to consider the letter as the motion. On April 13, (Friday the 13th, that is), the plaintiffs did just that, by formally requesting the recusal of Judge Peck (the defendants issued a response in opposition on April 30). But, on April 25, Judge Carter issued an opinion and order in the case, upholding Judge Peck’s opinion approving computer-assisted review.

Not done, the plaintiffs filed an objection on May 9 to Judge Peck’s rejection of their request to stay discovery pending the resolution of outstanding motions and objections (including the recusal motion, which has yet to be ruled on. Then, on May 14, Judge Peck issued a stay, stopping defendant MSLGroup’s production of electronically stored information. Finally, on June 15, in a 56 page opinion and order, Judge Peck denied the plaintiffs’ motion for recusal, which Judge Carter has now upheld.

So, what do you think? Will Judge Carter’s decision not to recuse Judge Peck restart the timetable for predictive coding on this case? Please share any comments you might have or if you’d like to know more about a particular topic.

Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by CloudNine Discovery. eDiscoveryDaily is made available by CloudNine Discovery solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscoveryDaily should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.

Does This Scare You? – eDiscovery Horrors!

October 31, 2012

Today is Halloween. While we could try to “scare” you with the traditional “frights”, we’re an eDiscovery blog, so every year we try to “scare” you in a different way instead. Does this scare you?

The defendant had been previously sanctioned $500,000 ($475,000 to the plaintiff and $25,000 to the court) and held in contempt of court by the magistrate judge for spoliation, who also recommended an adverse inference instruction be issued at trial. The defendant appealed to the district court, where Minnesota District Judge John Tunheim increased the award to the plaintiff to $600,000. Oops!

What about this?

Even though the litigation hold letter from April 2008 was sent to the primary custodians, at least one principal was determined to have actively deleted relevant emails. Additionally, the plaintiffs made no effort to suspend the automatic destruction policy of emails, so emails that were deleted could not be recovered. Ultimately, the court found that 9 of 14 key custodians had deleted relevant documents. After the defendants raised its spoliation concerns with the court, the plaintiffs continued to delete relevant information, including decommissioning and discarding an email server without preserving any of the relevant ESI. As a result, the New York Supreme Court imposed the severest of sanctions against the plaintiffs for spoliation of evidence – dismissal of their $20 million case.

Or this?

For most organizations, information volume doubles every 18-24 months and 90% of the data in the world has been created in the last two years. In a typical company in 2011, storing that data consumed about 10% of the IT budget. At a growth rate of 40% (even as storage unit costs decline), storing this data will consume over 20% of the typical IT budget by 2014.

How about this?

There “was stunned silence by all attorneys in the court room after that order. It looks like neither side saw it coming.”

Or maybe this?

If you have deleted any of your photos from Facebook in the past three years, you may be surprised to find that they are probably still on the company’s servers.

Scary, huh? If the possibility of sanctions, exponential data growth and judges ordering parties to perform predictive coding keep you awake at night, then the folks at eDiscovery Daily will do our best to provide useful information and best practices to enable you to relax and sleep soundly, even on Halloween!

Then again, if the expense, difficulty and risk of processing and loading up to 100 GB of data into an eDiscovery review application that you’ve never used before terrifies you, maybe you should check this out.

Of course, if you seriously want to get into the spirit of Halloween, click here. This will really terrify you!

Those of you who are really mortified that the next post in Jane Gennarelli’s “Litigation 101” series won’t run this week, fear not – it will run tomorrow.

What do you think? Is there a particular eDiscovery issue that scares you? Please share your comments and let us know if you’d like more information on a particular topic.

Happy Halloween!

Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by CloudNine Discovery. eDiscoveryDaily is made available by CloudNine Discovery solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscoveryDaily should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.

Both Sides Instructed to Use Predictive Coding or Show Cause Why Not – eDiscovery Case Law

October 29, 2012

As reported in Ralph Losey’s e-Discovery Team® blog, Vice Chancellor J. Travis Laster in Delaware Chancery Court – in EORHB, Inc., et al v. HOA Holdings, LLC, C.A. No. 7409-VCL (Del. Ch. Oct. 15, 2012) – has issued a “surprise” bench order requiring both sides to use predictive coding and to use the same vendor.

As Ralph notes, this “appears to be the first time a judge has required both sides of a dispute to use predictive coding when neither has asked for it. It may also be the first time a judge has ordered parties to use the same vendor.” Vice Chancellor Laster’s instruction was as follows:

“This seems to me to be an ideal non-expedited case in which the parties would benefit from using predictive coding. I would like you all, if you do not want to use predictive coding, to show cause why this is not a case where predictive coding is the way to go.

I would like you all to talk about a single discovery provider that could be used to warehouse both sides’ documents to be your single vendor. Pick one of these wonderful discovery super powers that is able to maintain the integrity of both side’s documents and insure that no one can access the other side’s information. If you cannot agree on a suitable discovery vendor, you can submit names to me and I will pick one for you.

One thing I don’t want to do – one of the nice things about most of these situations is once people get to the indemnification realm, particularly if you get the business guys involved, they have some interest in working out a number and moving on. The problem is that these types of indemnification claims can generate a huge amount of documents. That’s why I would really encourage you all, instead of burning lots of hours with people reviewing, it seems to me this is the type of non-expedited case where we could all benefit from some new technology use.”

Ralph notes that there “was stunned silence by all attorneys in the court room after that order. It looks like neither side saw it coming.” It will be interesting to see if either, or both party, proceeds to object and attempt to “show cause” as to why they shouldn’t use predictive coding.

So, what do you think? Is this an isolated case or the start of a trend? Please share any comments you might have or if you’d like to know more about a particular topic.

Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by CloudNine Discovery. eDiscoveryDaily is made available by CloudNine Discovery solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscoveryDaily should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.

Are You Requesting the Best Production Format for Your Case? – eDiscovery Best Practices

October 15, 2012

One of the blogs I read regularly is Ball in your Court from Craig Ball, a previous thought leader interviewee on this blog. His post from last Tuesday, Are They Trying to Screw Me?, is one that all attorneys that request ESI productions should read.

Ball describes a fairly typical proposed production format, as follows:

“Documents will be produced as single page TIFF files with multi-page extracted text or OCR. We will furnish delimited IPRO or Opticon load files and will later identify fielded information we plan to exchange.”

Then, he asks the question: “Are they trying to screw you?” Answer: “Probably not.” But, “Are you screwing yourself by accepting the proposed form of production? Yes, probably.”

With regard to producing TIFF files, Ball notes that “Converting a native document to TIFF images is lobotomizing the document.” The TIFF image is devoid of any of the metadata that provides valuable information about the way in which the document was used, making analysis of the produced documents a much more difficult effort. Ball sums up TIFF productions by saying “Think of a TIFF as a PDF’s retarded little brother. I mean no offense by that, but TIFFs are not just differently abled; they are severely handicapped. Not born that way, but lamed and maimed on purpose. The other side downgrades what they give you, making it harder to use and stripping it of potentially-probative content.”

Opposing counsel isn’t trying to screw you with a TIFF production. They just do it because they always provide it that way. And, you accept it that way because you’ve always accepted it that way. Ball notes that “You may accept the screwed up proposal because, even if the data is less useful and incomplete, you won’t have to evolve. You’ll pull the TIFF images into your browser and painstakingly read them one-by-one, just like good ol’ paper; all-the-while telling yourself that what you didn’t get probably wasn’t that important and promising yourself that next time, you’ll hold out for the good stuff—the native stuff.”

We recently ran a blog series called First Pass Review – Of Your Opponent’s Data. In that series, we discussed how useful that Early Data Assessment/FirstPass Review applications can be in reviewing your opponent’s produced ESI. At CloudNine Discovery, we use FirstPass®, powered by Venio FPR™ for first pass review – it provides a number of mechanisms that are useful in analyzing your opponent’s produced data. Capabilities like email analytics and message thread analysis (where missing emails in threads can be identified), synonym searching, fuzzy searching and domain categorization are quite useful in developing an understanding of your opponents production. However, these mechanisms are only as useful as the data they’re analyzing. Email analytics, message thread analysis and domain categorization are driven by metadata, so they are useless on TIFF/OCR/data productions. You can’t analyze what you don’t have.

It’s time to evolve. To get the most information out of your opponent’s production, you need to request the production in native format. Opponents are probably not trying to screw you by producing in TIFF format, but you are screwing yourself if you decide to accept it in that format.

So, what do you think? Do you request native productions from your opponents? If not, why not? Please share any comments you might have or if you’d like to know more about a particular topic.

Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by CloudNine Discovery. eDiscoveryDaily is made available by CloudNine Discovery solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscoveryDaily should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.

Don’t Be “Duped”, Files with Different HASH Values Can Still Be the Same – eDiscovery Best Practices

October 2, 2012

A couple of months ago, we published a post discussing how the number of pages in each gigabyte can vary widely and, to help illustrate the concept, we took one of our blog posts and put it into several different file formats to illustrate how each file had the same content, yet was a different size. That’s not the only concept that example illustrates.

Content is Often Republished

How many of you have ever printed or saved a file to Adobe Acrobat PDF format? Personally, I do it all the time. For example, I “publish” marketing slicks created in Microsoft® Publisher, “publish” finalized client proposals created in Microsoft Word and “publish” presentations created in Microsoft PowerPoint to PDF format regularly. Microsoft now even includes Adobe PDF as one of the standard file formats to which you can save a file, I even have a free PDF print driver on my laptop, so I can conceivably create a PDF file for just about anything that I can print. In each case, I’m duplicating the content of the file, but in a different file format designed for publishing that content.

Another way content is republished is via the ubiquitous “copy and paste” capability that is used by so many to duplicate content to another file. Whether copying part or all of the content, “copy and paste” functionality is essentially available in just about every application to be able to duplicate content from one application to the next or even one file to the next in the same application.

Same Content, Different HASH

When publishing a file to PDF or copying the entire contents of a file to a new file, the contents of the file may be the same, but the HASH value, which is a digital fingerprint that reflects the contents and format of the file, will be different. So, a Word file and the PDF file published from the Word file may contain the same content, but the HASH value will be different. Even copying the content from one file to another in the same software program can result in different HASH values, or even different file sizes. For example, I copied the entire contents of yesterday’s blog post, written in Word, into a brand new Word file. Not only did they have different HASH values, but they were different sizes – the copied file was 8K smaller than the original. So, these files, while identical in content, won’t be considered “duplicates” based on HASH value and won’t be “de-duped” out of the collection as a result. As a result, these files are considered “near-dupes” for analysis purposes, even though the content is essentially identical.

What to Do with the Near-Dupes?

Identifying and culling these essentially identical near-dupes isn’t necessary in every case, but if it is, you’ll need to perform a process that groups similar documents together so that those near-dupes can be identified and addressed. We call that “clustering”. For more on the benefits of clustering, check out this blog post.

So, what do you think? What do you do with “dupes” that have different HASH values? Please share any comments you might have or if you’d like to know more about a particular topic.

Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by CloudNine Discovery. eDiscoveryDaily is made available by CloudNine Discovery solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscoveryDaily should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.

Analysis