Analysis

Moneycase: Should Your Law Practice Be Run Like a Baseball Team? — eDiscovery Trends

Remember the movie Moneyball (adapted from the book of the same name) about Oakland A’s general manager Billy Beane’s use of computer-generated analytics to pick his players to successfully assemble a baseball team that advanced to the baseball playoffs while spending a fraction of the budget as other teams?  Can law firms learn from that example?

According to Angela Hunt in a recent article in Law Technology News (Why Attorneys Love-Hate Data Analytics), maybe they can.  As she notes in her article, James Michalowicz, managing director of Huron Legal advises firms to use big data and performance metrics to minimize legal spending.

Like the old-time baseball experts in Moneyball that scoffed at the use of computer-analytics to pick baseball players, some attorneys question the benefits in the legal arena.  “As much as I think the use of analytics is now penetrating the sports world, I think it’s slower in the legal world,” Michalowicz told Law Technology News. Since a law firm’s value depends heavily on its legal knowledge base, installing a program that does all the heavy thinking can make attorneys feel like their hard-earned legal education is being undermined, explains Michalowicz. “There’s this emotional piece to it. Lawyers don’t want to rely on data. It’s a challenge to their pride.”

However, for large firms and corporations that deal with litigation regularly, Michalowicz recommends using strategic case analytics, a predictive technology that helps attorneys pick their battles.  As the article notes, “[b]y evaluating venue data and case histories within a jurisdiction, law firms and corporate legal departments can give unbiased advice on whether to litigate or settle.”

The past three years, at LegalTech New York (LTNY), we have conducted and published a Thought Leader Series of interviews with various thought leaders in the litigation and eDiscovery industry (here’s the link to this year’s set of interviews).  One of the interviews was with Don Philbin, President and Founder of Picture It Settled®, which is a predictive analytics tool for the settlement negotiation process.  To support this process, they collected data for about ten thousand cases – not just the outcomes, but also the incremental moves that people make in negotiation.  If Billy Beane were an attorney, he’d love it!

Over the next few weeks, we’ll look at other analytics mechanisms to improve efficiency in the litigation and discovery process.

So, what do you think?  Do you employ any data analytics in your discovery practice?   Please share any comments you might have or if you’d like to know more about a particular topic.

Image © 2011 – Sony Pictures

Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by CloudNine Discovery. eDiscoveryDaily is made available by CloudNine Discovery solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscoveryDaily should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.

Does Size Matter? – eDiscovery Replay

Even those of us at eDiscovery Daily have to take an occasional vacation (see above); however, instead of “going dark” for the week, we thought we would use the week to do something interesting.  Up to this week, we have had 815 posts over 3+ years of the blog.  Some have been quite popular, so we thought we would “replay” the top four all-time posts this week in terms of page views since the blog began (in case you missed them).  Casey Kasem would be proud!  Apparently, my catchy title worked as, with over 1,150 lifetime views, here is the third most viewed post all time, originally published in March 2011.  Enjoy!

______________________________

I admit it, with a title like “Does Size Matter?”, I’m looking for a few extra page views.  😉

I frequently get asked how big does an ESI collection need to be to benefit from eDiscovery technology.  In a recent case with one of my clients, the client had a fairly small collection – only about 4 GB.  But, when a judge ruled that they had to start conducting depositions in a week, they needed to review that data in a weekend.  Without the ability to cull the data and using OnDemand® to manage the linear review, they would not have been able to make that deadline.  So, they clearly benefited from the use of eDiscovery technology in that case.

But, if you’re not facing a tight deadline, how large does your collection need to be for the use of eDiscovery technology to provide benefits?

I recently conducted a webinar regarding the benefits of First Pass Review – aka Early Case Assessment, or a more accurate term (as George Socha points out regularly), Early Data Assessment.  One of the topics discussed in that webinar was the cost of review for each gigabyte (GB).  Extrapolated from an analysis conducted by Anne Kershaw a few years ago (and published in the Gartner report E-Discovery: Project Planning and Budgeting 2008-2011), here is a breakdown:

Estimated Cost to Review All Documents in a GB:

  • Pages per GB:                      75,000
  • Pages per Document:        4
  • Documents Per GB:            18,750
  • Review Rate:                        50 documents per hour
  • Total Review Hours:            375
  • Reviewer Billing Rate:        $50 per hour

Total Cost to Review Each GB:      $18,750

Notes: The number of pages per GB can vary widely.  Page per GB estimates tend to range from 50,000 to 100,000 pages per GB, so 75,000 pages (18,750 documents) seems an appropriate average.  50 documents reviewed per hour is considered to be a fast review rate and $50 per hour is considered to be a bargain price.  eDiscovery Daily provided an earlier estimate of $16,650 per GB based on assumptions of 20,000 documents per GB and 60 documents reviewed per hour – the assumptions may change somewhat, but, either way, the cost for attorney review of each GB could be expected to range from at least $16,000 to $18,000, possibly more.

Advanced culling and searching can enable you to cull out 70-80% of most collections as clearly non-responsive without having to conduct attorney review on those files.  If you have merely a 2 GB collection and assume the lowest review cost above of $16,000 per GB, the use of advanced culling and searching to cull out 70% of the collection can save $22,400 in attorney review costs.  Is that worth it?

So, what do you think?  Do you use eDiscovery technology for only the really large cases or ALL cases?   Please share any comments you might have or if you’d like to know more about a particular topic.

Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by CloudNine Discovery. eDiscoveryDaily is made available by CloudNine Discovery solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscoveryDaily should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.

The Number of Pages in Each Gigabyte Can Vary Widely – eDiscovery Replay

Even those of us at eDiscovery Daily have to take an occasional vacation (see above); however, instead of “going dark” for the week, we thought we would use the week to do something interesting.  Up to this week, we have had 815 posts over 3+ years of the blog.  Some have been quite popular, so we thought we would “replay” the top four all-time posts this week in terms of page views since the blog began (in case you missed them).  Casey Kasem would be proud!  With nearly 1,000 lifetime views, here is the fourth most viewed post all time, originally published in July 2012.  Enjoy!

_________________________

A while back, we talked about how the average number of pages in each gigabyte is approximately 50,000 to 75,000 pages and that each gigabyte effectively culled out can save $18,750 in review costs.  But, did you know just how widely the number of pages per gigabyte can vary?

The “how many pages” question comes up a lot and I’ve seen a variety of answers.  Michael Recker of Applied Discovery posted an article to their blog last week titled Just How Big Is a Gigabyte?, which provides some perspective based on the types of files contained within the gigabyte, as follows:

“For example, e-mail files typically average 100,099 pages per gigabyte, while Microsoft Word files typically average 64,782 pages per gigabyte. Text files, on average, consist of a whopping 677,963 pages per gigabyte. At the opposite end of the spectrum, the average gigabyte of images contains 15,477 pages; the average gigabyte of PowerPoint slides typically includes 17,552 pages.”

Of course, each GB of data is rarely just one type of file.  Many emails include attachments, which can be in any of a number of different file formats.  Collections of files from hard drives may include Word, Excel, PowerPoint, Adobe PDF and other file formats.  So, estimating page counts with any degree of precision is somewhat difficult.

In fact, the same exact content ported into different applications can be a different size in each file, due to the overhead required by each application.  To illustrate this, I decided to conduct a little (admittedly unscientific) study using yesterday’s one page blog post about the Apple/Samsung litigation.  I decided to put the content from that page into several different file formats to illustrate how much the size can vary, even when the content is essentially the same.  Here are the results:

  • Text File Format (TXT): Created by performing a “Save As” on the web page for the blog post to text – 10 KB;
  • HyperText Markup Language (HTML): Created by performing a “Save As” on the web page for the blog post to HTML – 36 KB, over 3.5 times larger than the text file;
  • Microsoft Excel 2010 Format (XLSX): Created by copying the contents of the blog post and pasting it into a blank Excel workbook – 128 KB, nearly 13 times larger than the text file;
  • Microsoft Word 2010 Format (DOCX): Created by copying the contents of the blog post and pasting it into a blank Word document – 162 KB, over 16 times larger than the text file;
  • Adobe PDF Format (PDF): Created by printing the blog post to PDF file using the CutePDF printer driver – 211 KB, over 21 times larger than the text file;
  • Microsoft Outlook 2010 Message Format (MSG): Created by copying the contents of the blog post and pasting it into a blank Outlook message, then sending that message to myself, then saving the message out to my hard drive – 221 KB, over 22 times larger than the text file.

The Outlook example was probably the least representative of a typical email – most emails don’t have several embedded graphics in them (with the exception of signature logos) – and most are typically much shorter than yesterday’s blog post (which also included the side text on the page as I copied that too).  Still, the example hopefully illustrates that a “page”, even with the same exact content, will be different sizes in different applications.  As a result, to estimate the number of pages in a collection with any degree of accuracy, it’s not only important to understand the size of the data collection, but also the makeup of the collection as well.

So, what do you think?  Was this example useful or highly flawed?  Or both?  Please share any comments you might have or if you’d like to know more about a particular topic.

Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by CloudNine Discovery. eDiscoveryDaily is made available by CloudNine Discovery solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscoveryDaily should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.

Plaintiffs’ Supreme Effort to Recuse Judge Peck in Da Silva Moore Denied – eDiscovery Case Law

As we discussed back in July, attorneys representing lead plaintiff Monique Da Silva Moore and five other employees filed a petition for a writ of certiorari with the US Supreme Court arguing that New York Magistrate Judge Andrew Peck, who approved an eDiscovery protocol agreed to by the parties that included predictive coding technology, should have recused himself given his previous public statements expressing strong support of predictive coding.  Earlier this month, on October 7, that petition was denied by the Supreme Court.

Da Silva Moore and her co-plaintiffs had argued in the petition that the Second Circuit Court of Appeals was too deferential to Peck when denying the plaintiff’s petition to recuse him, asking the Supreme Court to order the Second Circuit to use the less deferential “de novo” standard.

The plaintiffs have now been denied in their recusal efforts in four courts.  Here is the link to the Supreme Court docket item, referencing denial of the petition.

This battle over predictive coding and Judge Peck’s participation has continued for over 18 months.  For those who may have not been following the case or may be new to the blog, here’s a recap.

Last year, back in February, Judge Peck issued an opinion making this case likely the first case to accept the use of computer-assisted review of electronically stored information (“ESI”) for this case.  However, on March 13, District Court Judge Andrew L. Carter, Jr. granted the plaintiffs’ request to submit additional briefing on their February 22 objections to the ruling.  In that briefing (filed on March 26), the plaintiffs claimed that the protocol approved for predictive coding “risks failing to capture a staggering 65% of the relevant documents in this case” and questioned Judge Peck’s relationship with defense counsel and with the selected vendor for the case, Recommind.

Then, on April 5, 2012, Judge Peck issued an order in response to Plaintiffs’ letter requesting his recusal, directing plaintiffs to indicate whether they would file a formal motion for recusal or ask the Court to consider the letter as the motion.  On April 13, (Friday the 13th, that is), the plaintiffs did just that, by formally requesting the recusal of Judge Peck (the defendants issued a response in opposition on April 30).  But, on April 25, Judge Carter issued an opinion and order in the case, upholding Judge Peck’s opinion approving computer-assisted review.

Not done, the plaintiffs filed an objection on May 9 to Judge Peck’s rejection of their request to stay discovery pending the resolution of outstanding motions and objections (including the recusal motion, which has yet to be ruled on.  Then, on May 14, Judge Peck issued a stay, stopping defendant MSLGroup’s production of electronically stored information.  On June 15, in a 56 page opinion and order, Judge Peck denied the plaintiffs’ motion for recusal.  Judge Carter ruled on the plaintiff’s recusal request on November 7 of last year, denying the request and stating that “Judge Peck’s decision accepting computer-assisted review … was not influenced by bias, nor did it create any appearance of bias”.

The plaintiffs then filed a petition for a writ of mandamus with the Second Circuit of the US Court of Appeals, which was denied this past April, leading to their petition for a writ of certiorari with the US Supreme Court, which has now also been denied.

So, what do you think?  Will we finally move on to the merits of the case?  Please share any comments you might have or if you’d like to know more about a particular topic.

Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by CloudNine Discovery. eDiscoveryDaily is made available by CloudNine Discovery solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscoveryDaily should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.

For Successful Discovery, Think Backwards – eDiscovery Best Practices

The Electronic Discovery Reference Model (EDRM) has become the standard model for the workflow of the process for handling electronically stored information (ESI) in discovery.  But, to succeed in discovery, regardless whether you’re the producing party or the receiving party, it might be helpful to think about the EDRM model backwards.

Why think backwards?

You can’t have a successful outcome without envisioning the successful outcome that you want to achieve.  The end of the discovery process includes the production and presentation stages, so it’s important to determine what you want to get out of those stages.  Let’s look at them.

Presentation

As a receiving party, it’s important to think about what types of evidence you need to support your case when presenting at depositions and at trial – this is the type of information that needs to be included in your production requests at the beginning of the case.

Production

The format of the ESI produced is important to both sides in the case.  For the receiving party, it’s important to get as much useful information included in the production as possible.  This includes metadata and searchable text for the produced documents, typically with an index or load file to facilitate loading into a review application.  The most useful form of production is native format files with all metadata preserved as used in the normal course of business.

For the producing party, it’s important to save costs, so it’s important to agree to a production format that minimizes production costs.  Converting files to an image based format (such as TIFF) adds costs, so producing in native format can be cost effective for the producing party as well.  It’s also important to determine how to handle issues such as privilege logs and redaction of privileged or confidential information.

Addressing production format issues up front will maximize cost savings and enable each party to get what they want out of the production of ESI.

Processing-Review-Analysis

It also pays to determine early in the process about decisions that affect processing, review and analysis.  How should exception files be handled?  What do you do about files that are infected with malware?  These are examples of issues that need to be decided up front to determine how processing will be handled.

As for review, the review tool being used may impact production specs in terms of how files are viewed and production of load files that are compatible with the review tool, among other considerations.  As for analysis, surely you test search terms to determine their effectiveness before you agree on those terms with opposing counsel, right?

Preservation-Collection-Identification

Long before you have to conduct preservation and collection for a case, you need to establish procedures for implementing and monitoring litigation holds, as well as prepare a data map to identify where corporate information is stored for identification, preservation and collection purposes.

As you can see, at the beginning of a case (and even before), it’s important to think backwards within the EDRM model to ensure a successful discovery process.  Decisions made at the beginning of the case affect the success of those latter stages, so don’t forget to think backwards!

So, what do you think?  What do you do at the beginning of a case to ensure success at the end?   Please share any comments you might have or if you’d like to know more about a particular topic.

P.S. — Notice anything different about the EDRM graphic?

Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by CloudNine Discovery. eDiscoveryDaily is made available by CloudNine Discovery solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscoveryDaily should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.

eDiscovery Daily is Three Years Old!

We’ve always been free, now we are three!

It’s hard to believe that it has been three years ago today since we launched the eDiscoveryDaily blog.  We’re past the “terrible twos” and heading towards pre-school.  Before you know it, we’ll be ready to take our driver’s test!

We have seen traffic on our site (from our first three months of existence to our most recent three months) grow an amazing 575%!  Our subscriber base has grown over 50% in the last year alone!  Back in June, we hit over 200,000 visits on the site and now we have over 236,000!

We continue to appreciate the interest you’ve shown in the topics and will do our best to continue to provide interesting and useful posts about eDiscovery trends, best practices and case law.  That’s what this blog is all about.  And, in each post, we like to ask for you to “please share any comments you might have or if you’d like to know more about a particular topic”, so we encourage you to do so to make this blog even more useful.

We also want to thank the blogs and publications that have linked to our posts and raised our public awareness, including Pinhawk, Ride the Lightning, Litigation Support Guru, Complex Discovery, Bryan College, The Electronic Discovery Reading Room, Litigation Support Today, Alltop, ABA Journal, Litigation Support Blog.com, Litigation Support Technology & News, InfoGovernance Engagement Area, EDD Blog Online, eDiscovery Journal, Learn About E-Discovery, e-Discovery Team ® and any other publication that has picked up at least one of our posts for reference (sorry if I missed any!).  We really appreciate it!

As many of you know by now, we like to take a look back every six months at some of the important stories and topics during that time.  So, here are some posts over the last six months you may have missed.  Enjoy!

Rodney Dangerfield might put it this way – “I Tell Ya, Information Governance Gets No Respect

Is it Time to Ditch the Per Hour Model for Document Review?  Here’s some food for thought.

Is it Possible for a File to be Modified Before it is Created?  Maybe, but here are some mechanisms for avoiding that scenario (here, here, here, here, here and here).  Best of all, they’re free.

Did you know changes to the Federal eDiscovery Rules are coming?  Here’s some more information.

Count Minnesota and Kansas among the states that are also making changes to support eDiscovery.

By the way, since the Electronic Discovery Reference Model (EDRM) annual meeting back in May, several EDRM projects (Metrics, Jobs, Data Set and the new Native Files project) have already announced new deliverables and/or requested feedback.

When it comes to electronically stored information (ESI), ensuring proper chain of custody tracking is an important part of handling that ESI through the eDiscovery process.

Do you self-collect?  Don’t Forget to Check for Image Only Files!

The Files are Already Electronic, How Hard Can They Be to Load?  A sound process makes it easier.

When you remove a virus from your collection, does it violate your discovery agreement?

Do you think that you’ve read everything there is to read on Technology Assisted Review?  If you missed anything, it’s probably here.

Consider using a “SWOT” analysis or Decision Tree for better eDiscovery planning.

If you’re an eDiscovery professional, here is what you need to know about litigation.

BTW, eDiscovery Daily has had 242 posts related to eDiscovery Case Law since the blog began!  Forty-four of them have been in the last six months.

Our battle cry for next September?  “Four more years!”  🙂

Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by CloudNine Discovery. eDiscoveryDaily is made available by CloudNine Discovery solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscoveryDaily should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.

SWOT Away Uncertainty in Your Discovery Practice – eDiscovery Best Practices

Understanding the relationships of your organization’s internal and external challenges allows your organization to approach ongoing and future discovery in a more strategic process.  A “SWOT” analysis is a tool that can be used to develop that understanding.

A “SWOT” analysis is a structured planning method used to evaluate the Strengths, Weaknesses, Opportunities, and Threats associated with a specific business objective.  That can be a specific project or all of the activities of a business unit.  It involves specifying the objective of the specific business objective and identifying the internal and external factors that are favorable and unfavorable to achieving that objective.  The SWOT analysis is broken down as follows:

  • Strengths: characteristics of the business or project that give it an advantage over others;
  • Weaknesses: are characteristics that place the team at a disadvantage relative to others;
  • Opportunities: elements that the project could exploit to its advantage;
  • Threats: elements in the environment that could cause trouble for the business or project.

“SWOT”, get it?

From an eDiscovery perspective, a SWOT analysis enables you to take an objective look at how your organization handles discovery issues – what you do well and where you need to improve – and the external factors that can affect how your organization addresses its discovery challenges.  From an eDiscovery perspective, the SWOT analysis enables you to assess how your organization handles each phase of the discovery process – from Information Governance to Presentation – to evaluate where your strengths and weaknesses exist so that you can capitalize on your strengths and implement changes to address your weaknesses.

How solid is your information governance plan?  How well does your legal department communicate with IT?  How well formalized is your coordination with outside counsel and vendors?  Do you have a formalized process for implementing and tracking litigation holds?  These are examples of questions you might ask about your organization and, based on the answers, identify your organization’s strengths and weaknesses in managing the discovery process.

However, if you only look within your organization, that’s only half the battle.  You also need to look at external factors and how they affect your organization in its handling of discovery issues.  Trends such as the growth of social media, and changes to state or federal rules addressing handling of electronically stored information (ESI) need to be considered in your organization’s strategic discovery plan.

Having worked through the strategic analysis process with several organizations, I find that the SWOT analysis is a useful tool for summarizing where the organization currently stands with regard to managing discovery, which naturally leads to recommendations for improvement.

So, what do you think?  Has your organization performed a SWOT analysis of your discovery process?   Please share any comments you might have or if you’d like to know more about a particular topic.

Graphic source: Wikipedia.

Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by CloudNine Discovery. eDiscoveryDaily is made available by CloudNine Discovery solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscoveryDaily should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.

Everything You Wanted to Know about Technology Assisted Review – eDiscovery Trends

Whether you were “afraid to ask” or not…

Rob Robinson has put together another terrific compilation, this time a compilation of articles about Technology Assisted Review and Predictive Coding over the past 1 1/2 years (from February 2012, last updated on August 12).  If you simply can’t get enough of the topic, you’ll want to check it out.

His compilation can be found at his Complex Discovery web site here (the title of the page is Technology-Assisted Review: From Expert Explanations to Mainstream Mentions).  According to my count, there are 632(!) articles regarding the topic.  Happy reading!

Of course, eDiscovery Daily made its fair share of contributions to the list.  Here are our posts regarding the topic on the site, in case you missed them and want to catch up:

Here are a few others that aren’t listed – just sayin’ Rob!  😉:

Thanks to Rob, once again, for providing a very useful compilation on a very important eDiscovery topic.  And, Rob, if you want to add links for the additional posts above, we won’t complain.  🙂

So, what do you think?  Do you keep up with articles about technology assisted review?  Please share any comments you might have or if you’d like to know more about a particular topic.

Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by CloudNine Discovery. eDiscoveryDaily is made available by CloudNine Discovery solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscoveryDaily should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.

How Big is Your ESI Collection, Really? – eDiscovery Best Practices

When I was at ILTA last week, this topic came up in a discussion with a colleague during the show, so I thought it would be good to revisit here.

After identifying custodians relevant to the case and collecting files from each, you’ve collected roughly 100 gigabytes (GB) of Microsoft Outlook email PST files and loose electronic files from the custodians.  You identify a vendor to process the files to load into a review tool, so that you can perform review and produce the files to opposing counsel.  After processing, the vendor sends you a bill – and they’ve charged you to process over 200 GB!!  Are they trying to overbill you?

Yes and no.

Many of the files in most ESI collections are stored in what are known as “archive” or “container” files.  For example, while Outlook emails can be stored in different file formats, they are typically collected from each custodian and saved in a personal storage (.PST) file format, which is an expanding container file. The scanned size for the PST file is the size of the file on disk.

Did you ever see one of those vacuum bags that you store clothes in and then suck all the air out so that the clothes won’t take as much space?  The PST file is like one of those vacuum bags – it often stores the emails and attachments in a compressed format to save space.  There are other types of archive container files that compress the contents – .ZIP and .RAR files are two examples of compressed container files.  These files are often used to not only to compress files for storage on hard drives, but they are also used to compact or group a set of files when transmitting them, often in email.  With email comprising a major portion of most ESI collections and the popularity of other archive container files for compressing file collections, the expanded size of your collection may be considerably larger than it appears when stored on disk.

When PST, ZIP, RAR or other compressed file formats are processed for loading into a review tool, they are expanded into their normal size.  This expanded size can be 1.5 to 2 times larger than the scanned size (or more).  And, that’s what some vendors will bill processing on – the expanded size.  In those cases, you won’t know what the processing costs will be until the data is expanded since it’s difficult to determine until processing is complete.

It’s important to be prepared for that and know your options when processing that data.  Make sure your vendor selection criteria includes questions about how processing is billed, on the scanned or expanded size.  Some vendors (like the company I work for, CloudNine Discovery), do bill based on the scanned size of the collection for processing, so shop around to make sure you’re getting the best deal from your vendor.

So, what do you think?  Have you ever been surprised by processing costs of your ESI?   Please share any comments you might have or if you’d like to know more about a particular topic.

Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by CloudNine Discovery. eDiscoveryDaily is made available by CloudNine Discovery solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscoveryDaily should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.

A Technical Explanation of Near-Dupes – eDiscovery Tutorial

Bill Dimm provides a comprehensive and interesting description of near-dupes and the algorithms used to identify them in his Clustify blog (What is a near-dupe, really?).  If you want to understand the “three reasonable, but different, ways of defining the near-dupe similarity between two documents”, bring your brain and check it out.

As we discussed last month, just because information volume in most organizations doubles every 18-24 months doesn’t mean that it’s all original.  When reviewers are reviewing the same data again and again, it’s unnecessarily expensive and prone to mistakes.

As Bill notes in his post, “Near-duplicates are documents that are nearly, but not exactly, the same.  They could be different revisions of a memo where a few typos were fixed or a few sentences were added.  They could be an original email and a reply that quotes the original and adds a few sentences.  They could be a Microsoft Word document and a printout of the same document that was scanned and OCRed with a few words not matching due to OCR errors.”  I also classify examples such as a Word document published to an Adobe PDF file (where the content is the same, but the file format is different, so the hash value will be different) as near-duplicates because they won’t be de-duped with an MD5 or SHA-1 hash algorithm at the file level.  You need an algorithm that looks for similarity in the document content.

Identifying near-duplicates that contain almost the same information reduces redundant review and saves costs.  A recent client of mine had over 800,000 emails belonging to near-duplicate groupings that would have been impossible to identify without an effective algorithm to group them together.

Bill’s blog post goes on to discuss different methods for measuring similarity using mechanisms like a Jaccard index and a MinHash algorithm which counts shingles (don’t worry, they’re neither painful nor scaly).  Understanding how your near-dupe software works is important.  As Bill notes, “If misunderstandings about how the algorithm works cause the similarity values generated by the software to be higher than you expected when you chose the similarity threshold, you risk tagging near-dupes of non-responsive documents incorrectly (grouped documents are not as similar as you expected).  If the similarity values are lower than you expected when you chose the threshold, you risk failing to group some highly similar documents together, which leads to less efficient review (extra groups to review).”  His post is an excellent primer to developing that understanding.

So, what do you think?  Do you have a plan for handling near-duplicates in your collection?   Please share any comments you might have or if you’d like to know more about a particular topic.

Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by CloudNine Discovery. eDiscoveryDaily is made available by CloudNine Discovery solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscoveryDaily should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.