eDiscovery Trends: Predictive Coding Strategy and Survey Results

Yesterday, we introduced the Virtual LegalTech online educational session Frontiers of E-Discovery: What Lawyers Need to Know About “Predictive Coding” and defined predictive coding while also noting the two “learning” methods that most predictive coding mechanisms use to predict document classifications.  To get background information regarding the session, including information about the speakers (Jason Baron, Maura Grossman and Bennett Borden), click here.

The session also focused on strategies for using predictive coding and results of the TREC 2010 Legal Track Learning Task on the effectiveness of “Predictive Coding” technologies.  Strategies discussed by Bennett Borden include:

  • Understanding the technology used by a particular provider:  Not only will supervised and active learning mechanisms often yield different results, but there are differing technologies within each of these learning mechanisms.
  • Understand the state of the law regarding predictive coding technology: So far, there is no case law available regarding use of this technology and, while it may eventually be the future of document review, that has yet to be established.
  • Obtain buy-in by the requesting party to use predictive coding technology: It’s much easier when the requesting party has agreed to your proposed approach and that agreement is included in an order of the court which covers the approach and also includes a FRE 502 “clawback” agreement and order.  To have a chance to obtain that buy-in and agreement, you’ll need a diligent approach that includes “tiering” of the collection by probable responsiveness and appropriate sampling of each tier level.

Maura Grossman then described TREC 2010 Legal Track Learning Task on the effectiveness of “Predictive Coding” technologies.  The team took the EDRM Enron Version 2 Dataset of 1.3 million public domain files, deduped it down to 685,000+ unique files and 5.5 GB of uncompressed data.  The team also identified eight different hypothetical eDiscovery requests for the test.

Participating predictive coding technologies were then given a “seed set” of roughly 1,000 documents that had previously been identified by TREC as responsive or non-responsive to each of the requests. Using this information, participants were required to rank the documents in the larger collection from most likely to least likely to be responsive, and estimate the likelihood of responsiveness as a probability for each document.  The study ranked the participants on recall rate accuracy based on 30% of the collection retrieved (200,000 files) and also on the predicted recall to determine a prediction accuracy.

The results?  Actual recall rates for all eight discovery requests ranged widely among the tools from 85.1% actual recall down to 38.2% (on individual requests, the range was even wider – as much as 82% different between the high and the low).  The prediction accuracy rates for the tools also ranged somewhat widely, from a high of 95% to a low of 42%.

Based on this study, it is clear that these technologies can differ significantly on how effective and efficient they are at correctly ranking and categorizing remaining documents in the collection based on the exemplar “seed set” of documents.  So, it’s always important to conduct sampling of both machine coded and human coded documents for quality control in any project, with or without predictive coding (we sometimes forget that human coded documents can just as often be incorrectly coded!).

For more about the TREC 2010 Legal Track study, click here.  As noted yesterday, you can also check out a replay of the session or download the slides for the presentation at the Virtual LegalTech site.

Full Disclosure: Trial Solutions provides predictive coding services using Hot Neuron LLC’s Clustify™, which categorizes documents by looking for similar documents in the exemplar set that satisfy a user-specified criteria, such as a minimum conceptual similarity or near-duplicate percentage.

So, what do you think?  Have you used predictive coding on a case?  Please share any comments you might have or if you’d like to know more about a particular topic.

eDiscovery Trends: What the Heck is “Predictive Coding”?


Yesterday, ALM hosted another Virtual LegalTech online "live" day online.  Every quarter, theVirtual LegalTech site has a “live” day with educational sessions from 9 AM to 5 PM ET, most of which provide CLE credit in certain states (New York, California, Florida, and Illinois).

One of yesterday’s sessions was Frontiers of E-Discovery: What Lawyers Need to Know About “Predictive Coding”.  The speakers for this session were:

Jason Baron: Director of Litigation for the National Archives and Records Administration, a founding co-coordinator of the National Institute of Standards and Technology’s Text Retrieval Conference (“TREC”) legal track and co-chair and editor-in-chief for various working groups for The Sedona Conference®;

Maura Grossman: Counsel at Wachtell, Lipton, Rosen & Katz, co-chair of the eDiscovery Working Group advising the New York State Unified Court System and coordinator of the 2010 TREC legal track; and

Bennett Borden: co-chair of the e-Discovery and Information Governance Section at Williams Mullen and member of Working Group I of The Sedona Conference on Electronic Document Retention and Production, as well as the Cloud Computing Drafting Group.

This highly qualified panel discussed a number of topics related to predictive coding, including practical applications of predictive coding technologies and results of the TREC 2010 Legal Track Learning Task on the effectiveness of “Predictive Coding” technologies.

Before discussing the strategies for using predictive coding technologies and the results of the TREC study, it’s important to understand what predictive coding is.  The panel gave the best descriptive definition that I’ve seen yet for predictive coding, as follows:

“The use of machine learning technologies to categorize an entire collection of documents as responsive or non-responsive, based on human review of only a subset of the document collection. These technologies typically rank the documents from most to least likely to be responsive to a specific information request. This ranking can then be used to “cut” or partition the documents into one or more categories, such as potentially responsive or not, in need of further review or not, etc.”

The panel used an analogy for predictive coding by relating it to spam filters that review and classify email and learn based on previous classifications which emails can be considered “spam”.  Just as no spam filter perfectly classifies all emails as spam or legitimate, predictive coding does not perfectly identify all relevant documents.  However, they can “learn” to identify most of the relevant documents based on one of two “learning” methods:

  • Supervised Learning: a human chooses a set of “exemplar” documents that feed the system and enable it to rank the remaining documents in the collection based on their similarity to the exemplars (e.g., “more like this”);
  • Active Learning: the system chooses the exemplars on which human reviewers make relevancy determinations, then the system learns from those classifications to apply to the remaining documents in the collection.

Tomorrow, I “predict” we will get into the strategies and the results of the TREC study.  You can check out a replay of the session at theVirtual LegalTech site. You’ll need to register – it’s free – then login and go to the CLE Center Auditorium upon entering the site (which is up all year, not just on "live days").  Scroll down until you see this session and then click on “Attend Now” to view the replay presentation.  You can also go to the Resource Center at the site and download the slides for the presentation.

So, what do you think?  Do you have experience with predictive coding?  Please share any comments you might have or if you’d like to know more about a particular topic.

Thought Leader Q&A: Brad Jenkins of Trial Solutions


Tell me about your company and the products you represent. Trial Solutions is an electronic discovery software and services company in Houston, Texas that assists corporations and law firms in the collection, processing and review of electronic data. Trial Solutions developed OnDemand™, formerly known as ImageDepot™, an online e-discovery review application which is currently used by over fifty of the top 250 law firms including seven of the top ten.  Trial Solutions also offers FirstPass™, an early case assessment and first-pass review application.  Both applications are offered as a software-as-a-service (SaaS), where Trial Solutions licenses the applications to customers for use and provides access via the Internet. Trial Solutions provides litigation support services in over 90 metropolitan areas throughout the United States and Canada.

What do you see as emerging trends for eDiscovery SaaS solutions?  I believe that one emerging trend that you’ll see is simplified pricing.  Pricing for many eDiscovery SaaS solutions is too complex and difficult for clients to understand.  Many providers base pricing on a combination of collection size and number of users (among other factors) which is confusing and penalizes organizations for adding users into a case,  I believe that organizations will expect simpler pricing models from providers with the ability to add an unlimited number of users to each case.

Another trend I expect to see is provision of more self-service capabilities giving legal teams greater control over managing their own databases and cases.  Organizations need the ability to administer their own databases, add users and maintain their rights without having to rely on the hosting provider to provide these services.  A major self-service capability is the ability to load your own data on your schedule without having to pay load fees to the hosting provider.

Why do you think that more eDiscovery SaaS solutions don’t provide a free self loading capability?  I don’t know.  Many SaaS solutions outside of eDiscovery enable you to upload your own data to use and share via the Web.  Facebook and YouTube enable you to upload and share pictures and videos, Google Docs is designed for sharing and maintaining business documents, and even allows you to upload contacts via a comma-separated values (CSV) file.  So, loading your own data is not a new concept for SaaS solutions.  OnDemand™ is about to roll out a new SelfLoader™ module to enable clients to load their own data, for free.  With SelfLoader, clients can load their own images, OCR text files, native files and metadata to an existing OnDemand database using an industry-standard load file (IPRO’s .lfp or Concordance’s .opt) format.

Are there any other trends that you see in the industry?  One clear trend is the rising popularity in first pass review/early case assessment (or, early data assessment, as some prefer) solutions like FirstPass as corporate data proliferates at an amazing pace.  According to International Data Corporation (IDC), the amount of digital information created, captured and replicated in the world as of 2006 was 161 exabytes or 161 billion gigabytes and that is expected to rise more than six-fold by 2010 (to 988 exabytes)!  That’s enough data for a stack of books from the sun to Pluto and back again!  With more data than ever to review, attorneys will have to turn to applications to enable them to quickly cull the data to a manageable level for review – it will simply be impossible to review the entire collection in a cost-efficient and timely manner.  It will also be important for there to be a seamless transition from first pass review for culling collections to attorney linear review for final determination of relevancy and privilege and Trial Solutions provides a fully integrated approach with FirstPass and OnDemand.

About Brad Jenkins
Brad Jenkins, President and CEO of Trial Solutions, has over 20 years of experience leading customer focused companies in the litigation support arena. Brad has authored many articles on litigation support issues, and has spoken before national audiences on document management practices and solutions.

Thought Leader Q&A: Chris Jurkiewicz of Venio Systems


Tell me about your company and the products you represent.  Venio Systems is an Electronic Discovery software solution provider specializing in early case assessment and first pass review.  Our product, Venio FPR™, allows forensic units, attorneys and litigation support teams to process, analyze, search, report, interact with and export responsive data for linear review or production.

What do you consider to be the reason for the enormous growth of early case assessment/first pass review tools in the industry?  I believe much of the growth we’ve seen in the past few years can be attributed to many factors, of which the primary one is the exponential growth of data within an organization.  The inexpensive cost of data storage available to an organization is making it easier for them to keep unnecessary data on their systems.  Companies who practice litigation and/or work with litigative data are seeking out quick and cost effective methods of funneling the necessary data from all the unnecessary data stored in these vast systems thereby making early case assessment/first pass review tools not only appealing but necessary.

Are there other areas where first pass review tools can be useful during eDiscovery?  Clients have found creative ways in using first pass review/ECA technology; recently a client utilized it to analyze a recent production received by opposing counsel. They were able to determine that the email information produced was not complete.  They were then able to force the opposing counsel to fill in the missing email gaps.

There have been several key cases related to search defensibility in the past couple of years.  How will those decisions affect organizations’ approach to ESI searching?  More organizations will have to adopt a defensible process for searching and use tools that support that process.  Venio’s software has many key features focused on search defensibility including: Search List Analysis, Wild Card Variation searching, Search Audit Reporting and Fuzzy Searching.  All searches run in Venio FPR™ are audited by user, date and time, terms, scope, and frequency.  By using these tools, clients have been able to find additional responsive files that would be otherwise missed and easily document their search approach and refinement.

How do you think the explosion of data and technology will affect the review process in the future?  I believe that technology will continue to evolve and provide innovative tools to allow for more efficient reviews of ESI.  In the past few years the industry has already seen several new technologies released such as near deduping, concept searching and clustering which have significantly improved the speed of the review.  Legal teams will have to continue to make greater utilization of these technologies to provide efficient and cost-effective review as their clients will demand it.

About Chris Jurkiewicz
Chris graduated in 2000 with a Bachelor of Science in Computer Information Systems at Marymount University in Arlington, Virginia.  He began working for On-Site Sourcing while still an intern at Marymount and became the youngest Director on On-Site’s management team within three years as the Director of their Electronic Data Discovery Division.  In 2009, Chris co-founded Venio Systems to fill a void in Early Case Assessment (ECA) technology with Venio FPR™ to provide law firms, corporations and government entities the ability to gain a comprehensive picture of their data set at the front-end; thereby, saving precious time and money on the back-end..  Chris is an industry recognized expert in the field of eDiscovery, having spoken on several eDiscovery panels and served as an eDiscovery expert witness.

Reporting from the EDRM Mid-Year Meeting


Launched in May 2005, the Electronic Discovery Reference Model (EDRM) Project was created to address the lack of standards and guidelines in the electronic discovery market.  Now, in its sixth year of operation, EDRM has become the gold standard for…well…standards in eDiscovery.  Most references to the eDiscovery industry these days refer to the EDRM model as a representation of the eDiscovery life cycle.

At the first meeting in May 2005, there were 35 attendees, according to Tom Gelbmann of Gelbmann & Associates, co-founder of EDRM along with George Socha of Socha Consulting LLC.  Check out the preliminary first draft of the EDRM diagram – it has evolved a bit!  Most participants were eDiscovery providers and, according to Gelbmann, they asked “Do you really expect us all to work together?”  The answer was “yes”, and the question hasn’t been asked again.  Today, there are over 300 members from 81 participating organizations including eDiscovery providers, law firms and corporations (as well as some individual participants).

This week, the EDRM Mid-Year meeting is taking place in St. Paul, MN.  Twice a year, in May and October, eDiscovery professionals who are EDRM members meet to continue the process of working together on various standards projects.  EDRM has eight currently active projects, as follows:

  • Data Set: provides industry-standard, reference data sets of electronically stored information (ESI) and software files that can be used to test various aspects of eDiscovery software and services,
  • Evergreen: ensures that EDRM remains current, practical and relevant and educates about how to make effective use of the Model,
  • Information Management Reference Model (IMRM): provides a common, practical, flexible framework to help organizations develop and implement effective and actionable information management programs,
  • Jobs: develops a framework for evaluating pre-discovery and discovery personnel needs or issues,
  • Metrics: provides an effective means of measuring the time, money and volumes associated with eDiscovery activities,
  • Model Code of Conduct: evaluates and defines acceptable boundaries of ethical business practices within the eDiscovery service industry,
  • Search: provides a framework for defining and managing various aspects of Search as applied to eDiscovery workflow,
  • XML: provides a standard format for e-discovery data exchange between parties and systems, reducing the time and risk involved with data exchange.

This is my fourth year participating in the EDRM Metrics project and it has been exciting to see several accomplishments made by the group, including creation of a code schema for measuring activities across the EDRM phases, glossary definitions of those codes and tools to track early data assessment, collection and review activities.  Today, we made significant progress in developing survey questions designed to gather and provide typical metrics experienced by eDiscovery legal teams in today’s environment.

So, what do you think?  Has EDRM impacted how you manage eDiscovery?  If so, how?  Please share any comments you might have or if you’d like to know more about a particular topic.

Announcing eDiscovery Thought Leader Q&A Series!


eDiscovery Daily is excited to announce a new blog series of Q&A interviews with various eDiscovery thought leaders.  Over the next three weeks, we will publish interviews conducted with six individuals with unique and informative perspectives on various eDiscovery topics.  Mark your calendars for these industry experts!

Christine Musil is Director of Marketing for Informative Graphics Corporation, a viewing, annotation and content management software company based in Arizona.  Christine will be discussing issues associated with native redaction and redaction of Adobe PDF files.  Her interview will be published this Thursday, October 14.

Jim McGann is Vice President of Information Discovery for Index Engines. Jim has extensive experience with the eDiscovery and Information Management.  Jim will be discussing issues associated with tape backup and retrieval.  His interview will be published this Friday, October 15.

Alon Israely is a Senior Advisor in BIA’s Advisory Services group and currently oversees BIA’s product development for its core technology products.  Alon will be discussing best practices associated with “left side of the EDRM model” processes such as preservation and collection.  His interview will be published next Thursday, October 21.

Chris Jurkiewicz is Co-Founder of Venio Systems, which provides Venio FPR™ allowing legal teams to analyze data, provide an early case assessment and a first pass review of any size data set.  Chris will be discussing current trends associated with early case assessment and first pass review tools.  His interview will be published next Friday, October 22.

Kirke Snyder is Owner of Legal Information Consultants, a consulting firm specializing in eDiscovery Process Audits to help organizations lower the risk and cost of e-discovery.  Kirke will be discussing best practices associated with records and information management.  His interview will be published on Monday, October 25.

Brad Jenkins is President and CEO for Trial Solutions, which is an electronic discovery software and services company that assists litigators in the collection, processing and review of electronic information.  Brad will be discussing trends associated with SaaS eDiscovery solutions.  His interview will be published on Tuesday, October 26.

We thank all of our guests for participating!

So, what do you think?  Is there someone you would like to see interviewed for the blog?  Are you an industry expert with some information to share from your “soapbox”?  If so, please share any comments or contact me at  We’re looking to assemble our next group of interviews now!

eDiscovery Case Study: Term List Searching for Deadline Emergencies!


A few weeks ago, I was preparing to conduct a Friday morning training session for a client to show them how to use FirstPass™, powered by Venio FPR™, to conduct a first pass review of their data when I received a call from the client.  “We thought we were going to have a month to review this data, but because of a judge’s ruling in the case, we now have to start depo prep for two key custodians on Monday for depositions now scheduled next week”, said Megan Moore, attorney with Steele Sturm, PLLC, in Houston.  “We have to complete our review of their files this weekend.”

So, what do you do when you have to conduct both a first pass and final review of the data in a weekend?

It was determined that Steele Sturm had to complete first pass review that Friday, so that we could prepare the potentially responsive files for an attorney review starting Saturday morning.  Steele Sturm identified a list of responsive search terms and Trial Solutions worked with the attorneys to include variations of the terms (such as proximity searches and synonyms) to finalize a list of terms to apply to the data to identify potentially responsive files.  Because FirstPass provides the ability to import and search an entire term list at once, we were able to identify potentially responsive files in a simple, two step process.  “Using FirstPass, Trial Solutions helped us cull out 75% of the collection as non-responsive, enabling our review team to focus review on the remaining 25%”, said Moore.

Once the potentially responsive files were identified, they were imported into OnDemand™, powered by ImageDepot™, for linear attorney review.  During review, the attorneys identified that some of the terms used in identifying potentially responsive files were overbroad, so additional searches were performed in OnDemand to “group tag” those files as non-responsive.  “Trial Solutions provided training and support throughout the weekend to enable our review team to quickly "tag" each file using OnDemand as to responsiveness and privilege to enable us to meet our deadline”, said Moore.

So, what do you think?  Do you have any “emergency” war stories to share?  Please share any comments you might have or if you’d like to know more about a particular topic.

eDiscovery Best Practices: Cost of Data Storage is Declining – Or Is It?

Recently, I was gathering information on the cost of data storage and ran across this ad from the early 1980s for a 10 MB disk drive – for $3,398! That’s MB (megabytes), not GB (gigabytes) or TB (terabytes). What a deal!

Even in 2000, storage costs were around $20 per GB, so an 8 GB drive would cost about $160.

Today, 1 TB is available for $100 or less. HP has a 2 TB external drive available at Best Buy for $140 (prices subject to change of course). That’s 7 cents per GB. Network storage drives are more expensive, but still available for around $100 per TB.

At these prices, it’s natural for online, accessible data in corporations to rise exponentially. It’s great to have more and more data readily available to you, until you are hit with litigation or regulatory requests. Then, you potentially have to go through all that data for discovery to determine what to preserve, collect, process, analyze, review and produce.

Here is what each additional GB can cost to review (based on typical industry averages):

  • 1 GB = 20,000 documents (can vary widely, depending on file formats)
  • Review attorneys typically average 60 documents reviewed per hour (for simple relevancy determinations)
  • That equals an average of 333 review hours per GB (20,000 / 60)
  • If you’re using contract reviewers at $50 per hour – each extra GB just cost you $16,650 to review (333×50)

That’s expensive storage! And, that doesn’t even take into consideration the costs to identify, preserve, collect, and process each additional GB.

Managing Storage Costs Effectively

One way to manage those costs is to limit the data retained in the first place through an effective records management program that calls for regular destruction of data not subject to a litigation hold. If you’re eliminating expired data on a regular basis, there is less data to go through the EDRM discovery “funnel” to production.

Sophisticated collection tools or first pass review tools (like FirstPass™, powered by Venio FPR™) can also help cull data for attorney review to reduce those costs, which is the most expensive component of eDiscovery.

So, what do you think? Do you track GB metrics for your eDiscovery cases? Please share any comments you might have or if you’d like to know more about a particular topic.

First Pass Review: Domain Categorization of Your Opponent’s Data

Yesterday, we talked about the use of First Pass Review (FPR) applications (such as FirstPass™, powered by Venio FPR™) to not only conduct first pass review of your own collection, but also to analyze your opponent’s ESI production. One way to analyze that data is through “fuzzy” searching to find misspellings or OCR errors in an opponent’s produced ESI.

Domain Categorization

Another type of analysis is the use of domain categorization. Email is generally the biggest component of most ESI collections and each participant in an email communication belongs to a domain associated with the email server that manages their email.

FirstPass supports domain categorization by providing a list of domains associated with the ESI collection being reviewed, with a count for each domain that appears in emails in the collection. Domain categorization provides several benefits when reviewing your opponent’s ESI:

  • Non-Responsive Produced ESI: Domains in the list that are obviously non-responsive to the case can be quickly identified and all messages associated with those domains can be “group-tagged” as non-responsive. If a significant percentage of files are identified as non-responsive, that may be a sign that your opponent is trying to “bury you with paper” (albeit electronic).
  • Inadvertent Disclosures: If there are any emails associated with outside counsel’s domain, they could be inadvertent disclosures of attorney work product or attorney-client privileged communications. If so, you can then address those according to the agreed-upon process for handling inadvertent disclosures and clawback of same.
  • Issue Identification: Messages associated with certain parties might be related to specific issues (e.g., an alleged design flaw of a specific subcontractor’s product), so domain categorization can isolate those messages more quickly.

In summary, there are several ways to use first pass review tools, like FirstPass, for reviewing your opponent’s ESI production, including: email analytics, synonym searching, fuzzy searching and domain categorization. First pass review isn’t just for your own production; it’s also an effective process to quickly evaluate your opponent’s production.

So, what do you think? Have you used first pass review tools to assess an opponent’s produced ESI? Please share any comments you might have or if you’d like to know more about a particular topic.

First Pass Review: Fuzzy Searching Your Opponent’s Data

Yesterday, we talked about the use of First Pass Review (FPR) applications (such as FirstPass™, powered by Venio FPR™) to not only conduct first pass review of your own collection, but also to analyze your opponent’s ESI production. One way to analyze that data is through synonym searching to find variations of your search terms to increase the possibility of finding the terminology used by your opponents.

Fuzzy Searching

Another type of analysis is the use of fuzzy searching. Attorneys know what terms they’re looking for, but those terms may not often be spelled correctly. Also, opposing counsel may produce a number of image only files that require Optical Character Recognition (OCR), which is usually not 100% accurate.

FirstPass supports “fuzzy” searching, which is a mechanism by finding alternate words that are close in spelling to the word you’re looking for (usually one or two characters off). FirstPass will display all of the words – in the collection – close to the word you’re looking for, so if you’re looking for the term “petroleum”, you can find variations such as “peroleum”, “petoleum” or even “petroleom” – misspellings or OCR errors that could be relevant. Then, simply select the variations you wish to include in the search. Fuzzy searching is the best way to broaden your search to include potential misspellings and OCR errors and FirstPass provides a terrific capability to select those variations to review additional potential “hits” in your collection.

Tomorrow, I’ll talk about the use of domain categorization to quickly identify potential inadvertent disclosures and weed out non-responsive files produced by your opponent, based on the domain of the communicators. Hasta la vista, baby!  🙂

In the meantime, what do you think? Have you used fuzzy searching to find misspellings or OCR errors in an opponent’s produced ESI? Please share any comments you might have or if you’d like to know more about a particular topic.