Review

If You Play “Tag” Too Often, You Might Find Yourself Playing “Hide and Seek”: eDiscovery Best Practices

If you’ve used any review tool, you’re familiar with the “tag” field to classify documents.  Whether classifying documents as responsive, non-responsive, privileged, or applicable to any of a number of issues, you’ve probably used a tag field to simply check a document to indicate that the associated characteristic of the document is “true”.  But, if you fall in love with the tag field too much, your database can become unmanageable and you may find yourself playing “hide and seek” to try to find the desired tag.

So, what is a “tag” field?

In databases such as SQL Server (which many review platforms use for managing the data associated with ESI being reviewed), a “tag” field is typically a “bit” field known as a yes/no boolean field (also known as true/false).  As a “bit” field, its valid values are 0 (false) and 1 (true).  In the review platform, the tag field is typically represented by a check box that can simply be clicked to check it as true (or click again to turn it back to false).  Easy, right?

One of the most popular features of CloudNine’s review platform (shameless plug warning!) is the ability for the users to create their own fields – as many as they want.  This can be useful for classifying documents in a variety of ways – in many cases, using the aforementioned “tag” field.  So, the user can create their fields and organize them in the order they want to make review more efficient.  Easy, right?

Sometimes, too much of a good thing can be a bad thing.

I have worked with some clients who have used tag fields to classify virtually everything they track within their collection – in some cases, to the extent where their field collections grew to over 200 data fields!!  Try finding the data field you need quickly when you have that many.  Not easy, right?  A couple of examples where use of the tag field was probably not the best choice:

  • Document Types: I have seen instances where clients have created a tag field for each type of document. So, instead of creating one text-based “DocType” field and populating it with the description of the type of document (e.g., Bank Statements, Correspondence, Reports, Tax Documents, etc.), the client created a tag field for each separate document type.  For clients who have identified 15-20 distinct document types (or more), it can become quite difficult to find the right tag to classify the type of document.
  • Account Numbers: Once again, instead of creating one text-based field for tracking key account numbers mentioned in a document, I have seen clients create a separate tag field for each key account number, which can drive the data field count up quite a bit.

Up front planning is one key to avoid “playing tag” too often.  Identify the classifications that you intend to track and look for common themes among larger numbers of classifications (e.g., document types, organizations mentioned, account numbers, etc.).  Develop an approach for standardizing descriptions for those within text-based fields (that can then effectively searched using “equal to” or “contains” searches, depending on what you’re trying to accomplish) and you can keep your data field count to a manageable level.  That will keep your game of “tag” from turning into “hide and seek”.

So, what do you think?  Have you worked with databases that have so many data fields that it becomes difficult to find the right field?   Please share any comments you might have or if you’d like to know more about a particular topic.

Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by CloudNine. eDiscovery Daily is made available by CloudNine solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscovery Daily should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.

Here are a Few Common Myths About Technology Assisted Review: eDiscovery Best Practices

A couple of years ago, after my annual LegalTech New York interviews with various eDiscovery thought leaders (a list of which can be found here, with links to each interview), I wrote a post about some of the perceived myths that exist regarding Technology Assisted Review (TAR) and what it means to the review process.  After a recent discussion with a client where their misperceptions regarding TAR were evident, it seemed appropriate to revisit this topic and debunk a few myths that others may believe as well.

  1. TAR is New Technology

Actually, with all due respect to each of the various vendors that have their own custom algorithm for TAR, the technology for TAR as a whole is not new technology.  Ever heard of artificial intelligence?  TAR, in fact, applies artificial intelligence to the review process.  With all of the acronyms we use to describe TAR, here’s one more for consideration: “Artificial Intelligence for Review” or “AIR”.  May not catch on, but I like it. (much to my disappointment, it didn’t)…

Maybe attorneys would be more receptive to it if they understood as artificial intelligence?  As Laura Zubulake pointed out in my interview with her, “For years, algorithms have been used in government, law enforcement, and Wall Street.  It is not a new concept.”  With that in mind, Ralph Losey predicts that “The future is artificial intelligence leveraging your human intelligence and teaching a computer what you know about a particular case and then letting the computer do what it does best – which is read at 1 million miles per hour and be totally consistent.”

  1. TAR is Just Technology

Treating TAR as just the algorithm that “reviews” the documents is shortsighted.  TAR is a process that includes the algorithm.  Without a sound approach for identifying appropriate example documents for the collection, ensuring educated and knowledgeable reviewers to appropriately code those documents and testing and evaluating the results to confirm success, the algorithm alone would simply be another case of “garbage in, garbage out” and doomed to fail.  In a post from last week, we referenced Tom O’Connor’s recent post where he quoted Maura Grossman, probably the most recognized TAR expert, who stated that “TAR is a process, not a product.”  True that.

  1. TAR and Keyword Searching are Mutually Exclusive

I’ve talked to some people that think that TAR and key word searching are mutually exclusive, i.e., that you wouldn’t perform key word searching on a case where you plan to use TAR.  Not necessarily.  Ralph Losey continues to advocate a “multimodal” approach, noting it as: “more than one kind of search – using TAR, but also using keyword search, concept search, similarity search, all kinds of other methods that we have developed over the years to help train the machine.  The main goal is to train the machine.”

  1. TAR Eliminates Manual Review

Many people (including the New York Times) think of TAR as the death of manual review, with all attorney reviewers being replaced by machines.  Actually, manual review is a part of the TAR process in several aspects, including: 1) Subject matter knowledgeable reviewers are necessary to perform review to create a training set of documents for the technology, 2) After the process is performed, both sets (the included and excluded documents) are sampled and the samples are reviewed to determine the effectiveness of the process, and 3) The resulting responsive set is generally reviewed to confirm responsiveness and also to determine whether the documents are privileged.  Without manual review to train the technology and verify the results, the process would fail.

  1. TAR Has to Be Perfect to Be Useful

Detractors of TAR note that TAR can miss plenty of responsive documents and is nowhere near 100% accurate.  In one recent case, the producing party estimated as many as 31,000 relevant documents may have been missed by the TAR process.  However, they also estimated that a much more costly manual review would have missed as many as 62,000 relevant documents.

Craig Ball’s analogy about the two hikers that encounter the angry grizzly bear is appropriate – the one hiker doesn’t have to outrun the bear, just the other hiker.  Craig notes: “That is how I look at technology assisted review.  It does not have to be vastly superior to human review; it only has to outrun human review.  It just has to be as good or better while being faster and cheaper.”

So, what do you think?  Do you agree that these are myths?  Can you think of any others?  Please share any comments you might have or if you’d like to know more about a particular topic.

Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by CloudNine. eDiscovery Daily is made available by CloudNine solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscovery Daily should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.

Keyword Searching Isn’t Dead, If It’s Done Correctly: eDiscovery Best Practices

In the latest post of the Advanced Discovery blog, Tom O’Connor (who is an industry thought leader and has been a thought leader interviewee on this blog several times) posed an interesting question: Is Keyword Searching Dead?

In his post, Tom recapped the discussion of a session with the same name at the recent Today’s General Counsel Institute in New York City where Tom was a co-moderator of the session along with Maura Grossman, a recognized Technology Assisted Review (TAR) expert, who was recently appointed as Special Master in the Rio Tinto case.  Tom then went on to cover some of the arguments for and against keyword searching as discussed by the panelists and participants in the session, while also noting that numerous polls and client surveys show that the majority of people are NOT using TAR today.  So, they must be using keyword searching, right?

Should they be?  Is there still room for keyword searching in today’s eDiscovery landscape, given the advances that have been made in recent years in TAR technology?

There is, if it’s done correctly.  Tom quotes Maura in the article as stating that “TAR is a process, not a product.”  The same could be said for keyword searching.  If the process is flawed within which the keyword searches are being performed, you could either retrieve way more documents to be reviewed than necessary and drive up eDiscovery costs or leave yourself open to challenges in the courtroom regarding your approach.  Many lawyers at corporations and law firms identify search terms to be performed (and, in many cases, agree on those terms with opposing counsel) without any testing done to confirm the validity of those terms.

Way back in the first few months of this blog (over four years ago), I advocated an approach to searching that I called “STARR”Search, Test, Analyze, Revise (if necessary) and Repeat (also, if necessary).  With an effective platform (using advanced search capabilities such as “fuzzy”, wildcard, synonym and proximity searching) and knowledge and experience of that platform and also knowledge of search best practices, you can start with a well-planned search that can be confirmed or adjusted using the “STARR” approach.

And, even when you’ve been searching databases for as long as I have (decades now), an effective process is key because you never know what you will find until you test the results.  The favorite example that I have used over recent years (and walked through in this earlier post) is the example where I was doing work for a petroleum (oil) company looking for documents that related to “oil rights” and retrieved almost every published and copyrighted document in the oil company with a search of “oil AND rights”.  Why?  Because almost every published and copyrighted document in the oil company had the phrase “All Rights Reserved”.  Testing and an iterative process eventually enabled me to find the search that offered the best balance of recall and precision.

Like TAR, keyword searching is a process, not a product.  And, you can quote me on that.  (-:

So, what do you think?  Is keyword searching dead?  And, please share any comments you might have or if you’d like to know more about a particular topic.

Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by CloudNine. eDiscovery Daily is made available by CloudNine solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscovery Daily should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.

Should Contract Review Attorneys Receive Overtime Pay?: eDiscovery Trends

Whether they should or not, maybe they can – if they’re found NOT to be practicing law, according to a ruling from the Second U.S. Circuit Court of Appeals.

According to a story in The Posse List (Contract attorney lawsuit against Skadden Arps can proceed, appeals court says; case could enable temporary lawyers hired for routine document review to earn extra wages), the Second U.S. Circuit Court of Appeals vacated the judgment of the district court and remanded the matter for further proceedings, ruling that a lawsuit demanding overtime pay from law firm Skadden, Arps and legal staffing agency Tower Legal Solutions can proceed.

The plaintiff, David Lola, on behalf of himself and all others similarly situated, filed the case as a Fair Labor Standards Act collective action against Skadden, Arps and Tower Legal Staffing.  He alleged that, beginning in April 2012, he worked for the defendants for fifteen months in North Carolina, working 45 to 55 hours per week and was paid $25 per hour. He conducted document review for Skadden in connection with a multi-district litigation pending in the United States District Court for the Northern District of Ohio. Lola is an attorney licensed to practice law in California, but he is not admitted to practice law in either North Carolina or the Northern District of Ohio.

According to the ruling issued by the appellate court, “Lola alleged that his work was closely supervised by the Defendants, and his entire responsibility . . . consisted of (a) looking at documents to see what search terms, if any, appeared in the documents, (b) marking those documents into the categories predetermined by Defendants, and (c) at times drawing black boxes to redact portions of certain documents based on specific protocols that Defendants provided.’  Lola also alleged that Defendants provided him with the documents he reviewed, the search terms he was to use in connection with those documents, and the procedures he was to follow if the search terms appeared.

The defendants moved to dismiss the complaint, arguing that Lola was exempt from FLSA’s overtime rules because he was a licensed attorney engaged in the practice of law. The district court granted the motion, finding (1) state, not federal, standards applied in determining whether an attorney was practicing law under FLSA; (2) North Carolina had the greatest interest in the outcome of the litigation, thus North Carolina’s law should apply; and (3) Lola was engaged in the practice of law as defined by North Carolina law, and was therefore an exempt employee under FLSA.”

While the appellate court agreed with the first two points, it disagreed with the third.  In vacating the judgment of the district court and remanding the matter for further proceedings, the appellate court stated in its ruling:

“The gravamen of Lola’s complaint is that he performed document review under such tight constraints that he exercised no legal judgment whatsoever—he alleges that he used criteria developed by others to simply sort documents into different categories. Accepting those allegations as true, as we must on a motion to dismiss, we find that Lola adequately alleged in his complaint that he failed to exercise any legal judgment in performing his duties for Defendants. A fair reading of the complaint in the light most favorable to Lola is that he provided services that a machine could have provided.”

A link to the appeals court ruling, also available in the article in The Posse List, can be found here.

So, what do you think?  Are document reviewers practicing law?  If not, should they be entitled to overtime pay?  Please share any comments you might have or if you’d like to know more about a particular topic.

Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by CloudNine. eDiscovery Daily is made available by CloudNine solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscovery Daily should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.

“Da Silva Moore Revisited” Will Be Visited by a Newly Appointed Special Master: eDiscovery Case Law

In Rio Tinto Plc v. Vale S.A., 14 Civ. 3042 (RMB)(AJP) (S.D.N.Y. Jul. 15, 2015), New York Magistrate Judge Andrew J. Peck, at the request of the defendant, entered an Order appointing Maura Grossman as a special master in this case to assist with issues concerning Technology-Assisted Review (TAR).

Back in March (as covered here on this blog), Judge Peck approved the proposed protocol for technology assisted review (TAR) presented by the parties, titling his opinion “Predictive Coding a.k.a. Computer Assisted Review a.k.a. Technology Assisted Review (TAR) — Da Silva Moore Revisited”.  Alas, as some unresolved issues remained regarding the parties’ TAR-based productions, Judge Peck decided to prepare the order appointing Grossman as special master for the case.  Grossman, of course, is a recognized TAR expert, who (along with Gordon Cormack) wrote Technology-Assisted Review in E-Discovery can be More Effective and More Efficient that Exhaustive Manual Review and also the Grossman-Cormack Glossary of Technology Assisted Review (covered on our blog here).

While noting that it has “no objection to Ms. Grossman’s qualifications”, the plaintiff issued several objections to the appointment, including:

  • The defendant should have agreed much earlier to appointment of a special master: Judge Peck’s response was that “The Court certainly agrees, but as the saying goes, better late than never. There still are issues regarding the parties’ TAR-based productions (including an unresolved issue raised at the most recent conference) about which Ms. Grossman’s expertise will be helpful to the parties and to the Court.”
  • The plaintiff stated a “fear that [Ms. Grossman’s] appointment today will only cause the parties to revisit, rehash, and reargue settled issues”: Judge Peck stated that “the Court will not allow that to happen. As I have stated before, the standard for TAR is not perfection (nor of using the best practices that Ms. Grossman might use in her own firm’s work), but rather what is reasonable and proportional under the circumstances. The same standard will be applied by the special master.”
  • One of the defendant’s lawyers had three conversations with Ms. Grossman about TAR issues: Judge Peck noted that one contact in connection with The Sedona Conference “should or does prevent Ms. Grossman from serving as special master”, and noted that, in the other two, the plaintiff “does not suggest that Ms. Grossman did anything improper in responding to counsel’s question, and Ms. Grossman has made clear that she sees no reason why she cannot serve as a neutral special master”, agreeing with that statement.

Judge Peck did agree with the plaintiff on allocation of the special master’s fees, stating that the defendant’s “propsal [sic] is inconsistent with this Court’s stated requirement in this case that whoever agreed to appointment of a special master would have to agree to pay, subject to the Court reallocating costs if warranted”.

So, what do you think?  Was the appointment of a special master (albeit an eminently qualified one) appropriate at this stage of the case?  Please share any comments you might have or if you’d like to know more about a particular topic.

Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by CloudNine. eDiscovery Daily is made available by CloudNine solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscovery Daily should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.

Quality Control, Making Sure the Numbers Add Up: eDiscovery Best Practices

Having touched on this topic a few years ago, a recent client experience spurred me to revisit it.

Friday, we wrote about tracking file counts from collection to production, the concept of expanded file counts, and the categorization of files during processing.  Today, let’s walk through a scenario to show how the files collected are accounted for during the discovery process.

Tracking the Counts after Processing

We discussed the typical categories of excluded files after processing – obviously, what’s not excluded is available for searching and review.  Even if your approach includes technology assisted review (TAR) as part of your methodology, it’s still likely that you will want to do some culling out of files that are clearly non-responsive.

Documents during review may be classified in a number of ways, but the most common ways to classify documents as to whether they are responsive, non-responsive, or privileged.  Privileged documents are also often classified as responsive or non-responsive, so that only the responsive documents that are privileged need be identified on a privilege log.  Responsive documents that are not privileged are then produced to opposing counsel.

Example of File Count Tracking

So, now that we’ve discussed the various categories for tracking files from collection to production, let’s walk through a fairly simple eMail based example.  We conduct a fairly targeted collection of a PST file from each of seven custodians in a given case.  The relevant time period for the case is January 1, 2013 through December 31, 2014.  Other than date range, we plan to do no other filtering of files during processing.  Identified duplicates will not be reviewed or produced.  We’re going to provide an exception log to opposing counsel for any file that cannot be processed and a privilege log for any responsive files that are privileged.  Here’s what this collection might look like:

  • Collected Files: After expansion and processing, 7 PST files expand to 101,852 eMails and attachments.
  • Filtered Files: Filtering eMails outside of the relevant date range eliminates 23,564
  • Remaining Files after Filtering: After filtering, there are 78,288 files to be processed.
  • NIST/System Files: eMail collections typically don’t have NIST or system files, so we’ll assume zero (0) files here. Collections with loose electronic documents from hard drives typically contain some NIST and system files.
  • Exception Files: Let’s assume that a little less than 1% of the collection (912) is exception files like password protected, corrupted or empty files.
  • Duplicate Files: It’s fairly common for approximately 30% or more of the collection to include duplicates, so we’ll assume 24,215 files here.
  • Remaining Files after Processing: We have 53,161 files left after subtracting NIST/System, Exception and Duplicate files from the total files after filtering.
  • Files Culled During Searching: If we assume that we are able to cull out 67% (approximately 2/3 of the collection) as clearly non-responsive, we are able to cull out 35,618.
  • Remaining Files for Review: After culling, we have 17,543 files that will actually require review (whether manual or via a TAR approach).
  • Files Tagged as Non-Responsive: If approximately 40% of the document collection is tagged as non-responsive, that would be 7,017 files tagged as such.
  • Remaining Files Tagged as Responsive: After QC to ensure that all documents are either tagged as responsive or non-responsive, this leaves 10,526 documents as responsive.
  • Responsive Files Tagged as Privileged: If roughly 8% of the responsive documents are determined to be privileged during review, that would be 842 privileged documents.
  • Produced Files: After subtracting the privileged files, we’re left with 9,684 responsive, non-privileged files to be produced to opposing counsel.

The percentages I used for estimating the counts at each stage are just examples, so don’t get too hung up on them.  The key is to note the numbers in red above.  Excluding the interim counts in black, the counts in red represent the different categories for the file collection – each file should wind up in one of these totals.  What happens if you add the counts in red together?  You should get 101,852 – the number of collected files after expanding the PST files.  As a result, every one of the collected files is accounted for and none “slips through the cracks” during discovery.  That’s the way it should be.  If not, investigation is required to determine where files were missed.

So, what do you think?  Do you have a plan for accounting for all collected files during discovery?  Please share any comments you might have or if you’d like to know more about a particular topic.

Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by CloudNine. eDiscovery Daily is made available by CloudNine solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscovery Daily should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.

Quality Control By The Numbers: eDiscovery Best Practices

Having touched on this topic a few years ago, a recent client experience spurred me to revisit it.

A while back, we wrote about Quality Assurance (QA) and Quality Control (QC) in the eDiscovery process.  Both are important in improving the quality of work product and making the eDiscovery process more defensible overall.  With regard to QC, an overall QC mechanism is tracking of document counts through the discovery process, especially from collection to production, to identify how every collected file was handled and why each non-produced document was not produced.

Expanded File Counts

Scanned counts of files collected are not the same as expanded file counts.  There are certain container file types, like Outlook PST files and ZIP archives that exist essentially to store a collection of other files.  So, the count that is important to track is the “expanded” file count after processing, which includes all of the files contained within the container files.  So, in a simple scenario where you collect Outlook PST files from seven custodians, the actual number of documents (emails and attachments) within those PST files could be in the tens of thousands.  That’s the starting count that matters if your goal is to account for every document or file in the discovery process.

Categorization of Files During Processing

Of course, not every document gets reviewed or even included in the search process.  During processing, files are usually categorized, with some categories of files usually being set aside and excluded from review.  Here are some typical categories of excluded files in most collections:

  • Filtered Files: Some files may be collected, and then filtered during processing. A common filter for the file collection is the relevant date range of the case.  If you’re collecting custodians’ source PST files, those may include messages outside the relevant date range; if so, those messages may need to be filtered out of the review set.  Files may also be filtered based on type of file or other reasons for exclusion.
  • NIST and System Files: Many file collections also contain system files, like executable files (EXEs) or Dynamic Link Library (DLLs) that are part of the software on a computer which do not contain client data, so those are typically excluded from the review set. NIST files are included on the National Institute of Standards and Technology list of files that are known to have no evidentiary value, so any files in the collection matching those on the list are “De-NISTed”.
  • Exception Files: These are files that cannot be processed or indexed, for whatever reason. For example, they may be password-protected or corrupted.  Just because these files cannot be processed doesn’t mean they can be ignored, depending on your agreement with opposing counsel, you may need to at least provide a list of them on an exception log to prove they were addressed, if not attempt to repair them or make them accessible (BTW, it’s good to establish that agreement for disposition of exception files up front).
  • Duplicate Files: During processing, files that are exact duplicates may be put aside to avoid redundant review (and potential inconsistencies). Some exact duplicates are typically identified based on the HASH value, which is a digital fingerprint generated based on the content and format of the file – if two files have the same HASH value, they have the same exact content and format.  Emails (and their attachments) may be identified as duplicates based on key metadata fields, so an attachment cannot be “de-duped” out of the collection by a standalone copy of the same file.

All of these categories of excluded files can reduce the set of files to actually be searched and reviewed.  On Monday, we’ll illustrate an example of a file set from collection to production to illustrate how each file is accounted for during the discovery process.

So, what do you think?  Do you have a plan for accounting for all collected files during discovery?  Please share any comments you might have or if you’d like to know more about a particular topic.

Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by CloudNine. eDiscovery Daily is made available by CloudNine solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscovery Daily should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.

This Study Discusses the Benefits of Including Metadata in Machine Learning for TAR: eDiscovery Trends

A month ago, we discussed the Discovery of Electronically Stored Information (DESI) workshop and the papers describing research or practice presented at the workshop that was held earlier this month and we covered one of those papers a couple of weeks later.  Today, let’s cover another paper from the study.

The Role of Metadata in Machine Learning for Technology Assisted Review (by Amanda Jones, Marzieh Bazrafshan, Fernando Delgado, Tania Lihatsh and Tamara Schuyler) attempts to study the  role of metadata in machine learning for technology assisted review (TAR), particularly with respect to the algorithm development process.

Let’s face it, we all generally agree that metadata is a critical component of ESI for eDiscovery.  But, opinions are mixed as to its value in the TAR process.  For example, the Grossman-Cormack Glossary of Technology Assisted Review (which we covered here in 2012) includes metadata as one of the “typical” identified features of a document that are used as input to a machine learning algorithm.  However, a couple of eDiscovery software vendors have both produced documentation stating that “machine learning systems typically rely upon extracted text only and that experts engaged in providing document assessments for training should, therefore, avoid considering metadata values in making responsiveness calls”.

So, the authors decided to conduct a study that established the potential benefit of incorporating metadata into TAR algorithm development processes, as well as evaluate the benefits of using extended metadata and also using the field origins of that metadata.  Extended metadata fields included Primary Custodian, Record Type, Attachment Name, Bates Start, Company/Organization, Native File Size, Parent Date and Family Count, to name a few.  They evaluated three distinct data sets (one drawn from Topic 301 of the TREC 2010 Interactive Task, two other proprietary business data sets) and generated a random sample of 4,500 individual documents for each (split into a 3,000 document Control Set and a 1,500 document Training Set).

The metric they used throughout to compare model performance is Area Under the Receiver Operating Characteristic Curve (AUROC). Say what?  According to the report, the metric indicates the probability that a given model will assign a higher ranking to a randomly selected responsive document than a randomly selected non-responsive document.

As indicated by the graphic above, their findings were that incorporating metadata as an integral component of machine learning processes for TAR improved results (based on the AUROC metric).  Particularly, models incorporating Extended metadata significantly outperformed models based on body text alone in each condition for every data set.  While there’s still a lot to learn about the use of metadata in modeling for TAR, it’s an interesting study and start to the discussion.

A copy of the twelve page study (including Bibliography and Appendix) is available here.  There is also a link to the PowerPoint presentation file from the workshop, which is a condensed way to look at the study, if desired.

So, what do you think?  Do you agree with the report’s findings?  Please share any comments you might have or if you’d like to know more about a particular topic.

Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by CloudNine. eDiscovery Daily is made available by CloudNine solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscovery Daily should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.

Here’s One Study That Shows Potential Savings from Technology Assisted Review: eDiscovery Trends

A couple of weeks ago, we discussed the Discovery of Electronically Stored Information (DESI) workshop and the papers describing research or practice presented at the workshop that was held earlier this month.  Today, let’s cover one of those papers.

The Case for Technology Assisted Review and Statistical Sampling in Discovery (by Christopher H Paskach, F. Eli Nelson and Matthew Schwab) aims to show how Technology Assisted Review (TAR) and Statistical Sampling can significantly reduce risk and improve productivity in eDiscovery processes.  The easy to read 6 page report concludes with the observation that, with measures like statistical sampling, “attorney stakeholders can make informed decisions about  the reliability and accuracy of the review process, thus quantifying actual risk of error and using that measurement to maximize the value of expensive manual review. Law firms that adopt these techniques are demonstrably faster, more informed and productive than firms who rely solely on attorney reviewers who eschew TAR or statistical sampling.”

The report begins by giving an introduction which includes a history of eDiscovery, starting with printing documents, “Bates” stamping them, scanning and using Optical Character Recognition (OCR) programs to capture text for searching.  As the report notes, “Today we would laugh at such processes, but in a profession based on ‘stare decisis,’ changing processes takes time.”  Of course, as we know now, “studies have concluded that machine learning techniques can outperform manual document review by lawyers”.  The report also references key cases such as DaSilva Moore, Kleen Products and Global Aerospace, demonstrating with the first few of many cases to approve the use of technology assisted review for eDiscovery.

Probably the most interesting portion of the report is the section titled Cost Impact of TAR, which illustrates a case scenario that compares the cost of TAR to the cost of manual review.  On a strictly relevance based review of 90,000 documents (after keyword filtering, which implies a multimodal approach to TAR), the TAR approach was over $57,000 less expensive ($136,225 vs. $193,500 for manual review).  The report illustrates the comparison with both a numbers spreadsheet and a pie chart comparison of costs, based on the assumptions provided.  Sounds like the basis for a budgeting tool!

Anyway, the report goes on to discuss the benefits of statistical sampling to validate the results, demonstrating that the only way to attempt to do so in a manual review scenario is to review the documents multiple times, which is prone to human error and inconsistent assessments of responsiveness.  The report then covers necessary process changes to realize the benefits of TAR and statistical sampling and concludes with the declaration that:

“Companies and law firms that take advantage of the rapid advances in TAR will be able to keep eDiscovery review costs down and reduce the investment in discovery by getting to the relevant facts faster. Those firms who stick with unassisted manual review processes will likely be left behind.”

The report is a quick, easy read and can be viewed here.

So, what do you think?  Do you agree with the report’s findings?  Please share any comments you might have or if you’d like to know more about a particular topic.

Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by CloudNine. eDiscovery Daily is made available by CloudNine solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscovery Daily should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.

DESI Got Your Input, and Here It Is: eDiscovery Trends

Back in January, we discussed the Discovery of Electronically Stored Information (DESI, not to be confused with Desi Arnaz, pictured above) workshop and its call for papers describing research or practice for the DESI VI workshop that was held last week at the University of San Diego as part of the 15th International Conference on Artificial Intelligence & Law (ICAIL 2015). Now, links to those papers are available on their web site.

The DESI VI workshop aims to bring together researchers and practitioners to explore innovation and the development of best practices for application of search, classification, language processing, data management, visualization, and related techniques to institutional and organizational records in eDiscovery, information governance, public records access, and other legal settings. Ideally, the aim of the DESI workshop series has been to foster a continuing dialogue leading to the adoption of further best practice guidelines or standards in using machine learning, most notably in the eDiscovery space. Organizing committee members include Jason R. Baron of Drinker Biddle & Reath LLP and Douglas W. Oard of the University of Maryland.

The workshop included keynote addresses by Bennett Borden and Jeremy Pickens, a session regarding Topics in Information Governance moderated by Jason R. Baron, presentations of some of the “refereed” papers and other moderated discussions. Sounds like a very informative day!

As for the papers themselves, here is a list from the site with links to each paper:

Refereed Papers

Position Papers

If you’re interested in discovery of ESI, Information Governance and artificial intelligence, these papers are for you! Kudos to all of the authors who submitted them. Over the next few weeks, we plan to dive deeper into at least a few of them.

So, what do you think? Did you attend DESI VI? Please share any comments you might have or if you’d like to know more about a particular topic.

Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by CloudNine. eDiscovery Daily is made available by CloudNine solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscovery Daily should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.