January 2011

eDiscovery Best Practices: EDRM Data Set for Great Test Data

January 17, 2011

In it’s almost six years of existence, the Electronic Discovery Reference Model (EDRM) Project has implemented a number of mechanisms to standardize the practice of eDiscovery. Having worked on the EDRM Metrics project for the past four years, I have seen some of those mechanisms implemented firsthand.

One of the most significant recent accomplishments by EDRM is the EDRM Data Set. Anyone who works with eDiscovery applications and processes understands the importance to be able to test those applications in as many ways as possible using realistic data that will illustrate expected results. The use of test data is extremely useful in crafting a defensible discovery approach, by enabling you to determine the expected results within those applications and processes before using them with your organization’s live data. It can also help you identify potential anomalies (those never occur, right?) up front so that you can be proactive to develop an approach to address those anomalies before encountering them in your own data.

Using public domain data from Enron Corporation (originating from the Federal Energy Regulatory Commission Enron Investigation), the EDRM Data Set Project provides industry-standard, reference data sets of electronically stored information (ESI) to test those eDiscovery applications and processes. In 2009, the EDRM Data Set project released its first version of the Enron Data Set, comprised of Enron e-mail messages and attachments within Outlook PST files, organized in 32 zipped files.

This past November, the EDRM Data Set project launched Version 2 of the EDRM Enron Email Data Set. Straight from the press release announcing the launch, here are some of the improvements in the newest version:

Larger Data Set: Contains 1,227,255 emails with 493,384 attachments (included in the emails) covering 151 custodians;
Rich Metadata: Includes threading information, tracking IDs, and general Internet headers;
Multiple Email Formats: Provision of both full and de-duplicated email in PST, MIME and EDRM XML, which allows organizations to test and compare results across formats.

The Text REtrieval Conference (TREC) Legal Track project provided input for this version of the data set, which, as noted previously on this blog, has used the EDRM data set for its research. Kudos to John Wang, Project Lead for the EDRM Data Set Project and Product Manager at ZL Technologies, Inc., and the rest of the Data Set team for such an extensive test set collection!

So, what do you think? Do you use the EDRM Data Set for testing your eDiscovery processes? Please share any comments you might have or if you’d like to know more about a particular topic.

eDiscovery Searching: Proximity, Not Absence, Makes the Heart Grow Fonder

January 14, 2011

Recently, I assisted a large corporate client where there were several searches conducted across the company’s enterprise-wide document management systems (DMS) for ESI potentially responsive to the litigation. Some of the individual searches on these systems retrieved over 200,000 files by themselves!

DMS systems are great for what they are intended to do – provide a storage archive for documents generated within the organization, version tracking of those documents and enable individuals to locate specific documents for reference or modification (among other things). However, few of them are developed with litigation retrieval in mind. Sure, they have search capabilities, but it can sometimes be like using a sledgehammer to hammer a thumbtack into the wall – advanced features to increase the precision of those searches may often be lacking.

Let’s say in an oil company you’re looking for documents related to “oil rights” (such as “oil rights”, “oil drilling rights”, “oil production rights”, etc.). You could perform phrase searches, but any variations that you didn’t think of would be missed (e.g., “rights to drill for oil”, etc.). You could perform an AND search (i.e., “oil” AND “rights”), and that could very well retrieve all of the files related to “oil rights”, but it would also retrieve a lot of files where “oil” and “rights” appear, but have nothing to do with each other. A search for “oil” AND “rights” in an oil company’s DMS systems may retrieve every published and copyrighted document in the systems mentioning the word “oil”. Why? Because almost every published and copyrighted document will have the phrase “All Rights Reserved” in the document.

That’s an example of the type of issue we were encountering with some of those searches that yielded 200,000 files with hits. And, that’s where proximity searching comes in. Proximity searching is simply looking for two or more words that appear close to each other in the document (e.g., “oil within 5 words of rights”) – the search will only retrieve the file if those words are as close as specified to each other, in either order. Proximity searching helped us reduce that collection to a more manageable number for review, even though the enterprise-wide document management system didn’t have a proximity search feature.

How? We wound up taking a two-step approach to get the collection to a more likely responsive set. First, we did the “AND” search in the DMS system, understanding that we would retrieve a large number of files, and exported those results. After indexing them with a first pass review tool that has more precise search alternatives (at Trial Solutions, we use FirstPass™, powered by Venio FPR™, for first pass review), we performed a second search on the set using proximity searching to limit the result set to only files where the terms were near each other. Then, tested the results and revised where necessary to retrieve a result set that maximized both recall and precision.

The result? We were able to reduce an initial result set of 200,000 files to just over 5,000 likely responsive files by applying the proximity search to the first result set. And, we probably saved $50,000 to $100,000 in review costs – on a single search.

I also often use proximity searches as alternatives to phrase searches to broaden the recall of those searches to identify additional potentially responsive hits. For example, a search for “Doug Austin” doesn’t retrieve “Austin, Doug” and a search for “Dye 127” doesn’t retrieve “Dye #127”. One character difference is all it takes for a phrase search to miss a potentially responsive file. With proximity searching, you can look for these terms close to each other and catch those variations.

So, what do you think? Do you use proximity searching in your culling for review? Please share any comments you might have or if you’d like to know more about a particular topic.

Managing an eDiscovery Contract Review Team: First Steps in Drafting Criteria

January 13, 2011

In theory, responsive documents are described in the other side’s request for production. In practice, those requests are often open to interpretation. Your goal in drafting responsive criteria is to distill those requests and create a clear set of objective rules that leave little room for interpretation – a set of rules that can be applied correctly and consistently to the document collection. This step is important for a couple of reasons:

It is difficult to get consistent results from a group of people doing the same task. No two people will make exactly the same decision about every document – not even attorneys. Even an individual attorney will not always make the same decision about duplicates of the same document. Thorough, clear, detailed and objective criteria will minimize inconsistencies.
If discovery disputes arise, it may be necessary to demonstrate a good-faith effort. Thorough, detailed criteria will help. Judges understand the human error factor. They are less tolerant of work that was approached casually or sloppily. Clear, detailed criteria will demonstrate a carefully thought-out approach.

Where do you start? First, do a little preparation. There are some basic materials and information that you’ll need:

The complaint.
The request for production
Knowledge of the document collection (in the last blog in this series, we talked about sampling the collection).
Knowledge of the strategy for defending or prosecuting the case.

Once you’ve read the complaint and the document request and you’ve sampled the collection, you’ll have a feel for the materials that reviewers are likely to see and how those documents relate to the facts and legal issues in the case. If a strategy for defending or prosecuting the case has been developed, make sure you understand that strategy. It is likely that an understanding of the allegations and the strategy will broaden your view of what is responsive and important.

After these preparation steps, you’ll be ready to develop a first draft of the criteria. In the next issue, we’ll talk about how to structure and write effective criteria.

Have you drafted criteria for a document review of a large collection? How did you approach it and how well did it work? Please share any comments you might have and let us know if you’d like to know more about an eDiscovery topic.

Managing an eDiscovery Contract Review Team: Get a Handle on the Document Collection

January 12, 2011

Once you’ve defined the objectives of the review, you need to move forward with other preparation steps: You need to draft review criteria, you need to identify the type of people that are appropriate for the review (do you need a staff of attorneys? lay people? staff with expertise in a specific subject matter?), and you need to pull that team together.

Before moving forward with these steps, you need a bit more information. You need to know what’s in the document collection. You need to know what types of documents are in the collection and you need to know what type of content is in the documents. Once you’ve got a handle on the collection, you’ll be in a better position to make decisions on subsequent steps.

Start by interviewing custodians. You don’t need to talk to every custodian, but talk to a representative sample. For example, if you are collecting documents from a corporate client, speak to at least one person from each department from which you’ve collected documents. The person you speak to should probably be a manager or someone who has a good handle on the overall operation of the department. Find out about the department’s operations and determine its role in the events that are at issue in the case. Ask about the types of documents that are generated and retained. Information that you glean here will help in the next step: sampling the collection.

After you’ve collected information from the custodians, take a look at the documents. Review a representative sample. Look at documents from each custodian. Take notes on what you are finding and make copies of documents that can be used as examples to illustrate the criteria you’ll be drafting and to be used in training.

Your ultimate goal is to develop a set of objective rules that a well-trained staff can apply effectively and consistently to the collection during the review. The more you learn about the documents in advance, the better you’ll be able to do that. So spend the time up front learning what you can about what’s in your document collection.

Do you typically sample an eDiscovery document collection before a review? How did you approach it? Please share any comments you might have and let us know if you’d like to know more about an eDiscovery topic.

eDiscovery Trends: Sanctions Down in 2010 — at least thru December 1

January 11, 2011

Recently, this blog cited a Duke Law Journal study that indicated that eDiscovery sanctions were at an all-time high through 2009. Then, a couple of weeks ago, I saw a story recently from Williams Mullen recapping the 2010 year in eDiscovery. It provides a very thorough recap including 2010 trends in sanctions (identifying several cases where sanctions were at issue), advances made during the year in cooperation and proportionality, challenges associated with privacy concerns in foreign jurisdictions and trends in litigation dealing with social media. It’s a very comprehensive summary of the year in eDiscovery.

One noteworthy finding is that, according to the report, sanctions were sought and awarded in fewer cases in 2010. Some notable stats from the report:

There were 208 eDiscovery opinions in 2009 versus 209 through December 1, 2010;
Out of 209 cases with eDiscovery opinions in 2010, sanctions were sought in 79 of them (38%) and awarded in 49 (62% of those cases, and 23% of all eDiscovery cases).
Compare that with 2009 when sanctions were sought in 42% of eDiscovery cases and were awarded in 70% of the cases in which they were requested (30% of all eDiscovery cases).
While overall requests for sanctions decreased, motions to compel more than doubled in 2010, being filed in 43% of all e-discovery cases, compared to 20% in 2009.
Costs and fees were by far the most common sanction, being awarded in 60% of the cases involving sanctions.
However, there was a decline in each type of sanction as costs and fees (from 33 to 29 total sanctions), adverse inference (13 to 7), terminating (10 to 7), additional discovery (10 to 6) and preclusion (5 to 3) sanctions all declined.

The date of this report was December 17, and the report noted a total of 209 eDiscovery cases as of December 1, 2010. So, final tallies for the year were not yet tabulated. It will be interesting to see if the trend in decline of sanctions held true once the entire year is considered.

So, what do you think? Is this a significant indication that more organizations are getting a handle on their eDiscovery obligations – or just a “blip in the radar”? Please share any comments you might have or if you’d like to know more about a particular topic.

eDiscovery Best Practices: Database Discovery Pop Quiz ANSWERS

January 10, 2011

So, how did you do? Did you know all the answers from Friday’s post – without “googling” them? 😉

Here are the answers – enjoy!

What is a “Primary Key”? The primary key of a relational table uniquely identifies each record in the table. It can be a normal attribute that you expect to be unique (e.g., Social Security Number); however, it’s usually best to be a sequential ID generated by the Database Management System (DBMS).

What is an “Inner Join” and how does it differ from an “Outer Join”? An inner join is the most common join operation used in applications, creating a new result table by combining column values of two tables. An outer join does not require each record in the two joined tables to have a matching record. The joined table retains each record in one of the tables – even if no other matching record exists. Sometimes, there is a reason to keep all of the records in one table in your result, such as a list of all employees, whether or not they participate in the company’s benefits program.

What is “Normalization”? Normalization is the process of organizing data to minimize redundancy of that data. Normalization involves organizing a database into multiple tables and defining relationships between the tables.

How does a “View” differ from a “Table”? A view is a virtual table that consists of columns from one or more tables. Though it is similar to a table, it is a query stored as an object.

What does “BLOB” stand for? A Binary Large OBject (BLOB) is a collection of binary data stored as a single entity in a database management system. BLOBs are typically images or other multimedia objects, though sometimes binary executable code is stored as a blob. So, if you’re not including databases in your discovery collection process, you could also be missing documents stored as BLOBs. BTW, if you didn’t click on the link next to the BLOB question in Friday’s blog, it takes you to the amusing trailer for the 1958 movie, The Blob, starring a young Steve McQueen (so early in his career, he was billed as “Steven McQueen”).

What is the different between a “flat file” and a “relational” database? A flat file database is a database designed around a single table, like a spreadsheet. The flat file design puts all database information in one table, or list, with fields to represent all parameters. A flat file is prone to considerable duplicate data, as each value is repeated for each item. A relational database, on the other hand, incorporates multiple tables with methods (such as normalization and inner and outer joins, defined above) to store data efficiently and minimize duplication.

What is a “Trigger”? A trigger is a procedure which is automatically executed in response to certain events in a database and is typically used for keeping the integrity of the information in the database. For example, when a new record (for a new employee) is added to the employees table, a trigger might create new records in the taxes, vacations, and salaries tables.

What is “Rollback”? A rollback is the undoing of partly completed database changes when a database transaction is determined to have failed, thus returning the database to its previous state before the transaction began. Rollbacks help ensure database integrity by enabling the database to be restored to a clean copy after erroneous operations are performed or database server crashes occur.

What is “Referential Integrity”? Referential integrity ensures that relationships between tables remain consistent. When one table has a foreign key to another table, referential integrity ensures that a record is not added to the table that contains the foreign key unless there is a corresponding record in the linked table. Many databases use cascading updates and cascading deletes to ensure that changes made to the linked table are reflected in the primary table.

Why is a “Cartesian Product” in SQL almost always a bad thing? A Cartesian Product occurs in SQL when a join condition (via a WHERE clause in a SQL statement) is omitted, causing all combinations of records from two or more tables to be displayed. For example, when you go to the Department of Motor Vehicles (DMV) to pay your vehicle registration, they use a database with an Owners and a Vehicles table joined together to determine for which vehicle(s) you need to pay taxes. Without that join condition, you would have a Cartesian Product and every vehicle in the state would show up as registered to you – that’s a lot of taxes to pay!

If you didn’t know the answers to most of these questions, you’re not alone. But, to effectively provide the information within a database responsive to an eDiscovery request, knowledge of databases at this level is often necessary to collect and produce the appropriate information. As Craig Ball noted in his Law.com article Ubiquitous Databases, “Get the geeks together, and get out of their way”. Hey, I resemble that remark!

So, what do you think? Did you learn anything? Please share any comments you might have or if you’d like to know more about a particular topic.

eDiscovery Best Practices: Database Discovery Pop Quiz

January 7, 2011

Databases: You can’t live with them, you can’t live without them.

Or so it seems in eDiscovery. On a regular basis, I’ve seen various articles and discussions related to discovery of databases and other structured data and I remain very surprised how few legal teams understand database discovery and know how to handle it. A colleague of mine (who I’ve known over the years to be honest and reliable) even claimed to me a few months back while working for a nationally known eDiscovery provider that their collection procedures actually excluded database files.

Last month, Law.com had an article written by Craig Ball, called Ubiquitous Databases, which provided a lot of good information about database discovery. It included various examples how databases touch our lives every day, while noting that eDiscovery is still ultra document-centric, even when those “documents” are generated from databases. There is some really good information in that article about Database Management Software (DBMS), Structured Query Language (SQL), Entity Relationship Diagrams (ERDs) and how they are used to manage, access and understand the information contained in databases. It’s a really good article especially for database novices who need to understand more about databases and how they “tick”.

But, maybe you already know all you need to know about databases? Maybe you would already be ready to address eDiscovery on your databases today?

Having worked with databases for over 20 years (I stopped counting at 20), I know a few things about databases. So, here is a brief “pop” quiz on database concepts. Call them “Database 101” questions. See how many you can answer!

What is a “Primary Key”? (hint: it is not what you start the car with)
What is an “Inner Join” and how does it differ from an “Outer Join”?
What is “Normalization”?
How does a “View” differ from a “Table”?
What does “BLOB” stand for? (hint: it’s not this)
What is the different between a “flat file” and a “relational” database?
What is a “Trigger”?
What is “Rollback”? (hint: it has nothing to do with Wal-Mart prices)
What is “Referential Integrity”?
Why is a “Cartesian Product” in SQL almost always a bad thing?

So, what do you think? Are you a database guru or a database novice? Please share any comments you might have or if you’d like to know more about a particular topic.

Did you think I was going to provide the answers at the bottom? No cheating!! I’ll answer the questions on Monday. Hope you can stand it!!

Managing an eDiscovery Contract Review Team: Clearly Define Objectives

January 6, 2011

Yesterday, we introduced the blog series to discuss Managing an eDiscovery Contract Review Team. Now, it’s time to get started! The first step in preparing for a document review is to very clearly define the objectives of the review. It’s an easy step, but it’s very important. It will drive several subsequent decisions that you’ll make regarding management of the project.

Here are some likely objectives you may choose:

Identify responsive documents
Identify privileged documents
Identify documents to be reviewed by an expert
Identify significant helpful and harmful documents

The choices you make here will affect the type of people you’ll assign to the review, the amount of time the review will take, the type of criteria you’ll need to draft, and the level of training you’ll need to do.

How do you make these decisions? There are a few factors that should affect your choices:

The nature of the case and the nature of the document collection: What type of case are you handling and what types of documents are in the collection? If the case involves highly technical or scientific subject matter, you may need to train the review staff to segregate those documents that require review by an expert.
Where are you on the case and what do you know so far? If you don’t know much yet about the case and what will be important, you won’t be in a position to ask reviewers to recognize significant materials.
What’s the pool of available reviewers? Can you easily pull together a team that’s qualified to identify potentially privileged or significant documents? If you need a very large team, you might be better off working with a team that can more easily focus on objective criteria, and use a smaller group of attorney staff to work with a smaller collection after the initial review.

Determine the objectives that will work best for your case and that can be accomplished with the available resources. Make sure that the objectives are clearly defined and that everyone on the litigation team understands the objectives and has the same expectations.

What do you look to accomplish with an eDiscovery document review? Have you had objectives in addition to those listed above? Please share any comments you might have and let us know if you’d like to know more about an eDiscovery topic.

Managing an eDiscovery Contract Review Team: Introduction

January 5, 2011

In a perfect world, attorneys responsible for a case would review an entire document collection for responsive materials. On large cases with huge collections, that’s just not practical or possible. In those situations, your only choice may be to pull together a team of contract reviewers to identify responsive materials.

How well does this work? A review done by a contract review team will certainly cost less than one done by a team of law firm attorneys. More likely than not, it will be done more efficiently. And if there’s good preparation and management, the quality will be just as good (in fact, it may be better because a contract staff is more likely to stay better focused on the inevitable, more mundane aspects of the work).

I’ve managed many successful review projects done by teams of contract employees. Sometimes those teams were made up of attorneys, but more often they included mostly paralegals and college-educated lay personnel with good reading and comprehension skills. These projects were successful because they were structured and managed in a way where decision-making responsibility was in the hands of the attorneys, but there were effective mechanisms in place for disseminating those decisions to the team. In this blog series, I’m going to walk through how to do this. Specifically, we’ll be covering:

Clearly defining the objectives of the document review
Getting a handle on the document collection
Determining the right mix of people for the project
Creating effective document review criteria
Effectively training the review team
Managing the project
Disseminating updated project information
Implementing effective quality control procedures

What has been your experience with contract review teams for large projects? Do you have good or bad experiences you can tell us about? Are there any specific problems you’ve had with review teams? Please share any comments you might have and let us know if you’d like to know more about an eDiscovery topic.

eDiscovery Trends: 2011 Predictions — By The Numbers

January 4, 2011

Comedian Nick Bakay”>Nick Bakay always ends his Tale of the Tape skits where he compares everything from Married vs. Single to Divas vs. Hot Dogs with the phrase “It's all so simple when you break things down scientifically.”

The late December/early January time frame is always when various people in eDiscovery make their annual predictions as to what trends to expect in the coming year. We’ll have some of our own in the next few days (hey, the longer we wait, the more likely we are to be right!). However, before stating those predictions, I thought we would take a look at other predictions and see if we can spot some common trends among those, “googling” for 2011 eDiscovery predictions, and organized the predictions into common themes. I found serious predictions here, here, here, here and here. Oh, also here and here.

A couple of quick comments: 1) I had NO IDEA how many times that predictions are re-posted by other sites, so it took some work to isolate each unique set of predictions. I even found two sets of predictions from ZL Technologies, one with twelve predictions and another with seven, so I had to pick one set and I chose the one with seven (sorry, eWEEK!). If I have failed to accurately attribute the original source for a set of predictions, please feel free to comment. 2) This is probably not an exhaustive list of predictions (I have other duties in my “day job”, so I couldn’t search forever), so I apologize if I’ve left anybody’s published predictions out. Again, feel free to comment if you’re aware of other predictions.

Here are some of the common themes:

Cloud and SaaS Computing: Six out of seven “prognosticators” indicated that adoption of Software as a Service (SaaS) “cloud” solutions will continue to increase, which will become increasingly relevant in eDiscovery. No surprise here, given last year’s IDC forecast for SaaS growth and many articles addressing the subject, including a few posts right here on this blog.
Collaboration/Integration: Six out of seven “augurs” also had predictions related to various themes associated with collaboration (more collaboration tools, greater legal/IT coordination, etc.) and integration (greater focus by software vendors on data exchange with other systems, etc.). Two people specifically noted an expectation of greater eDiscovery integration within organization governance, risk management and compliance (GRC) processes.
In-House Discovery: Five “pundits” forecasted eDiscovery functions and software will continue to be brought in-house, especially on the “left-side of the EDRM model” (Information Management).
Diverse Data Sources: Three “soothsayers” presaged that sources of data will continue to be more diverse, which shouldn’t be a surprise to anyone, given the popularity of gadgets and the rise of social media.
Social Media: Speaking of social media, three “prophets” (yes, I’ve been consulting my thesaurus!) expect social media to continue to be a big area to be addressed for eDiscovery.
End to End Discovery: Three “psychics” also predicted that there will continue to be more single-source end-to-end eDiscovery offerings in the marketplace.

The “others receiving votes” category (two predicting each of these) included maturing and acceptance of automated review (including predictive coding), early case assessment moving toward the Information Management stage, consolidation within the eDiscovery industry, more focus on proportionality, maturing of global eDiscovery and predictive/disruptive pricing.

Predictive/disruptive pricing (via Kriss Wilson of Superior Document Services and Charles Skamser of eDiscovery Solutions Group respective blogs) is a particularly intriguing prediction to me because data volumes are continuing to grow at an astronomical rate, so greater volumes lead to greater costs. Creativity will be key in how companies deal with the larger volumes effectively, and pressures will become greater for providers (even, dare I say, review attorneys) to price their services more creatively.

Another interesting prediction (via ZL Technologies) is that “Discovery of Databases and other Structured Data will Increase”, which is something I’ve expected to see for some time. I hope this is finally the year for that.

Finally, I said that I found serious predictions and analyzed them; however, there are a couple of not-so-serious sets of predictions here and here. My favorite prediction is from The Posse List, as follows: “LegalTech…renames itself “EDiscoveryTech” after Law.com survey reveals that of the 422 vendors present, 419 do e-discovery, and the other 3 are Hyundai HotWheels, Speedway Racers and Convert-A-Van who thought they were at the Javits Auto Show.”

So, what do you think? Care to offer your own “hunches” from your crystal ball? Please share any comments you might have or if you’d like to know more about a particular topic.