Email Archives

Data Needs to Be Converted More Often than You Think – eDiscovery Best Practices

September 3, 2013

We’ve discussed previously that electronic files aren’t necessarily ready to review just because they’re electronic. They often need processing and good processing requires a sound process. Sometimes that process includes data conversion if the data isn’t in the most useful format.

Case in point: I recently worked with a client that received a multi-part production from the other side (via a another party involved in the litigation, per agreement between the parties) that included image files, OCR text files and metadata. The files that my client received were produced over several months to several other parties in the litigation. The production contained numerous emails, each of which (of course) included an email sent date. Can you guess which format the email sent date was provided in? Here are some choices (using today’s date and 1:00 PM as an example):

09/03/2013 13:00:00
9/03/2013 1:00 PM
September 3, 2013 1:00 PM
Sep-03-2013 1:00 PM
2013/09/03 13:00:00

The answer: all of them.

Because there were several productions to different parties with (apparently) different format agreements, my client didn’t have the option to request the data to be reproduced in a standard format. Not only that, the name of the produced metadata field wasn’t consistent between productions – in about 15 percent of the documents the producing party named the field email_date_sent, in the rest it was named date_sent.

Ever try to sort emails chronologically when they’re not only in different formats, but also in two different fields? It’s impossible. Fortunately, at CloudNine Discovery, there is no shortage of computer “geeks” to address problems like this (I’m admittedly one of them).

As a result, we had to standardize the format of the dates into one standard format in one field. We used a combination of SQL queries to get the data into one field and string commands and regular expressions to manipulate dates that didn’t fit a standard SQL date format by re-parsing them into a correct date format. For example, the date 2013/09/03 was reparsed into 09/03/2013.

Getting the dates into a standard format in a single field not only enabled us to sort the emails chronologically by date sent, it also enabled us to identify (in combination with other standard email metadata fields) duplicates in the collection based on metadata fields (since the data was in image and OCR formats, HASH algorithms weren’t a viable option for de-duplication).

Over the years, I’ve seen many examples where data (either from our side or the other side) needs to be converted. It happens more than you think. When that happens, it’s good to have a computer “geek” on your side to address the problem.

So, what do you think? Have you encountered data conversion issues in your cases? Please share any comments you might have or if you’d like to know more about a particular topic.

Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by CloudNine Discovery. eDiscoveryDaily is made available by CloudNine Discovery solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscoveryDaily should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.

Scheindlin Reverses Magistrate Judge Ruling, Orders Sanction for Spoliation of Data – eDiscovery Case Law

August 27, 2013

If you’re hoping to get away with failing to preserve data in eDiscovery, you might want to think again if your case appears in the docket for the Southern District of New York with Judge Shira Scheindlin presiding.

As reported in by Victor Li in Law Technology News, (Scheindlin Not Charmed When Revisiting Spoliation a Third Time), Judge Scheindlin, who issued two of the most famous rulings with regard to eDiscovery sanctions for spoliation of data – Zubulake v. UBS Warburg and Pension Committee of the Montreal Pension Plan v. Banc of America Securities – sanctioned Sekisui America Corp. and Sekisui Medical Co. with an adverse inference jury instruction for deleting emails in its ongoing breach of contract case, as well as an award of “reasonable costs, including attorneys’ fees, associated with bringing this motion”.

Last year, the plaintiffs sued two former executives, including CEO Richard Hart of America Diagnostica, Inc. (ADI), a medical diagnostic products manufacturer acquired by Sekisui in 2009, for breach of contract. While the plaintiffs informed the defendants in October 2010 that they intended to sue, they did not impose a litigation hold on their own data until May 2012. According to court documents, during the interim, thousands of emails were deleted in order to free up server space, including Richard Hart’s entire email folder and that of another ADI employee (Leigh Ayres).

U.S. Magistrate Judge Frank Maas of the Southern District of New York, while finding that the actions could constitute gross negligence by the plaintiffs, recommended against sanctions because:

There was no showing of bad faith, and;
The defendants could not prove that the emails would have been beneficial to them, or prove that they were prejudiced by the deletion of the emails.

The defendants appealed. Judge Scheindlin reversed the ruling by Magistrate Judge Maas, finding that “the destruction of Hart’s and Ayres’ ESI was willful and that prejudice is therefore presumed” and the “Magistrate Judge’s Decision denying the Harts’ motion for sanctions was therefore ‘clearly erroneous.’”

With regard to the defendants proving whether the deleted emails would have been beneficial to them, Judge Scheindlin stated “When evidence is destroyed intentionally, such destruction is sufficient evidence from which to conclude that the missing evidence was unfavorable to that party. As such, once willfulness is established, no burden is imposed on the innocent party to point to now-destroyed evidence which is no longer available because the other party destroyed it.”

Judge Scheindlin also found fault with the proposed amendment to Rule 37(e) to the Federal Rules of Civil Procedure, which would limit the imposition of eDiscovery sanctions for spoliation to instances where the destruction of evidence caused substantial prejudice and was willful or in bad faith, stating “I do not agree that the burden to prove prejudice from missing evidence lost as a result of willful or intentional misconduct should fall on the innocent party. Furthermore, imposing sanctions only where evidence is destroyed willfully or in bad faith creates perverse incentives and encourages sloppy behavior.”

As a result, Judge Scheindlin awarded the defendants’ request for an adverse inference jury instruction and also awarded “reasonable costs, including attorneys’ fees, associated with bringing this motion”. To see the full opinion order (via Law Technology News), click here.

So, what do you think? Should sanctions have been awarded? Please share any comments you might have or if you’d like to know more about a particular topic.

Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by CloudNine Discovery. eDiscoveryDaily is made available by CloudNine Discovery solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscoveryDaily should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.

How Big is Your ESI Collection, Really? – eDiscovery Best Practices

August 26, 2013

When I was at ILTA last week, this topic came up in a discussion with a colleague during the show, so I thought it would be good to revisit here.

After identifying custodians relevant to the case and collecting files from each, you’ve collected roughly 100 gigabytes (GB) of Microsoft Outlook email PST files and loose electronic files from the custodians. You identify a vendor to process the files to load into a review tool, so that you can perform review and produce the files to opposing counsel. After processing, the vendor sends you a bill – and they’ve charged you to process over 200 GB!! Are they trying to overbill you?

Yes and no.

Many of the files in most ESI collections are stored in what are known as “archive” or “container” files. For example, while Outlook emails can be stored in different file formats, they are typically collected from each custodian and saved in a personal storage (.PST) file format, which is an expanding container file. The scanned size for the PST file is the size of the file on disk.

Did you ever see one of those vacuum bags that you store clothes in and then suck all the air out so that the clothes won’t take as much space? The PST file is like one of those vacuum bags – it often stores the emails and attachments in a compressed format to save space. There are other types of archive container files that compress the contents – .ZIP and .RAR files are two examples of compressed container files. These files are often used to not only to compress files for storage on hard drives, but they are also used to compact or group a set of files when transmitting them, often in email. With email comprising a major portion of most ESI collections and the popularity of other archive container files for compressing file collections, the expanded size of your collection may be considerably larger than it appears when stored on disk.

When PST, ZIP, RAR or other compressed file formats are processed for loading into a review tool, they are expanded into their normal size. This expanded size can be 1.5 to 2 times larger than the scanned size (or more). And, that’s what some vendors will bill processing on – the expanded size. In those cases, you won’t know what the processing costs will be until the data is expanded since it’s difficult to determine until processing is complete.

It’s important to be prepared for that and know your options when processing that data. Make sure your vendor selection criteria includes questions about how processing is billed, on the scanned or expanded size. Some vendors (like the company I work for, CloudNine Discovery), do bill based on the scanned size of the collection for processing, so shop around to make sure you’re getting the best deal from your vendor.

So, what do you think? Have you ever been surprised by processing costs of your ESI? Please share any comments you might have or if you’d like to know more about a particular topic.

Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by CloudNine Discovery. eDiscoveryDaily is made available by CloudNine Discovery solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscoveryDaily should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.

Data May Be Doubling Every Couple of Years, But How Much of it is Original? – eDiscovery Best Practices

July 31, 2013

According to the Compliance, Governance and Oversight Council (CGOC), information volume in most organizations doubles every 18-24 months. However, just because it doubles doesn’t mean that it’s all original. Like a bad cover band singing Free Bird, the rendition may be unique, but the content is the same. The key is limiting review to unique content.

When reviewers are reviewing the same files again and again, it not only drives up costs unnecessarily, but it could also lead to problems if the same file is categorized differently by different reviewers (for example, inadvertent production of a duplicate of a privileged file if it is not correctly categorized).

Of course, we all know the importance of identifying exact duplicates (that contain the exact same content in the same file format) which can be identified through MD5 and SHA-1 hash values, so that they can be removed from the review population and save considerable review costs.

Identifying near duplicates that contain the same (or almost the same) information (such as a Word document published to an Adobe PDF file where the content is the same, but the file format is different, so the hash value will be different) also reduces redundant review and saves costs.

Then, there is message thread analysis. Many email messages are part of a larger discussion, sometimes just between two parties, and, other times, between a number of parties in the discussion. To review each email in the discussion thread would result in much of the same information being reviewed over and over again. Pulling those messages together and enabling them to be reviewed as an entire discussion can eliminate that redundant review. That includes any side conversations within the discussion that may or may not be related to the original topic (e.g., a side discussion about the latest misstep by Anthony Weiner).

Clustering is a process which pulls similar documents together based on content so that the duplicative information can be identified more quickly and eliminated to reduce redundancy. With clustering, you can minimize review of duplicative information within documents and emails, saving time and cost and ensuring consistency in the review. As a result, even if the data in your organization doubles every couple of years, the cost of your review shouldn’t.

So, what do you think? Does your review tool support clustering technology to pull similar content together for review? Please share any comments you might have or if you’d like to know more about a particular topic.

Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by CloudNine Discovery. eDiscoveryDaily is made available by CloudNine Discovery solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscoveryDaily should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.

Good Processing Requires a Sound Process – eDiscovery Best Practices

July 26, 2013

As we discussed yesterday, working with electronic files in a review tool is NOT just simply a matter of loading the files and getting started. Electronic files are diverse and can represent a whole collection of issues to address in order to process them for loading. To address those issues effectively, processing requires a sound process.

eDiscovery providers like (shameless plus warning!) CloudNine Discovery process electronic files regularly to enable their clients to work with those files during review and production. As a result, we are aware of some of the information that must be provided by the client to ensure that the resulting processed data meets their needs and have created an EDD processing spec sheet to gather that information before processing. Examples of information we collect from our clients:

Do you need de-duplication? If so, should it performed at the case or the custodian level?
Should Outlook emails be extracted in MSG or HTM format?
What time zone should we use for email extraction? Typically, it’s the local time zone of the client or Greenwich Mean Time (GMT). If you don’t think that matters, consider this example.
Should we perform Optical Character Recognition (OCR) for image-only files that don’t have corresponding text? If we don’t OCR those files, these could be responsive files that are missed during searching.
If any password-protected files are encountered, should we attempt to crack those passwords or log them as exception files?
Should the collection be culled based on a responsive date range?
Should the collection be culled based on key terms?

Those are some general examples for native processing. If the client requests creation of image files (many still do, despite the well documented advantages of native files), there are a number of additional questions we ask regarding the image processing. Some examples:

Generate as single-page TIFF, multi-page TIFF, text-searchable PDF or non text-searchable PDF?
Should color images be created when appropriate?
Should we generate placeholder images for unsupported or corrupt files that cannot be repaired?
Should we create images of Excel files? If so, we proceed to ask a series of questions about formatting preferences, including orientation (portrait or landscape), scaling options (auto-size columns or fit to page), printing gridlines, printing hidden rows/columns/sheets, etc.
Should we endorse the images? If so, how?

Those are just some examples. Questions about print format options for Excel, Word and PowerPoint take up almost a full page by themselves – there are a lot of formatting options for those files and we identify default parameters that we typically use. Don’t get me started.

We also ask questions about load file generation (if the data is not being loaded into our own review tool, OnDemand®), including what load file format is preferred and parameters associated with the desired load file format.

This isn’t a comprehensive list of questions we ask, just a sample to illustrate how many decisions must be made to effectively process electronic data. Processing data is not just a matter of feeding native electronic files into the processing tool and generating results, it requires a sound process to ensure that the resulting output will meet the needs of the case.

So, what do you think? How do you handle processing of electronic files? Please share any comments you might have or if you’d like to know more about a particular topic.

P.S. – No hamsters were harmed in the making of this blog post.

Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by CloudNine Discovery. eDiscoveryDaily is made available by CloudNine Discovery solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscoveryDaily should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.

The Files are Already Electronic, How Hard Can They Be to Load? – eDiscovery Best Practices

July 25, 2013

Since hard copy discovery became electronic discovery, I’ve worked with a number of clients who expect that working with electronic files in a review tool is simply a matter of loading the files and getting started. Unfortunately, it’s not that simple!

Back when most discovery was paper based, the usefulness of the documents was understandably limited. Documents were paper and they all required conversion to image to be viewed electronically, optical character recognition (OCR) to capture their text (though not 100% accurately) and coding (i.e., data entry) to capture key data elements (e.g., author, recipient, subject, document date, document type, names mentioned, etc.). It was a problem, but it was a consistent problem – all documents needed the same treatment to make them searchable and usable electronically.

Though electronic files are already electronic, that doesn’t mean that they’re ready for review as is. They don’t just represent one problem, they can represent a whole collection of problems. For example:

Image only electronic files such as TIFF or image-only PDF files may be electronic, but they still have no searchable text. They still require OCR to generate searchable text to enable them to be effectively searched. It’s important to account for image-only files when self-collecting as keyword searches will miss these files.
Outlook Emails are typically stored in a “container” file like an EDB (Exchange Database), OST (Outlook Offline Storage Table) or PST (Outlook Personal Storage Table). To work with the emails individually, they typically require processing to break them out into individual MSG (Outlook MSG Files). That processing is also necessary to break out the attachments from the emails so that they can be reviewed or categorized individually, if required. And, if the emails are stored in Lotus Notes, there is no equivalent single message format, so those emails generally require conversion to HTML format during processing.
Databases are large, structured collections of data, but they don’t relate easily to a document format, so they require some analysis to determine if, and in what form, they should be produced.
In almost every collection, there are some files that cannot be processed or searched. Corrupt files, password protected files and other types of exception files are frequent components of your ESI collection and it can become very expensive to make these files searchable or reviewable.

These are just a few examples of why working with electronic files for review isn’t necessarily straightforward. Of course, when processed correctly, electronic files include considerable metadata that provides useful information about how and when the files were created and used, and by whom. They’re way more useful than paper documents. So, it’s still preferable to work with electronic files instead of hard copy files whenever they are available. But, despite what you might think, that doesn’t make them ready to review as is.

So, what do you think? Have you encountered difficulties or challenges when processing electronic files? Please share any comments you might have or if you’d like to know more about a particular topic.

Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by CloudNine Discovery. eDiscoveryDaily is made available by CloudNine Discovery solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscoveryDaily should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.

Court Rules that Stored Communications Act Applies to Former Employee Emails – eDiscovery Case Law

July 17, 2013

In Lazette v. Kulmatycki, No. 3:12CV2416, 2013 U.S. Dist. (N.D. Ohio June 5, 2013), the Stored Communications Act (SCA) applied when a supervisor reviewed his former employee’s Gmails through her company-issued smartphone; it covered emails the former employee had not yet opened but not emails she had read but not yet deleted.

When the plaintiff left her employer, she returned her company-issued Blackberry, which she believed the company would recycle and give to another employee. Over the next eighteen months, her former supervisor read 48,000 emails on the plaintiff’s personal Gmail account without her knowledge or authorization. The plaintiff also claimed her supervisor shared the contents of her emails with others. As a result, she filed a lawsuit alleging violations of the SCA, among other claims.

The SCA allows recovery where someone “(1) intentionally accesses without authorization a facility through which an electronic communication service is provided; or (2) intentionally exceeds an authorization to access that facility; and thereby obtains . . . access to a wire or electronic communication while it is in electronic storage in such system.” “Electronic storage” includes “(A) any temporary, intermediate storage of a wire or electronic communication incidental to the electronic transmission thereof; and (B) any storage of such communication by an electronic communication service for purposes of backup protection of such communication.”

The defendants claimed that Kulmatycki’s review of the plaintiff’s emails did not violate the SCA for several reasons: the SCA was aimed at “‘high-tech’ criminals, such as computer hackers,”‘ that Kulmatycki had authority to access the plaintiff’s emails, that his access “did not occur via ‘a facility through which an electronic communication service is provided’ other than the company owned Blackberry,” that “the emails were not in electronic storage when Kulmatycki read them,” and that the company was exempt because “the person or entity providing an electronic communications service is exempt from the Act, because the complaint does not make clear that plaintiff’s g-mail account was separate from her company account.”

The court rejected all but one of the defendants’ arguments. The SCA’s scope extended beyond high-tech hackers, and the Gmail server was the “facility” in question, not the plaintiff’s Blackberry. The court also found that the plaintiff’s failure to delete her Gmail account from her Blackberry did not give her supervisor her implied consent to access her emails; the plaintiff’s negligence did not amount to “approval, much less authorization. There is a difference between someone who fails to leave the door locked when going out and one who leaves it open knowing someone be stopping by.” The court also found that the former employer could be held liable through respondeat superior: the actions of the supervisor could be imputed to the company.

Where the defendants scored a minor victory is in their interpretation of “storage”: any emails that the plaintiff had opened but not deleted before the defendant saw them were not being kept “for the purposes of backup protection” and thus were not protected under the SCA.

Accordingly, the court allowed the plaintiff’s SCA claim to proceed.

So, what do you think? Should the emails have been protected under the SCA? Please share any comments you might have or if you’d like to know more about a particular topic.

Case Summary Source: Applied Discovery (free subscription required). For eDiscovery news and best practices, check out the Applied Discovery Blog here.

Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by CloudNine Discovery. eDiscoveryDaily is made available by CloudNine Discovery solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscoveryDaily should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.

Appellate Court Upholds District Court Discretion for Determining the Strength of Adverse Inference Sanction – eDiscovery Case Law

June 14, 2013

In Flagg v. City of Detroit, No. 11-2501, 2013 U.S. App. (6th Cir. Apr. 25, 2013), the Sixth Circuit held that the district court did not abuse its discretion in issuing a permissive rather than mandatory adverse inference instruction for the defendant’s deletion of emails, noting that the district court has discretion in determining the strength of the inference to be applied.

In this appeal, the plaintiff children of a murder victim argued that the district court did not go far enough in issuing a permissive adverse inference instruction against the defendants for the destruction of evidence; instead, they believed a mandatory adverse inference instruction was warranted.

During discovery, the plaintiffs had filed a motion for preservation of evidence that covered emails. The court granted the motion. Later, the plaintiffs asked the defendants to produce all emails for a number of city officials, including the mayor. However, the city had deleted and purged the email of several officials when they resigned, including those of the mayor. The district court found the city had acted “culpably and in bad faith” in destroying the emails. Though it denied the plaintiffs’ request for a default judgment and a mandatory adverse inference, it did grant their request for a permissive inference. The plaintiffs appealed the district court’s choice of sanction.

The Sixth Circuit reviewed the district court’s opinion for abuse of discretion. It found that the plaintiffs met all three elements required for an adverse inference instruction: that the defendants had an obligation to preserve the evidence they destroyed, that the defendants destroyed the evidence with a culpable state of mind, and that the destroyed evidence was relevant to the plaintiffs’ claim.

Because the district court has the power to decide the strength of the inference, the Sixth Circuit upheld its decision, despite noting that “[i]f the severity of a spoliation sanction were required to be based solely on the sanctioned party’s degree of fault, this Court likely would be compelled to agree with Plaintiffs that the district court abused its discretion. After all, ‘intentionality’ is the highest degree of fault contemplated by this Court . . . and the district court found it to be present in this case.”

So, what do you think? Should the District Court decision have been upheld? Please share any comments you might have or if you’d like to know more about a particular topic.

Case Summary Source: Applied Discovery (free subscription required). For eDiscovery news and best practices, check out the Applied Discovery Blog here.

Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by CloudNine Discovery. eDiscoveryDaily is made available by CloudNine Discovery solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscoveryDaily should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.

Some Additional Perspective on the EDRM Enron Data Set “Controversy” – eDiscovery Trends

May 31, 2013

Sharon Nelson wrote a terrific post about the “controversy” regarding the Electronic Discovery Reference Model (EDRM) Enron Data Set in her Ride the Lightning blog (Is the Enron E-Mail Data Set Worth All the Mudslinging?). I wanted to repeat some of her key points here and offer some of my own perspective directly from sitting in on the Data Set team during the EDRM Annual Meeting earlier this month.

But, First a Recap

To recap, the EDRM Enron Data Set, sourced from the FERC Enron Investigation release made available by Lockheed Martin Corporation, has been a valuable resource for eDiscovery software demonstration and testing (we covered it here back in January 2011). Initially, the data was made available for download on the EDRM site, then subsequently moved to Amazon Web Services (AWS). However, after much recent discussion about personally-identifiable information (PII) data (including social security numbers, credit card numbers, dates of birth, home addresses and phone numbers) available within FERC (and consequently the EDRM Data Set), the EDRM Data Set was taken down from the AWS site.

Then, a couple of weeks ago, EDRM, along with Nuix, announced that they have republished version 1 of the EDRM Enron PST Data Set (which contains over 1.3 million items) after cleansing it of private, health and personal financial information. Nuix and EDRM have also published the methodology Nuix’s staff used to identify and remove more than 10,000 high-risk items, including credit card numbers (60 items), Social Security or other national identity numbers (572), individuals’ dates of birth (292) and other personal data. All personal data gone, right?

Not so fast.

As noted in this Law Technology News article by Sean Doherty (Enron Sandbox Stirs Up Private Data, Again), “Index Engines (IE) obtained a copy of the Nuix-cleansed Enron data for review and claims to have found many ‘social security numbers, legal documents, and other information that should not be made public.’ IE evidenced its ‘find’ by republishing a redacted version of a document with PII” (actually, a handful of them). IE and others were quite critical of the effort by Nuix/EDRM and the extent of the PII data still remaining.

As he does so well, Rob Robinson has compiled a list of articles, comments and posts related to the PII issue, here is the link.

Collaboration, not criticism

Sharon’s post had several observations regarding the data set “controversy”, some of which are repeated here:

“Is the legal status of the data pretty clear? Yes, when a court refused to block it from being made public apparently accepting the greater good of its release, the status is pretty clear.”
“Should Nuix be taken to task for failure to wholly cleanse the data? I don’t think so. I am not inclined to let perfect be the enemy of the good. A lot was cleansed and it may be fair to say that Nuix was surprised by how much PII remained.”
“The terms governing the download of the data set made clear that there was no guarantee that all the PII was removed.” (more on that below in my observations)
“While one can argue that EDRM should have done something about the PII earlier, at least it is doing something now. It may be actively helpful to Nuix to point out PII that was not cleansed so it can figure out why.”
“Our expectations here should be that we are in the midst of a cleansing process, not looking at the data set in a black or white manner of cleansed or uncleansed.”
“My suggestion? Collaboration, not criticism. I believe Nuix is anxious to provide the cleanest version of the data possible – to the extent that others can help, it would be a public service.”

My Perspective from the Data Set Meeting

I sat in on part of the Data Set meeting earlier this month and there was a couple of points discussed during the meeting that I thought were worth relaying:

1. We understood that there was no guarantee that all of the PII data was removed.

As with any process, we understood that there was no effective way to ensure that all PII data was removed after the process was complete and discussed needing a mechanism for people to continue to report PII data that they find. On the download page for the data set, there was a link to the legal disclaimer page, which states in section 1.8:

“While the Company endeavours to ensure that the information in the Data Set is correct and all PII is removed, the Company does not warrant the accuracy and/or completeness of the Data Set, nor that all PII has been removed from the Data Set. The Company may make changes to the Data Set at any time without notice.”

With regard to a mechanism for reporting persistent PII data, there is this statement on the Data Set page on the EDRM site:

“PII: These files may contain personally identifiable information, in spite of efforts to remove that information. If you find PII that you think should be removed, please notify us at mail@edrm.net.”

2. We agreed that any documents with PII data should be removed, not redacted.

Because the original data set, with all of the original PII data, is available via FERC, we agreed that any documents containing sensitive personal information should be removed from the data set – NOT redacted. In essence, redacting those documents is putting a beacon on them to make it easier to find them in the FERC set or downloaded copies of the original EDRM set, so the published redacted examples of missed PII only serves to facilitate finding those documents in the original sets.

Conclusion

Regardless of how effective the “cleansing” of the data set was perceived to be by some, it did result in removing over 10,000 items with personal data. Yet, some PII data evidently remains. While some people think (and they may have a point) that the data set should not have been published until after an independent audit for remaining PII data, it seems impractical (to me, at least) to wait until it is “perfect” before publishing the set. So, when is it good enough to publish? That appears to be open to interpretation.

Like Sharon, my hope is that we can move forward to continue to improve the Data Set through collaboration and that those who continue to find PII data in the set will notify EDRM, so that they can remove those items and continue to make the set better. I’d love to see the Data Set page on the EDRM site reflect a history of each data set update, with the revision date, the number of additional PII items found and removed and who identified them (to give credit to those finding the data). As Canned Heat would say, “Let’s Work Together”.

And, we haven’t even gotten to version 2 of the Data Set yet – more fun ahead! 🙂

So, what do you think? Have you used the EDRM Enron Data Set? If so, do you plan to download the new version? Please share any comments you might have or if you’d like to know more about a particular topic.

Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by CloudNine Discovery. eDiscoveryDaily is made available by CloudNine Discovery solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscoveryDaily should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.

Hard Drive Turned Over to Criminal Defendant – Eight Years Later – eDiscovery Case Law

May 30, 2013

If you think discovery violations by the other side can cause you problems, imagine being this guy.

As reported by WRAL.com in Durham, North Carolina, the defense in State of North Carolina v. Raven S. Abaroa, No. 10 CRS 1087 filed a Motion to Dismiss the Case for Discovery Violations after the state produced a forensic image of a hard drive (in the middle of trial) that had been locked away in the Durham Police Department for eight years.

After the state responded to the defendant’s March 2010 discovery request, the defendant filed a Motion to Compel Discovery in October 2012, alleging that the state had failed to disclose all discoverable “information in the possession of the state, including law enforcement officers, that tends to undermine the statements of or reflects negatively on the credibility of potential witnesses”. At the hearing on the motion, the Assistant DA stated that all emails had been produced and the court agreed.

On April 29 of this year, the defendant filed another Motion to Compel Specific Items of Discovery “questioning whether all items within the state’s custody had been revealed, including information with exculpatory or impeachment value”. Once again, the state assured the court it had met its discovery obligations and the court again denied the motion.

During pre-trial preparation of a former forensic examiner of the Durham Police Department (DPD) and testimony of detectives in the case, it became apparent that a hard drive of the victim’s that was imaged was never turned over to the defense. On May 15, representatives of the DPD located the image from the victim’s hard drive which had been locked away in a cabinet for eight years. Once defense counsel obtained a copy of the drive, their forensic examiner retrieved several emails between the victim and her former boyfriend that were exchanged within a few weeks of the murder that belied the prosecution’s portrayal of the defendant as an unfaithful, verbally abusive and controlling husband feared by his wife. In testimony, the defendant’s forensic examiner testified that had he known about the hard drive in 2005, steps could have been taken to preserve the emails on the email server and that they could have provided a better snapshot of the victim’s email and Internet activity.

This led to the filing of the Motion to Dismiss the Case for Discovery Violations by the defense (link to the filing here).

As reported by WTVD, Judge Orlando Hudson, having been recently ruled against by the North Carolina Court of Appeals in another murder case where he dismissed the case based on discovery violations by Durham prosecutors, denied the defense’s requests for a dismissal or a mistrial. Sounds like interesting grounds for appeal if the defendant is convicted.

So, what do you think? Should the judge have granted the defense’s request for a dismissal, or at least a mistrial? Please share any comments you might have or if you’d like to know more about a particular topic.

Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by CloudNine Discovery. eDiscoveryDaily is made available by CloudNine Discovery solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscoveryDaily should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.

Email