eDiscovery Daily Blog

Here’s a New Twist to Text Overlays on Image-Only PDF Files That Can Be Even More Problematic: eDiscovery Best Practices

Remember when we discussed the issue of text overlays on image-only PDF files (typically represented as Bates numbers) and the problems they cause?  Well, we found a variation to the issue that is even more of a problem.

Here’s a recap of the issue we identified a couple of years ago.  The client was using the Discovery Client that allows clients to upload their own native data for automated processing and loading into new or existing projects into our CloudNine platform.  The collection was purported to consist mostly of image-only PDF files, which is one way to create PDF files (click back to the old post for more info on both ways to do so).

Like many processing tools, such as LAW PreDiscovery®, CloudNine was programmed back then to handle PDF files by extracting the text if present or, if not, performing OCR on the files to capture text from the image.  Text from the file is always preferable to OCR text because it’s a lot more accurate, so this is why OCR is typically only performed on the PDF files lacking text.

After the client loaded their data, we did a spot quality control check (like we always do) and discovered that the text for several of the documents only consisted of Bates numbers.

Why?

Because the Bates numbers were added as text overlays to the pre-existing image-only PDF files.  When the processing software viewed the file, it found that there was extractable text, so it extracted that text instead of OCRing the PDF file.  In effect, adding the Bates numbers as text overlays to the image-only PDF rendered it as no longer an image-only PDF.

As a result of this issue a couple of years ago, we added logic to the processing engine of CloudNine to perform OCR if there is minimal text per page (to account for the scenarios where there is only a Bates number).  Therefore, the content portion of the text would still be captured, so it would be available for indexing and searching.  Problem solved, right?

For the most part, yes.  Until a couple of weeks ago, where we ran into the situation again on a few PDF files.  Again, these files only generated the Bates numbers during processing.  What made them different?

Ever hear of a watermark?  These documents were stamped DRAFT via a light gray watermark on the PDF file.  Then, they were Bates stamped with the Adobe Acrobat Bates Numbering functionality.

Evidently, because of the watermark, the document image and the text overlaid Bates number were on separate levels of the PDF.  The processing tool failed to pick up the text because it essentially couldn’t find it.  Our production team ultimately had to re-generate the PDF files (by printing them back to PDF) and then OCR them.  That’s one reason why it’s good to have a team in place – to handle anomalies like that which occur.

As we noted a couple of years ago, if you haven’t applied Bates numbers on the files yet (or have a backup of the files before they were applied – highly recommended) and they haven’t been produced, you should process the files before putting Bates numbers on the images to ensure that you capture the most text available.  And, if opposing counsel will be producing any image-only PDF files, you will want to request the text as well (along with a load file) so that you can maximize your ability to search their production.  Doing so will save you additional processing charges.

Of course, your first choice should be to receive native format productions whenever possible – here’s a link to an excellent guide on that subject.

So, what do you think?  Have you dealt with image-only PDF files with text overlaid Bates numbers?  Please share any comments you might have or if you’d like to know more about a particular topic.

Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by CloudNine. eDiscovery Daily is made available by CloudNine solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscovery Daily should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.