Text Overlays on Image-Only PDF Files Can Be Problematic – eDiscovery Best Practices

Recently, we at CloudNine Discovery received a set of Adobe PDF files from a client that raised an issue regarding the handling of those files for searching and reviewing purposes.   The issue serves as a cautionary tale for those working with image-only PDFs in their document collection.  Here’s a recap of the issue.

The client was using OnDemand Discovery®, which is our new Client Side add-on to OnDemand® that allows clients to upload their own native data for automated processing and loading into new or existing projects.  The collection was purported to consist mostly of image-only PDF files.  PDF files are created in two ways:

  1. By saving or printing from applications to a PDF file: Many applications, such as Microsoft Office applications like Word, Excel and PowerPoint, provide the ability to save the document or spreadsheet that you’ve created to a PDF file, which is common when you want to “publish” the document.  If the application you’re using doesn’t provide that option, you can print the document to PDF using any of several PDF printer drivers available (some of which are free).  These PDFs that are created usually include the text of the file from which the PDF was created.
  2. By scanning or otherwise creating an image to a PDF file: Typically, this occurs either by scanning hard copy documents to PDF or through some sort of receipt in an image-only PDF form (such as through fax software).  These PDFs that are created are images and do not include the text of the document from which they came.

Like many processing tools, such as LAW PreDiscovery®, OnDemand Discovery is programmed to handle PDF files by extracting the text if present or, if not, performing OCR on the files to capture text from the image.  Text from the file is always preferable to OCR text because it’s a lot more accurate, so this is why OCR is typically only performed on the PDF files lacking text.

After the client loaded their data, we did a spot Quality Control check (like we always do) and discovered that the text for several of the documents only consisted of Bates numbers.


Because the Bates numbers were added as text overlays to the pre-existing image-only PDF files.  When the processing software viewed the file, it found that there was extractable text, so it extracted that text instead of OCRing the PDF file.  In effect, adding the Bates numbers as text overlays to the image-only PDF rendered it as no longer an image-only PDF.  Therefore, the content portion of the text wasn’t captured, so it wasn’t available for indexing and searching.  These documents were essentially rendered non-searchable even after processing.

How did this happen?  Likely through Adobe Acrobat’s Bates Numbering functionality, which is available on later versions of Acrobat (version 8 and higher).  It does exactly that – applies a text overlay Bates number to each page of the document.  Once that happens, eDiscovery processing software applications will not perform OCR on the image-only PDF.

What can you do about it?  If you haven’t applied Bates numbers on the files yet (or have a backup of the files before they were applied – highly recommended) and they haven’t been produced, you should process the files before putting Bates numbers on the images to ensure that you capture the most text available.  And, if opposing counsel will be producing any image-only PDF files, you will want to request the text as well (along with a load file) so that you can maximize your ability to search their production (of course, your first choice should be to receive native format productions whenever possible – here’s a link to an excellent guide on that subject).

If the Bates numbers are already applied and you don’t have a backup of the files without the Bates numbers (oops!) you’re faced with additional processing charges to convert them to TIFF and perform OCR of the text AND the Bates number, a totally unnecessary charge if you plan ahead.

So, what do you think?  Have you dealt with image-only PDF files with text overlaid Bates numbers?  Please share any comments you might have or if you’d like to know more about a particular topic.

