eDiscovery Daily Blog

The Number of Files in Each Gigabyte Can Vary Widely: eDiscovery Best Practices

Now and then, I am asked by clients how many documents (files) are typically contained in one gigabyte (GB) of data.  When trying to estimate the costs for review, having a good estimate of the number of files is important to provide a good estimate for review costs.  However, because the number of files per GB can vary widely, estimating review costs accurately can be a challenge.

About four years ago, I conducted a little (unscientific) experiment to show how the number of pages in each GB can vary widely, depending on the file formats that comprise that GB.  Since we now tend to think more about files per GB than pages, I have taken a fresh look using the updated estimate below.

Each GB of data is rarely just one type of file.  Many emails include attachments, which can be in any of a number of different file formats.  Collections of files from hard drives may include Word, Excel, PowerPoint, Adobe PDF and other file formats.  Even files within the same application can vary, depending on the version in which they are stored.  For example, newer versions of Office files (e.g., .docx, .xlsx) incorporate zip compression of the text, so the data sizes tend to be smaller than their older counterparts.  So, estimating file counts with any degree of precision can be somewhat difficult.

To illustrate this, I decided to put the content from yesterday’s case law blog post into several different file formats to illustrate how much the size can vary, even when the content is essentially the same.  Here are the results – rounded to the nearest kilobyte (KB):

  • Text File Format (TXT): Created by performing a “Save As” on the web page for the blog post to text – 4 KB, it would take 262,144 text files at 4 KB each to equal 1 GB;
  • HyperText Markup Language (HTML): Created by performing a “Save As” on the web page for the blog post to HTML – 57 KB, it would take 18,396 HTML files at 57 KB each to equal 1 GB;
  • Microsoft Excel 97-2003 Format (XLS): Created by copying the contents of the blog post and pasting it into a blank Excel XLS workbook – 325 KB, it would take 3,226 XLS files at 325 KB each to equal 1 GB;
  • Microsoft Excel 2010 Format (XLSX): Created by copying the contents of the blog post and pasting it into a blank Excel XLSX workbook – 296 KB, it would take 3,542 XLSX files at 296 KB each to equal 1 GB;
  • Microsoft Word 97-2003 Format (DOC): Created by copying the contents of the blog post and pasting it into a blank Word DOC document – 312 KB, it would take 3,361 DOC files at 312 KB each to equal 1 GB;
  • Microsoft Word 2010 Format (DOCX): Created by copying the contents of the blog post and pasting it into a blank Word DOCX document – 299 KB, it would take 3,507 DOCX files at 299 KB each to equal 1 GB;
  • Microsoft Outlook 2010 Message Format (MSG): Created by copying the contents of the blog post and pasting it into a blank Outlook message, then sending that message to myself, then saving the message out to my hard drive – 328 KB, it would take 3,197 MSG files at 328 KB each to equal 1
  • Adobe PDF Format (PDF): Created by printing the blog post to PDF file using the CutePDF printer driver – 1,550 KB, it would take 677 PDF files at 1,550 KB each to equal 1

The HTML and PDF examples weren’t exactly an “apples to apples” comparison to the other formats – they included other content from the web page as well.  Nonetheless, the examples above hopefully illustrate that, to estimate the number of files in a collection with any degree of accuracy, it’s not only important to understand the size of the data collection, but also the makeup of the collection as well.  Performing an Early Data Assessment on your data beforehand can provide those file counts you need to more accurately estimate your review costs.

So, what do you think?  Was the 2016 example useful, highly flawed or both?  Please share any comments you might have or if you’d like to know more about a particular topic.

Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by CloudNine. eDiscovery Daily is made available by CloudNine solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscovery Daily should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.

print