eDiscovery Daily Blog

To Be or Not To Be? Not to Be, if Your Search Contains Noise Words: eDiscovery Best Practices

This is an issue that comes up frequently when my clients ask me to review their proposed search terms, or need help in understanding why a particular term doesn’t retrieve the intended result.

When providing searching assistance to my clients and reviewing their proposed list of search terms, one of the considerations I use for evaluating those terms is whether they contain any potential “noise” words that might affect their search results.  Noise words (also known as stop words) are words – such as “to”, “or”, “not”, etc. – which are so common that they are not considered useful in searches.

Search engines rely on indices to find information quickly – these indices are built and updated each time documents are loaded into the database.  To save time, both in creating the indexing and (especially) in performing searches, noise words are not indexed and are ignored in indexed searches. This enables even the most complex searches on the largest databases to be performed quickly and effectively; for example, a search for 30 to 40 terms within a 15 million document database in CloudNine usually takes a matter of seconds to retrieve the results.  I know, because I recently performed several of those searches for a client on their 15 million document database.

The advantage of excluding these words is smaller indexes and improved indexing, and searching, time.  If every “the”, “do”, “can” and “up” was indexed, searching eDiscovery databases would be way slower – painfully slow, in fact.

However, there can be drawbacks to not indexing these noise words.  One disadvantage is that if your searches are typically for common phrases, you may not be able to search with precision and you may either get additional non-responsive results or (even worse) miss some responsive results.

Years ago, I attended a presentation by Craig Ball, where he identified the perfect phrase that illustrates the problems with noise words:

“To Be or Not to Be”

This famous phrase in Shakespeare’s Hamlet would typically not be indexed at all in most search engines – every word in the phrase is a typical noise word.

If a quoted phrase in a search query includes a noise word, the search results may contain results with any word in place of the noise word. For example, a search query for “deed of trust”, might contain documents with the phrases “deed and trust” or “deed under trust” in the search results.

Most search tools can provide a list of the noise words used, so that you can adjust accordingly when constructing your searches.  So, when preparing a list of search terms, it’s important to remember that noise words exist and they could affect your search results.  If you have to search for phrases that contain noise words, you may retrieve some non-responsive hits in those results, so you want to be prepared to review to determine how effective the search was able to retrieve the desired results.  Don’t let noise words drown out your ability to effectively search your collection!

So, what do you think?  Have you encountered issues with noise words in your searches?  How have you addressed those issues?  Please share any comments you might have or if you’d like to know more about a particular topic.

Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by CloudNine. eDiscovery Daily is made available by CloudNine solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscovery Daily should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.