eDiscovery Daily Blog

You Should Check the Level of Your Fuzzy When Searching: eDiscovery Best Practices

If the title seems odd, let me clarify. I’m talking about “fuzzy” searching, which is a mechanism by finding alternate words that are close in spelling to the word you’re looking for. Fuzzy searching will expand your search recall, but too much “fuzzy” will leave you reviewing a lot of non-responsive hits.

Attorneys may know what terms they’re looking for, but those terms may not always be spelled correctly. Let’s face it, we all make mistakes. For example, if you’re searching for emails that relate to management decisions, can you be certain that “management” is spelled perfectly throughout the collection? Unlikely. It could be spelled “managment” or “mangement”. Also, you may have a number of image only files that require Optical Character Recognition (OCR), which is usually not 100% accurate. Without an effective search mechanism, you could miss key documents.

That’s where fuzzy searching comes in. Fuzzy searching enables you to find not just the exact matches of the word or words you’re seeking, but also alternate words that are close in spelling to the word you’re looking for (usually one or two characters off). For example, if you’re looking for the term “petroleum”, you can find variations such as “peroleum”, “petoleum” or even “petroleom” – misspellings, OCR errors or other variations (such as the term in a foreign language) that could be relevant.

However, fuzzy searching can also retrieve other legitimate words that are not relevant. Let’s take the term “concept” – if you perform a fuzzy search which retrieves words that are up to two characters off, you’ll get variations like “consent”, “content” and “concern”. So, it’s important to test your results to evaluate your level of precision vs. recall.

In CloudNine’s review platform, our search interface provides a check box to apply fuzzy searching to the entire term, along with a drop down to select the level of “fuzzy” (from 1 to 10, the higher the number, the more “fuzzy” the search results). But, we also enable the user to apply “fuzzy” to individual terms via the ‘%’ character, used generally after the first character to represent words that are one or two characters off. This enables you to perform a search to find documents with only fuzzy hits. Here are a couple of text search examples using an Enron demo set of over 117,000 documents:

  • p%%etroleum and not petroleum: Retrieves all documents that have words within two characters of “petroleum”, but not the word “petroleum” itself. In this case, 59 total documents were retrieved and the variations retrieved included words like “petróleos”, “petróleo” and “pertroleum”. The first two variations are Spanish language variations of “petroleum”, the third appears to be a misspelling. All of these terms appear responsive, so the precision is still good at this level and we retrieved 59 additional documents that are likely responsive that we wouldn’t have retrieved without fuzzy searching.
  • c%%oncept and not concept: Retrieves all documents that have words within two characters of “concept”, but not the word “concept” itself. In this case, 5,304 total documents were retrieved and the variations retrieved included words like “consent”, “Concast”, “content” and “concern”. We retrieved a high number of documents with clearly non-responsive terms, so this search is proving to be over broad and we may need to dial it back. If we reduce it to one character of “concept”, but not the word “concept” itself, we get 291 total documents retrieved and a number of those non-responsive variations are eliminated, giving us a more precise search.

Think of fuzzy searching as a “dial”. If you “dial” it up a little bit, you can retrieve additional responsive hits without sacrificing precision in your search. If you “dial” it up too much, you’ll be reviewing a lot of non-responsive hits and documents. Test your results to play with the “dial” until you get the most appropriate balance of recall and precision in your search.

So, what do you think? Does your keyword search strategy include the use of fuzzy searching? Please share any comments you might have or if you’d like to know more about a particular topic.

eDiscovery Daily will return on Monday. Have a nice Easter!

Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by CloudNine. eDiscoveryDaily is made available by CloudNine solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscovery Daily should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.

print