arcs and tangents: Scalable Search Engines

I have now worked at three different companies where we needed to provide a search feature that could scale into the millions of documents. Two of these involved consumer products and the other consisted of documents for legal review. I saw a very interesting contrast in the requirements between these two domains that I thought might be worth sharing.

The Use Cases

In the consumer product space, we had millions of users a day searching across hundreds of millions of documents, while in the legal technology space, we only had a few thousand users a day who were only searching across tens of millions of products.

At this level, it would seem like the legal domain is a much easier system design problem to solve. However, there are a couple of key difference between the two use cases which can make the legal search problem much, much harder.

Data Shapes

Consumer products result in fairly small documents: there is a title, some attributes and maybe a description that need to be indexed. For legal review, this could involve any type of document: if it is on someone hard drive or in their cloud accounts, it may be subject to legal review. Out in the wild, there are Excel data files in excess of 120MB and PDF files with tens of thousands of pages. They all need to be in the search index. (Processing files that exhibit these extremes is a nightmare in itself, but here we'll stick with the search-related problems.)

Besides the potential for these large documents, the big issues that makes the system design hard is the variance of the sizes. There are plenty of small documents to go with these these larger documents in a legal review. In contrast, consumer products are relatively uniform in the size of what needs to be indexed.

Query Shapes

With consumer products, the search expressions that are typical will consist of a handful of words at most (on average, about 2.3 words). Users mostly express simple concepts like "Sony 55 in. TV". In the legal space, queries are not only much longer, but a considerable amount of time is spent crafting the searches to hone in on specific concepts. Search expression are serious business for lawyers. Competing parties will negotiate search terms and argue about them before judges. It is not uncommon for a single search to have hundreds of terms.

The size of the search expression is only a small part of the story though. The legal domain requires all the special search operators might can think of: wildcard searches, proximity searches, range searches, etc. Not only do they require these, these are used extensively.

Consider search for a specific person of interest in a legal case whose full legal name is "Larry Thomas Johnson". How do you search for documents that mention this name so as to not miss any? "Larry Johnson", "L. Johnson", "L. T. Johnson", "Larry T. Johnson" are some of the more obvious variations that might be out there. In the legal world, they will do something like this: "L? \2 Johnson", where "?" is a wildcard matching any suffix and "\2" is a proximity search looking for occurrences within 2 words.

Requiring these sophisticated queries makes perfect sense in the legal context, but they are also the types of queries that stress a search engine (in both CPU and memory resources). The contrast of these two domains is between a high volume of simple searches and a lower volume of more complex searches.

As with the data shapes, in the legal space the query shapes tend to have a very high variance. There are plenty of simple queries to mix in with the more sophisticated ones. A handful of sophisticated queries (sometimes just one) can consume enough resources to add latency to the simpler queries. In contrast, the consumer product search query complexity is much more regularized and predictable.

Precision vs. Recall

Suppose you have a product catalog containing 10,543 televisions and a user searches for the term "TV". Let's say your search engine happens to only match 10,327 for some reason. Is the user going to notice? Probably not. They have no idea how many products are in your catalog, or even the specific products in your catalog. The recall requirements here are a bit loose.

In contrast, in the legal domain, there is often a single document that is the "smoking gun" that can make or break a case. If an attorney searches for it and your search engine misses it in the search results, you will soon be out of business. If the document exists, your search engine *must* match it.

The recall requirements are strict in the legal domain, but the precision requires are not. An attorney will not mind too much to sift through a few non-relevant documents. They are good at at pattern matching and filtering and would much prefer to get a few extra than to miss any.

Consumers search for products are a bit different. Showing a bunch of microwave ovens in your search results for a "TV" search is going to be a bad user experience if it happens to frequently. For product catalogs, trading off some amount of recall to improve precision is a good choice as it leads to "cleaner" search results for the user and they are none the wiser about what they might be missing. But in the legal domain, flawless recall is a requirement that is non-negotiable.

Data Normalization

It is common for search engines to use "stop words" to filter out words with low semantic content, e.g., "the", "a", "an". Often this filtering is the default behavior. What the consumer product and legal domains share in common is that stopwords are a bad idea. There was a consumer brand name "THE" and an important legal case involving "Project A". Feed that data though a stopword filter and you lose some very important semantic information.

Technology

All of the search technology I have used were based on Lucene. This includes Elasticsearch, SOLR and even a home-grown distributed version of Lucene back in the days before Elasticsearch existed. Lucene is an amazing piece of software and so is Elasticsearch. There is mostly no reason to use anything else. The only downside I found was in the legal domain where the data shapes and query shapes are atypical of most other domains.

Specifically, we ran into issues of limits and resource usage. Earlier versions of Lucene/Elasticsearch lack limits in some important code paths that resulted in run-away CPU or memory usage. As they have address these "holes" they have put limits in place that are too low to support some of the legitimate legal queries that need to be supported. Some thresholds are adjustable, but not all are, and there is always some amount of peril in playing with too many of the default parameters.

The Lucene and Elasticsearch teams focus on the most common cases for query and data shapes, which is the right thing for them to do. Unfortunately, the legal domain use cases fall outside of the norm. Consumer product searches, on the other hand, are more inline with their priorities.

Conclusion

There are critically important differences in the search requirements for the two domains I have work in. It feels like these represent two extremes of a problem space, I wonder how many other domains match one of these two or if there are additional "classes" of search requirements with their own unique characteristics..

arcs and tangents

Sunday, October 25, 2020

Scalable Search Engines