arcs and tangents: 2020

Sunday, December 20, 2020

Microservices - Why?

I read an article with the basic theme:

“Microservices was a good idea taken too far and applied too bluntly."

Link: Microservices — architecture nihilism in minimalism's clothes

I agree with a lot of what is in there as I have seen the complications this design pattern leads to and often wondered if it was worth it.

This general behavior matches a familiar one that happens time and again. A similar things happened (is still happening?) with NoSQL and you could say the same thing:

“NoSQL was a good idea taken too far and applied too bluntly."

Ditto for the semi-recent trend of "Single Page Webapps". XML was another one. The list is broad, deep and goes further back than my time in the field.

Why is there the constant trend in software development of chasing the latest idea and overusing it until it collapses on itself? Haven't we seen this pattern enough to not repeat our mistakes? Ultimately, this is an immature behavior, but what is the source of that immaturity? Is it simply because the field only goes back a few decades? Is it that the field is dominated by a younger and less experienced group of people?

Software engineers are initially drawn to the discovery aspect of technology, so they naturally gravitate toward the "new". Human nature leans toward doing the familiar, often ignoring the "why". There is also the time required to learn a new technology that is a driver for getting the most out of that investment. Maybe these combine so that once the new thing is learned, it tends to propagate that new pattern and/or technology without asking "why".

Maybe it is an education problem. Are we giving students these lessons and warnings about them in their Software Engineering classes? Wouldn't it be helpful if every student coming out of school knew about this danger and could recognize it as quickly as they can regurgitate the big-oh complexity of a bubble sort?

There's also a class culture in software where the "coolness" factor is related to the "newness" factor. Many engineers look down upon the use of older technologies, ridiculing their use and sometimes shaming people into using new tech. Who wants to be coding in PHP and be socially outcast from all the cool kids? This behavior is especially troubling because it promotes the idea that the technology is more important than the problem it is meant to solve.

As an engineering leader, it is important to combat these less-than-rational reasons for adopting a technology. I think I am often viewed as a curmudgeon about adopting new technology, but asking "why" is the responsible thing to do.

Sunday, November 8, 2020

Code Reuse is Overrated

As an engineer, I have often been conflicted about the interplay of code reuse and dependency management. It is innate behavior for me to not want to repeat myself and it is through hard and painful experiences that I have learned to be hyper-critical of all new dependencies.

Why re-invent the wheel? Because the existing wheel comes with lots of strings attached. This article does a good job of explaining the pitfalls in depth:

Redundancy vs dependencies: which is worse?

If it is my wheel and it breaks, I know how to and am able to fix it quickly and cheaply (measured in time). If I depend on someone else's wheel, now I have lots of problems. Will they fix it and when? Will their new version introduce new things I am not expecting? Maybe they too have their own dependencies. That's to name just a few.

I was on a team where a large third-party Java library was added to the project for the sole purposes of using the "isBlank()" function. That was not a good trade-off of re-use and dependencies. I hear the Javascript/npm world has this same sort of problem in spades.

Too often the Don't Repeat Yourself (DRY) mantra is used as a gospel, devoid of the dependency cost. I've seen development cultures where adding dependencies is done often and effortlessly: it does not even register that there should be a decision process around this. As a group, we need to be more thoughtful about the trade-offs we are making when introducing a dependency and to assign it the proper cost.

Here is another good article related to this theme:

Small Functions considered Harmful

Thursday, November 5, 2020

Fallacies of Distributed Computing

I came across this Wikipedia page:

Fallacies of Distributed Computing

It lists the fallacies as:

The network is reliable;
Latency is zero;
Bandwidth is infinite;
The network is secure;
Topology doesn't change;
There is one administrator;
Transport cost is zero;
The network is homogeneous.
We all trust each other.

These are all good things to keep in mind while you design a distributed system, but I think the use of the word "fallacy" is a bit overstated. I've seen a lot of designs (and existing systems) where some of those items have been neglected, but the reason is not because the author had "mistaken beliefs".

Even for someone new to distributed systems, if you asked them "Is the network reliable?", they will rightly know that it is not. If their first designs do not properly account for this, it is not because they had mistaken beliefs, but more due to their inexperience or oversight.

The same is true for the remaining items: if you asked someone the specific question, you will likely get the right answer, though their designs may still be lax in that area.

If you forgot to pay your electric bill, I would not conclude that you have the false belief that electricity is free.

Sunday, October 25, 2020

Scalable Search Engines

I have now worked at three different companies where we needed to provide a search feature that could scale into the millions of documents. Two of these involved consumer products and the other consisted of documents for legal review. I saw a very interesting contrast in the requirements between these two domains that I thought might be worth sharing.

The Use Cases

In the consumer product space, we had millions of users a day searching across hundreds of millions of documents, while in the legal technology space, we only had a few thousand users a day who were only searching across tens of millions of products.

At this level, it would seem like the legal domain is a much easier system design problem to solve. However, there are a couple of key difference between the two use cases which can make the legal search problem much, much harder.

Data Shapes

Consumer products result in fairly small documents: there is a title, some attributes and maybe a description that need to be indexed. For legal review, this could involve any type of document: if it is on someone hard drive or in their cloud accounts, it may be subject to legal review. Out in the wild, there are Excel data files in excess of 120MB and PDF files with tens of thousands of pages. They all need to be in the search index. (Processing files that exhibit these extremes is a nightmare in itself, but here we'll stick with the search-related problems.)

Besides the potential for these large documents, the big issues that makes the system design hard is the variance of the sizes. There are plenty of small documents to go with these these larger documents in a legal review. In contrast, consumer products are relatively uniform in the size of what needs to be indexed.

Query Shapes

With consumer products, the search expressions that are typical will consist of a handful of words at most (on average, about 2.3 words). Users mostly express simple concepts like "Sony 55 in. TV". In the legal space, queries are not only much longer, but a considerable amount of time is spent crafting the searches to hone in on specific concepts. Search expression are serious business for lawyers. Competing parties will negotiate search terms and argue about them before judges. It is not uncommon for a single search to have hundreds of terms.

The size of the search expression is only a small part of the story though. The legal domain requires all the special search operators might can think of: wildcard searches, proximity searches, range searches, etc. Not only do they require these, these are used extensively.

Consider search for a specific person of interest in a legal case whose full legal name is "Larry Thomas Johnson". How do you search for documents that mention this name so as to not miss any? "Larry Johnson", "L. Johnson", "L. T. Johnson", "Larry T. Johnson" are some of the more obvious variations that might be out there. In the legal world, they will do something like this: "L? \2 Johnson", where "?" is a wildcard matching any suffix and "\2" is a proximity search looking for occurrences within 2 words.

Requiring these sophisticated queries makes perfect sense in the legal context, but they are also the types of queries that stress a search engine (in both CPU and memory resources). The contrast of these two domains is between a high volume of simple searches and a lower volume of more complex searches.

As with the data shapes, in the legal space the query shapes tend to have a very high variance. There are plenty of simple queries to mix in with the more sophisticated ones. A handful of sophisticated queries (sometimes just one) can consume enough resources to add latency to the simpler queries. In contrast, the consumer product search query complexity is much more regularized and predictable.

Precision vs. Recall

Suppose you have a product catalog containing 10,543 televisions and a user searches for the term "TV". Let's say your search engine happens to only match 10,327 for some reason. Is the user going to notice? Probably not. They have no idea how many products are in your catalog, or even the specific products in your catalog. The recall requirements here are a bit loose.

In contrast, in the legal domain, there is often a single document that is the "smoking gun" that can make or break a case. If an attorney searches for it and your search engine misses it in the search results, you will soon be out of business. If the document exists, your search engine *must* match it.

The recall requirements are strict in the legal domain, but the precision requires are not. An attorney will not mind too much to sift through a few non-relevant documents. They are good at at pattern matching and filtering and would much prefer to get a few extra than to miss any.

Consumers search for products are a bit different. Showing a bunch of microwave ovens in your search results for a "TV" search is going to be a bad user experience if it happens to frequently. For product catalogs, trading off some amount of recall to improve precision is a good choice as it leads to "cleaner" search results for the user and they are none the wiser about what they might be missing. But in the legal domain, flawless recall is a requirement that is non-negotiable.

Data Normalization

It is common for search engines to use "stop words" to filter out words with low semantic content, e.g., "the", "a", "an". Often this filtering is the default behavior. What the consumer product and legal domains share in common is that stopwords are a bad idea. There was a consumer brand name "THE" and an important legal case involving "Project A". Feed that data though a stopword filter and you lose some very important semantic information.

Technology

All of the search technology I have used were based on Lucene. This includes Elasticsearch, SOLR and even a home-grown distributed version of Lucene back in the days before Elasticsearch existed. Lucene is an amazing piece of software and so is Elasticsearch. There is mostly no reason to use anything else. The only downside I found was in the legal domain where the data shapes and query shapes are atypical of most other domains.

Specifically, we ran into issues of limits and resource usage. Earlier versions of Lucene/Elasticsearch lack limits in some important code paths that resulted in run-away CPU or memory usage. As they have address these "holes" they have put limits in place that are too low to support some of the legitimate legal queries that need to be supported. Some thresholds are adjustable, but not all are, and there is always some amount of peril in playing with too many of the default parameters.

The Lucene and Elasticsearch teams focus on the most common cases for query and data shapes, which is the right thing for them to do. Unfortunately, the legal domain use cases fall outside of the norm. Consumer product searches, on the other hand, are more inline with their priorities.

Conclusion

There are critically important differences in the search requirements for the two domains I have work in. It feels like these represent two extremes of a problem space, I wonder how many other domains match one of these two or if there are additional "classes" of search requirements with their own unique characteristics..