Are the following two products the same?
Do you think this is an easy question to answer or a hard one? The correct answer is: "it depends" and the bigger part of that is "it depends on how you define the word 'product'".
I have worked on this problem, called "product normalization", for more than 5 years and at two different companies. Not only was I required to be able to answer questions like these, but I needed to encode the "rules" for doing this and build software systems that could make these decisions automatically (and at scale).
Before diving any deeper, if you believe the answer to the question is either "yes" or "no", let me try to convince you that the answer is not so cut-and-dry.
The first search-based company I worked for was a comparison shopping company. It was a site where you could find the best price for a product by comparing all the store prices in one place. In this context, those two products above are definitely *not* the same. One of them is going to cost twice as much, so if we showed the price for the "2-pack" from one store next to the price for a "4-pack" from another store, we are not providing the user with a meaningful comparison experience.
The next company I worked for dealt with managing customer reviews. Suppose there are some meaningful reviews on the 2-pack and none for the 4-pack, but the user is looking at the 4-pack product page. Do we show the 2-pack reviews on the 4-pack page? In this case, the answer is "yes", we would not want the user to miss out on relevant reviews about the quality of the product.
Therefore, when we are comparing price, the quantity in the package is very important, and these should not be considered the same. But when trying to make buying decision based on the quality and features of the product, the quantity is not important.
The Fallacy of UPC matching
The most basic way to compare products for "sameness" is to look at their UPC codes (EAN codes for the Europeans). As illustrated above, this is not going to help if you are matching for the purpose of consolidating product reviews. The 2-pack and 4-pack surely have different UPC codes.
But suppose we only need to solve the problem in the price comparison domain. Comparing UPC codes will tell us immediately these are not the same. This starts to make the price comparison case seem trivial to solve: just match on UPC codes.
This would seem to work great, though you have to assume the following:
- you have all the UPC data for all the products; and
- the UPC data is accurate for all your products.
However, there are some harsh realities about UPCs that really complicate things:
- assumption of completeness and accuracy of data are always highly dubious;
- not all products are assigned UPCs (e..g, clothes);
- some products are assigned more than one UPC;
- UPC values can be re-used for completely different products;
- trivial changes to a product can result in a new UPC (e.g., sometimes just a different packaging causes a new UPC to be used).
We can illustrate further with a final example. Are these two products the same?
The quantity is the same, they are now both 4-packs, but the packaging is slightly different.
The answer would seem to be "yes, they are the same" for both the price comparison and the product review cases. However, flip the packages over and you will notice they have completely different UPC values:
Hopefully this gives you some introductory flavor for the complications of product normalization. I've written a much more in-depth and technical document on this subject. It's a whopping 60 pages, so not something to read unless you really need to solve this problem and need some ideas of approaches. You can read it here: