How to Match Products with AI

Post Image

The Product LLM specializes in product matching, which is the machine learning task that involves classifying whether two product descriptions correspond to the same real-world product. Across various testing sets, our model achieves 95% accuracy or more on high-certainty cases. We achieved this by leveraging the latest research, aggregating a large dataset of manually scored products, and experimenting with our own techniques.

In this article, we will share how a product matching AI is built. However, the easiest way to match products with AI is actually to head to our API playground and try our model for free. Our documentation covers how to match products with our service.

Building a product matching AI

Matching products is well suited to large language models, as discussed in this article. Thus, we'll need to start with a large language model (LLM) from one of the many providers. We won't cover the basics for using large language models, but you can reference the latest ChatGPT model to get started. We recommend using the API format, but the chat interface is available for a quick test.

Now that you have an LLM to prompt, most people start by tweaking a prompt a few times and comparing "improvements" one prompt at a time. Don't fall for this trap! Without a decent sample of tests, you really don't know if your changes are helping or not! Instead, you should actually start by labeling a product matching dataset and building a structured testing and iteration pipeline (see below).

That being said, let's first look at some of the prompts that perform the best.

Prompting methods: the building blocks of AI matching

There are many prompting strategies, but one of the most important principles is that you need to be very precise in what you ask in an LLM prompt. The model does not know what you mean unless you give it the appropriate context.

In a similar vein, since you will be feeding product data for matching, you want to make sure your product data is presented in a clean way. Are the descriptions themselves clean? Abbreviations are fine, but missing information, missing spaces, and data without context on where it came from can be confusing. Additionally, if your prompt text runs long, you can make use of certain characters to group sections of your prompt. For example, three tick marks ``` can demarcate the start and end of a section of text.

As for the prompting strategies...

First, we have in-context learning. This allows the model to learn from examples embedded directly in each model call. In-context learning can perform well especially when the examples make a good representative sample of what you expect to see. You can actually give the model a lot of in-context examples, but at a certain point, some models will struggle with too much information. Also, more tokens result in a higher cost.

Next, there is chain of thought prompting. This requests that the model think through multiple steps to get to an answer. While this is less relevant to product matching, you could ask the model to first compare the similarity of product codes and/or brands.

We can also try instruction-based prompts. This can perform well, if there are simple rules to follow.

Let's get started with a couple prompt options. We pulled various battle-tested prompts from several research papers and from our own experiments to see which ones performed the best. Some of the best options are summarized below. For more details, see "Entity Matching using Large Language Models" by Ralph Peeters and Christian Bizer, 2024.

1. Simple Prompt

Do the following two product descriptions match?”

While simple prompts can work with fine-tuning, they can be a bit ambiguous for a model that you're using out-of-the-box. The model either needs context on your requirements, or real-world training examples to teach it what is a match. Peeters and Bizer tested these simple prompts across different models and found that some models performed well with simple prompts, but most performed poorly. In our tests, we've seen simple prompts win out over other strategies for some language models.

2. In-Context Learning

“Do the following two product descriptions match?
Example 1:
Dyno D1 Tape 24mm
Dymo D1 21mm x 7m
Example Answer: No

Product 1: Crest fresh mint toothpaste 8.2 oz w whitening

Product 2: Crest Bspx Whitening8.2 Size 8.2z Crest Baking Soda & Peroxide Toothpaste W/Tartar Control Fresh Mint

Please provide your answer in JSON format…"

After prefacing your request with 3 or more examples, the model will generalize from the examples to your specific request, improving performance. In Peeters’ study, this few-shot method often boosted accuracy significantly, especially for datasets with unusual products. You can hand-pick static examples or you can randomly choose examples from the same product category.

3. Score-Based Methods

Instead of forcing a yes or no decision, this method asks the model to give a probability-style score:

"On a scale of 1 to 10, score the similarity between the two products below, where a score out of 10 represents the likelihood that they are the same product. That is, a score of 8 suggests there is an 80% chance they are the same products."

This approach seems like a strong approach for conveying certainty and match class simultaneously, but we experienced poor results with numeric scoring. This is likely because the model does not really know 60% probability from 70% probability, and in any case, a 10-category outcome variable is not set up well for fine-tuning. 

4. Instruction-Based Methods

If examples aren’t broad enough, clear guidance could help. We tried a few instructions, such as the one below:

“Do the following two product descriptions match?

Instructions: The descriptions below may be missing information. Base your answer on your judgement.
1. Products with matching UPC codes are matches
2. Products with matching manufacturer codes are usually matches, if description details do not differ (e.g., color, size, etc.)
3. Products are definitely different if they have different brands, manufacturers, models, colors, sizes, etc.

This method may work well when your domain knowledge can be distilled into rules, but make sure your rules are flexible to different matching scenarios with potentially missing information.

5. Other methods

There are other methods like chain of thought that might ask the model to compare product codes or brands before comparing the product itself. You can also sample the model multiple times, and add certainty. Most importantly, you can add image data to the mix to improve accuracy. However, image requests can cost 10x the price as a small text-based request, so those trade-offs might not be worth it depending on your use case.

Finally, when training, there are ways to help guide the model toward learning more quickly.

Ultimately, it’s important to check multiple prompting methods for your given model, because different models respond differently.

There is no universally best prompt. Prompt design is a parameter that needs to be trained and tested efficiently.

Training your model: you need to iterate

LLMs still don't perform well enough on product matching out of the box. You'll need to fine-tune your model, and you'll need to experiment with the best prompting strategy for your chosen LLM.

The best way to start is by aggregating a dataset with labeled training and testing data. This dataset should store product pairs and whether or not they are a match. Be careful here, your model is only as good as your dataset!

Once you have your dataset, you can build a structured pipeline to run and evaluate how each prompt does. Pick the prompt with the best accuracy, precision, or recall (depending on your needs).

Armed with the best prompting strategy, it's time to train your model. LoRA and QLoRA are common training approaches, and each has a lot of public information on how to set those up. Make sure ythat you use the same prompt for training as you use in the future. Also, make sure you evaluate your model on a different testing set than it was trained on! You don't want to overfit.

The takeaway here is that you can only achieve performance with the right infrastructure to test prompts, evaluate model performance, and track results. Each time you edit the prompt or fine-tune the model, you need to be able to measure the improvement and iterate. Additionally, what works for one model may not work for others, and what works in one product category might fail for another.

Matching with The Product LLM

If you want to get the best model without the time cost, we’ve already battle-tested the best prompts across the top LLMs. We're constantly staying up to date with the latest models, and we're optimizing performance based on a rapidly growing real-world product dataset.

You can try our model, "theproductllm-mini," directly in our API playground. It delivers top-tier accuracy out of the box, it's free to try, and it's affordable for use at high volumes.