How to Match Products with AI

Post Image

The Product LLM specializes in product matching, which is a machine learning task that involves classifying whether two product descriptions correspond to the same real-world product. Across various real-world datasets, our model achieves a 95% accuracy or more on high-certainty cases.

In this article, we will share how a product matching AI is built. To get the best performance, you need to:

  1. Aggregate accurate testing data
  2. Build a structured pipeline for testing any changes to your model
  3. Experiment with different models, prompts, and fine tuning parameters

That being said, the easiest way to match products with AI is to head to our API playground and try our model for free. Our documentation covers how to match products with our service.

Start with an existing large language model

Large language models recently became the state of the art for matching products, as discussed in this article. Thus, we'll need to start with a large language model (LLM) from one of the many providers (OpenAI, Google, Meta, Qwen, DeepSeek, etc.). To access each, you can reference the latest documentation from your provider of choice.

Once you have an LLM to prompt, most people immediately start by trying a couple prompts and comparing "improvements" one at a time. Don't fall for this trap! Without a decent sample of tests, you really don't know if your changes are helping or not. Instead, you should start by aggregating a clean product matching dataset and building a structured testing and iteration pipeline (see below).

Aggregate clean data

We aggregated data from hundreds of thousands of products from a broad array of product categories. Each row contains an individual product pair and whether or not the pair is a match.

Each record is reviewed by multiple quality checking experts for accuracy. We then split that data into separate datasets for fine-tuning our model (training) and for testing our final model configurations.

Ultimately, your model is only as strong as your training data, which is why our matching data is our secret sauce.

Iterate with a structured pipeline

Before trying different prompts one by one, you should set up a structured pipeline for testing future improvements to your model.

We built a testing script that evaluates the accuracy, precision, and recall for a given model configuration. Each time we made a change to the model, we tested if it actually performed any better on our testing dataset. Note: always test your model on a different dataset from the one you use to train your model.

If your tests are efficient to run, you can make many parameter tweaks and test enough iterations to find what performs the best. This structured iteration pipeline is the only way to get to the best model.

Experiment with different prompt techniques

With our testing pipeline in place, it's time to try various prompting techniques.

Overall, we should keep a couple general principles in mind:

  1. Be precise in what you ask in an LLM prompt. The model does not have context unless you provide it
  2. Present your product data cleanly. Generally, if it's human-legible, it should be fine. However, missing spaces or data without context on where it came from can be confusing.
  3. Missing information makes it very hard to distinguish between similar products
  4. Separate long-running text sections with special characters. For example, three tick marks ``` can demarcate the start and end of a long product description
  5. Define the format you would like the model to respond with (e.g., JSON)

As for the prompting strategies...

First, we have in-context learning. This allows the model to learn from examples embedded directly in each model call. In-context learning can perform well especially when the examples make a good representative sample of what you expect to see. You can actually give the model a lot of in-context examples, but at a certain point, some models will struggle with too much information. Also, more tokens result in a higher cost.

Next, there is chain of thought prompting. This requests that the model think through multiple steps to get to an answer. While this is less relevant to product matching, you could ask the model to first compare the similarity of product codes and/or brands before scoring the match.

We can also try instruction-based prompts. These can perform well, if there are simple rules to follow.

Let's get started with a couple prompt options which we pulled from the research and our own experiments. For more prompts, see "Entity Matching using Large Language Models" by Ralph Peeters and Christian Bizer, 2024.

1. Simple Prompt

Do the following two product descriptions match?”

In our tests, we've seen simple prompts win out over other strategies for some language models. However, they can be a bit ambiguous for some model that you're using out-of-the-box. The model either needs context on your requirements, or real-world training examples to teach it what is a match. Peeters and Bizer tested simple prompts across different models and found that some models performed well with simple prompts, but most performed poorly.

2. In-Context Learning

“Do the following two product descriptions match?
Example 1:
Dyno D1 Tape 24mm
Dymo D1 21mm x 7m
Example Answer: No

Product 1: Crest fresh mint toothpaste 8.2 oz w whitening

Product 2: Crest Bspx Whitening8.2 Size 8.2z Crest Baking Soda & Peroxide Toothpaste W/Tartar Control Fresh Mint

Please provide your answer in JSON format…"

After prefacing your request with 3 or more examples, the model will generalize from the examples to your specific request, improving performance. In Peeters’ study, this few-shot method often boosted accuracy significantly, especially for datasets with unusual products. You can hand-pick static examples or you can randomly choose examples from the same product category.

3. Score-Based Methods

Instead of forcing a yes or no decision, this method asks the model to give a probability-style score:

"On a scale of 1 to 10, score the similarity between the two products below, where a score out of 10 represents the likelihood that they are the same product. That is, a score of 8 suggests there is an 80% chance they are the same products."

This approach seems like a strong approach for conveying certainty and match class simultaneously, but we experienced poor results with numeric scoring. This is likely because the model does not really know 60% probability from 70% probability, and in any case, a 10-category outcome variable is not set up well for fine-tuning. 

4. Instruction-Based Methods

If examples aren’t broad enough, clear guidance could help. We tried a few instructions, such as the one below:

“Do the following two product descriptions match?

Instructions: The descriptions below may be missing information. Base your answer on your judgement.
1. Products with matching UPC codes are matches
2. Products with matching manufacturer codes are usually matches, if description details do not differ (e.g., color, size, etc.)
3. Products are definitely different if they have different brands, manufacturers, models, colors, sizes, etc.

This method may work well when your domain knowledge can be distilled into rules, but make sure your rules are flexible to different product categories and cases where information is missing.

5. Other methods

There are other methods like chain of thought that might ask the model to compare product codes or brands before comparing the product itself. You can also sample the model multiple times, and add certainty. Most importantly, you can add image data to the mix to improve accuracy. However, image requests can cost ~5-10x the price as small, text-based request, so those trade-offs might not be worth it depending on your use case.

Finally, when training, there are ways to help guide the model toward learning more quickly.

Ultimately, it’s important to check multiple prompting methods for your given model, because different models respond differently.

There is no universally best prompt. Prompt design is a parameter that needs to be trained and tested efficiently.

Train your model

LLMs still don't perform well enough on product matching out-of-the-box. You'll need to fine-tune your model using the best prompt from your prior experimentation.

The most common training approaches are LoRA and QLoRA. LoRA trains a new set of neural parameters on top of an existing model, and QLoRA does the same with less memory usage. Each has a lot of public instruction on how to set up. There is also a separate set of training parameters to define for the fine tuning itself (e.g., learning rate, epochs). These can also be optimized with experimentation.

In general, you'll train your model on a training dataset and then test it on a separate testing dataset. It is important to keep these separate so you can detect if you're simply overfitting your training data.

Ultimately, achieving a good model is possible with a large dataset of accurate data and the right infrastructure to evaluate and iterate. Each time you edit the prompt or fine-tune another model, you need to evaluate the results and iterate. Additionally, what works for one model may not work for others, and what works in one product category might fail for another.

Use The Product LLM instead

If you want to get the best model and save time, we’ve already battle-tested the best prompts across the top LLMs. We're constantly staying up to date with the latest models and optimizing performance based on a rapidly growing real-world product dataset.

You can try our proprietary model, "theproductllm-mini," directly in our API playground. It delivers top-tier accuracy out-of-the-box, it's free to try, and it's affordable for use at high volumes.