Skip to main content
LLaMoT: Redefining Entity Resolution with Open-Source LLMs

LLaMoT: Redefining Entity Resolution with Open-Source LLMs

Panos Korovesis
Author
Panos Korovesis
Data Engineer at BIG DATA LAB, Msc Data Science
Table of Contents

Introduction πŸ”Ž
#

Entity Resolution (ER) is the critical task of identifying and linking records that refer to the same real-world entity across different data sources. Historically, this has been framed under the Filtering-Verification paradigm. While Filtering restricts the computational cost by grouping similar records, Verification examines these pairs analytically to make a final match decision.

In my thesis, LLaMoT (Leveraging Large Language Models for Entity Resolution), I explored how the “5th Generation” of ERβ€”driven by the semantic power of Large Language Models (LLMs)β€”can replace traditional deep learning models that require expensive, labeled datasets. By using open-source LLMs in a zero-shot setting, we can achieve high-accuracy results without the need for annotated data or domain-specific fine-tuning.

Full code, optimized prompts, and experimental results are available on GitHub.

The Evolutionary Context: Five Generations of ER πŸ•°οΈ
#

To understand the impact of LLaMoT, it is essential to look at how Entity Resolution has evolved:

  1. Generation 1 (Veracity): Focused on structured, schema-aware data with noise only in attribute values.
  2. Generation 2 (Volume): Introduced parallelization (e.g., MapReduce) to handle datasets with millions of records.
  3. Generation 3 (Variety): Addressed heterogeneous and noisy data using schema-agnostic approaches.
  4. Generation 4 (Velocity): Introduced real-time and budget-aware ER, prioritizing response times and limited resources.
  5. Generation 5 (Semantics): The current era, where ER systems leverage external semantic knowledge from LLMs, moving beyond simple character or token matching to understand the actual meaning of data.

Methodology 1: Prompt-based Matching (PbM) πŸ’¬
#

The PbM approach uses the reasoning capabilities of LLMs to act as the final judge in the verification process. I implemented a robust three-step workflow to ensure both accuracy and reproducibility:

  • Match Step: A binary classification phase that filters candidate pairs to remove obvious non-matches. This significantly reduces the workload for the following, more complex steps.
  • Order Step: Using a Compare prompt, the system executes a bubble sort logic to rank candidate records based on their relevance to a “key” entity.
  • Select Step: The LLM is presented with the ordered candidates in a multiple-choice format and tasked with selecting the single best match (or “none” if no match exists).

The Power of “Optimal” Prompts
#

One of my most significant findings was that LLMs are incredibly sensitive to instructions. I developed Optimal Match prompts that included hard constraints like “Answer ONLY ‘Yes’ or ‘No’” and ended with the prefix “The Answer is:”. This simple refinement forced the models to stop providing “chattiness” and reasoning tokens, which reduced run-times by up to 45% for models like Phi-4 and eliminated invalid responses.


Methodology 2: Embedding-based Matching (EbM) πŸ•ΈοΈ
#

EbM leverages dense, high-dimensional vector representations to measure similarity. Unlike traditional Bag-of-Words or TF-IDF, these embeddings capture deep semantic context.

Mathematical Foundation
#

The goal of EbM is to convert each record into a vector and find the candidate record $s^{*}$ that minimizes the distance $D$ (Cosine or Euclidean) from the key record’s vector $k_{e}$:

$$s^{*}=arg~min_{s\in S}D(k_{e},S_{e})$$

I evaluated 10 distinct models, ranging from the lightweight 22.7M parameter all-MiniLM to massive 7B parameter models like gte-Qwen2. My research proved that while LLMs are not great at generating embeddings “out of the box,” those specifically fine-tuned for retrieval (like SFR-Embedding-Mistral) approximate or even exceed prompt-based accuracy at a fraction of the time.


Experimental Setup & Hardware πŸ› οΈ
#

A core pillar of this thesis was reproducibility on consumer-grade hardware. I avoided proprietary, closed-source models to ensure the research was accessible to the open-source community.

The Hardware:

  • GPU: NVIDIA GTX 1080 Ti (12GB VRAM).
  • CPU: Intel i7-9700K (8 Cores).
  • RAM: 32GB.

The Software Stack:

  • Ollama: Used to host and inference quantized versions of models like Qwen-2.5-14B, Gemma2-9B, and Phi-4-14B.
  • pyJedAI: Utilized for the initial Blocking and Filtering pipeline.
  • FAISS: For high-efficiency similarity search during the blocking stage.

Key Results & Performance Analysis πŸ†
#

I evaluated both approaches across four diverse datasets: Abt-Buy (AB), Amazon-Google (AG), DBLP-ACM (DA), and DBPEDIA-IMDB (DI).

1. Effectiveness vs. Baselines
#

LLaMoT PbM consistently outperformed both supervised (HierGAT) and unsupervised (Sudowoodo) baselines. On the Abt-Buy dataset, PbM with Gemma2 reached an F1 score of 94.96, exceeding the supervised baseline by nearly 20 points.

2. Accuracy vs. Speed (The Great Trade-off)
#

  • Precision: PbM is the winner. It can “reason” over specific details, such as spotting different serial numbers in two laptops that otherwise share identical text descriptions.
  • Speed: EbM is the winner. In the Abt-Buy dataset, the Stella embedding model completed the task in under 27 minutes, while Phi3.5 required 4 hours just for the PbM Match step.
Dataset Best PbM Model F1 Score
Abt-Buy Gemma2-9B 94.96
Amazon-Google Gemma2-9B 84.84
DBLP-ACM Qwen-2.5-14B 99.43

Conclusion: Which approach should you use? βš–οΈ
#

  • Choose PbM (Prompt-based) when Accuracy is your highest priority and you have the inference time to spare. It is the most robust method for complex, noisy data.
  • Choose EbM (Embedding-based) for Scalability. It offers a “plug-and-play” behavior with minimal coding and is ideal for high-volume, time-sensitive applications.

Future Work: The Hybrid Path πŸš€
#

The ultimate “Goldilocks” solution for Entity Resolution is a Hybrid framework. By using embeddings (EbM) to quickly filter and rank candidates, and then applying prompt-based reasoning (PbM) only for the final, most uncertain decisions, we can achieve maximum effectiveness without sacrificing speed.


Full Thesis Reference: Korovesis, P. P. (2025). LLaMoT: Leveraging Large Language Models for Entity Resolution. MSC Thesis, National and Kapodistrian University of Athens.