LLaMoT: Redefining Entity Resolution with Open-Source LLMs

Table of Contents

Introduction 🔎
#

Entity Resolution (ER) is the critical task of identifying and linking records that refer to the same real-world entity across different data sources. Historically, this has been framed under the Filtering-Verification paradigm. While Filtering restricts the computational cost by grouping similar records, Verification examines these pairs analytically to make a final match decision.

In my thesis, LLaMoT (Leveraging Large Language Models for Entity Resolution), I explored how the “5th Generation” of ER—driven by the semantic power of Large Language Models (LLMs)—can replace traditional deep learning models that require expensive, labeled datasets. By using open-source LLMs in a zero-shot setting, we can achieve high-accuracy results without the need for annotated data or domain-specific fine-tuning.

Full code, optimized prompts, and experimental results are available on GitHub.

The Evolutionary Context: Five Generations of ER 🕰️
#

To understand the impact of LLaMoT, it is essential to look at how Entity Resolution has evolved:

Generation 1 (Veracity): Focused on structured, schema-aware data with noise only in attribute values.
Generation 2 (Volume): Introduced parallelization (e.g., MapReduce) to handle datasets with millions of records.
Generation 3 (Variety): Addressed heterogeneous and noisy data using schema-agnostic approaches.
Generation 4 (Velocity): Introduced real-time and budget-aware ER, prioritizing response times and limited resources.
Generation 5 (Semantics): The current era, where ER systems leverage external semantic knowledge from LLMs, moving beyond simple character or token matching to understand the actual meaning of data.

Methodology 1: Prompt-based Matching (PbM) 💬
#

The PbM approach uses the reasoning capabilities of LLMs to act as the final judge in the verification process. I implemented a robust three-step workflow to ensure both accuracy and reproducibility:

Match Step: A binary classification phase that filters candidate pairs to remove obvious non-matches. This significantly reduces the workload for the following, more complex steps.
Order Step: Using a Compare prompt, the system executes a bubble sort logic to rank candidate records based on their relevance to a “key” entity.
Select Step: The LLM is presented with the ordered candidates in a multiple-choice format and tasked with selecting the single best match (or “none” if no match exists).

The Power of “Optimal” Prompts
#

One of my most significant findings was that LLMs are incredibly sensitive to instructions. I developed Optimal Match prompts that included hard constraints like “Answer ONLY ‘Yes’ or ‘No’” and ended with the prefix “The Answer is:”. This simple refinement forced the models to stop providing “chattiness” and reasoning tokens, which reduced run-times by up to 45% for models like Phi-4 and eliminated invalid responses.

Methodology 2: Embedding-based Matching (EbM) 🕸️
#

EbM leverages dense, high-dimensional vector representations to measure similarity. Unlike traditional Bag-of-Words or TF-IDF, these embeddings capture deep semantic context.

Mathematical Foundation
#

The goal of EbM is to convert each record into a vector and find the candidate record $s^{*}$ that minimizes the distance $D$ (Cosine or Euclidean) from the key record’s vector $k_{e}$:

$$s^{*}=arg~min_{s\in S}D(k_{e},S_{e})$$

I evaluated 10 distinct models, ranging from the lightweight 22.7M parameter all-MiniLM to massive 7B parameter models like gte-Qwen2. My research proved that while LLMs are not great at generating embeddings “out of the box,” those specifically fine-tuned for retrieval (like SFR-Embedding-Mistral) approximate or even exceed prompt-based accuracy at a fraction of the time.

Experimental Setup & Hardware 🛠️
#

A core pillar of this thesis was reproducibility on consumer-grade hardware. I avoided proprietary, closed-source models to ensure the research was accessible to the open-source community.

The Hardware:

GPU: NVIDIA GTX 1080 Ti (12GB VRAM).
CPU: Intel i7-9700K (8 Cores).
RAM: 32GB.

The Software Stack:

Ollama: Used to host and inference quantized versions of models like Qwen-2.5-14B, Gemma2-9B, and Phi-4-14B.
pyJedAI: Utilized for the initial Blocking and Filtering pipeline.
FAISS: For high-efficiency similarity search during the blocking stage.

Key Results & Performance Analysis 🏆
#

I evaluated both approaches across four diverse datasets: Abt-Buy (AB), Amazon-Google (AG), DBLP-ACM (DA), and DBPEDIA-IMDB (DI).

1. Effectiveness vs. Baselines
#

LLaMoT PbM consistently outperformed both supervised (HierGAT) and unsupervised (Sudowoodo) baselines. On the Abt-Buy dataset, PbM with Gemma2 reached an F1 score of 94.96, exceeding the supervised baseline by nearly 20 points.

2. Accuracy vs. Speed (The Great Trade-off)
#

Precision: PbM is the winner. It can “reason” over specific details, such as spotting different serial numbers in two laptops that otherwise share identical text descriptions.
Speed: EbM is the winner. In the Abt-Buy dataset, the Stella embedding model completed the task in under 27 minutes, while Phi3.5 required 4 hours just for the PbM Match step.

Dataset	Best PbM Model	F1 Score
Abt-Buy	Gemma2-9B	94.96
Amazon-Google	Gemma2-9B	84.84
DBLP-ACM	Qwen-2.5-14B	99.43

Conclusion: Which approach should you use? ⚖️
#

Choose PbM (Prompt-based) when Accuracy is your highest priority and you have the inference time to spare. It is the most robust method for complex, noisy data.
Choose EbM (Embedding-based) for Scalability. It offers a “plug-and-play” behavior with minimal coding and is ideal for high-volume, time-sensitive applications.

Future Work: The Hybrid Path 🚀
#

The ultimate “Goldilocks” solution for Entity Resolution is a Hybrid framework. By using embeddings (EbM) to quickly filter and rank candidates, and then applying prompt-based reasoning (PbM) only for the final, most uncertain decisions, we can achieve maximum effectiveness without sacrificing speed.

Full Thesis Reference: Korovesis, P. P. (2025). LLaMoT: Leveraging Large Language Models for Entity Resolution. MSC Thesis, National and Kapodistrian University of Athens.

Introduction 🔎 #

The Evolutionary Context: Five Generations of ER 🕰️ #

Methodology 1: Prompt-based Matching (PbM) 💬 #

The Power of “Optimal” Prompts #

Methodology 2: Embedding-based Matching (EbM) 🕸️ #

Mathematical Foundation #

Experimental Setup & Hardware 🛠️ #

Key Results & Performance Analysis 🏆 #

1. Effectiveness vs. Baselines #

2. Accuracy vs. Speed (The Great Trade-off) #

Conclusion: Which approach should you use? ⚖️ #

Future Work: The Hybrid Path 🚀 #