Introduction π #
Entity Resolution (ER) is the critical task of identifying and linking records that refer to the same real-world entity across different data sources. Historically, this has been framed under the Filtering-Verification paradigm. While Filtering restricts the computational cost by grouping similar records, Verification examines these pairs analytically to make a final match decision.
In my thesis, LLaMoT (Leveraging Large Language Models for Entity Resolution), I explored how the “5th Generation” of ERβdriven by the semantic power of Large Language Models (LLMs)βcan replace traditional deep learning models that require expensive, labeled datasets. By using open-source LLMs in a zero-shot setting, we can achieve high-accuracy results without the need for annotated data or domain-specific fine-tuning.
The Evolutionary Context: Five Generations of ER π°οΈ #
To understand the impact of LLaMoT, it is essential to look at how Entity Resolution has evolved:
- Generation 1 (Veracity): Focused on structured, schema-aware data with noise only in attribute values.
- Generation 2 (Volume): Introduced parallelization (e.g., MapReduce) to handle datasets with millions of records.
- Generation 3 (Variety): Addressed heterogeneous and noisy data using schema-agnostic approaches.
- Generation 4 (Velocity): Introduced real-time and budget-aware ER, prioritizing response times and limited resources.
- Generation 5 (Semantics): The current era, where ER systems leverage external semantic knowledge from LLMs, moving beyond simple character or token matching to understand the actual meaning of data.
Methodology 1: Prompt-based Matching (PbM) π¬ #
The PbM approach uses the reasoning capabilities of LLMs to act as the final judge in the verification process. I implemented a robust three-step workflow to ensure both accuracy and reproducibility:
- Match Step: A binary classification phase that filters candidate pairs to remove obvious non-matches. This significantly reduces the workload for the following, more complex steps.
- Order Step: Using a
Compareprompt, the system executes a bubble sort logic to rank candidate records based on their relevance to a “key” entity. - Select Step: The LLM is presented with the ordered candidates in a multiple-choice format and tasked with selecting the single best match (or “none” if no match exists).
The Power of “Optimal” Prompts #
One of my most significant findings was that LLMs are incredibly sensitive to instructions. I developed Optimal Match prompts that included hard constraints like “Answer ONLY ‘Yes’ or ‘No’” and ended with the prefix “The Answer is:”. This simple refinement forced the models to stop providing “chattiness” and reasoning tokens, which reduced run-times by up to 45% for models like Phi-4 and eliminated invalid responses.
Methodology 2: Embedding-based Matching (EbM) πΈοΈ #
EbM leverages dense, high-dimensional vector representations to measure similarity. Unlike traditional Bag-of-Words or TF-IDF, these embeddings capture deep semantic context.
Mathematical Foundation #
The goal of EbM is to convert each record into a vector and find the candidate record $s^{*}$ that minimizes the distance $D$ (Cosine or Euclidean) from the key record’s vector $k_{e}$:
$$s^{*}=arg~min_{s\in S}D(k_{e},S_{e})$$
I evaluated 10 distinct models, ranging from the lightweight 22.7M parameter all-MiniLM to massive 7B parameter models like gte-Qwen2. My research proved that while LLMs are not great at generating embeddings “out of the box,” those specifically fine-tuned for retrieval (like SFR-Embedding-Mistral) approximate or even exceed prompt-based accuracy at a fraction of the time.
Experimental Setup & Hardware π οΈ #
A core pillar of this thesis was reproducibility on consumer-grade hardware. I avoided proprietary, closed-source models to ensure the research was accessible to the open-source community.
The Hardware:
- GPU: NVIDIA GTX 1080 Ti (12GB VRAM).
- CPU: Intel i7-9700K (8 Cores).
- RAM: 32GB.
The Software Stack:
- Ollama: Used to host and inference quantized versions of models like Qwen-2.5-14B, Gemma2-9B, and Phi-4-14B.
- pyJedAI: Utilized for the initial Blocking and Filtering pipeline.
- FAISS: For high-efficiency similarity search during the blocking stage.
Key Results & Performance Analysis π #
I evaluated both approaches across four diverse datasets: Abt-Buy (AB), Amazon-Google (AG), DBLP-ACM (DA), and DBPEDIA-IMDB (DI).
1. Effectiveness vs. Baselines #
LLaMoT PbM consistently outperformed both supervised (HierGAT) and unsupervised (Sudowoodo) baselines. On the Abt-Buy dataset, PbM with Gemma2 reached an F1 score of 94.96, exceeding the supervised baseline by nearly 20 points.
2. Accuracy vs. Speed (The Great Trade-off) #
- Precision: PbM is the winner. It can “reason” over specific details, such as spotting different serial numbers in two laptops that otherwise share identical text descriptions.
- Speed: EbM is the winner. In the Abt-Buy dataset, the
Stellaembedding model completed the task in under 27 minutes, whilePhi3.5required 4 hours just for the PbM Match step.
| Dataset | Best PbM Model | F1 Score |
|---|---|---|
| Abt-Buy | Gemma2-9B | 94.96 |
| Amazon-Google | Gemma2-9B | 84.84 |
| DBLP-ACM | Qwen-2.5-14B | 99.43 |
Conclusion: Which approach should you use? βοΈ #
- Choose PbM (Prompt-based) when Accuracy is your highest priority and you have the inference time to spare. It is the most robust method for complex, noisy data.
- Choose EbM (Embedding-based) for Scalability. It offers a “plug-and-play” behavior with minimal coding and is ideal for high-volume, time-sensitive applications.
Future Work: The Hybrid Path π #
The ultimate “Goldilocks” solution for Entity Resolution is a Hybrid framework. By using embeddings (EbM) to quickly filter and rank candidates, and then applying prompt-based reasoning (PbM) only for the final, most uncertain decisions, we can achieve maximum effectiveness without sacrificing speed.
Full Thesis Reference: Korovesis, P. P. (2025). LLaMoT: Leveraging Large Language Models for Entity Resolution. MSC Thesis, National and Kapodistrian University of Athens.