DH.
All projects
LLM-Enhanced Customer Data Matching

2023-02

LLM-Enhanced Customer Data Matching

Automating record reconciliation with AI-powered precision

ClientPhysical security company
RoleFull-Stack Development
PythonLlama 3.2Fuzzy MatchingData EngineeringPrompt Engineering

Overview

When two databases describe the same customer differently — one says "ABC Security LLC", another says "A.B.C. Security, Lancaster" — traditional exact-match queries fail. This project built a multi-layered matching system that uses fuzzy string algorithms as a first pass and Llama 3.2 as the intelligent decision layer, producing confident match/no-match decisions with explicit reasoning.

The system reconciles customer accounts across two internal databases, handling name variations, address formatting differences, and subsidiary relationships that no rule-based system could manage reliably.

Architecture

Layer 1: Fuzzy Pre-filtering

Before involving the LLM, I normalize and fuzzy-match company names and addresses to generate candidate pairs. This narrows the search space and avoids burning LLM calls on obvious non-matches.

from rapidfuzz import fuzz
 
def normalize(name: str) -> str:
    name = name.lower().strip()
    # Remove legal suffixes that vary between systems
    for suffix in ["llc", "inc", "corp", "co.", "ltd"]:
        name = name.replace(suffix, "").strip()
    return re.sub(r"[^a-z0-9\s]", "", name)
 
def get_candidates(record, pool, threshold=65):
    normalized_target = normalize(record["name"])
    return [
        candidate for candidate in pool
        if fuzz.token_sort_ratio(
            normalized_target, normalize(candidate["name"])
        ) >= threshold
    ]

Layer 2: LLM Decision Engine

Each candidate pair gets evaluated by Llama 3.2 using a structured prompt that enforces a seven-step analysis sequence. The prompt is engineered to prevent two common LLM failure modes: overconfident matching on superficial similarity, and decision paralysis on ambiguous cases.

MATCH_PROMPT = """
Analyze whether these two records represent the same business.
 
Record A: {record_a}
Record B: {record_b}
 
Follow these steps in order:
1. Compare normalized business names and note similarity patterns
2. Evaluate address match (exact / partial / different city / no address)
3. Check for known subsidiary or DBA relationships
4. Assess phone/contact overlap if present
5. Consider whether name differences suggest rebranding vs. different entity
6. Assign a confidence level: HIGH (>90%), MEDIUM (70-90%), LOW (<70%)
7. Make a final MATCH or NO_MATCH decision
 
Constraints:
- Analyze only the pair provided — do not reference other records
- A HIGH confidence MATCH requires agreement on at least name + address
- If confidence is LOW, default to NO_MATCH
- Explain your reasoning before the final decision
 
Output format:
DECISION: [MATCH|NO_MATCH]
CONFIDENCE: [HIGH|MEDIUM|LOW]
REASONING: [your analysis]
"""

Layer 3: Post-processing and Validation

LLM output is parsed with regex against the enforced format. Results below a configurable confidence threshold are flagged for human review rather than auto-applied.

Key Design Decisions

Conservative defaults. When in doubt, the system returns NO_MATCH. A false positive (merging two real customers) is far more damaging than a false negative that leaves one account unmatched.

Structured response format. Forcing DECISION: / CONFIDENCE: / REASONING: labels makes parsing reliable and creates an audit trail — you can always explain why any match was made.

Separating concerns. Fuzzy matching is cheap; LLM calls are not. Keeping them as distinct layers means the LLM budget goes only toward genuinely ambiguous pairs.

Outcome

The system processed the full customer database backlog in an overnight batch run. Match quality was validated against a hand-curated test set: precision above 94%, with no false positives in the HIGH confidence bucket.

Manual reconciliation that previously took weeks per quarter now runs automatically as new accounts are created.