YAKE LogoYAKE!
Documentation/Core

Levenshtein

The Levenshtein class provides utilities for calculating edit distances and similarity ratios between strings using the Levenshtein algorithm.

Info: This documentation provides interactive code views for each method. Click on a function name to view its implementation.

Class Overview

"""
Module providing optimized Levenshtein distance and ratio calculations.
 
This module implements an optimized version of the Levenshtein (edit distance) 
algorithm for measuring the difference between two strings. It provides both 
a raw distance calculation and a normalized similarity ratio, which are useful 
for comparing text strings and identifying potential matches with slight variations.
 
Optimizations include:
- Memory-efficient two-row approach instead of full matrix
- Early termination for highly dissimilar strings (>70% length difference)
- LRU caching for repeated calculations (20K cache size)
- Optimized recursive algorithm for short strings (length <= 3)
"""
 
class Levenshtein:
    """
    Optimized class for computing Levenshtein distance and similarity ratio.
    
    This class provides static methods to calculate the edit distance between
    strings (how many insertions, deletions, or substitutions are needed to
    transform one string into another) and to determine a normalized similarity
    ratio between them, with significant performance optimizations.
    
    These metrics are widely used in fuzzy string matching, spell checking,
    and approximate text similarity measurements.
    """

The Levenshtein class offers methods to measure the difference between two strings and calculate their similarity.

Static Methods

Usage Guide

Distance Calculation

The Levenshtein distance measures the minimum number of single-character edits (insertions, deletions, or substitutions) required to change one string into another.

Example:

from levenshtein import Levenshtein
 
# Calculate the edit distance between two strings
distance = Levenshtein.distance("kitten", "sitting")
print(f"Levenshtein distance: {distance}")  # Output: 3

Similarity Ratio

The similarity ratio is a normalized measure between 0 and 1, where 1 means the strings are identical and 0 means they are completely different.

Example:

from levenshtein import Levenshtein
 
# Calculate the similarity ratio between two strings
similarity = Levenshtein.ratio("kitten", "sitting")
print(f"Similarity ratio: {similarity:.4f}")  # Output: 0.5714

Algorithm Explanation

The Levenshtein distance algorithm calculates the minimum edit distance between two strings using dynamic programming. This implementation includes several optimizations:

Standard Algorithm

  1. Initialize two arrays (instead of full matrix for memory efficiency)
  2. The previous_row stores distances for the previous character of seq1
  3. The current_row stores distances being calculated for current character
  4. For each position, calculate the minimum cost from three operations:
    • Deletion: Value from cell above + 1
    • Insertion: Value from cell to the left + 1
    • Substitution/Match: Value from diagonal + cost (0 if match, 1 if different)
  5. After processing all characters, the last value contains the final distance

Optimizations

Memory Efficiency: Uses only two rows instead of full matrix

  • Reduces space complexity from O(m×n) to O(n)
  • Swaps rows after each iteration to reuse memory

Early Termination: Checks if strings are too different before computing

  • If length difference > 70% of max length, returns max length immediately
  • Avoids unnecessary computation for very dissimilar strings

String Length Optimization: Always processes shorter string first

  • Ensures the outer loop is smaller
  • Minimizes memory usage and iterations

Short String Optimization: Uses recursive approach for strings ≤ 3 characters

  • Simple recursive method is faster for very short strings
  • Avoids array allocation overhead

LRU Caching: Both distance() and ratio() methods use @lru_cache

  • Cache size: 20,000 entries
  • Dramatically speeds up repeated comparisons
  • Useful when comparing many keywords against the same text

Complete Usage Example

import functools
from levenshtein import Levenshtein
 
# Test strings
string1 = "natural language processing"
string2 = "neural language processing"
 
# Calculate distance and similarity
distance = Levenshtein.distance(string1, string2)
similarity = Levenshtein.ratio(string1, string2)
 
print(f"Strings:\n1: '{string1}'\n2: '{string2}'")
print(f"Levenshtein distance: {distance}")
print(f"Similarity ratio: {similarity:.4f}")
 
# Example output:
# Strings:
# 1: 'natural language processing'
# 2: 'neural language processing'
# Levenshtein distance: 3
# Similarity ratio: 0.8889
 
# Demonstrate caching benefit
import time
 
# First call (not cached)
start = time.time()
for _ in range(1000):
    Levenshtein.ratio("test string one", "test string two")
first_time = time.time() - start
 
# Clear cache and repeat
Levenshtein.ratio.cache_clear()
 
# Subsequent calls (cached after first)
start = time.time()
for _ in range(1000):
    Levenshtein.ratio("test string one", "test string two")
cached_time = time.time() - start
 
print(f"\nPerformance comparison:")
print(f"Without cache: {first_time:.4f}s")
print(f"With cache: {cached_time:.4f}s")
print(f"Speedup: {first_time/cached_time:.1f}x faster")

Performance Considerations

  • Time Complexity: O(m×n) where m and n are the lengths of the input strings
  • Space Complexity: O(min(m,n)) due to two-row optimization (not full matrix)
  • Cache Size: 20,000 entries for both distance() and ratio() methods
  • Early Termination: Strings with >70% length difference return immediately
  • Short String Optimization: Strings ≤ 3 characters use faster recursive approach
  • Memory Efficiency: Always processes shorter string as outer loop

Performance Tips

  1. Use caching effectively: Repeated comparisons are significantly faster
  2. Pre-normalize strings: Convert to lowercase once before multiple comparisons
  3. Batch similar operations: Cache works best with repeated string pairs
  4. Clear cache periodically: Use Levenshtein.ratio.cache_clear() if memory is a concern
  5. Consider alternatives for very long strings: For strings >1000 characters, approximate algorithms may be faster

Dependencies

The Levenshtein class relies on:

  • functools: For LRU cache decorator to optimize repeated calculations

Applications

Levenshtein distance is commonly used in:

  • Spell checking and correction
  • DNA sequence alignment
  • Plagiarism detection
  • Fuzzy string matching
  • Natural language processing
  • Record linkage and data deduplication

On this page