Levenshtein

The Levenshtein class provides utilities for calculating edit distances and similarity ratios between strings using the Levenshtein algorithm.

Info: This documentation provides interactive code views for each method. Click on a function name to view its implementation.

Class Overview

"""
Module providing optimized Levenshtein distance and ratio calculations.
 
This module implements an optimized version of the Levenshtein (edit distance) 
algorithm for measuring the difference between two strings. It provides both 
a raw distance calculation and a normalized similarity ratio, which are useful 
for comparing text strings and identifying potential matches with slight variations.
 
Optimizations include:
- Memory-efficient two-row approach instead of full matrix
- Early termination for highly dissimilar strings (>70% length difference)
- LRU caching for repeated calculations (20K cache size)
- Optimized recursive algorithm for short strings (length <= 3)
"""
 
class Levenshtein:
    """
    Optimized class for computing Levenshtein distance and similarity ratio.
    
    This class provides static methods to calculate the edit distance between
    strings (how many insertions, deletions, or substitutions are needed to
    transform one string into another) and to determine a normalized similarity
    ratio between them, with significant performance optimizations.
    
    These metrics are widely used in fuzzy string matching, spell checking,
    and approximate text similarity measurements.
    """

The Levenshtein class offers methods to measure the difference between two strings and calculate their similarity.

Static Methods

Usage Guide

Distance Calculation

The Levenshtein distance measures the minimum number of single-character edits (insertions, deletions, or substitutions) required to change one string into another.

Example:

from levenshtein import Levenshtein
 
# Calculate the edit distance between two strings
distance = Levenshtein.distance("kitten", "sitting")
print(f"Levenshtein distance: {distance}")  # Output: 3

Similarity Ratio

The similarity ratio is a normalized measure between 0 and 1, where 1 means the strings are identical and 0 means they are completely different.

Example:

from levenshtein import Levenshtein
 
# Calculate the similarity ratio between two strings
similarity = Levenshtein.ratio("kitten", "sitting")
print(f"Similarity ratio: {similarity:.4f}")  # Output: 0.5714

Algorithm Explanation

The Levenshtein distance algorithm calculates the minimum edit distance between two strings using dynamic programming. This implementation includes several optimizations:

Standard Algorithm

Initialize two arrays (instead of full matrix for memory efficiency)
The previous_row stores distances for the previous character of seq1
The current_row stores distances being calculated for current character
For each position, calculate the minimum cost from three operations:
- Deletion: Value from cell above + 1
- Insertion: Value from cell to the left + 1
- Substitution/Match: Value from diagonal + cost (0 if match, 1 if different)
After processing all characters, the last value contains the final distance

Optimizations

Memory Efficiency: Uses only two rows instead of full matrix

Reduces space complexity from O(m×n) to O(n)
Swaps rows after each iteration to reuse memory

Early Termination: Checks if strings are too different before computing

If length difference > 70% of max length, returns max length immediately
Avoids unnecessary computation for very dissimilar strings

String Length Optimization: Always processes shorter string first

Ensures the outer loop is smaller
Minimizes memory usage and iterations

Short String Optimization: Uses recursive approach for strings ≤ 3 characters

Simple recursive method is faster for very short strings
Avoids array allocation overhead

LRU Caching: Both distance() and ratio() methods use @lru_cache

Cache size: 20,000 entries
Dramatically speeds up repeated comparisons
Useful when comparing many keywords against the same text

Complete Usage Example

import functools
from levenshtein import Levenshtein
 
# Test strings
string1 = "natural language processing"
string2 = "neural language processing"
 
# Calculate distance and similarity
distance = Levenshtein.distance(string1, string2)
similarity = Levenshtein.ratio(string1, string2)
 
print(f"Strings:\n1: '{string1}'\n2: '{string2}'")
print(f"Levenshtein distance: {distance}")
print(f"Similarity ratio: {similarity:.4f}")
 
# Example output:
# Strings:
# 1: 'natural language processing'
# 2: 'neural language processing'
# Levenshtein distance: 3
# Similarity ratio: 0.8889
 
# Demonstrate caching benefit
import time
 
# First call (not cached)
start = time.time()
for _ in range(1000):
    Levenshtein.ratio("test string one", "test string two")
first_time = time.time() - start
 
# Clear cache and repeat
Levenshtein.ratio.cache_clear()
 
# Subsequent calls (cached after first)
start = time.time()
for _ in range(1000):
    Levenshtein.ratio("test string one", "test string two")
cached_time = time.time() - start
 
print(f"\nPerformance comparison:")
print(f"Without cache: {first_time:.4f}s")
print(f"With cache: {cached_time:.4f}s")
print(f"Speedup: {first_time/cached_time:.1f}x faster")

Performance Considerations

Time Complexity: O(m×n) where m and n are the lengths of the input strings
Space Complexity: O(min(m,n)) due to two-row optimization (not full matrix)
Cache Size: 20,000 entries for both distance() and ratio() methods
Early Termination: Strings with >70% length difference return immediately
Short String Optimization: Strings ≤ 3 characters use faster recursive approach
Memory Efficiency: Always processes shorter string as outer loop

Performance Tips

Use caching effectively: Repeated comparisons are significantly faster
Pre-normalize strings: Convert to lowercase once before multiple comparisons
Batch similar operations: Cache works best with repeated string pairs
Clear cache periodically: Use Levenshtein.ratio.cache_clear() if memory is a concern
Consider alternatives for very long strings: For strings >1000 characters, approximate algorithms may be faster

Dependencies

The Levenshtein class relies on:

functools: For LRU cache decorator to optimize repeated calculations

Applications

Levenshtein distance is commonly used in:

Spell checking and correction
DNA sequence alignment
Plagiarism detection
Fuzzy string matching
Natural language processing
Record linkage and data deduplication

__ratio(distance, str_length)

ratio(seq1, seq2)

distance(seq1, seq2)

_simple_distance(seq1, seq2)

On this page