Levenshtein
The Levenshtein class provides utilities for calculating edit distances and similarity ratios between strings using the Levenshtein algorithm.
Info: This documentation provides interactive code views for each method. Click on a function name to view its implementation.
Class Overview
The Levenshtein class offers methods to measure the difference between two strings and calculate their similarity.
Static Methods
Usage Guide
Distance Calculation
The Levenshtein distance measures the minimum number of single-character edits (insertions, deletions, or substitutions) required to change one string into another.
Example:
Similarity Ratio
The similarity ratio is a normalized measure between 0 and 1, where 1 means the strings are identical and 0 means they are completely different.
Example:
Algorithm Explanation
The Levenshtein distance algorithm calculates the minimum edit distance between two strings using dynamic programming. This implementation includes several optimizations:
Standard Algorithm
- Initialize two arrays (instead of full matrix for memory efficiency)
- The
previous_rowstores distances for the previous character of seq1 - The
current_rowstores distances being calculated for current character - For each position, calculate the minimum cost from three operations:
- Deletion: Value from cell above + 1
- Insertion: Value from cell to the left + 1
- Substitution/Match: Value from diagonal + cost (0 if match, 1 if different)
- After processing all characters, the last value contains the final distance
Optimizations
Memory Efficiency: Uses only two rows instead of full matrix
- Reduces space complexity from O(m×n) to O(n)
- Swaps rows after each iteration to reuse memory
Early Termination: Checks if strings are too different before computing
- If length difference > 70% of max length, returns max length immediately
- Avoids unnecessary computation for very dissimilar strings
String Length Optimization: Always processes shorter string first
- Ensures the outer loop is smaller
- Minimizes memory usage and iterations
Short String Optimization: Uses recursive approach for strings ≤ 3 characters
- Simple recursive method is faster for very short strings
- Avoids array allocation overhead
LRU Caching: Both distance() and ratio() methods use @lru_cache
- Cache size: 20,000 entries
- Dramatically speeds up repeated comparisons
- Useful when comparing many keywords against the same text
Complete Usage Example
Performance Considerations
- Time Complexity: O(m×n) where m and n are the lengths of the input strings
- Space Complexity: O(min(m,n)) due to two-row optimization (not full matrix)
- Cache Size: 20,000 entries for both
distance()andratio()methods - Early Termination: Strings with >70% length difference return immediately
- Short String Optimization: Strings ≤ 3 characters use faster recursive approach
- Memory Efficiency: Always processes shorter string as outer loop
Performance Tips
- Use caching effectively: Repeated comparisons are significantly faster
- Pre-normalize strings: Convert to lowercase once before multiple comparisons
- Batch similar operations: Cache works best with repeated string pairs
- Clear cache periodically: Use
Levenshtein.ratio.cache_clear()if memory is a concern - Consider alternatives for very long strings: For strings >1000 characters, approximate algorithms may be faster
Dependencies
The Levenshtein class relies on:
functools: For LRU cache decorator to optimize repeated calculations
Applications
Levenshtein distance is commonly used in:
- Spell checking and correction
- DNA sequence alignment
- Plagiarism detection
- Fuzzy string matching
- Natural language processing
- Record linkage and data deduplication