Getting Started
Quick Start Guide
Get YAKE! up and running in less than 5 minutes. Follow these simple steps to start extracting keywords from your texts.
📦 Installing YAKE!
Installation
YAKE! requires Python 3.7+ and can be installed directly from GitHub.
💡 Optional: Lemmatization Support
To use lemmatization features (aggregate keyword variations like "tree/trees"), install the optional dependencies:
uv pip install yake[lemmatization]Then download the required language models:
# For spaCy (recommended)python -m spacy download en_core_web_sm# For NLTK (alternative)python -c "import nltk; nltk.download('wordnet'); nltk.download('omw-1.4')"🚀 Usage (Command Line)
Basic Command Structure
Available Options
| Option | Type | Description |
|---|---|---|
| -ti, --text_input | TEXT | Input text (must be surrounded by single quotes) |
| -i, --input_file | TEXT | Path to input file |
| -l, --language | TEXT | Language code (e.g., 'en', 'pt', 'es') |
| -n, --ngram-size | INTEGER | Maximum n-gram size (default: 3) |
| -t, --top | INTEGER | Number of keywords to extract (default: 20) |
| -df, --dedup-func | CHOICE | Deduplication method: leve, jaro, or seqm |
| -dl, --dedup-lim | FLOAT | Deduplication threshold (0.0-1.0) |
| --lemmatize | FLAG | Enable lemmatization (default: False) |
| --lemma-aggregation | CHOICE | Aggregation method: min, mean, max, or harmonic (default: min) |
| --lemmatizer | CHOICE | Backend: spacy or nltk (default: spacy) |
| -v, --verbose | FLAG | Show detailed scores |
💡 Example Commands
🔄 Keyword Deduplication Methods
Why Deduplication?
YAKE may extract similar keywords (e.g., "machine learning" and "machine learning algorithm"). Deduplication merges similar keywords to avoid redundancy.
YAKE uses three methods to compute string similarity during keyword deduplication:
L
Levenshtein
leve
Measures edit distance between strings — operations needed to transform one string into another.
1 - distance / max_len
J
Jaro
jaro
Measures similarity based on matching characters and their relative positions.
jellyfish.jaro_similarity()
S
Sequence
seqm
Uses Python's built-in difflib to find matching blocks in strings.
2 * matches / total_len
📊 Comparison Table
| Method | Based on | Best for | Performance |
|---|---|---|---|
| leve | Edit operations | Typos and small changes | Medium |
| jaro | Matching positions | Names with swaps | Fast |
| seqm | Matching blocks | Longer strings | Fast |
💡 Recommendation
For most use cases, leve (Levenshtein) provides the best balance between accuracy and performance. Use jaro for proper names or short strings, and seqm for longer texts with shared subsequences.
1. leve — Levenshtein Similarity
- What it is: Measures the edit distance between two strings — how many operations (insertions, deletions, substitutions) are needed to turn one string into another.
- Formula used:
- Best for: Very accurate for small changes (e.g., "house" vs "horse")
- Performance: Medium speed
2. jaro — Jaro Similarity
- What it is: Measures similarity based on matching characters and their relative positions
- Implementation: Uses the
jellyfishlibrary - Best for: More tolerant of transpositions (e.g., "maria" vs "maira")
- Performance: Fast
3. seqm — SequenceMatcher Ratio
- What it is: Uses Python's built-in
difflib.SequenceMatcher - Formula:
where M is the number of matching characters, and T is the total number of characters in both strings.
- Best for: Good for detecting shared blocks in longer strings
- Performance: Fast
Comparison Table
| Method | Based on | Best for | Performance |
|---|---|---|---|
leve | Edit operations | Typos and small changes | Medium |
jaro | Matching positions | Names and short strings with swaps | Fast |
seqm | Common subsequences | General phrase similarity | Fast |
Practical Examples
| Compared Strings | leve | jaro | seqm |
|---|---|---|---|
| "casa" vs "caso" | 0.75 | 0.83 | 0.75 |
| "machine" vs "mecine" | 0.71 | 0.88 | 0.82 |
| "apple" vs "a pple" | 0.8 | 0.93 | 0.9 |
Recommendation: For general use with a good balance of speed and accuracy, seqm is a solid default (and it is YAKE's default). For stricter lexical similarity, choose leve. For names or when letter swaps are common, go with jaro.
Lemmatization
YAKE supports lemmatization to aggregate keywords with the same lemma, reducing redundancy from morphological variations like "tree"/"trees" or "run"/"running"/"runner".
When lemmatization is enabled, YAKE combines morphological variations and aggregates their scores using one of four methods:
1. min — Best Score (Default)
- What it is: Selects the keyword with the lowest (best) score from all morphological variations
- Formula:
final_score = min(score_tree, score_trees) - Best for: Most cases — selects the most relevant form
- Performance: Fast
2. mean — Average Score
- What it is: Averages scores across all morphological variations
- Formula:
final_score = sum(scores) / len(scores) - Best for: When all forms are equally important
- Performance: Fast
3. max — Worst Score
- What it is: Uses the highest (worst) score — most conservative approach
- Formula:
final_score = max(score_tree, score_trees) - Best for: Conservative filtering to ensure only high-quality keywords
- Performance: Fast
4. harmonic — Harmonic Mean
- What it is: Calculates the harmonic mean of all scores
- Formula:
final_score = n / sum(1/score for score in scores) - Best for: Balanced approach between min and mean
- Performance: Fast
Comparison Table
| Method | Based on | Best for | Example: "tree" (0.05) + "trees" (0.08) |
|---|---|---|---|
min | Lowest score | General use - best variant | 0.05 |
mean | Average | All forms equally important | 0.065 |
max | Highest score | Conservative filtering | 0.08 |
harmonic | Harmonic mean | Balanced combination | ~0.061 |
Practical Example
Installation: Requires spaCy or NLTK:
Recommendation: Use min (default) for general cases as it selects the most relevant form. Use mean or harmonic when you want to consider all variations equally.
Usage (Python)
How to use it using Python:
Simple usage using default parameters
Specifying custom parameters
Output
The lower the score, the more relevant the keyword is.
Copyright ©2018-2026 INESC TEC. Distributed by an INESCTEC license.