-
-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix:voc_match #327
fix:voc_match #327
Conversation
handle variations in accents/punctuation in the .voc files this is common in written chat and in some postprocessing applied to STTs example, ocp pipeline was not matching "noticias" vs "notícias" (News in portuguese)
WalkthroughThe pull request introduces modifications to the Changes
Suggested Reviewers
Suggested Labels
Possibly related PRs
Poem
✨ Finishing Touches
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## dev #327 +/- ##
==========================================
- Coverage 49.27% 49.02% -0.26%
==========================================
Files 37 37
Lines 4426 4439 +13
==========================================
- Hits 2181 2176 -5
- Misses 2245 2263 +18 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
ovos_workshop/skills/ovos.py (1)
2774-2791
: Consider enhancing the utility function before moving to ovos-utils.The implementation looks good but could be improved before moving to ovos-utils:
- Consider adding input validation for None/non-string inputs
- Consider adding optional parameters to:
- Specify which punctuation marks to preserve (currently hardcoded to
{
and}
)- Control whether to preserve whitespace
- Consider adding unit tests before the move
Here's a suggested enhancement:
def remove_accents_and_punct(input_str: str) -> str: + if not isinstance(input_str, str): + raise TypeError("Input must be a string") + if not input_str: + return input_str + rm_chars = [c for c in string.punctuation if c not in ("{", "}")] # Normalize to NFD (Normalization Form Decomposed), which separates characters and diacritical marks nfkd_form = unicodedata.normalize('NFD', input_str) # Remove characters that are not ASCII letters or punctuation we want to keep return ''.join([char for char in nfkd_form if unicodedata.category(char) != 'Mn' and char not in rm_chars])
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
ovos_workshop/skills/ovos.py
(7 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: unit_tests (3.9)
🔇 Additional comments (3)
ovos_workshop/skills/ovos.py (3)
7-7
: LGTM! Required imports added.The new imports for
string
andunicodedata
modules are appropriately placed with other standard library imports and are necessary for the new accent/punctuation normalization functionality.Also applies to: 11-11
Line range hint
2215-2252
: LGTM! Enhanced vocabulary matching with accent/punctuation handling.The changes improve the vocabulary matching by:
- Adding the
ensure_ascii
parameter with a sensible default (True) for backward compatibility- Applying accent/punctuation normalization to both the input and vocabulary terms when enabled
This enhancement will help match variations like "noticias" and "notícias" as intended.
Line range hint
2215-2252
: Verify the accent normalization with test cases.The implementation looks correct for handling accent variations in .voc files. To ensure robustness, consider testing with these cases:
- Words with multiple accents
- Words with punctuation and accents
- Template variables with accents inside curly braces
- Empty strings and edge cases
Also applies to: 2774-2791
✅ Verification successful
Implementation correctly handles accents and template variables
The accent normalization implementation is robust and correct:
- Uses Unicode NFD normalization for reliable accent handling
- Properly preserves template variables with
{
and}
- Correctly processes both utterances and vocabulary entries
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Create a test .voc file with accented words echo "notícias noticias {news}" > test.voc # Test the voc_match function with various inputs python3 -c ' from ovos_workshop.skills.ovos import remove_accents_and_punct test_cases = [ "notícias", "noticias", "notícias!", "!noticias!", "{nótícias}", "", "múltiple accénts" ] for test in test_cases: normalized = remove_accents_and_punct(test) print(f"Original: {test}") print(f"Normalized: {normalized}\n") 'Length of output: 792
Script:
#!/bin/bash # Find the implementation of remove_accents_and_punct rg -A 5 "def remove_accents_and_punct" # Look for test cases fd "test.*\.py" --exec rg -l "remove_accents_and_punct" # Check template variable handling rg "{.*?}" "*.voc" || true # Don't fail if no .voc files foundLength of output: 603
Script:
#!/bin/bash # Get the complete implementation of remove_accents_and_punct ast-grep --pattern 'def remove_accents_and_punct($params) { $$$ }' # Also check for any usage examples rg -B 2 -A 5 "remove_accents_and_punct\("Length of output: 1207
Script:
#!/bin/bash # Try to get the complete function implementation with a different pattern rg -A 15 "^def remove_accents_and_punct" # Also check if there are any related utility functions rg "import.*unicodedata|from.*unicodedata"Length of output: 1362
handle variations in accents/punctuation in the .voc files
this is common in written chat and in some postprocessing applied to STTs
example, ocp pipeline was not matching "noticias" vs "notícias" (News in portuguese)
Summary by CodeRabbit
New Features
Improvements
voc_match
method to provide more flexible string comparison