-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[scripts] handle frame_shift and utt2num_frames in utils/ #3323
Merged
Merged
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about copying the conf/ dir? Could perhaps just copy one of them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually the conf dir thing is not important. But how many different options have you tested with?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It cannot be sensibly merged, the features can be potentially extracted differently.
Short answer is quite exhaustively. Depends on the level of changes. For this script, basically tested the rejection if frame_shift is different, and that the file is actually copied when present somewhere (and d-oh, that initially failed, because I misspelled it). For the subset_data_dir, which I've heavily rearranged/rewritten, I had to test it exhaustively with all options (and actually used it already, but that does not cover much) -- this is a very commonly used one. Even with fake ctm files, which it also subsets, and with comments in them, because it keeps the comments; I also ran validate_data_dir on every subset it created. For validate_data_dir, I added an extra line by hand to utt2num_frames, then moved it out of order, then mangled an int in the second column in the third run. For simple cases where I simply had to added utt2num_frames besides utt2dir into the long list of files to be conditionally copied, basically did not do at all (do not remember if there are any changes this simple, tho).
I'm always very careful with stuff that does not have automated testing, really. Remember how I discovered the one-sample discrepancy in utt2dur extraction with and without segments the other day: I actually diffed the extracted files in the extraction scripts. All of them, and each without the segments and with a trivial segments file. Also, I've rewritten these extractions scripts such that when I did e.g.
diff make_mfcc.sh make_fbank.sh
, the only differences shown were related to the actual extractor changes. (This is where I resisted the urge to change $mfcc_dir in one and $fbank_dir in the other to $feats_dir. But maybe I should have). Thendiff make_mfcc.sh make_mfcc_pitch.sh
, and from there ondiff make_mfcc_pitch.sh make_everyother_pitch.sh
. Found some minor discrepancies this way. So... let's say I test our scrips like I would test something that has no automated test.If only we had tests for these! If you look at the front page of Kaldi on GitHub, where it says what is the main language for the project (near the long colored horizontal bar), you'll notice that GitHub thinks Kaldi is written in bash. I was thinking about writing at least some tests, for most commonly used scripts at the very least, but this is an immense task. Just try to think of exhaustive test cases for validate_data_dir...
I quickly looked into bash testing frameworks, they'd help, but the majority of a test it still creating test data and validating it, and this is all manual. If we only had two of me...