Byte-Pair Encoding #26

wwerkk · 2023-05-20T18:40:40Z

It could help a lot to detect commonly occuring sub-sequences and replace them with a single token.

wwerkk · 2023-05-20T18:46:12Z

could also help with #27

wwerkk · 2023-05-27T17:03:43Z

Test implementation using a toy dataset suggests it is an option, but two new problems come to mind:

in picking from the most common sub-sequences that the used BPE implementation returns.
This is fairly easy to solve by ensuring the replacement function starts doing so with the sub-sequence with highest number of occurrences, then goes to the second most common until there is no frames left that match any sub-sequence in the vocabulary.
When tested with real world data, it would often be the case that there is no repeating label sub-sequences. This suggests a bigger problem within the pipeline, likely spread across all three stages of segmentation/analysis/classification and is unlikely to be resolved at this point.

wwerkk added the enhancement New feature or request label May 20, 2023

Provide feedback