-
Notifications
You must be signed in to change notification settings - Fork 135
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Heuristic sortcriteria for discovering "unfavourable" filenames #198
Comments
Maybe a simpler algorithm to achieve the same effect: For each file in a cluster of duplicates, count the number of matching files whose basename is a subset of this file's basename:
String A is a subset of string B if B contains all of A's characters in the same order. Can test either by regex (eg B matches Although I'm not sure if there are many cases where this gives different results to the current |
... continuing the discussion, a related usecase would be to discriminate between auto-numbered files and human-readable names:
Edit: maybe just count number of alpha characters before the first period in the basename, eg
|
The subset approach might be a bit hard to implement and might be expensive for large groups and it won't work for files like Other tools like jdupes "solve" this problem by applying a numerical sort, in order to sort shorter files and lower numeric values before others. An insane idea would be to check which filename looks more like english (one might also add other languages) by checking the bigrams in the name. But that's really |
After thinking about this a bit in the train today, I will probably also close this. |
A new sortcriteria might be introduced, that gives each path a score based on the 'look' of a path.
Maybe an example might make this clear:
If we assume the sortcriteria is named
f
,rmlint
would need to calculate the score of each pathonce and store it in RmFile (or maybe an additional hash table to save memory for the base case).
The calculation would look like this (it's kind of obvious, just to be clear):
(1)
before the extension)The remaining problem is finding a good list of bad patterns to search for.
The text was updated successfully, but these errors were encountered: