Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Heuristic sortcriteria for discovering "unfavourable" filenames #198

Closed
sahib opened this issue Nov 15, 2016 · 4 comments
Closed

Heuristic sortcriteria for discovering "unfavourable" filenames #198

sahib opened this issue Nov 15, 2016 · 4 comments

Comments

@sahib
Copy link
Owner

sahib commented Nov 15, 2016

A new sortcriteria might be introduced, that gives each path a score based on the 'look' of a path.
Maybe an example might make this clear:

# In a set of these files
file.png
file (1).png
file_1.png
file.png.bak

# ...it's likely that the user wants to pick file.png
# as original file, since it probably was there first.

If we assume the sortcriteria is named f, rmlint would need to calculate the score of each path
once and store it in RmFile (or maybe an additional hash table to save memory for the base case).
The calculation would look like this (it's kind of obvious, just to be clear):

  • A neutral file has a score of 0.
  • A pattern indicating a bad filename increases the score (e.g. a (1) before the extension)
  • Certain patterns might reduce the score again, although I can't imagine one currently.

The remaining problem is finding a good list of bad patterns to search for.

@SeeSpotRun
Copy link
Collaborator

Maybe a simpler algorithm to achieve the same effect:

For each file in a cluster of duplicates, count the number of matching files whose basename is a subset of this file's basename:

Basename     Subset Count
file.png         0
file (1).png     1   # has 'file.png' as a subset
file_1.png       1   # has 'file.png' as a subset
file.png.bak     1   # has 'file.png' as a subset
file_1.png.bak   2   # has both 'file.png' and 'file.png.bak' as a subset

String A is a subset of string B if B contains all of A's characters in the same order. Can test either by regex (eg B matches '*f*i*l*e*.*p*n*g*') or by iterating over the two strings matching characters one at a time.

Although I'm not sure if there are many cases where this gives different results to the current --rank-by l criterion (keep shortest basename).

@SeeSpotRun
Copy link
Collaborator

SeeSpotRun commented Nov 15, 2016

... continuing the discussion, a related usecase would be to discriminate between auto-numbered files and human-readable names:

DSC0123.JPG
Spot with his Xmas tree.jpg
Spot_Christmas_2015.jpg

Edit: maybe just count number of alpha characters before the first period in the basename, eg --rank-by Hl would rank by most alpha characters before the first period in the basename, then by shortest basename, so would rank the following example as follows:

Spot_Christmas_2015.jpg      # 13/23
Spot_Christmas_2015(1).jpg   # 13/26
Spot_Christmas_2015.jpg.bak  # 13/27
DSC0123.jpg                  #  3/11

@sahib
Copy link
Owner Author

sahib commented Nov 15, 2016

The subset approach might be a bit hard to implement and might be expensive for large groups and it won't work for files like DSC0123.JPG and human readable names. I really like the alphanumeric approach though. Maybe just take the ratio alphanumeric(s - extension) / len(s) and take that as base of the score? One could add additionally weight in the total length of the path (assuming the user
usually wants to keep the less nested copy). A quick regex filter might detect things like (1) and the like, causing a worse score for this path. I might tinker around a bit once I have the time.

Other tools like jdupes "solve" this problem by applying a numerical sort, in order to sort shorter files and lower numeric values before others.

An insane idea would be to check which filename looks more like english (one might also add other languages) by checking the bigrams in the name. But that's really
crazy talk and should just remind myself to stick to simple solutions.

@sahib
Copy link
Owner Author

sahib commented Mar 20, 2017

After thinking about this a bit in the train today, I will probably also close this.
This enters the land of probalibistic user interfaces, where I don't like to see a very deterministic tool going.
There is simply no way to get this 100% right and since it depends a lot on the usecase, we should leave this to the user in my opinion.

@sahib sahib closed this as completed Mar 20, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants