Heuristic sortcriteria for discovering "unfavourable" filenames #198

sahib · 2016-11-15T00:49:35Z

A new sortcriteria might be introduced, that gives each path a score based on the 'look' of a path.
Maybe an example might make this clear:

# In a set of these files
file.png
file (1).png
file_1.png
file.png.bak

# ...it's likely that the user wants to pick file.png
# as original file, since it probably was there first.

If we assume the sortcriteria is named f, rmlint would need to calculate the score of each path
once and store it in RmFile (or maybe an additional hash table to save memory for the base case).
The calculation would look like this (it's kind of obvious, just to be clear):

A neutral file has a score of 0.
A pattern indicating a bad filename increases the score (e.g. a (1) before the extension)
Certain patterns might reduce the score again, although I can't imagine one currently.

The remaining problem is finding a good list of bad patterns to search for.

The text was updated successfully, but these errors were encountered:

SeeSpotRun · 2016-11-15T01:46:54Z

Maybe a simpler algorithm to achieve the same effect:

For each file in a cluster of duplicates, count the number of matching files whose basename is a subset of this file's basename:

Basename     Subset Count
file.png         0
file (1).png     1   # has 'file.png' as a subset
file_1.png       1   # has 'file.png' as a subset
file.png.bak     1   # has 'file.png' as a subset
file_1.png.bak   2   # has both 'file.png' and 'file.png.bak' as a subset

String A is a subset of string B if B contains all of A's characters in the same order. Can test either by regex (eg B matches '*f*i*l*e*.*p*n*g*') or by iterating over the two strings matching characters one at a time.

Although I'm not sure if there are many cases where this gives different results to the current --rank-by l criterion (keep shortest basename).

SeeSpotRun · 2016-11-15T02:03:10Z

... continuing the discussion, a related usecase would be to discriminate between auto-numbered files and human-readable names:

DSC0123.JPG
Spot with his Xmas tree.jpg
Spot_Christmas_2015.jpg

Edit: maybe just count number of alpha characters before the first period in the basename, eg --rank-by Hl would rank by most alpha characters before the first period in the basename, then by shortest basename, so would rank the following example as follows:

Spot_Christmas_2015.jpg      # 13/23
Spot_Christmas_2015(1).jpg   # 13/26
Spot_Christmas_2015.jpg.bak  # 13/27
DSC0123.jpg                  #  3/11

sahib · 2016-11-15T20:12:57Z

The subset approach might be a bit hard to implement and might be expensive for large groups and it won't work for files like DSC0123.JPG and human readable names. I really like the alphanumeric approach though. Maybe just take the ratio alphanumeric(s - extension) / len(s) and take that as base of the score? One could add additionally weight in the total length of the path (assuming the user
usually wants to keep the less nested copy). A quick regex filter might detect things like (1) and the like, causing a worse score for this path. I might tinker around a bit once I have the time.

Other tools like jdupes "solve" this problem by applying a numerical sort, in order to sort shorter files and lower numeric values before others.

An insane idea would be to check which filename looks more like english (one might also add other languages) by checking the bigrams in the name. But that's really
crazy talk and should just remind myself to stick to simple solutions.

sahib · 2017-03-20T09:47:27Z

After thinking about this a bit in the train today, I will probably also close this.
This enters the land of probalibistic user interfaces, where I don't like to see a very deterministic tool going.
There is simply no way to get this 100% right and since it depends a lot on the usecase, we should leave this to the user in my opinion.

sahib added the Discussion label Nov 15, 2016

Awerick mentioned this issue Nov 16, 2016

New sort criteria (for original detection): "maximize link count" #196

Closed

sahib closed this as completed Mar 20, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Heuristic sortcriteria for discovering "unfavourable" filenames #198

Heuristic sortcriteria for discovering "unfavourable" filenames #198

sahib commented Nov 15, 2016

SeeSpotRun commented Nov 15, 2016

SeeSpotRun commented Nov 15, 2016 •

edited

Loading

sahib commented Nov 15, 2016

sahib commented Mar 20, 2017

Heuristic sortcriteria for discovering "unfavourable" filenames #198

Heuristic sortcriteria for discovering "unfavourable" filenames #198

Comments

sahib commented Nov 15, 2016

SeeSpotRun commented Nov 15, 2016

SeeSpotRun commented Nov 15, 2016 • edited Loading

sahib commented Nov 15, 2016

sahib commented Mar 20, 2017

SeeSpotRun commented Nov 15, 2016 •

edited

Loading