-
Notifications
You must be signed in to change notification settings - Fork 136
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New sort criteria (for original detection): "maximize link count" #196
Comments
Hey @Awerick, the simple approach sounds very reasonable to me and it should be easy to implement The advanced approach is probably out of the scope of |
I barely tested it, but I think e4a99c0 does what you want. Edit: Use this to test the feature: $ rmlint -S H A/ B/ |
Thanks for the feedback and the fast commit! Regarding the advanced approach: However I think it could be implemented without the need to traverse any other paths:
If I am not mistaken... ;-) |
Ah, It seems I misunderstood your approach then. Thanks for the detailed explanation; looks promising. |
If I understood you correctly (that's never guaranteed...) 648c0ac should include the advanced approach as $ mkdir links/a links/b -p
$ echo x > links/a/foo
$ ln links/a/foo links/b/foo-from-a
$ echo x > links/b/foo
$ ln links/b/foo links/b/foo
$ ln links/b/foo links/b/foo-link-1
$ ln links/b/foo links/b/foo-link-2
$ tree links
links
├── a
│ └── foo
└── b
├── foo
├── foo-from-a
├── foo-link-1
└── foo-link-2
2 directories, 5 files
$ ./rmlint -S O links/b
# Duplicate(s):
ls '/home/sahib/dev/rmlint/links/b/foo-from-a' # link to highest outlyer.
rm '/home/sahib/dev/rmlint/links/b/foo'
rm '/home/sahib/dev/rmlint/links/b/foo-link-1'
rm '/home/sahib/dev/rmlint/links/b/foo-link-2'
[...]
$ ./rmlint -S OH links
# Duplicate(s):
ls '/home/sahib/dev/rmlint/links/b/foo' # link to highest link count (no outlyers)
rm '/home/sahib/dev/rmlint/links/b/foo-link-1'
rm '/home/sahib/dev/rmlint/links/b/foo-link-2'
rm '/home/sahib/dev/rmlint/links/a/foo'
rm '/home/sahib/dev/rmlint/links/b/foo-from-a'
[...] |
Wow, you are quick – and everything seems to behave correctly! (I have some more considerations for documenting/describing the behavior that I will comment on Monday or so.) When looking at your first example, I fooled myself: I thought In the example you used the option Interesting: I intuitively thought that in To understand and test the behavior of the new options I came up with a setup where the options Setup overview
Create setup
Show setup
Test four different sort criteria combinations for rmlint(The
Result of the TestGreat: When I run the four tests, they all returned what they were supposed to return! :-) |
Good to hear it fits your usecase. I allowed myself to copy your testcase as an automated one (see 16254c4). I'm still wondering if Thanks for your detailed and well thought responses, I'm really not used to that. |
I'm glad if the testcase is usuable for your testsuite! Using Well, this is my first github issue, so I wanted to be as understandable as possible – if I comment more regularly in the future, my comments might become a little bit more rough ;-) |
As promised: Some thoughts and suggestions for the manual/documentation:I think the manual now describes the behavior of Maybe some of the following remarks could be helpful in the manual to understand the effects of
(In case of the Some details on point 2:
For dealing with hardlinked files in general (even without criteria
|
I tried to reword some things in the manpage. See 3584976.
I changed the default to
I added it to the
I extended the
Actually, $ echo xxx > hello
$ ln hello world
$ rmlint hello world
...
==> In total 2 files, whereof 1 are duplicates in 1 groups.
==> This equals 0 B of duplicates which could be removed. However, this does not work (yet) when having outside hardlinks. $ echo xxx > hello
$ echo xxx > world
$ ln hello hello_link
$ ln world world_link
$ rmlint *_link
==> In total 2 files, whereof 1 are duplicates in 1 groups.
==> This equals 4 B of duplicates which could be removed. This would require us to take the link count of each removed file into account. |
Thanks again for your quick answers!
Is this not a little bit radical? I think Maybe I misunderstood your previous comment ("I'm still wondering if OH would be a good default if any link option is given that includes hardlinks."). My answer ("I am also not sure about it...") only applied to the hardlink handler:
Ah, very interesting – thanks for digging this up!
Oh yes! Probably I interpreted the result incorrectly when I tested it. A rough idea how outside hardlinks could be taken into account: Every duplicate that has at least one outside hardlink ( |
Maybe. 😄
No, you didn't misunderstand that, I just made up my mind afterwards. I'm still thinking about it, but I kinda like it. The default sortcriteria is a heuristic anyways, you can always
The developer's perspective is always a bit biased and different from a user's perspective.
True. 4cc9fa7 implements this approach. I'm not sure either if this approach handles all edge cases (well, I can't really think of one, but that's the problem with edge cases...), but it should be better than before. |
Ah ok :-) Well, if the average users will benefit from the new default, this is certainly good. I just thought, But for changes to the defaults, feedback from other users/developers (possibly less biased) would probably be valuable. Phew! All those different hardlink constellations, sorting criterias, handler modes and discussions of the last days can make one slightly dizzy %) |
Better don't open an issue including symbolic links then. 😄 (j/k) I think this issue reached it's end of life. Closing now. |
For ensuring minimal/decreased disk usage, I would like to request a new sort criteria for the
-S
/--rank-by
option: "maximize link count".This might decrease the disk usage when running
rmlint
on duplicates, which are already hardlinked from another path (see examples below).(The tool
hardlink
(for Debian-based distros) provides this capability somehow with the-m
/--maximize
option, see manpage.)Simple example
A
that contains a filefoo
(at inode 1001).B
contains the same filefoo
two times: One time hardlinked to the filefoo
at pathA
(i.e. same inode number 1001 and link count 2) and one time separately (say inode 1002, link count 1).Now consider
rmlint -c sh:hardlink
is executed on pathB
.(I consider
-c sh:hardlink
, i.e. the mode in which duplicates are replaced with hardlinks, however the other use cases, e.g. removing duplicates, would probably benefit as well.)Current behavior: It might happen that the separate file (at inode 1002) is treated as the original and both files are hardlinked to it. This is not optimal because in the end the data still exists two times: Once for the file in Path
A
and once for the files in PathB
.With the requested criteria: It would be better to treat the file with the "highest link count" as the original (which is the file at inode 1001) so that all files get hardlinked to this one. By this the data only exists once on the hardware.
Advanced Approach
The approach above to "keep the highest link count" may still not be optimal in all cases, a better approach would be to "maximize the link count". Advanced example:
B
contains the filefoo
four times: One time hardlinked to the filefoo
at pathA
(link count 2) and three times hardlinked "with each other" (link count 3).If all files were hardlinked to the "highest link count" (as suggested above), the data would still exist two times: Once for the file in path
A
(link count 1) and once for the files in pathB
(link count 4).With the advanced approach this will be avoided: Link the files in a way, that, in the end, the "link count is maximized". In our example: The data only exists once for all files in path
A
andB
(link count 5).(Possible workaround with the current
rmlint
version: If all the files with the maximum link count are somehow guaranteed to be in one path, the flag separator//
can be used to tag these as originals.)Sorry for the long text, but I hope, it made the desired behavior (and the benefits) clear.
The text was updated successfully, but these errors were encountered: