diff --git a/.gitignore b/.gitignore index ba462df5..10510a06 100644 --- a/.gitignore +++ b/.gitignore @@ -15,9 +15,14 @@ docs/rmlint.1 .scons* .sconf* .rope* +*.pyc docs/_build __pycache__ +lib/config.h lib/formats/py.c -lib/formats/py.c +lib/formats/sh.c +uninstall- +gui/app/resources/app.gresource +*.a diff --git a/.travis.yml b/.travis.yml index a9427f98..4421adeb 100644 --- a/.travis.yml +++ b/.travis.yml @@ -1,16 +1,23 @@ language: c install: - sudo apt-get update - - sudo apt-get install python3-sphinx python3-nose gettext python3-setuptools valgrind - - sudo apt-get install libblkid-dev libelf-dev libglib2.0-dev libjson-glib-dev + - sudo apt-get install python3-sphinx python3-nose gettext python3-setuptools + - sudo apt-get install libblkid-dev libelf-dev libglib2.0-dev libjson-glib-dev - sudo easy_install3 pip - sudo /usr/local/bin/pip install sphinx_bootstrap_theme + compiler: - clang - gcc + notifications: email: - sahib@online.de - thomas_d_j@yahoo.com.au -script: scons VERBOSE=1 && scons config && export USE_VALGRIND=1 && PEDANTIC=1 PRINT_CMD=1 sudo nosetests3 -a '!slow' +script: + - scons VERBOSE=1 + - scons config + - export RM_TS_PRINT_CMD=1 + - export RM_TS_PEDANTIC=0 + - sudo -E nosetests3 -s -v -a '!slow' diff --git a/.version b/.version index 8eb764dd..bb22f4f5 100644 --- a/.version +++ b/.version @@ -1 +1 @@ -2.2.0 Dreary Dropbear +2.4.0 Myopic Micrathene diff --git a/CHANGELOG.md b/CHANGELOG.md index ca168eba..137c72e1 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -4,7 +4,91 @@ All notable changes to this project will be documented in this file. The format follows [keepachangelog.com]. Please stick to it. -## [2.3.0 (No name yet)] -- [unreleased] +## [2.4.0 Myopic Micrathene] -- 2015-10-25 + +### Fixed + +- ``rmlint`` should compile on Mac OSX now. +- Bugfix: Broken ``chown`` calls in sh script (thanks Shukrat Mukimov) +- Bugfix: memory corruption when specifying ``-T dd`` alone. +- Bugfix: Make ``-D`` and ``-k / -K`` play together nicely (thanks phiresky). +- Smaller compile time troubles fixed. +- Progressbar uses timeout-based redraws which leads to much smoother drawing + and less cpu footprint. +- ``pretty`` formatter (default) produces now valid escaped commands. + It is still intented for visual output only. That's why a note for this was + added. + +### Added + +- A fully working graphical user interface which is installed as a python module + by default (can be disabled via compile option ie ``scons --without-gui``). + It can be started via ``rmlint --gui``. +- Support for automatic deduplication on btrfs using ``BTRFS_IOC_FILE_EXTENT_SAME``. + The Shellscript now will contain calls to ``rmlint --btrfs $source $dest`` + for duplicates on ``btrfs`` filesystems if the user specified ``-c sh:clone``. +- Benchmark suite that will track the performance of rmlint from release to release. + This helps developers detect any speed regressions or improvements and is a tool + to help develop and validate optimization strategies. +- Shell/Python-script now does more sanity checks before removing and can be told to + re-compare files byte-by-byte before removing them (``-p`` option when running + the ``.sh`` file). +- Add a new ``--hash`` option so rmlint can be used as a very fast file hashing + utility, eg ``rmlint --hash`` works like ``sha1sum``, or ``rmlint --hash -d md5`` + works like ``md5sum``. Also does sha256, sha512, murmur{128}, spooky{32,64,128}, + city{128}. +- ``--sort-by`` learned new keys: ``l`` (path length) and ``d`` (path depth). +- New ``--unmatched-basename`` option only finds twins with differing basenames. +- Smaller performance and memory optimisations in shredder. + +### Changed + +- ``-g`` now checks if there is already a ``sh`` and ``json`` formatter before + it adds one. +- ``-PP`` now defaults to ``xxhash`` as hashing algorithm. +- ``-o / --output`` learned to guess the formatter you want to use from the file ending. + For example ``-o /tmp/test.json`` will work like ``-o json:/tmp/test.json``. +- JSON output contains ``rmlint`` version and revision now. +- ``--replay`` learned to merge several json files. +- Internal refactoring (credits go to Daniel) of the scheduler and hashing + library. The duplicate finding process has be split in separate modules. + +## [2.3.0 Ominous Oscar] -- 2015-06-15 + +### Fixed + +- Compiles on Mac OSX now. See also: https://github.com/sahib/rmlint/issues/139 +- Fix a crash that happened with ``-e``. +- Protect other lint than duplicates by ``-k`` or ``-K``. +- ``chown`` in sh script fixed (was ``chmod`` by accident). + +### Added + +- ``--replay``: Re-output a previously written json file. Allow filtering + by using all other standard options (like size or directory filtering). +- ``--sort-by``: Similar to ``-S``, but sorts groups of files. So showing + the group with the biggest size sucker is as easy as ``-y s``. + +### Changed + +- ``-S``'s long options is ``--rank-by`` now (prior ``--sortcriteria``). +- ``-o`` can guess the formatter from the filename if given. +- Remove some optimisations that gave no visible effect. +- Simplified FIEMAP optimisation to reduce initial delay and reduce memory overhead +- Improved hashing strategy for large disks (do repeated smaller sweeps across + the disk instead of incrementally hashing every file on the disk) + +## [2.2.1 Dreary Dropbear Bugfixes] + +### Fixed + +- Incorrect handling of -W, --no-with-color option +- Handling of $PKG_CONFIG in SConstruct +- Failure to build manpage +- Various BSD compatibility issues +- Nonstandard header sequence in modules using fts +- Removed some unnecessary warnings + ## [2.2.0 Dreary Dropbear] -- 2015-05-09 @@ -44,7 +128,7 @@ The format follows [keepachangelog.com]. Please stick to it. physical disk to enable fast reading without disk thrash. The improved algorithm now increases the number of cpu threads used to hash the data as it is read in. Also an improved mutex strategy reduces the wait time - before the hash results can be processed. + before the hash results can be processed. Note the new threading strategy is particularly effective on the "paranoid" (byte-by-byte) file comparison method (option -pp), which is now almost as fast as the default (SHA1 hash) method. @@ -60,7 +144,7 @@ The format follows [keepachangelog.com]. Please stick to it. the core got slower very fast due to linear lookups. Fixed. - performance regression: No SSDs were detected due to two bugs. - commandline aborts also on non-fatal option misuses. -- Some statistic counts were updated wrong sometimes. +- Some statistic counts were updated wrong sometimes. - Fixes in treemerge to respect directories tagges as originals. - Ignore "evil" fs types like bindfs, nullfs completely. - Fix race in file tree traversal. @@ -100,7 +184,8 @@ The format follows [keepachangelog.com]. Please stick to it. Initial release of the rewrite. [unreleased]: https://github.com/sahib/rmlint/compare/master...develop -[2.2.0 Dreary Dropbear]: https://github.com/sahib/rmlint/compare/master...develop +[2.2.1 Dreary Dropbear Bugfixes]: https://github.com/sahib/rmlint/compare/master...develop +[2.2.0 Dreary Dropbear]: https://github.com/sahib/rmlint/releases/tag/v2.2.0 [2.1.0 Malnourished Molly]: https://github.com/sahib/rmlint/releases/tag/v2.1.0 [2.0.0 Personable Pidgeon]: https://github.com/sahib/rmlint/releases/tag/v2.0.0 [keepachangelog.com]: http://keepachangelog.com/ diff --git a/README.rst b/README.rst index 5e474f0a..21360445 100644 --- a/README.rst +++ b/README.rst @@ -1,6 +1,7 @@ ====== + .. image:: https://raw.githubusercontent.com/sahib/rmlint/develop/docs/_static/logo.png :align: center @@ -94,8 +95,8 @@ AUTHORS Here's a list of developers to blame: =================================== ============================= =========================================== -*Christopher Pahl* https://github.com/sahib 2010-2014 -*Daniel Thomas* https://github.com/SeeSpotRun 2014-2014 +*Christopher Pahl* https://github.com/sahib 2010-2015 +*Daniel Thomas* https://github.com/SeeSpotRun 2014-2015 =================================== ============================= =========================================== There are some other people that helped us of course. diff --git a/SConstruct b/SConstruct index f7fdcf66..dca0ec4a 100755 --- a/SConstruct +++ b/SConstruct @@ -72,6 +72,10 @@ def check_git_rev(context): rev = subprocess.check_output('git log --pretty=format:"%h" -n 1', shell=True) except subprocess.CalledProcessError: print('Unable to find git revision.') + except AttributeError: + # Patch for some special sandbox permission problems. + # See https://github.com/sahib/rmlint/issues/143#issuecomment-139929733 + print('Not allowed.') rev = rev or 'unknown' conf.env['gitrev'] = rev @@ -85,7 +89,7 @@ def check_libelf(context): if GetOption('with_libelf') is False: rc = 0 - if rc and tests.CheckHeader(context, 'libelf.h'): + if rc and tests.CheckHeader(context, 'libelf.h', header="#include "): rc = 0 if rc and tests.CheckLib(context, ['libelf']): @@ -98,6 +102,19 @@ def check_libelf(context): return rc +def check_uname(context): + rc = 1 + + if rc and tests.CheckHeader(context, 'sys/utsname.h', header=""): + rc = 0 + + conf.env['HAVE_UNAME'] = rc + + context.did_show_result = True + context.Result(rc) + return rc + + def check_gettext(context): rc = 1 @@ -210,6 +227,22 @@ def check_sysctl(context): return rc +def check_posix_fadvise(context): + rc = 1 + + if tests.CheckDeclaration( + context, 'posix_fadvise', + includes='#include ' + ): + rc = 0 + + conf.env['HAVE_POSIX_FADVISE'] = rc + + context.did_show_result = True + context.Result(rc) + return rc + + def check_xattr(context): rc = 1 @@ -260,33 +293,29 @@ def check_c11(context): return rc -def check_sse42(context): - if GetOption('with_sse') is False: +def check_sqlite3(context): + rc = 1 + if tests.CheckHeader(context, 'sqlite3.h'): rc = 0 - else: - rc = 1 - - if tests.CheckDeclaration(context, '__SSE4_2__'): - rc = 0 - else: - conf.env.Prepend(CFLAGS=['-msse4.2']) - conf.env['HAVE_SSE42'] = rc + if tests.CheckLib(context, ['sqlite3']): + rc = 0 + conf.env['HAVE_SQLITE3'] = rc context.did_show_result = True context.Result(rc) return rc -def check_sqlite3(context): +def check_btrfs_h(context): rc = 1 - if tests.CheckHeader(context, 'sqlite3.h'): - rc = 0 - - if tests.CheckLib(context, ['sqlite3']): + if tests.CheckHeader( + context, 'linux/btrfs.h', + header='#include \n#include ' + ): rc = 0 - conf.env['HAVE_SQLITE3'] = rc + conf.env['HAVE_BTRFS_H'] = rc context.did_show_result = True context.Result(rc) return rc @@ -409,7 +438,7 @@ AddOption( action='store', metavar='DIR', help='libdir name (lib or lib64)' ) -for suffix in ['libelf', 'gettext', 'fiemap', 'blkid', 'json-glib']: +for suffix in ['libelf', 'gettext', 'fiemap', 'blkid', 'json-glib', 'gui']: AddOption( '--without-' + suffix, action='store_const', default=False, const=False, dest='with_' + suffix @@ -464,16 +493,18 @@ conf = Configure(env, custom_tests={ 'check_libelf': check_libelf, 'check_fiemap': check_fiemap, 'check_xattr': check_xattr, - 'check_sse42': check_sse42, 'check_sha512': check_sha512, 'check_blkid': check_blkid, 'check_sysctl': check_sysctl, + 'check_posix_fadvise': check_posix_fadvise, 'check_sys_block': check_sys_block, 'check_bigfiles': check_bigfiles, 'check_c11': check_c11, 'check_gettext': check_gettext, 'check_sqlite3': check_sqlite3, - 'check_linux_limits': check_linux_limits + 'check_linux_limits': check_linux_limits, + 'check_btrfs_h': check_btrfs_h, + 'check_uname': check_uname }) if not conf.CheckCC(): @@ -549,9 +580,6 @@ if 'clang' in os.path.basename(conf.env['CC']): conf.env.Append(CCFLAGS=['-fcolor-diagnostics']) # Colored warnings conf.env.Append(CCFLAGS=['-Qunused-arguments']) # Hide wrong messages -conf.env.Append(CCFLAGS=['-march=native']) -conf.check_sse42() - # Optional flags: conf.env.Append(CFLAGS=[ '-Wall', '-W', '-Wextra', @@ -567,9 +595,9 @@ env.ParseConfig(pkg_config + ' --cflags --libs ' + ' '.join(packages)) conf.env.Append(_LIBFLAGS=['-lm']) +conf.check_sysctl() conf.check_blkid() conf.check_sys_block() -conf.check_sysctl() conf.check_libelf() conf.check_fiemap() conf.check_xattr() @@ -578,6 +606,9 @@ conf.check_sha512() conf.check_gettext() conf.check_sqlite3() conf.check_linux_limits() +conf.check_posix_fadvise() +conf.check_btrfs_h() +conf.check_uname() if conf.env['HAVE_LIBELF']: conf.env.Append(_LIBFLAGS=['-lelf']) @@ -589,10 +620,14 @@ if conf.env['HAVE_SQLITE3']: env = conf.Finish() library = SConscript('lib/SConscript') -program = SConscript('src/SConscript', exports='library') -SConscript('tests/SConscript', exports='program') +programs = SConscript('src/SConscript', exports='library') +env.Default(library) + +SConscript('tests/SConscript', exports='programs') SConscript('po/SConscript') SConscript('docs/SConscript') +SConscript('gui/SConscript') + def build_tar_gz(target=None, source=None, env=None): tarball = 'rmlint-{a}.{b}.{c}.tar.gz'.format( @@ -669,7 +704,6 @@ if 'config' in COMMAND_LINE_TARGETS: Find non-stripped binaries (needs libelf) : {libelf} Optimize using ioctl(FS_IOC_FIEMAP) (needs linux) : {fiemap} Support for SHA512 (needs glib >= 2.31) : {sha512} - Support for SSE4.2 instructions for fast CityHash : {sse42} Support for swapping metadata to disk (needs SQLite3) : {sqlite3} Build manpage from docs/rmlint.1.rst : {sphinx} Support for caching checksums in file's xattr : {xattr} @@ -709,7 +743,6 @@ Type 'scons' to actually compile rmlint now. Good luck. blkid=yesno(env['HAVE_BLKID']), fiemap=yesno(env['HAVE_FIEMAP']), sha512=yesno(env['HAVE_SHA512']), - sse42=yesno(env['HAVE_SSE42']), sqlite3=yesno(env['HAVE_SQLITE3']), bigfiles=yesno(env['HAVE_BIGFILES']), bigofft=yesno(env['HAVE_BIG_OFF_T']), diff --git a/docs/SConscript b/docs/SConscript index 38e754df..eee912f4 100644 --- a/docs/SConscript +++ b/docs/SConscript @@ -33,42 +33,42 @@ def run_sphinx_binary(builder, **kwargs): def gzip_file(target, source, env): + source, dest = source[0].get_abspath(), target[0].get_abspath() try: subprocess.check_call('gzip -c {s} > {t}'.format( - s=source[0].get_abspath(), - t=target[0].get_abspath(), + s=source, t=dest ), shell=True) except Exception as err: print('Warning: could not gzip {s} to {t}: {e}'.format( - s=source[0].get_abspath(), - t=target[0].get_abspath(), - e=err + s=source, t=dest, e=err )) +# Do not use partial(), but a real function. +# Scons uses this to check if the previous action +# differs from the current action. +# Partial actions are always different. +def run_sphinx_binary_man(**kwargs): + run_sphinx_binary('man', **kwargs) -sphinx = env.AlwaysBuild( - env.Command( - '_build/man/rmlint.1', 'rmlint.1.rst', - Action(partial(run_sphinx_binary, 'man'), "Building manpage from rst...") - ) + +sphinx = env.Command( + '_build/man/rmlint.1', 'rmlint.1.rst', + env.Action(run_sphinx_binary_man, "Building manpage from rst...") ) -manpage = env.AlwaysBuild( - env.Command( - 'rmlint.1.gz', '_build/man/rmlint.1', gzip_file - ) +manpage = env.Command( + 'rmlint.1.gz', '_build/man/rmlint.1', gzip_file ) +env.Default(sphinx) +env.Default(manpage) + env.Alias('man', env.Depends(manpage, sphinx)) if 'install' in COMMAND_LINE_TARGETS: - manpage[0].build() - if os.access(str(manpage[0]), os.R_OK): - man_install = env.Install('$PREFIX/share/man/man1', [manpage]) - env.Alias('install', [man_install]) - else: - print('WARNING: No manpage will be installed!') + man_install = env.Install('$PREFIX/share/man/man1', [manpage]) + target = env.Alias('install', [manpage, man_install]) if 'uninstall' in COMMAND_LINE_TARGETS: diff --git a/docs/_static/benchmarks/cpu_usage.svg b/docs/_static/benchmarks/cpu_usage.svg new file mode 100644 index 00000000..b5679587 --- /dev/null +++ b/docs/_static/benchmarks/cpu_usage.svg @@ -0,0 +1,4 @@ + +CPU usage comparison on ['/usr', '/mnt/music']002020404060608080100100120120140140160160180180200200Run #1Run #2Run #3Averagehttps://github.com/sahib/rmlint/blob/develop/tests/test_speed/build_scripts/baseline.shbaseline.py (1.0)2624.14423076923077472.3084539223153Run #1https://github.com/sahib/rmlint/blob/develop/tests/test_speed/build_scripts/baseline.shbaseline.py (1.0)25.5158.27884615384616472.9201256664128Run #2https://github.com/sahib/rmlint/blob/develop/tests/test_speed/build_scripts/baseline.shbaseline.py (1.0)26292.4134615384615472.3084539223153Run #3https://github.com/sahib/rmlint/blob/develop/tests/test_speed/build_scripts/baseline.shbaseline.py (1.0)25.833426.5480769230769472.5127522848439Averagehttps://github.com/jvirkki/dupddupd (1.2-dev)1234.87499999999999489.43526275704494Run #1https://github.com/jvirkki/dupddupd (1.2-dev)81169.0096153846154405.0245620715918Run #2https://github.com/jvirkki/dupddupd (1.2-dev)99303.1442307692307383.00437928408223Run #3https://github.com/jvirkki/dupddupd (1.2-dev)64437.27884615384613425.8214013709063Averagehttp://en.wikipedia.org/wiki/Fdupesfdupes (fdupes 1.51)2545.605769230769226473.5317974105103Run #1http://en.wikipedia.org/wiki/Fdupesfdupes (fdupes 1.51)25179.7403846153846473.5317974105103Run #2http://en.wikipedia.org/wiki/Fdupesfdupes (fdupes 1.51)65313.87499999999994424.5980578827114Run #3http://en.wikipedia.org/wiki/Fdupesfdupes (fdupes 1.51)38.333448.00961538461536457.2209586824067Averagehttp://rdfind.pauldreik.serdfind (1.3.4)2856.33653846153845469.86176694592535Run #1http://rdfind.pauldreik.serdfind (1.3.4)30190.47115384615384467.41507996953544Run #2http://rdfind.pauldreik.serdfind (1.3.4)30324.60576923076917467.41507996953544Run #3http://rdfind.pauldreik.serdfind (1.3.4)29.333458.7403846153846468.23105007616147Averagehttps://github.com/sahib/rmlintrmlint (2.4.0 rev 7a7a243)5567.06730769230768436.8314927646611Run #1https://github.com/sahib/rmlintrmlint (2.4.0 rev 7a7a243)63201.20192307692307427.0447448591013Run #2https://github.com/sahib/rmlintrmlint (2.4.0 rev 7a7a243)70335.3365384615384418.4813404417365Run #3https://github.com/sahib/rmlintrmlint (2.4.0 rev 7a7a243)62.667469.4711538461538427.45211824067025Averagehttps://github.com/sahib/rmlintrmlint-old (1.0.6)2077.7980769230769479.6485148514852Run #1https://github.com/sahib/rmlintrmlint-old (1.0.6)16211.9326923076923484.54188880426506Run #2https://github.com/sahib/rmlintrmlint-old (1.0.6)20346.0673076923076479.6485148514852Run #3https://github.com/sahib/rmlintrmlint-old (1.0.6)18.667480.20192307692304481.27923172124906Averagehttps://github.com/sahib/rmlintrmlint-paranoid (2.4.0 rev 7a7a243)6288.52884615384613428.26808834729627Run #1https://github.com/sahib/rmlintrmlint-paranoid (2.4.0 rev 7a7a243)66222.66346153846152423.3747143945164Run #2https://github.com/sahib/rmlintrmlint-paranoid (2.4.0 rev 7a7a243)66356.79807692307685423.3747143945164Run #3https://github.com/sahib/rmlintrmlint-paranoid (2.4.0 rev 7a7a243)64.667490.93269230769226425.0054312642803Averagehttps://github.com/sahib/rmlintrmlint-replay (2.4.0 rev 7a7a243)3999.25961538461536456.40498857578064Run #1https://github.com/sahib/rmlintrmlint-replay (2.4.0 rev 7a7a243)99233.39423076923075383.00437928408223Run #2https://github.com/sahib/rmlintrmlint-replay (2.4.0 rev 7a7a243)99367.5288461538461383.00437928408223Run #3https://github.com/sahib/rmlintrmlint-replay (2.4.0 rev 7a7a243)79501.6634615384615407.4712490479817Averagehttps://github.com/sahib/rmlintrmlint-v2.2.2 (2.2.0 rev d514de2)51109.99038461538458441.724866717441Run #1https://github.com/sahib/rmlintrmlint-v2.2.2 (2.2.0 rev d514de2)57244.12499999999997434.38480578827114Run #2https://github.com/sahib/rmlintrmlint-v2.2.2 (2.2.0 rev d514de2)124378.2596153846153352.4207920792079Run #3https://github.com/sahib/rmlintrmlint-v2.2.2 (2.2.0 rev d514de2)77.333512.3942307692307409.5105626428027Averagehttps://github.com/sahib/rmlintrmlint-v2.2.2-paranoid (2.2.0 rev d514de2)65120.72115384615381424.5980578827114Run #1https://github.com/sahib/rmlintrmlint-v2.2.2-paranoid (2.2.0 rev d514de2)74254.8557692307692413.58796648895657Run #2https://github.com/sahib/rmlintrmlint-v2.2.2-paranoid (2.2.0 rev d514de2)163388.9903846153845304.71039603960395Run #3https://github.com/sahib/rmlintrmlint-v2.2.2-paranoid (2.2.0 rev d514de2)100.667523.125380.9650656892612Averagehttps://github.com/sahib/rmlintrmlint-xxhash (2.4.0 rev 7a7a243)43131.45192307692304451.51161462300075Run #1https://github.com/sahib/rmlintrmlint-xxhash (2.4.0 rev 7a7a243)195265.5865384615384265.56340441736484Run #2https://github.com/sahib/rmlintrmlint-xxhash (2.4.0 rev 7a7a243)202399.72115384615375257.0Run #3https://github.com/sahib/rmlintrmlint-xxhash (2.4.0 rev 7a7a243)146.667533.8557692307693324.6912652322925AverageCPU usage comparison on ['/usr', '/mnt/music']Averaged CPU usage over 2 runsbaseline.pydupdfdupesrdfindrmlintrmlint-oldrmlint-paranoidrmlint-replayrmlint-v2.2.2rmlint-v2.2.2-…rmlint-v2.2.2-paranoidrmlint-xxhash \ No newline at end of file diff --git a/docs/_static/benchmarks/found_items.html b/docs/_static/benchmarks/found_items.html new file mode 100644 index 00000000..8c3a33dc --- /dev/null +++ b/docs/_static/benchmarks/found_items.html @@ -0,0 +1,31 @@ +
rdfindfdupesrmlintrmlint-paranoidrmlint-replayrmlint-v2.2.2rmlint-v2.2.2-paranoidrmlint-xxhashrmlint-olddupdbaseline.py
Duplicates027.203k27.203k27.203k27.203k27.203k27.203k27.203k39.656k43.217k67.931k
Originals016.115k16.115k16.115k16.115k16.115k16.115k16.115k15.133k16.109k22.848k
\ No newline at end of file diff --git a/docs/_static/benchmarks/found_items.svg b/docs/_static/benchmarks/found_items.svg new file mode 100644 index 00000000..34d2565b --- /dev/null +++ b/docs/_static/benchmarks/found_items.svg @@ -0,0 +1,4 @@ + +Found results comparison on ['/usr', '/mnt/music']2261 Dupes found0.62%297.950068789964713.398305764778826709 Set of dupes0.19%304.4492062437132613.6085278143936250.0 Dupevariance between runs0.00%305.9997314971750513.6833414854021950.0 Setvariance between runs0.00%305.9997314971750513.6833414854021950.82%296.0798958254754146.8894811363125525512 Dupes found7.01%361.0202645622676422.64049822264098615236 Set of dupes4.19%441.043301970157461.0351893119293440.0 Dupevariance between runs0.00%466.774244779450282.227614749033080.0 Setvariance between runs0.00%466.774244779450282.2276147490330811.20%340.1207684311084156.4755426049878225512 Dupes found7.01%502.9393316172249124.6490269084440915236 Set of dupes4.19%539.1295586857035205.692880991023170.0 Dupevariance between runs0.00%545.0436474673902238.498732110063230.0 Setvariance between runs0.00%545.0436474673902238.4987321100632311.20%400.42859735806326213.1936902428801725512 Dupes found7.01%545.1827085538715294.2434213422818615236 Set of dupes4.19%520.357609597093379.45816937108590.0 Dupevariance between runs0.00%503.6469337219716408.30177569135960.0 Setvariance between runs0.00%503.6469337219716408.301775691359611.20%409.73112157331866295.4581755605504425512 Dupes found7.01%467.6939451746557450.9030940993515236 Set of dupes4.19%393.6400627699959499.83019620670190.0 Dupevariance between runs0.00%362.23857219096533511.01709493590090.0 Setvariance between runs0.00%362.23857219096533511.017094935900911.20%363.6116578672296364.211194176603537130 Dupes found10.20%281.83732874967006520.404256614913714275 Set of dupes3.92%174.13137091951648491.072692491803760.0 Dupevariance between runs0.00%147.4923101150888474.76389167599960.0 Setvariance between runs0.00%147.4923101150888474.763891675999614.12%272.999426637482385.4736239218271440629 Dupes found11.16%84.97018453856907412.12724891998415179 Set of dupes4.17%41.405875412257785299.22916339168080.0 Dupevariance between runs0.00%39.35165897253165266.0826156677190.0 Setvariance between runs0.00%39.35165897253165266.08261566771915.33%186.3228315829271322.281138180354767931 Dupes found18.66%82.22995722145188125.8819233155121322848 Set of dupes6.28%243.2987400146998618.266996046215040.0 Dupevariance between runs0.00%292.9999999999999413.3499999999999940.0 Setvariance between runs0.00%292.9999999999999413.34999999999999424.94%208.19489620121107181.8876221124293Found results comparison on ['/usr', '/mnt/music']Averaged Found results over 2 runsrdfindfdupesrmlintrmlint-paranoidrmlint-replayrmlint-olddupdbaseline.py \ No newline at end of file diff --git a/docs/_static/benchmarks/memory.svg b/docs/_static/benchmarks/memory.svg new file mode 100644 index 00000000..313e88fa --- /dev/null +++ b/docs/_static/benchmarks/memory.svg @@ -0,0 +1,4 @@ + +Peakmem comparison on ['/usr', '/mnt/music']004004008008001.2k1.2k1.6k1.6k2k2k2.4k2.4k2.8k2.8k3.2k3.2kRun #1Run #2Run #3Averagehttps://github.com/sahib/rmlint/blob/develop/tests/test_speed/build_scripts/baseline.shbaseline.py (1.0)10076.96153846153845496.4928389356928Run #1https://github.com/sahib/rmlint/blob/develop/tests/test_speed/build_scripts/baseline.shbaseline.py (1.0)100209.65384615384616496.4928389356928Run #2https://github.com/sahib/rmlint/blob/develop/tests/test_speed/build_scripts/baseline.shbaseline.py (1.0)100342.3461538461538496.4928389356928Run #3https://github.com/sahib/rmlint/blob/develop/tests/test_speed/build_scripts/baseline.shbaseline.py (1.0)100475.0384615384615496.4928389356928Averagehttps://github.com/jvirkki/dupddupd (1.2-dev)50.43876.96153846153845485.0256336660781Run #1https://github.com/jvirkki/dupddupd (1.2-dev)49.074209.65384615384616485.1296051891491Run #2https://github.com/jvirkki/dupddupd (1.2-dev)50.57342.3461538461538485.0155719057809Run #3https://github.com/jvirkki/dupddupd (1.2-dev)50.027475.0384615384615485.05696232882167Averagehttp://en.wikipedia.org/wiki/Fdupesfdupes (fdupes 1.51)82.97376.96153846153845474.8563192493445Run #1http://en.wikipedia.org/wiki/Fdupesfdupes (fdupes 1.51)83.016209.65384615384616475.0609846008442Run #2http://en.wikipedia.org/wiki/Fdupesfdupes (fdupes 1.51)82.957342.3461538461538474.8374153360589Run #3http://en.wikipedia.org/wiki/Fdupesfdupes (fdupes 1.51)82.982475.0384615384615474.9182905457204Averagehttp://rdfind.pauldreik.serdfind (1.3.4)16.93876.96153846153845467.24055763530765Run #1http://rdfind.pauldreik.serdfind (1.3.4)15.609209.65384615384616467.54324892424825Run #2http://rdfind.pauldreik.serdfind (1.3.4)15.594342.3461538461538467.32532034326584Run #3http://rdfind.pauldreik.serdfind (1.3.4)16.047475.0384615384615467.36975978457843Averagehttps://github.com/sahib/rmlintrmlint (2.4.0 rev 7a7a243)232.9176.96153846153845448.1957797055113Run #1https://github.com/sahib/rmlintrmlint (2.4.0 rev 7a7a243)232.109209.65384615384616448.6608312174293Run #2https://github.com/sahib/rmlintrmlint (2.4.0 rev 7a7a243)228.652342.3461538461538448.70755742244586Run #3https://github.com/sahib/rmlintrmlint (2.4.0 rev 7a7a243)231.224475.0384615384615448.5214148569478Averagehttps://github.com/sahib/rmlintrmlint-old (1.0.6)285.57876.96153846153845408.673795061791Run #1https://github.com/sahib/rmlintrmlint-old (1.0.6)263.035209.65384615384616410.91825363717624Run #2https://github.com/sahib/rmlintrmlint-old (1.0.6)307.23342.3461538461538407.85970718319993Run #3https://github.com/sahib/rmlintrmlint-old (1.0.6)285.281475.0384615384615409.15058529405576Averagehttps://github.com/sahib/rmlintrmlint-paranoid (2.4.0 rev 7a7a243)1.392k76.96153846153845280.82022657266646Run #1https://github.com/sahib/rmlintrmlint-paranoid (2.4.0 rev 7a7a243)1.414k209.65384615384616283.10302655282044Run #2https://github.com/sahib/rmlintrmlint-paranoid (2.4.0 rev 7a7a243)1.436k342.3461538461538274.99492471333235Run #3https://github.com/sahib/rmlintrmlint-paranoid (2.4.0 rev 7a7a243)1.414k475.0384615384615279.6394180214253Averagehttps://github.com/sahib/rmlintrmlint-replay (2.4.0 rev 7a7a243)127.71176.96153846153845165.00014227170084Run #1https://github.com/sahib/rmlintrmlint-replay (2.4.0 rev 7a7a243)127.727209.65384615384616165.60171357674201Run #2https://github.com/sahib/rmlintrmlint-replay (2.4.0 rev 7a7a243)127.711342.3461538461538155.81406002219066Run #3https://github.com/sahib/rmlintrmlint-replay (2.4.0 rev 7a7a243)127.716475.0384615384615162.1387148490013Averagehttps://github.com/sahib/rmlintrmlint-v2.2.2 (2.2.0 rev d514de2)244.93876.96153846153845136.5948020217862Run #1https://github.com/sahib/rmlintrmlint-v2.2.2 (2.2.0 rev d514de2)242.371209.65384615384616137.39082446711632Run #2https://github.com/sahib/rmlintrmlint-v2.2.2 (2.2.0 rev d514de2)282.984342.3461538461538124.50864604298056Run #3https://github.com/sahib/rmlintrmlint-v2.2.2 (2.2.0 rev d514de2)256.764475.0384615384615132.83155121972234Averagehttps://github.com/sahib/rmlintrmlint-v2.2.2-paranoid (2.2.0 rev d514de2)406.70776.9615384615384586.92286422735867Run #1https://github.com/sahib/rmlintrmlint-v2.2.2-paranoid (2.2.0 rev d514de2)417.93209.6538461538461687.05907911865472Run #2https://github.com/sahib/rmlintrmlint-v2.2.2-paranoid (2.2.0 rev d514de2)441.578342.346153846153869.2785766152922Run #3https://github.com/sahib/rmlintrmlint-v2.2.2-paranoid (2.2.0 rev d514de2)422.072475.038461538461581.08696702952983Averagehttps://github.com/sahib/rmlintrmlint-xxhash (2.4.0 rev 7a7a243)173.5976.9615384615384542.689460324477665Run #1https://github.com/sahib/rmlintrmlint-xxhash (2.4.0 rev 7a7a243)167.457209.6538461538461642.43768764067744Run #2https://github.com/sahib/rmlintrmlint-xxhash (2.4.0 rev 7a7a243)168.805342.346153846153822.751853619219105Run #3https://github.com/sahib/rmlintrmlint-xxhash (2.4.0 rev 7a7a243)169.951475.038461538461535.959743420248174AveragePeakmem comparison on ['/usr', '/mnt/music']Averaged Peakmem over 2 runsbaseline.pydupdfdupesrdfindrmlintrmlint-oldrmlint-paranoidrmlint-replayrmlint-v2.2.2rmlint-v2.2.2-…rmlint-v2.2.2-paranoidrmlint-xxhash \ No newline at end of file diff --git a/docs/_static/benchmarks/timing.svg b/docs/_static/benchmarks/timing.svg new file mode 100644 index 00000000..ca7f251d --- /dev/null +++ b/docs/_static/benchmarks/timing.svg @@ -0,0 +1,4 @@ + +Timing comparison on ['/usr', '/mnt/music']22334455667788991010202030304040505060607070808090901001002002003003004004005005006006007007008008009009001k1k2k2k3k3k4k4kRun #1Run #2Run #3Averagehttps://github.com/sahib/rmlint/blob/develop/tests/test_speed/build_scripts/baseline.shbaseline.py (1.0)4.307k24.14423076923077257.0Run #1https://github.com/sahib/rmlint/blob/develop/tests/test_speed/build_scripts/baseline.shbaseline.py (1.0)4.209k158.27884615384616257.738905112578Run #2https://github.com/sahib/rmlint/blob/develop/tests/test_speed/build_scripts/baseline.shbaseline.py (1.0)4.213k292.4134615384615257.7081462643233Run #3https://github.com/sahib/rmlint/blob/develop/tests/test_speed/build_scripts/baseline.shbaseline.py (1.0)4.243k426.5480769230769257.48053322881657Averagehttps://github.com/jvirkki/dupddupd (1.2-dev)163.49134.87499999999999362.2535549454728Run #1https://github.com/jvirkki/dupddupd (1.2-dev)4.762169.0096153846154476.0276286553511Run #2https://github.com/jvirkki/dupddupd (1.2-dev)3.721303.1442307692307483.9644323198363Run #3https://github.com/jvirkki/dupddupd (1.2-dev)57.325437.27884615384613395.9737444558273Averagehttp://en.wikipedia.org/wiki/Fdupesfdupes (fdupes 1.51)210.11445.605769230769226354.18106889760884Run #1http://en.wikipedia.org/wiki/Fdupesfdupes (fdupes 1.51)165.199179.7403846153846361.919163426526Run #2http://en.wikipedia.org/wiki/Fdupesfdupes (fdupes 1.51)34.812313.87499999999994412.0218720136345Run #3http://en.wikipedia.org/wiki/Fdupesfdupes (fdupes 1.51)136.708448.00961538461536368.01002483597654Averagehttp://rdfind.pauldreik.serdfind (1.3.4)344.31956.33653846153845338.2891952696052Run #1http://rdfind.pauldreik.serdfind (1.3.4)317.164190.47115384615384340.93236609407757Run #2http://rdfind.pauldreik.serdfind (1.3.4)311.942324.60576923076917341.4665280888565Run #3http://rdfind.pauldreik.serdfind (1.3.4)324.475458.7403846153846340.19911149368704Averagehttps://github.com/sahib/rmlintrmlint (2.4.0 rev 7a7a243)89.46467.06730769230768381.6526218694454Run #1https://github.com/sahib/rmlintrmlint (2.4.0 rev 7a7a243)60.484201.20192307692307394.24780622840086Run #2https://github.com/sahib/rmlintrmlint (2.4.0 rev 7a7a243)52.424335.3365384615384398.849308471473Run #3https://github.com/sahib/rmlintrmlint (2.4.0 rev 7a7a243)67.457469.4711538461538390.7371325494039Averagehttps://github.com/sahib/rmlintrmlint-old (1.0.6)242.73977.7980769230769349.53702518243495Run #1https://github.com/sahib/rmlintrmlint-old (1.0.6)188.733211.9326923076923357.6340021665053Run #2https://github.com/sahib/rmlintrmlint-old (1.0.6)216.241346.0673076923076353.2562506530557Run #3https://github.com/sahib/rmlintrmlint-old (1.0.6)215.904480.20192307692304353.30643294196864Averagehttps://github.com/sahib/rmlintrmlint-paranoid (2.4.0 rev 7a7a243)120.29888.52884615384613372.12442497989684Run #1https://github.com/sahib/rmlintrmlint-paranoid (2.4.0 rev 7a7a243)99.484222.66346153846152378.2368960604007Run #2https://github.com/sahib/rmlintrmlint-paranoid (2.4.0 rev 7a7a243)95.609356.79807692307685379.51520878217275Run #3https://github.com/sahib/rmlintrmlint-paranoid (2.4.0 rev 7a7a243)105.13490.93269230769226376.46080283294555Averagehttps://github.com/sahib/rmlintrmlint-replay (2.4.0 rev 7a7a243)8.24899.25961538461536458.3537469348438Run #1https://github.com/sahib/rmlintrmlint-replay (2.4.0 rev 7a7a243)1.463233.39423076923075514.0Run #2https://github.com/sahib/rmlintrmlint-replay (2.4.0 rev 7a7a243)1.468367.5288461538461513.8902247140085Run #3https://github.com/sahib/rmlintrmlint-replay (2.4.0 rev 7a7a243)3.726501.6634615384615483.9212268546182Averagehttps://github.com/sahib/rmlintrmlint-v2.2.2 (2.2.0 rev d514de2)85.556109.99038461538458383.0897286479154Run #1https://github.com/sahib/rmlintrmlint-v2.2.2 (2.2.0 rev d514de2)60.744244.12499999999997394.10979289557207Run #2https://github.com/sahib/rmlintrmlint-v2.2.2 (2.2.0 rev d514de2)13.185378.2596153846153443.26011319864256Run #3https://github.com/sahib/rmlintrmlint-v2.2.2 (2.2.0 rev d514de2)53.161512.3942307692307398.40012653085256Averagehttps://github.com/sahib/rmlintrmlint-v2.2.2-paranoid (2.2.0 rev d514de2)139.349120.72115384615381367.39437629113075Run #1https://github.com/sahib/rmlintrmlint-v2.2.2-paranoid (2.2.0 rev d514de2)87.558254.8557692307692382.34550885882663Run #2https://github.com/sahib/rmlintrmlint-v2.2.2-paranoid (2.2.0 rev d514de2)19.282388.9903846153845431.0306067612628Run #3https://github.com/sahib/rmlintrmlint-v2.2.2-paranoid (2.2.0 rev d514de2)82.063523.125384.4309121046599Averagehttps://github.com/sahib/rmlintrmlint-xxhash (2.4.0 rev 7a7a243)85.394131.45192307692304383.1507098539829Run #1https://github.com/sahib/rmlintrmlint-xxhash (2.4.0 rev 7a7a243)7.362265.5865384615384462.0100973006119Run #2https://github.com/sahib/rmlintrmlint-xxhash (2.4.0 rev 7a7a243)7.094399.72115384615375463.2032235125741Run #3https://github.com/sahib/rmlintrmlint-xxhash (2.4.0 rev 7a7a243)33.283533.8557692307693413.4670296914222AverageTiming comparison on ['/usr', '/mnt/music']Averaged Timing over 2 runsbaseline.pydupdfdupesrdfindrmlintrmlint-oldrmlint-paranoidrmlint-replayrmlint-v2.2.2rmlint-v2.2.2-…rmlint-v2.2.2-paranoidrmlint-xxhash \ No newline at end of file diff --git a/docs/_static/gui_editor.png b/docs/_static/gui_editor.png new file mode 100644 index 00000000..bb6de835 Binary files /dev/null and b/docs/_static/gui_editor.png differ diff --git a/docs/_static/gui_locations.png b/docs/_static/gui_locations.png new file mode 100644 index 00000000..ab308bd0 Binary files /dev/null and b/docs/_static/gui_locations.png differ diff --git a/docs/_static/gui_runner.png b/docs/_static/gui_runner.png new file mode 100644 index 00000000..ea29ee89 Binary files /dev/null and b/docs/_static/gui_runner.png differ diff --git a/docs/_static/gui_settings.png b/docs/_static/gui_settings.png new file mode 100644 index 00000000..9801e27a Binary files /dev/null and b/docs/_static/gui_settings.png differ diff --git a/docs/_static/logo_boot.png b/docs/_static/logo_boot.png index f3620628..9479333b 100644 Binary files a/docs/_static/logo_boot.png and b/docs/_static/logo_boot.png differ diff --git a/docs/_static/shredder.svg b/docs/_static/shredder.svg new file mode 100644 index 00000000..a3b9b299 --- /dev/null +++ b/docs/_static/shredder.svg @@ -0,0 +1,432 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + image/svg+xml + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/docs/benchmarks.rst b/docs/benchmarks.rst index 6bf44578..a5077bf4 100644 --- a/docs/benchmarks.rst +++ b/docs/benchmarks.rst @@ -1,17 +1,115 @@ +.. _benchmark_ref: + Benchmarks ========== -We will post some benchmark results here once the respective scripts -are ready enough. Here, have some early one to see what they look like: +This page contains the images that our _`benchmark suite` renders for the current +release. Inside the benchmark suite, ``rmlint`` is *challenged* against other +popular and some less known duplicate finders. Apart from that a very dumb +duplicate finder called ``baseline.py`` is used to see how slow a program would +be that would blindly hash all files it finds. Luckily none of the programs is +*that* slow. We'll allow us a few remarks on the plots, although we focus a bit +on ``rmlint``. You're of course free to interpret something different or re-run_ +the benchmarks on your own machine. The exact version of each program is given +in the plots. + +It should be noted that it is very hard to compare these tools, since *each* +tool investigated a slightly different amount of data and produces different +results on the dataset below. This is partly due to the fact that some tools +count empty files and hardlinks as duplicates, while ``rmlint`` does not. Partly +it might also be false positives, missed files or, in some tools, `paths that +contain a ','`_. For ``rmlint`` we verified that no false positives are in the +set. + +.. _`benchmark suite`: https://github.com/sahib/rmlint/tree/develop/tests/test_speed +.. _re-run: https://github.com/sahib/rmlint/issues/131 +.. _`paths that contain a ','`: https://github.com/jvirkki/dupd/blob/master/src/scan.c#L83 + +Here are some statistics on the datasets ``/usr`` and ``/mnt/music``. ``/usr`` +is on a ``btrfs`` filesystem that is located on a SSD with many small files, +while ``/mnt/music`` is located on a rotational disk with ``ext4`` as +filesystem. The amount of available memory was *8GB*. + +.. code-block:: bash + + $ du -hs /usr + 7,8G /usr + $ du -hs /mnt/music + 213G /mnt/music + $ find /usr -type f ! -empty | wc -l + 284075 + $ find /mnt/music -type f ! -empty | wc -l + 37370 + $ uname -a + Linux werkstatt 3.14.51-1-lts #1 SMP Mon Aug 17 19:21:08 CEST 2015 x86_64 GNU/Linux + +.. image:: _static/benchmarks/timing.svg + :width: 75% + :align: center + +*Note:* This plot uses logarithmic scaling for the time. + +It should be noted that the first run is the most important run. At least for a +rather large amount of data (here 211 GB), it is unlikely that the file system +has all relevant files in it's cache. You can see this with the second run of +``baseline.py`` - when reading all files the cache won't be useful at such large +file quantities. The other tools read only a partial set of files and can thus +benefit from caching on the second run. However ``rmlint`` (and also ``dupd``) +support fast re-running (see ``rmlint-replay``) which makes repeated runs very +fast. It is interesting to see ``rmlint-paranoid`` (no hash, incremental +byte-by-byte comparison) to be mostly equally fast as the vanilla ``rmlint``. + +.. image:: _static/benchmarks/cpu_usage.svg + :width: 75% + :align: center + +``rmlint`` has the highest CPU footprint here, mostly due to it's multithreaded +nature. Higher CPU usage is not a bad thing since it might indicate that the program +spends more time hashing files instead of switching between hashing and reading. +``dupd`` seems to be pretty efficient here, especially on re-runs. +``rmlint-replay`` has a high CPU usage here, but keep in mind that it does +(almost) no IO and only has to repeat previous outputs. -.. image:: _static/benchmark.svg +.. image:: _static/benchmarks/memory.svg :width: 75% :align: center +The most memory efficient program here seems to be ``rdfind`` which uses even +less than the bare bone ``baseline.py`` (which does not much more than holding a +hashtable). The well known ``fdupes`` is also low on memory footprint. + +Before saying that the paranoid mode of ``rmlint`` is a memory hog, it should be +noted (since this can't be seen on those plots) that the memory consumption +scales very well. Partly because ``rmlint`` saves all paths in a Trie_, making +it usable for :math:`\geq` `5M files`_. Also it is able to control the amount of +memory it uses in the paranoid mode (``--max-paranoid-mem``). Due to the high +amount of internal data structures it however has a rather large base memory +footprint. + +``dupd`` uses direct file comparison for groups of two and three files and hash +functions for the rest. It seems to have a rather high memory footprint in any +case. + +.. _Trie: https://en.wikipedia.org/wiki/Radix_tree +.. _`5M files`: https://github.com/sahib/rmlint/issues/109 + +.. raw:: html + :file: _static/benchmarks/found_items.html + +| + +Surprisingly each tool found a different set of files. As stated above, direct +comparison may not be possible here. For most tools except ``rdfind`` and +``baseline.py`` it's about in the same magnitude of files. ``fdupes`` seems to +find about the same amount as ``rmlint`` (with small differences). +The reasons for this are not clear yet, but we're looking at it currently_. + +.. _currently: https://github.com/sahib/rmlint/issues/131#issuecomment-143387431 + User benchmarks --------------- -If you like, you can add your own benchmarks below. +If you like, you can add your own benchmarks below. Maybe include the following information: - ``rmlint --version`` @@ -30,7 +128,7 @@ If you have longer output you might want to use a pastebin like gist_.