-
-
Notifications
You must be signed in to change notification settings - Fork 757
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
backup a big amount of data #216
Comments
Software:
Hardware:
I tested with 5000 directories that all looked like this one:
Aside from the empty sparse file, there were NO duplicate files (I modified the data with a counter value, so there are no [or at least not many] duplicate chunks in them). Here is the script I used:
I decided to exclude the text files because they slowed down the backup significantly and I didn't want to wait that long. Here is the result of the backup script run:
I interrupted the backup once (Ctrl-C) because it appeared slow. Then I removed the os.fsync from the source and it got faster. That is why this checkpoint exists. After the backup had completed, I tried to remove the checkpoint (not needed any more):
Whoops! I ran some I tried again, with the USB3 disk being directly connected (no hub) to the host system: checksum issues GONE! So the usb3 hub seems to have caused these troubles. Happy end:
Some data (this was after successfully deleting the checkpoint):
I didn't measure runtime or memory needs (due to the USB connection, I assumed it wouldn't be super fast), but when looking at index file sizes, I'ld estimate it used at least 2GB of memory (chunks + files + index.NNNNNN size), likely a bit more. |
Here is the script I used to generate big amounts of data: |
BTW, I don't have the 15TB free space any more, so I can't do more big volume tests. 3rd party help / confirmation is appreciated. |
So I've got 8+ TB of real data, and if it would be helpful I can try and use borg for my daily backups (I've got redundancy for older backups). History: I've been using attic for small amounts of personal data, but it proved unsuitable for large backups. I previously (3 or 4 months ago) tried to backup "everything" with attic and hit two major issues: 1) it would routinely crash when it was using enough memory to start swapping in earnest (but well before it exhausted available swap) and 2) I encountered the corruption issue in jborg/attic#264 (after spending several days running the first backup in stages). I saw the bug (and lack of response), and didn't bother trying a second time. It looks like the chunking options you've added will allow me to reduce the memory pressure to a manageable level, which will make it less of a pain to try again. So I'm game to take a stab at running a full backup with borg and using it for my dailies. Let me know if you still want testing on this point, this was a huge thorn in my side with attic, so I'm excited it's getting some love over here. |
@alraban yes, more tests are welcome! borg has a bit better error messages and shows tracebacks, so in case you run into an issue like jborg/attic#264 there likely will be more information we could use for analyzing it. For 8+TB, try a chunk size of ~2MB (--chunker-params 19,23,21,4095) that will produce up to 32x less the chunks than the default settings. Also, giving a higher than default value for --checkpoint-interval might be useful. Use the latest release code please. |
@alraban @alraban let's get deeper in https://github.com/borgbackup/borg/blob/master/borg/_chunker.c what do you think? |
Just a status update: I started a test run with borg 0.27 last Thursday. The first backup finished yesterday and I'm running an integrity check (with the --repository-only flag), which has been going about eight hours and I'd estimate is about half-way done based on disk access patterns. Once I've finished the integrity check, I'll extract some files, check them, and then run a second backup, followed by another check and test extraction. At that point I'll post detailed results (system specs, exact command flags used, memory usage, bottlenecks, whatever else would be useful). At current rate of speed seems likely that will be sometime later this week. Preliminary good news: using your chunker settings, borg made it all the way through a 9.65TB (8.6TiB) backup without crashing or throwing errors, which is already a good improvement over my previously observed behavior with attic 16. We'll see soon whether the corruption issue has also been sorted (fingers crossed). |
So I finished up, and there appears to be no corruption issue with a large data set. I'll continue doing testing (through regular backups), but I'm fairly comfortable that things are working. You requested info about testing environment and methodology above and I'll provide it here: Software on both the Source and the Destination machines:
Hardware:
After init, I ran: A few days later I got this result
I extracted a few files, and even browsed the repo as a fuse mount. Everything worked as expected (albeit slightly slower than a normal borg fuse mount). Then I ran a I removed a bunch of files and added some large new ones and ran another backup, and was pleasantly surprised to see the following less than an hour later:
I then extracted a directory and browsed the new archive as a fuse mount and everything worked as expected. I'm running another Performance Notes: Memory performance: The maximum used on the destination machine was about 2.2GB above the machine's idle memory usage. The machine is a headless backup only machine so it's running almost nothing in the background. Of that 2.2GB only about .4GB was every registered to the borg process, the rest of the memory appeared to be tied up in slab cache (which steadily grew as the backup went along and then vanished when it ended). Interestingly the follow up backup didn't use anywhere near as much memory (only about .5GB total). Excellent memory performance for such a large backup and never got close to swapping. I'm guessing I could probably tune the chunk parameters to make fuller use of available memory, but unless that would result in a speed gain I'm probably not interested. Compression and deduplication performance: At first glance the compression performance looks bad, but 99% of the files being backed up are digital media so are already pre-compressed to a high degree. I only enabled compression to make the test more realistic and because quick tests didn't show any speed advantage to not compression in my test case (I think because compression takes place on the Source machine, which in my case has CPU cycles to burn). Deduplication performance was also quite good. I removed a few hundred gigabytes of files and added almost exactly 89GB of new files, and moved around/renamed some old ones. So the deduplicator did a pretty great job, only picking up about 2GB worth of "static" on a 9.4TB dataset in a little less than an hour!. Time performance: the bottleneck in my case was the CPU on the Destination machine. The two machines are connected by a Gigabit link and it never got more than a third saturated, averaging about 35MB/sec. Similarly disk throughput utilization never got above 60%. One CPU core on the destination machine remained pegged at 100% for the entire backups and checks, but the other core never moved (I'm guessing the process is single-threaded other than for compression). Let me know if anymore information or testing would be useful. I'll keep running regular backups at intervals with release versions and report back if anything untoward happens. |
@alraban thanks a lot for extensive testing and reporting your results. a few comments: compression: I guess one can't do much wrong with enabling lz4 in almost all situations (except maybe when having a crappy CPU and potentially great I/O throughput at the same time). lz4 has great throughput. 100% cpu on destination machine: that's strange as it is just doing ssh and borg serve storing stuff into the repo + updating the repo index. ssh and borg can even use 1 core each. did you see which process was eating 100%, was it borg? do you have disk or fs encryption (dmcrypt, ext4 encryption, ...) enabled on the repo device? I'ld rather have expected the throughput limit to be limitations from being single threaded currently (waiting for read(), fsync(), not overlapping I/O with computations, no parallel computations). |
No encryption, but you're right that I may have misread the evidence. Looking closer at my munin graphs, the i/o wait is a pretty significant part of the picture. No single process was eating 100% CPU, the total system CPU usage stayed at almost exactly ~100% (out of 200%) with a According to munin, over the course of the backup the total CPU "pie" was on average 11% system, 33% user, 44% i/o wait, 9% misc, and 102% idle. Looking at So I may have been lead astray by the ~100% cpu usage, which looked suspiciously like one core working by itself, when in reality the story is a little different. If I can provide any other data, let me know. |
ok, thanks for the clarification. so I don't need to hunt for cpu eaters in borg serve. :) |
I am closing this one. At least 2 multi-TB tests done, nothing special found (that is borgbackup related). More multi-TB tests are appreciated, either append them to this ticket or send them to the mailing list. |
Following setup:
Plus
So far no issues encountered! :-) |
Just wanted to check back in; I've now been using borg for daily backups of ~10TB of real data for about 9 months (since my post above from last October). In that time I've had to do a few partial restores and I do daily pruning. The repo as a whole is just under 12TB total at this point and contains 12 archives at any given time. All the restores have gone perfectly (albeit very slowly, typically about 1/10th the speed of the initial backup). I also did some checksumming for comparison on a few occasions and everything checked out perfectly. Performance for the daily backups is quite good considering the volume of data. As noted upthread, the initial backup took a few days, but the dailys only take about an hour on average (less if nothing much has changed). The data being backed up contains some smaller borg repositories that contain hourly backups from my workstations, and there seems to be no issues with repos inside of other repos (all "borgception" restores successful). I've been using lz4 compression and haven't been using encryption. I'm happy to provide any additional performance or other information if that would be helpful, but from where I sit it's mostly a good news story: no data issues, successful restores, flexible incremental backups with good deduplication, and reasonable backup speed. The only really serious scalability issue is that restore speed (at least for the limited partial restores I've done) is really quite slow. I was seeing between 10GB or 20GB per hour. If that's representative, I expect a full restore of my dataset would take between three and six weeks! That wouldn't fly in production, but for my uses could potentially be tolerated (especially if the alternative is total data loss). For insurance, I still take a conventional rsync backup at intervals so I can fall back on that in a total catastrophe. But any backup system that (eventually) returns your data intact is a successful one in my book; and being able to reach back and grab individual files or directories as they existed months ago is fantastic. Thanks for all your good work on this, borg has already saved my bacon more than once :-) |
Thanks for your testing ^W usage report! :) 10-20 GB/hour = a couple MB/s -- How did you restore, FUSE or |
I restored via FUSE because I needed to browse a bit and use some scripts to get the exact files I needed. Do I understand you to be saying that extract is notably faster? In any case it sounds like FUSE performance will improve dramatically in the near future; a 60-fold increase in speed would be quite welcome and would effectively make the restore time more or less symmetrical to the backup time. |
Yes. This affects not only things like shaXXXsum, but also e.g. KDE/Dolphin which also do 32k reads with a 60x slow-down in 1.0. If you have large files you wish to extract over FUSE then using dd with a large extraction via "fixed" or work-arounded FUSE (1.1+, not yet released) or borg-extract should be as fast as create or faster (since decompression is normally faster than compression, extraction doesn't need to do any chunking, and reading is typically a bit faster than writing from disks). |
From maltefiala in #1422: """ That's all folks, |
Hardware backup server:
Borg:
Total time for the initial snapshot was around 3 days. A check for consistency takes around 12 hours. Actually, here I'd like to see a performance gain in future here because I'm thinking of verifying the whole backup after each addition. I'm now running the backup every night at 3 AM which takes between half an hour and some more hours if we had a massive data income. I'm currently using the following command line: I'm kind of wondering why the stderr file is full of the \r printout from borg. I thought only specifying --stats and --list would give me a nice stats table at the end without any other output. I was able to mount a snapshot with fuse to recover one accidentally deleted file. I was also able to directly restore a test set of 1.5TB of data in roughly 300 minutes. My over the thumb calculation for a full restore is around 5-6 days... although I hope I'll never get into that situation. Overall I have to say I'm really happy how easy everything went, especially the deduplication function and its performance amaze me (we're storing mostly genomic data, i.e. text files and compressed text files).
Anyway, would the developers recommend to to a check after each run with archives this large? I'd really like to be sure that I'm not writing garbage every night :) |
The segment checking performed by borg-check is checking a CRC32 over the data. I would expect that in a "good and proper" server like yours - with ECC memory, RAID 6, proper RAID controllers and maybe ZFS(?) - that this is relatively unlikely to catch problems (in the way that all the other checksumming should catch them first). It doesn't hurt, though. What might be interesting here is an incremental check, ie. only check data written since the last time borg-check ran. This would allow to verify everything was written correctly, while still allowing to check everything from time to time, to detect silent data corruption. You may also find the new --verify-data option interesting, it uses two rounds of HMAC-SHA-256 (for now, when encrypted) on the data, which has a probability to detect tampering of ~1.
-p enables progress output. Depending on how you view the file this may be non-obvious. Thanks for the report! :) Really nice setup you got there |
@tjakobi thanks for the feedback, biggest known borg backup yet. \o/ About I analyzed it and found that its slowness is caused by accessing chunks in index order (which is kind of random) and not in sequential on-disk order. I have begun working on optimizing that to work in disk order... |
We're using an LSI MegaRAID SAS 2208 Controller with a battery backup unit, the server is secured by an UPS. The file system is actually only a simple ext4 system because of some other reasons. The data is fed in to the backup server from a BeeGFS file server cluster through InfiniBand, also secured by identical LSI controllers with battery backups and UPS. I'd say it's a pretty stable setup so far. An incremental check option seems definitive interesting. I would trust my setup so far that data that's already checked stays healthy. What would happen in case a chunk turns out to be faulty? Is the whole archive unusable or just the part of the archive contained inside the broken chunk? Also: thank you very much for you support, I'll try to contribute insights from my side whenever necessary. |
It depends(tm) As always when metadata is corrupted the effects are often more drastic than simple data corruption, however, metadata is usually smaller compared to data, so it will be hit less frequently (by random issues). If check detects a broken data chunk, it will be marked as broken. That part of a file would read as a string of zeroes. Should the same chunk be seen again during a borg-create operation, then a later check will notice that and repair the file with the new copy of the chunk. When a metadata chunk is corrupted the check routine will notice that as well and cuts the part out, continuing with the next file/item whose metadata is uncorrupted. With the default settings each metadata chunk would usually contain somewhere between ~500 files at most (short paths, no extra metadata, small files) to <=1 file (very large files, lots of xattrs or ACLs, very long paths).
I created a ticket for it: #1657 |
Regarding check performance -- larger segments should improve performance. You can change this on the fly in the repository We might also improve performance a bit by enabling FADV_SEQUENTIAL when reading segments. |
an error recovery story (hope that fits here): i've recently had similar errors to those in jborg/attic#264 on a remote operational since summer 2015 with semi-regular backups of about 200gb of developer machine data. problems showed up when pruning (unfortunately lost the error message), then a check gave dozens of lines similar (only byte string differing) to
contrary to what's described in jborg/attic#264, a the borg version currently in use on both ends is 1.0.7. the output of the latest backup run (for purposes of judging size):
|
@chrysn were these backups made with attic or borg? if the latter: which versions? |
all created with borg, possibly as early as 0.23 (although no archives from back then have been in the repository for some time; i've probably used most debian-released versions since then). |
@chrysn ok, maybe it was the problem that is described at the start of changes.rst, which was fixed in 1.0.4. |
@chrysn hmm, how often do you run check? could it be a pre-1.0.4 problem (and you check infrequently) or did it happen after 1.0.4? |
i haven't run check in ages, it can easily be a pre-1.0.4 problem. my
main point of reporting this was probably less about the issue still
happening than about check --repair nowadays being able to fix the
issue.
|
@chrysn ok, so let's hope it was one of the issues fixed in 1.0.4. :) |
https://irregularbyte.otherreality.net/tags/Borg/ some blog post about a bigger scale borg deployment. |
@enkore Perhaps the metadata here (being important and relatively small) is a great candidate for par2 or some other erasure code with a large redundancy level or something! |
My company has decided to applying borg (and borgmatic) as backup system for a number of linux and netbsd servers. Most server are "happy" however one server holding a very large number of smaller files is struggling to complete its first backup. A session, applying borg over ssh, is running for hours for suddenly to break down. We have enabled debug-logging in both ends (via BORG_LOGGING_CONF on server side) however struggle to understand what makes the session break. Nether borg nor ssh seem to report much. Any suggestion to how we may debug this situation better? |
@ottojwittner please open a new ticket, provide borg version, borg output when it "breaks down" and other stuff the issue asks for. Maybe it is a ssh / network issue. In general, you can just restart using the same borg command and it will progress over time, deduplicating against what it already has in the repo. |
In the attic issue tracker there are some old, stale and unclosed issues about big backups and/or consistency / corruption events.
It is rather unclear whether they are really a problem of the software or caused by external things (hardware, OS, ...). It is also unclear whether they were already fixed in attic, in msgpack or in borg.
jborg/attic#264
jborg/attic#176
jborg/attic#166
To increase confidence and to be able to close these issues (for borg and/or for attic), it would be nice if somebody with lots of (real) data could regularly use borg (and/or attic) and do multi-terabyte backup tests.
Please make sure that:
Please gIve feedback below (after you started testing, when you have completed testing).
Give how much data you have, in how many files, your hardware, how long it took, how much RAM it needed, how large your .cache/borg got, how large .cache/borg/chunks and .../files is.
The text was updated successfully, but these errors were encountered: