-
Notifications
You must be signed in to change notification settings - Fork 6.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SST files overlap at a level and keys are not ordered in a SST file #7405
Comments
One known possibility is block cache key conflict leading to an old block in cache being returned instead of a new block. That's aligned with seeing an ordered sequence of keys injected in the middle of another ordered sequence of keys. Another known possibility is faulty hardware. More details on the key conflict possibility: block cache keys are composed of (inode number, inode generation (4 bytes), file offset). In older filesystems (even older versions of ext4) the inode generation is randomly generated. Then if we recycle inode number a bunch of times, which file systems tend to do, we can end up reading a stale block from the cache. In terms of probability of this happening, imagine there's 2^32 days in a year. The birthday paradox math says the class needs to have ~70K students in order for a birthday conflict to be 50% likely. Similarly, we need to recycle a inode number roughly ~70K times to see an (inode number, inode generation) conflict with 50% probability. But why would RocksDB cycle through files so fast (note the stale block needs to also not be evicted from block cache in order for the problem to occur)? One reason is bugs. We had a bug before that caused infinitely recompacting the same file: #4575. Easy way to guess here is look at the SST file numbers (e.g., the "6" in "000006.sst"). If they're very large (like into the millions) that makes the existence of this bug more plausible. Also, how to detect how the filesystem picks inode generation (sequentially or randomly)? I don't know the right way but have a diagnostic script that cycles through files, printing Besides the above possibility (it does sound crazy but I've repro'd it, a bit artificially), the cases we see tend to involve faulty hardware. In recent memory, machine checks have reported CPU / memory problems on the hosts where we knew corruption was first written. The challenge though is how to prevent corruption happening silently and then spreading. Here are some prevention suggestions:
|
Thank you very much. It contains lots of details which are very helpful. We are using ext4 and your tool prints random generation numbers. The sst file numbers are around half million currently. Our comparator can handle the internal keys well and return the same result as the user keys, so #4575 does not affect us? The block cache conflict and faulty hardware are possible for our storage, though we do not have enough proof now to confirm which happened. Does the rocksdb team have a plan to avoid the block cache conflict completely in future? We have added the checking in the iteration, which we run frequently, and are evaluating other options you suggested. Thank you. |
@ajkr Thank you for the detailed explanation of the issue. We hit the cache key issue last week and was about to file an issue. Do you know which kernel version is affected by the random inode generation issue? Thanks. |
Also, is it possible to make the cache key take into account the file checksum as well? |
Our relatively new "unique with extremely high probability" identifier for sst files is saved DB session id + file number. The DB session id is currently only available as a 20 byte human readable string, with about 100 bits of entropy. File numbers have 10-20+ bits of entropy. To generate efficient but effective cache key prefixes from this, I would generate two 64-bit hashes from the DB session id, and add (with addition) to one of those the file number. Then you get a 128 bit identifier approaching full entropy as your file numbers approach ~1 billion. There is also a trick in this design (which I considered in the updated backup names): within the same db session id, distinct files are guaranteed unique by unique file numbers. This doesn't actually reduce the overall probability that two files from different databases will collide, but it does "cluster" the collisions such that a single collision between files of two DBs means much higher likelihood of multiple collisions. Why is that good? Keeping the overall probability of files from two DBs colliding the same, it means the probability of any file collisions between any two DBs is much lower--lower than if the file number were hashed. (I believe that in most applications, any corruption in a DB is almost or equally as bad as widespread corruption.) |
That sounds awesome. With file number being part of the hash at least we will not get conflict within the DB instance. I'm Interested in how the DB session id is generated. Any reference? Also I'm thinking how to detect if cache conflict do happen. For compaction we can do key order check. But looks like there's nothing we can do for point-get request. Thoughts? |
It does cost some extra cache key bytes, at least when you want your cache optimized for the case of multiple DB instances accessing the same SST files. (NB: existing cache key requires shared storage to be considered same SST file; my proposal would generalize to copied files.)
https://github.com/facebook/rocksdb/blob/master/db/db_impl/db_impl.cc#L3704
Compare data block range to upper/lower bound from index binary search? |
Thanks. So it takes a uuid and then hash to generate 128bit random number. I search a bit and looks like the uuid file is based on /dev/urandom. Maybe it can directly read 128bit from /dev/urandom. |
This way was kind of a matter of "if it ain't broke, don't fix it" and kind of a matter of only relying on what's already available in Env.
Improvement welcome, but e.g. adding more code to Env to do basically the same thing for this one case might not be an improvement ;) |
lol, totally agree. My comment was just a random thought. |
Correction: it's actually newer versions of ext4 that would be affected: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=232530680290ba94ca37852ab10d9556ea28badf |
It's a known issue, which is tracked in facebook#7405, facebook#7470. Disable it for now.
Hi. Is there work on getting a fix for the cache key issue? Thanks @ajkr @pdillinger @zhichao-cao |
Summary: It's a known issue, which is tracked in facebook#7405, facebook#7470. Disable it for now. Pull Request resolved: facebook#7511 Reviewed By: zhichao-cao Differential Revision: D24145075 Pulled By: jay-zhuang fbshipit-source-id: 1858497972f2baba617867aaeac30d93b8305c80
Summary: Use DB session ids in SST table properties to make cache keys stable across DB re-open and copy / move / restore / etc. These new cache keys are currently only enabled when FileSystem does not provide GetUniqueId. For now, they are typically larger, so slightly less efficient. Relevant to facebook#7405 Test Plan: new unit test added, which fails when disabling new functionality
Summary: Use DB session ids in SST table properties to make cache keys stable across DB re-open and copy / move / restore / etc. These new cache keys are currently only enabled when FileSystem does not provide GetUniqueId. For now, they are typically larger, so slightly less efficient. Relevant to #7405 This change has a minor regression in PersistentCache functionality: metaindex blocks are no longer cached in PersistentCache. Table properties blocks already were not but ideally should be. I didn't spent effort to fix & test these issues because we don't believe PersistentCache is used much if at all and expect SecondaryCache to replace it. (Though PRs are welcome.) FIXME: there is more to be fixed for stable cache keys on external SST files Pull Request resolved: #8659 Test Plan: new unit test added, which fails when disabling new functionality Reviewed By: zhichao-cao Differential Revision: D30297705 Pulled By: pdillinger fbshipit-source-id: e8539a5c8802a79340405629870f2e3fb3822d3a
Basic support added in #8659, revised for same release in #8686. They are in RocksDB 6.24.0. The stable cache keys are only enabled if the FileSystem does not provide file unique ids, and the keys are not very compact, but they work well. I have an enhanced version in progress. Also, if the SST file is older than RocksDB 6.24, then it will use a temporary cache key based on the current DB session id and file number rather than the originals for that file. (This is at least stable across table file re-open, but changes across different DB Opens.) |
Summary: Use DB session ids in SST table properties to make cache keys stable across DB re-open and copy / move / restore / etc. These new cache keys are currently only enabled when FileSystem does not provide GetUniqueId. For now, they are typically larger, so slightly less efficient. Relevant to facebook/rocksdb#7405 This change has a minor regression in PersistentCache functionality: metaindex blocks are no longer cached in PersistentCache. Table properties blocks already were not but ideally should be. I didn't spent effort to fix & test these issues because we don't believe PersistentCache is used much if at all and expect SecondaryCache to replace it. (Though PRs are welcome.) FIXME: there is more to be fixed for stable cache keys on external SST files Pull Request resolved: facebook/rocksdb#8659 Test Plan: new unit test added, which fails when disabling new functionality Reviewed By: zhichao-cao Differential Revision: D30297705 Pulled By: pdillinger fbshipit-source-id: e8539a5c8802a79340405629870f2e3fb3822d3a
We are upgrading our rocksdb storage from 5.7.2 to 6.4.6. After upgrading to the 6.4.6 version, we noticed some shards returned unexpected keys in the iteration. Then we dumped the manifest and sst files, and found the (smallest key, greatest key) of some SST files at the bottommost level overlapped, and some SST files contained multiple sorted ranges but these ranges are overlapped.
For example, an sst file contains 4 segments of sorted keys, these are the ranges showing the first 4 bytes of the keys in hex format(the common prefix is not shown here to make it concise) and the key counts:
Then we deployed a detector for misordered keys in the compaction and iteration and found 4 shards having the misordered keys among 400k shards.
We also checked the shards still running v5.7.2, and have found a shard with similar key misordering issue. So this problem already existed in the old version. But we did not get the issue reported from the iteration in the v5.7.2 as in the v6.4.6 - that is why we just noticed it after upgrading to v6.4.6.
Then we want to ask:
Thanks. We are still collecting more data from our storage, please let us know if you need additional data.
The text was updated successfully, but these errors were encountered: