-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DocDB] Block cache key generation algorithm may return already existing key #20852
Closed
1 task done
Labels
2.20 Backport Required
2024.1 Backport Required
2024.2 Backport Required
area/docdb
YugabyteDB core features
kind/bug
This issue is a bug
priority/high
High Priority
Comments
iSignal
added a commit
that referenced
this issue
Mar 25, 2024
Summary: See #20852 for more details Test Plan: Create an AWS univ, verify etc/fstab ``` yugabyte@ip-10-9-110-14 ~]$ cat /etc/fstab UUID=98eafea5-808c-4734-8e73-5d70e7f08d8d / xfs defaults 0 0 UUID=3c443eec-afbf-4f13-8abd-27fea64ec0d8 /boot xfs defaults 0 0 UUID=96C8-6DB1 /boot/efi vfat defaults,uid=0,gid=0,umask=077,shortname=winnt 0 2 #/mnt/d0 UUID=0864d05b-15d5-4fc0-a953-16f7ca2a0da6 /mnt/d0 xfs defaults,ikeep,noatime,nofail,allocsize=4m 0 2 #/mnt/d1 UUID=628768ad-ec85-43a6-be4c-71e5c9fe8fef /mnt/d1 xfs defaults,ikeep,noatime,nofail,allocsize=4m 0 2 ``` Reviewers: bogdan, timur, yshchetinin Reviewed By: yshchetinin Subscribers: yugaware Differential Revision: https://phorge.dev.yugabyte.com/D33358
iSignal
added a commit
that referenced
this issue
Mar 29, 2024
Summary: See #20852 for more details Test Plan: Create an AWS univ, verify etc/fstab ``` yugabyte@ip-10-9-110-14 ~]$ cat /etc/fstab UUID=98eafea5-808c-4734-8e73-5d70e7f08d8d / xfs defaults 0 0 UUID=3c443eec-afbf-4f13-8abd-27fea64ec0d8 /boot xfs defaults 0 0 UUID=96C8-6DB1 /boot/efi vfat defaults,uid=0,gid=0,umask=077,shortname=winnt 0 2 UUID=0864d05b-15d5-4fc0-a953-16f7ca2a0da6 /mnt/d0 xfs defaults,ikeep,noatime,nofail,allocsize=4m 0 2 UUID=628768ad-ec85-43a6-be4c-71e5c9fe8fef /mnt/d1 xfs defaults,ikeep,noatime,nofail,allocsize=4m 0 2 ``` Original diff: https://phorge.dev.yugabyte.com/D33358 Original commit: c7333c4 Reviewers: bogdan, timur, yshchetinin Reviewed By: yshchetinin Subscribers: yugaware Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D33501
iSignal
added a commit
that referenced
this issue
Apr 5, 2024
…ions Summary: See #20852 for more details Original diff: https://phorge.dev.yugabyte.com/D33358 Test Plan: Create an AWS univ, verify etc/fstab ``` yugabyte@ip-10-9-110-14 ~]$ cat /etc/fstab UUID=98eafea5-808c-4734-8e73-5d70e7f08d8d / xfs defaults 0 0 UUID=3c443eec-afbf-4f13-8abd-27fea64ec0d8 /boot xfs defaults 0 0 UUID=96C8-6DB1 /boot/efi vfat defaults,uid=0,gid=0,umask=077,shortname=winnt 0 2 UUID=0864d05b-15d5-4fc0-a953-16f7ca2a0da6 /mnt/d0 xfs defaults,ikeep,noatime,nofail,allocsize=4m 0 2 UUID=628768ad-ec85-43a6-be4c-71e5c9fe8fef /mnt/d1 xfs defaults,ikeep,noatime,nofail,allocsize=4m 0 2 ``` Reviewers: bogdan, timur, yshchetinin Reviewed By: yshchetinin Subscribers: yugaware Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D33853
huapengy
added a commit
that referenced
this issue
Nov 26, 2024
Summary: Problem: In RocksDB, the file id used for block cache key prefix is generated using file inode information (inode device ID, inode number, and inode generation). However, file systems like XFS and ext4 reuse inodes(inode number, and inode generation), which means the block cache key for new files may point to stale entries from deleted files. Fix: The mtime and the checksum of the meta-block are added to the prefix key. The mtime (4 bytes) is finalized once the data is flushed and remains unchanged even if a hard link is created. The checksum (4 bytes) is calculated based on the meta-block content of the SST file. With the combination of inode, mtime and checksum, a conflict will be extremely unlikely. Jira: DB-9839 Test Plan: 1. existing rocksdb unit test. 2. Added a new unit test: ybd tsan --cxx-test table_test --gtest_filter TableTest.BlockCacheWithHardlink --clang17 Reviewers: timur, arybochkin, rthallam Reviewed By: timur Subscribers: ybase Differential Revision: https://phorge.dev.yugabyte.com/D39729
Closed
1 task
huapengy
added a commit
that referenced
this issue
Dec 31, 2024
…he key prefix Summary: Original commit: 26e2c1fb2913898087f49b52d278d3f06c3612ee/D39729 Problem: In RocksDB, the file id used for block cache key prefix is generated using file inode information (inode device ID, inode number, and inode generation). However, file systems like XFS and ext4 reuse inodes(inode number, and inode generation), which means the block cache key for new files may point to stale entries from deleted files. Fix: The mtime and the checksum of the meta-block are added to the prefix key. The mtime (4 bytes) is finalized once the data is flushed and remains unchanged even if a hard link is created. The checksum (4 bytes) is calculated based on the meta-block content of the SST file. With the combination of inode, mtime and checksum, a conflict will be extremely unlikely. Jira: DB-9839 Test Plan: 1. existing rocksdb unit test. 2. Added a new unit test: ybd tsan --cxx-test table_test --gtest_filter TableTest.BlockCacheWithHardlink --clang17 Reviewers: timur, arybochkin, rthallam Reviewed By: timur Subscribers: ybase Differential Revision: https://phorge.dev.yugabyte.com/D40812
Reactivating for 2024.1 and 2.20 backports. |
huapengy
added a commit
that referenced
this issue
Feb 20, 2025
…he key prefix Summary: Original commit: 26e2c1f/D39729 Problem: In RocksDB, the file id used for block cache key prefix is generated using file inode information (inode device ID, inode number, and inode generation). However, file systems like XFS and ext4 reuse inodes(inode number, and inode generation), which means the block cache key for new files may point to stale entries from deleted files. Fix: The mtime and the checksum of the meta-block are added to the prefix key. The mtime (4 bytes) is finalized once the data is flushed and remains unchanged even if a hard link is created. The checksum (4 bytes) is calculated based on the meta-block content of the SST file. With the combination of inode, mtime and checksum, a conflict will be extremely unlikely. Jira: DB-9839 Test Plan: 1. existing rocksdb unit test. 2. Added a new unit test: ybd tsan --cxx-test table_test --gtest_filter TableTest.BlockCacheWithHardlink --clang17 Reviewers: timur, arybochkin, rthallam Reviewed By: rthallam Subscribers: ybase Differential Revision: https://phorge.dev.yugabyte.com/D41980
huapengy
added a commit
that referenced
this issue
Feb 20, 2025
… key prefix Summary: Original commit: 26e2c1f/D39729 Problem: In RocksDB, the file id used for block cache key prefix is generated using file inode information (inode device ID, inode number, and inode generation). However, file systems like XFS and ext4 reuse inodes(inode number, and inode generation), which means the block cache key for new files may point to stale entries from deleted files. Fix: The mtime and the checksum of the meta-block are added to the prefix key. The mtime (4 bytes) is finalized once the data is flushed and remains unchanged even if a hard link is created. The checksum (4 bytes) is calculated based on the meta-block content of the SST file. With the combination of inode, mtime and checksum, a conflict will be extremely unlikely. Jira: DB-9839 Test Plan: 1. existing rocksdb unit test. 2. Added a new unit test: ybd tsan --cxx-test table_test --gtest_filter TableTest.BlockCacheWithHardlink --clang17 Reviewers: timur, arybochkin, rthallam Reviewed By: rthallam Subscribers: ybase Differential Revision: https://phorge.dev.yugabyte.com/D41981
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
2.20 Backport Required
2024.1 Backport Required
2024.2 Backport Required
area/docdb
YugabyteDB core features
kind/bug
This issue is a bug
priority/high
High Priority
Jira Link: DB-9839
Description
Current implementation of block cache key prefix generation does not provide enough guarantee of returning a unique key for a SST file. It uses the inode information, which can be reused after the file deletion while some old blocks of that deleted file may still be present in the block cache, and the cache will return an old block instead of a new block, where the block handle has the information about the deleted file. Observed effects of such behaviour are: block checksum error during compaction and flush, attempt to read by the offset past the file end, and further crashes in some cases.
Block cache key prefix is composed of (inode device id, inode number, inode generation). In older filesystems (even older versions of ext4) the inode generation is randomly generated. Then if we recycle inode number a bunch of times, which file systems tend to do, we can end up reading a stale block from the cache. Also XFS might generate random number and put it into inode’s generation number instead of reading from disk, when inode info is not in disk cache.
The similar issues were met in the upstream RocksDB.
Potential fixes:
Also from XFS source code:
The following issues could be related to the conflicting keys problem: #10621, #12681, #13109, #14064
Issue Type
kind/bug
Warning: Please confirm that this issue does not contain any sensitive information
The text was updated successfully, but these errors were encountered: