Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DocDB] Block cache key generation algorithm may return already existing key #20852

Closed
1 task done
arybochkin opened this issue Jan 30, 2024 · 1 comment
Closed
1 task done

Comments

@arybochkin
Copy link
Contributor

arybochkin commented Jan 30, 2024

Jira Link: DB-9839

Description

Current implementation of block cache key prefix generation does not provide enough guarantee of returning a unique key for a SST file. It uses the inode information, which can be reused after the file deletion while some old blocks of that deleted file may still be present in the block cache, and the cache will return an old block instead of a new block, where the block handle has the information about the deleted file. Observed effects of such behaviour are: block checksum error during compaction and flush, attempt to read by the offset past the file end, and further crashes in some cases.

Block cache key prefix is composed of (inode device id, inode number, inode generation). In older filesystems (even older versions of ext4) the inode generation is randomly generated. Then if we recycle inode number a bunch of times, which file systems tend to do, we can end up reading a stale block from the cache. Also XFS might generate random number and put it into inode’s generation number instead of reading from disk, when inode info is not in disk cache.

The similar issues were met in the upstream RocksDB.

Potential fixes:

Fragmentation and things being “scattered around the disk more” cause significant performance problems for traditional spinning-platter disks. However, on an SSD, those things are no problem at all. Conversely, deleting disk blocks is a significant performance problem on SSDs while being no problem for spinning-disks. So while the patch making it the default to delete empty inodes was a performance improvement for spinning-disks, it actually made performance worse on SSDs. Hence the recommendation to use ikeep on SSDs."

Also from XFS source code:

/*
 * Read the disk inode attributes into the in-core inode structure.
 *
 * For version 5 superblocks, if we are initialising a new inode and we are not
 * utilising the XFS_MOUNT_IKEEP inode cluster mode, we can simple build the new
 * inode core with a random generation number. If we are keeping inodes around,
 * we need to read the inode cluster to get the existing generation number off
 * disk. Further, if we are using version 4 superblocks (i.e. v1/v2 inode
 * format) then log recovery is dependent on the di_flushiter field being
 * initialised from the current on-disk value and hence we must also read the
 * inode off disk.
 */
int
xfs_iread(

The following issues could be related to the conflicting keys problem: #10621, #12681, #13109, #14064

Issue Type

kind/bug

Warning: Please confirm that this issue does not contain any sensitive information

  • I confirm this issue does not contain any sensitive information.
@arybochkin arybochkin added area/docdb YugabyteDB core features status/awaiting-triage Issue awaiting triage labels Jan 30, 2024
@yugabyte-ci yugabyte-ci added kind/bug This issue is a bug priority/medium Medium priority issue priority/high High Priority and removed priority/medium Medium priority issue labels Jan 30, 2024
@yugabyte-ci yugabyte-ci removed the status/awaiting-triage Issue awaiting triage label Jan 30, 2024
@yugabyte-ci yugabyte-ci added priority/medium Medium priority issue and removed priority/high High Priority labels Feb 7, 2024
@rthallamko3 rthallamko3 added priority/high High Priority and removed priority/medium Medium priority issue labels Mar 19, 2024
iSignal added a commit that referenced this issue Mar 25, 2024
Summary: See #20852 for more details

Test Plan:
Create an AWS univ, verify etc/fstab

```
yugabyte@ip-10-9-110-14 ~]$ cat /etc/fstab
UUID=98eafea5-808c-4734-8e73-5d70e7f08d8d /                       xfs     defaults        0 0
UUID=3c443eec-afbf-4f13-8abd-27fea64ec0d8 /boot                   xfs     defaults        0 0
UUID=96C8-6DB1  /boot/efi               vfat    defaults,uid=0,gid=0,umask=077,shortname=winnt        0 2
#/mnt/d0
UUID=0864d05b-15d5-4fc0-a953-16f7ca2a0da6 /mnt/d0 xfs defaults,ikeep,noatime,nofail,allocsize=4m 0 2
#/mnt/d1
UUID=628768ad-ec85-43a6-be4c-71e5c9fe8fef /mnt/d1 xfs defaults,ikeep,noatime,nofail,allocsize=4m 0 2
```

Reviewers: bogdan, timur, yshchetinin

Reviewed By: yshchetinin

Subscribers: yugaware

Differential Revision: https://phorge.dev.yugabyte.com/D33358
iSignal added a commit that referenced this issue Mar 29, 2024
Summary: See #20852 for more details

Test Plan:
Create an AWS univ, verify etc/fstab

```
yugabyte@ip-10-9-110-14 ~]$ cat /etc/fstab
UUID=98eafea5-808c-4734-8e73-5d70e7f08d8d /                       xfs     defaults        0 0
UUID=3c443eec-afbf-4f13-8abd-27fea64ec0d8 /boot                   xfs     defaults        0 0
UUID=96C8-6DB1  /boot/efi               vfat    defaults,uid=0,gid=0,umask=077,shortname=winnt        0 2
UUID=0864d05b-15d5-4fc0-a953-16f7ca2a0da6 /mnt/d0 xfs defaults,ikeep,noatime,nofail,allocsize=4m 0 2
UUID=628768ad-ec85-43a6-be4c-71e5c9fe8fef /mnt/d1 xfs defaults,ikeep,noatime,nofail,allocsize=4m 0 2
```

Original diff: https://phorge.dev.yugabyte.com/D33358
Original commit: c7333c4

Reviewers: bogdan, timur, yshchetinin

Reviewed By: yshchetinin

Subscribers: yugaware

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D33501
iSignal added a commit that referenced this issue Apr 5, 2024
…ions

Summary:
See #20852 for more details

Original diff: https://phorge.dev.yugabyte.com/D33358

Test Plan:
Create an AWS univ, verify etc/fstab

```
yugabyte@ip-10-9-110-14 ~]$ cat /etc/fstab
UUID=98eafea5-808c-4734-8e73-5d70e7f08d8d /                       xfs     defaults        0 0
UUID=3c443eec-afbf-4f13-8abd-27fea64ec0d8 /boot                   xfs     defaults        0 0
UUID=96C8-6DB1  /boot/efi               vfat    defaults,uid=0,gid=0,umask=077,shortname=winnt        0 2
UUID=0864d05b-15d5-4fc0-a953-16f7ca2a0da6 /mnt/d0 xfs defaults,ikeep,noatime,nofail,allocsize=4m 0 2
UUID=628768ad-ec85-43a6-be4c-71e5c9fe8fef /mnt/d1 xfs defaults,ikeep,noatime,nofail,allocsize=4m 0 2
```

Reviewers: bogdan, timur, yshchetinin

Reviewed By: yshchetinin

Subscribers: yugaware

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D33853
huapengy added a commit that referenced this issue Nov 26, 2024
Summary:
Problem:
In RocksDB, the file id used for block cache key prefix is generated using file inode information (inode device ID, inode number, and inode generation).
However, file systems like XFS and ext4 reuse inodes(inode number, and inode generation), which means the block cache key for new files may point to stale entries from deleted files.

Fix:
The mtime and the checksum of the meta-block are added to the prefix key. The mtime (4 bytes) is finalized once the data is flushed and
remains unchanged even if a hard link is created. The checksum (4 bytes) is calculated based on the meta-block content of the SST file.
With the combination of inode, mtime and checksum, a conflict will be extremely unlikely.
Jira: DB-9839

Test Plan:
1. existing rocksdb unit test.
2. Added a new unit test: ybd tsan --cxx-test table_test --gtest_filter TableTest.BlockCacheWithHardlink --clang17

Reviewers: timur, arybochkin, rthallam

Reviewed By: timur

Subscribers: ybase

Differential Revision: https://phorge.dev.yugabyte.com/D39729
huapengy added a commit that referenced this issue Dec 31, 2024
…he key prefix

Summary:
Original commit: 26e2c1fb2913898087f49b52d278d3f06c3612ee/D39729

Problem:
In RocksDB, the file id used for block cache key prefix is generated using file inode information (inode device ID, inode number, and inode generation).
However, file systems like XFS and ext4 reuse inodes(inode number, and inode generation), which means the block cache key for new files may point to stale entries from deleted files.

Fix:
The mtime and the checksum of the meta-block are added to the prefix key. The mtime (4 bytes) is finalized once the data is flushed and
remains unchanged even if a hard link is created. The checksum (4 bytes) is calculated based on the meta-block content of the SST file.
With the combination of inode, mtime and checksum, a conflict will be extremely unlikely.
Jira: DB-9839

Test Plan:
1. existing rocksdb unit test.
2. Added a new unit test: ybd tsan --cxx-test table_test --gtest_filter TableTest.BlockCacheWithHardlink --clang17

Reviewers: timur, arybochkin, rthallam

Reviewed By: timur

Subscribers: ybase

Differential Revision: https://phorge.dev.yugabyte.com/D40812
@rthallamko3 rthallamko3 reopened this Feb 16, 2025
@rthallamko3
Copy link
Contributor

Reactivating for 2024.1 and 2.20 backports.

huapengy added a commit that referenced this issue Feb 20, 2025
…he key prefix

Summary:
Original commit: 26e2c1f/D39729

Problem:
In RocksDB, the file id used for block cache key prefix is generated using file inode information (inode device ID, inode number, and inode generation).
However, file systems like XFS and ext4 reuse inodes(inode number, and inode generation), which means the block cache key for new files may point to stale entries from deleted files.

Fix:
The mtime and the checksum of the meta-block are added to the prefix key. The mtime (4 bytes) is finalized once the data is flushed and
remains unchanged even if a hard link is created. The checksum (4 bytes) is calculated based on the meta-block content of the SST file.
With the combination of inode, mtime and checksum, a conflict will be extremely unlikely.
Jira: DB-9839

Test Plan:
1. existing rocksdb unit test.
2. Added a new unit test: ybd tsan --cxx-test table_test --gtest_filter TableTest.BlockCacheWithHardlink --clang17

Reviewers: timur, arybochkin, rthallam

Reviewed By: rthallam

Subscribers: ybase

Differential Revision: https://phorge.dev.yugabyte.com/D41980
huapengy added a commit that referenced this issue Feb 20, 2025
… key prefix

Summary:
Original commit: 26e2c1f/D39729

Problem:
In RocksDB, the file id used for block cache key prefix is generated using file inode information (inode device ID, inode number, and inode generation).
However, file systems like XFS and ext4 reuse inodes(inode number, and inode generation), which means the block cache key for new files may point to stale entries from deleted files.

Fix:
The mtime and the checksum of the meta-block are added to the prefix key. The mtime (4 bytes) is finalized once the data is flushed and
remains unchanged even if a hard link is created. The checksum (4 bytes) is calculated based on the meta-block content of the SST file.
With the combination of inode, mtime and checksum, a conflict will be extremely unlikely.
Jira: DB-9839

Test Plan:
1. existing rocksdb unit test.
2. Added a new unit test: ybd tsan --cxx-test table_test --gtest_filter TableTest.BlockCacheWithHardlink --clang17

Reviewers: timur, arybochkin, rthallam

Reviewed By: rthallam

Subscribers: ybase

Differential Revision: https://phorge.dev.yugabyte.com/D41981
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants