Skip to content

Wrong disk size in metrics for btrfs backend instances #15265

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
edlerd opened this issue Mar 26, 2025 · 8 comments
Open

Wrong disk size in metrics for btrfs backend instances #15265

edlerd opened this issue Mar 26, 2025 · 8 comments
Assignees

Comments

@edlerd
Copy link
Contributor

edlerd commented Mar 26, 2025

Distribution

snap

Distribution version

6.3

Output of "snap list --all lxd core20 core22 core24 snapd"

lxd     6.3-d704dcb     32918  latest/stable  canonical✓  -

Issue description

For an instance on a btrfs pool with limited main disk size, the reported lxd_filesystem_size_bytes in the GET /1.0/metrics endpoint is wrongly sending the total storage pool size.

With other storage pool drivers, like zfs or directory, the size in the metrics result correctly returns the instance limit, not the total pool size.

I suspect this be an issue with btrfs integration.

See also canonical/lxd-ui#1155
Might be related to #8468

Steps to reproduce

  1. Create a storage pool with btrfs driver with 5G size.
  2. Create an instance on the pool
  3. Restrict the instances main drive size to 2G.
  4. Call the GET /1.0/metrics endpoint
  5. see lxd_filesystem_size_bytes for the instance is 5G, but it should be 2G
@tomponline
Copy link
Member

Yes i was thinking #8468 sounded similar.

@edlerd
Copy link
Contributor Author

edlerd commented Mar 27, 2025

Yes i was thinking #8468 sounded similar.

Though I now realize #8468 is about disk usage and here it is about disk total. But both issues might be relating to a similar cause, as both relate to disk on btrfs pools.

@gabrielmougard gabrielmougard self-assigned this Apr 2, 2025
@gabrielmougard
Copy link
Contributor

I think I have identified the cause of this issue. The problem stems from how different filesystems expose quota information to the kernel's VFS layer. While ZFS integrates quota limits directly into its filesystem statistics (making statfs calls correctly report the quota-limited size), BTRFS reports the entire pool's statistics regardless of any quotas applied to specific subvolumes. Currently, our metrics code relies on the standard filesystem statistics, which works correctly for ZFS but not for BTRFS. I'm working on a fix that will specifically handle the BTRFS case by directly querying the BTRFS quota information (we could parse the output of btrfs qgroup show -f --raw <path> . Should I put this logic directly in the getFSStats() function or should we have it as part of an exported BTRFS driver method ?) for the container's volume and using that value for the reported filesystem size instead of the raw pool size. @tomponline what do you think?

@tomponline
Copy link
Member

@gabrielmougard which code path is the problem at currently, link please :)

@gabrielmougard
Copy link
Contributor

In the getFSStats() function when filesystem.StatVFS is called (that would be my strong guess)

https://github.com/canonical/lxd/blob/05ca2853d6b307725da6c3479f23bda347c43ca4/lxd/instance/drivers/driver_lxc.go#L8593C4-L8593C44

@gabrielmougard
Copy link
Contributor

gabrielmougard commented Apr 7, 2025

Obviously, we have the same issue in getFilesystemMetrics() if the instance is a VM

statfs, err := filesystem.StatVFS(stats.Mountpoint)

@tomponline
Copy link
Member

@gabrielmougard can you give me a lxc CLI example of the incorrect output and the related instance config, as im not following currently. Thanks

@gabrielmougard
Copy link
Contributor

Sure! Also, I want to stress that this idea of mine is a guess for now as I need to log the output of statfs with the reproducer scenario. But this guess seems to be corroborated by https://lore.kernel.org/linux-btrfs/[email protected]/T/

If I understand correctly: statfs(), only allows the kernel to report two numbers to describe space usage: total blocks and free blocks and the only space tracking we have at a subvolume level is qgroups, hence why the idea of using btrfs qgroups show ... instead of a statfs call in case of having a btrfs storage pool.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants