summaryrefslogtreecommitdiff
path: root/fs/btrfs/qgroup.h
AgeCommit message (Collapse)AuthorFilesLines
2024-07-11btrfs: qgroup: preallocate memory before adding a relationDavid Sterba1-1/+10
There's a transaction joined in the qgroup relation add/remove ioctl and any error will lead to abort/error. We could lift the allocation from btrfs_add_qgroup_relation() and move it outside of the transaction context. The relation deletion does not need that. The ownership of the structure is moved to the add relation handler. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: constify pointer parameters where applicableDavid Sterba1-6/+6
We can add const to many parameters, this is for clarity and minor addition to safety. There are some minor effects, in the assembly code and .ko measured on release config. This patch does not cover all possible conversions. Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: qgroup: delete a TODO about using kmem cache to allocate structuresJunchao Sun1-1/+0
Generic slab works fine allocating btrfs_qgroup_extent_record structures. It's not necessary to create a dedicated kmem cache that would be created but unused if quotas were not enabled. Let's delete the TODO line. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Junchao Sun <sunjunchao2870@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: automatically remove the subvolume qgroupQu Wenruo1-0/+1
Currently if we fully clean a subvolume (not only delete its directory, but fully clean all it's related data and root item), the associated qgroup would not be removed. We have "btrfs qgroup clear-stale" to handle such 0 level qgroups. Change the behavior to automatically removie the qgroup of a fully cleaned subvolume when possible: - Full qgroup but still consistent We can and should remove the qgroup. The qgroup numbers should be 0, without any rsv. - Full qgroup but inconsistent Can happen with drop_subtree_threshold feature (skip accounting and mark qgroup inconsistent). We can and should remove the qgroup. Higher level qgroup numbers will be incorrect, but since qgroup is already inconsistent, it should not be a problem. - Squota mode This is the special case, we can only drop the qgroup if its numbers are all 0. This would be handled by can_delete_qgroup(), so we only need to check the return value and ignore the -EBUSY error. Link: https://bugzilla.suse.com/show_bug.cgi?id=1222847 Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-05btrfs: qgroup: validate btrfs_qgroup_inherit parameterQu Wenruo1-0/+3
[BUG] Currently btrfs can create subvolume with an invalid qgroup inherit without triggering any error: # mkfs.btrfs -O quota -f $dev # mount $dev $mnt # btrfs subvolume create -i 2/0 $mnt/subv1 # btrfs qgroup show -prce --sync $mnt Qgroupid Referenced Exclusive Path -------- ---------- --------- ---- 0/5 16.00KiB 16.00KiB <toplevel> 0/256 16.00KiB 16.00KiB subv1 [CAUSE] We only do a very basic size check for btrfs_qgroup_inherit structure, but never really verify if the values are correct. Thus in btrfs_qgroup_inherit() function, we have to skip non-existing qgroups, and never return any error. [FIX] Fix the behavior and introduce extra checks: - Introduce early check for btrfs_qgroup_inherit structure Not only the size, but also all the qgroup ids would be verified. And the timing is very early, so we can return error early. This early check is very important for snapshot creation, as snapshot is delayed to transaction commit. - Drop support for btrfs_qgroup_inherit::num_ref_copies and num_excl_copies Those two members are used to specify to copy refr/excl numbers from other qgroups. This would definitely mark qgroup inconsistent, and btrfs-progs has dropped the support for them for a long time. It's time to drop the support for kernel. - Verify the supported btrfs_qgroup_inherit::flags Just in case we want to add extra flags for btrfs_qgroup_inherit. Now above subvolume creation would fail with -ENOENT other than silently ignore the non-existing qgroup. CC: stable@vger.kernel.org # 6.7+ Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04btrfs: add forward declarations and headers, part 3David Sterba1-4/+13
Do a cleanup in the rest of the headers: - add forward declarations for types referenced by pointers - add includes when types need them This fixes potential compilation problems if the headers are reordered or the missing includes are not provided indirectly. Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-07btrfs: ensure releasing squota reserve on head refsBoris Burkov1-2/+1
A reservation goes through a 3 step lifetime: - generated during delalloc - released/counted by ordered_extent allocation - freed by running delayed ref That third step depends on must_insert_reserved on the head ref, so the head ref with that field set owns the reservation. Once you prepare to run the head ref, must_insert_reserved is unset, which means that running the ref must free the reservation, whether or not it succeeds, or else the reservation is leaked. That results in either a risk of spurious ENOSPC if the fs stays writeable or a warning on unmount if it is readonly. The existing squota code was aware of these invariants, but missed a few cases. Improve it by adding a helper function to use in the cleanup paths and call it from the existing early returns in running delayed refs. This also simplifies btrfs_record_squota_delta and struct btrfs_quota_delta. This fixes (or at least improves the reliability of) generic/475 with "mkfs -O squota". On my machine, that test failed ~4/10 times without this patch and passed 100/100 times with it. Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-07btrfs: fix qgroup_free_reserved_data int overflowBoris Burkov1-2/+2
The reserved data counter and input parameter is a u64, but we inadvertently accumulate it in an int. Overflowing that int results in freeing the wrong amount of data and breaking reserve accounting. Unfortunately, this overflow rot spreads from there, as the qgroup release/free functions rely on returning an int to take advantage of negative values for error codes. Therefore, the full fix is to return the "released" or "freed" amount by a u64 argument and to return 0 or negative error code via the return value. Most of the call sites simply ignore the return value, though some of them handle the error and count the returned bytes. Change all of them accordingly. CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: qgroup: check generation when recording simple quota deltaBoris Burkov1-0/+2
Simple quotas count extents only from the moment the feature is enabled. Therefore, if we do something like: 1. create subvol S 2. write F in S 3. enable quotas 4. remove F 5. write G in S then after 3. and 4. we would expect the simple quota usage of S to be 0 (putting aside some metadata extents that might be written) and after 5., it should be the size of G plus metadata. Therefore, we need to be able to determine whether a particular quota delta we are processing predates simple quota enablement. To do this, store the transaction id when quotas were enabled. In fs_info for immediate use and in the quota status item to make it recoverable on mount. When we see a delta, check if the generation of the extent item is less than that of quota enablement. If so, we should ignore the delta from this extent. Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: qgroup: simple quota auto hierarchy for nested subvolumesBoris Burkov1-3/+3
Consider the following sequence: - enable quotas - create subvol S id 256 at dir outer/ - create a qgroup 1/100 - add 0/256 (S's auto qgroup) to 1/100 - create subvol T id 257 at dir outer/inner/ With full qgroups, there is no relationship between 0/257 and either of 0/256 or 1/100. There is an inherit feature that the creator of inner/ can use to specify it ought to be in 1/100. Simple quotas are targeted at container isolation, where such automatic inheritance for not necessarily trusted/controlled nested subvol creation would be quite helpful. Therefore, add a new default behavior for simple quotas: when you create a nested subvol, automatically inherit as parents any parents of the qgroup of the subvol the new inode is going in. In our example, 257/0 would also be under 1/100, allowing easy control of a total quota over an arbitrary hierarchy of subvolumes. I think this _might_ be a generally useful behavior, so it could be interesting to put it behind a new inheritance flag that simple quotas always use while traditional quotas let the user specify, but this is a minimally intrusive change to start. Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: add helper for recording simple quota deltasBoris Burkov1-0/+15
Rather than re-computing shared/exclusive ownership based on backrefs and walking roots for implicit backrefs, simple quotas does an increment when creating an extent and a decrement when deleting it. Add the API for the extent item code to use to track those events. Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: qgroup: add new quota mode for simple quotasBoris Burkov1-3/+14
Add a new quota mode called "simple quotas". It can be enabled by the existing quota enable ioctl via a new command, and sets an incompat bit, as the implementation of simple quotas will make backwards incompatible changes to the disk format of the extent tree. Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: qgroup: introduce quota modeBoris Burkov1-0/+6
In preparation for introducing simple quotas, change from a binary setting for quotas to an enum based mode. Initially, the possible modes are disabled/full. Full quotas is normal btrfs qgroups. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Boris Burkov <boris@bur.io> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: move functions comments from qgroup.h to qgroup.cDavid Sterba1-76/+0
We keep the comments next to the implementation, there were some left to move. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: qgroup: use qgroup_iterator_nested to in qgroup_update_refcnt()Qu Wenruo1-0/+18
The ulist @qgroups is utilized to record all involved qgroups from both old and new roots inside btrfs_qgroup_account_extent(). Due to the fact that qgroup_update_refcnt() itself is already utilizing qgroup_iterator, here we have to introduce another list_head, btrfs_qgroup::nested_iterator, allowing nested iteration. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: qgroup: iterate qgroups without memory allocation for qgroup_reserve()Qu Wenruo1-0/+9
Qgroup heavily relies on ulist to go through all the involved qgroups, but since we're using ulist inside fs_info->qgroup_lock spinlock, this means we're doing a lot of GFP_ATOMIC allocations. This patch reduces the GFP_ATOMIC usage for qgroup_reserve() by eliminating the memory allocation completely. This is done by moving the needed memory to btrfs_qgroup::iterator list_head, so that we can put all the involved qgroup into a on-stack list, thus eliminating the need to allocate memory while holding spinlock. The only cost is the slightly higher memory usage, but considering the reduce GFP_ATOMIC during a hot path, it should still be acceptable. Function qgroup_reserve() is the perfect start point for this conversion. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: sink gfp_t parameter to btrfs_qgroup_trace_extentDavid Sterba1-1/+1
All callers pass GFP_NOFS, we can drop the parameter and use it directly. Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: convert QGROUP_* defines to enum bitsDavid Sterba1-3/+6
The defines/enums are used only for tracepoints and are not part of the on-disk format. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: introduce BTRFS_QGROUP_RUNTIME_FLAG_NO_ACCOUNTING to skip qgroup ↵Qu Wenruo1-0/+1
accounting The new flag will make btrfs qgroup skip all its time consuming qgroup accounting. The lifespan is the same as BTRFS_QGROUP_RUNTIME_FLAG_CANCEL_RESCAN, only get cleared after a new rescan. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: introduce BTRFS_QGROUP_RUNTIME_FLAG_CANCEL_RESCANQu Wenruo1-0/+2
Introduce a new runtime flag, BTRFS_QGROUP_RUNTIME_FLAG_CANCEL_RESCAN, which will inform qgroup rescan to cancel its work asynchronously. This is to address the window when an operation makes qgroup numbers inconsistent (like qgroup inheriting) while a qgroup rescan is running. In that case, qgroup inconsistent flag will be cleared when qgroup rescan finishes. But we changed the ownership of some extents, which means the rescan is already meaningless, and the qgroup inconsistent flag should not be cleared. With the new flag, each time we set INCONSISTENT flag, we also set this new flag to inform any running qgroup rescan to exit immediately, and leaving the INCONSISTENT flag there. The new runtime flag can only be cleared when a new rescan is started. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16btrfs: avoid blocking on space revervation when doing nowait dio writesFilipe Manana1-4/+8
When doing a NOWAIT direct IO write, if we can NOCOW then it means we can proceed with the non-blocking, NOWAIT path. However reserving the metadata space and qgroup meta space can often result in blocking - flushing delalloc, wait for ordered extents to complete, trigger transaction commits, etc, going against the semantics of a NOWAIT write. So make the NOWAIT write path to try to reserve all the metadata it needs without resulting in a blocking behaviour - if we get -ENOSPC or -EDQUOT then return -EAGAIN to make the caller fallback to a blocking direct IO write. This is part of a patchset comprised of the following patches: btrfs: avoid blocking on page locks with nowait dio on compressed range btrfs: avoid blocking nowait dio when locking file range btrfs: avoid double nocow check when doing nowait dio writes btrfs: stop allocating a path when checking if cross reference exists btrfs: free path at can_nocow_extent() before checking for checksum items btrfs: release path earlier at can_nocow_extent() btrfs: avoid blocking when allocating context for nowait dio read/write btrfs: avoid blocking on space revervation when doing nowait dio writes The following test was run before and after applying this patchset: $ cat io-uring-nodatacow-test.sh #!/bin/bash DEV=/dev/sdc MNT=/mnt/sdc MOUNT_OPTIONS="-o ssd -o nodatacow" MKFS_OPTIONS="-R free-space-tree -O no-holes" NUM_JOBS=4 FILE_SIZE=8G RUN_TIME=300 cat <<EOF > /tmp/fio-job.ini [io_uring_rw] rw=randrw fsync=0 fallocate=posix group_reporting=1 direct=1 ioengine=io_uring iodepth=64 bssplit=4k/20:8k/20:16k/20:32k/10:64k/10:128k/5:256k/5:512k/5:1m/5 filesize=$FILE_SIZE runtime=$RUN_TIME time_based filename=foobar directory=$MNT numjobs=$NUM_JOBS thread EOF echo performance | \ tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor umount $MNT &> /dev/null mkfs.btrfs -f $MKFS_OPTIONS $DEV &> /dev/null mount $MOUNT_OPTIONS $DEV $MNT fio /tmp/fio-job.ini umount $MNT The test was run a 12 cores box with 64G of ram, using a non-debug kernel config (Debian's default config) and a spinning disk. Result before the patchset: READ: bw=407MiB/s (427MB/s), 407MiB/s-407MiB/s (427MB/s-427MB/s), io=119GiB (128GB), run=300175-300175msec WRITE: bw=407MiB/s (427MB/s), 407MiB/s-407MiB/s (427MB/s-427MB/s), io=119GiB (128GB), run=300175-300175msec Result after the patchset: READ: bw=436MiB/s (457MB/s), 436MiB/s-436MiB/s (457MB/s-457MB/s), io=128GiB (137GB), run=300044-300044msec WRITE: bw=435MiB/s (456MB/s), 435MiB/s-435MiB/s (456MB/s-456MB/s), io=128GiB (137GB), run=300044-300044msec That's about +7.2% throughput for reads and +6.9% for writes. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-07-22btrfs: fix lock inversion problem when doing qgroup extent tracingFilipe Manana1-1/+1
At btrfs_qgroup_trace_extent_post() we call btrfs_find_all_roots() with a NULL value as the transaction handle argument, which makes that function take the commit_root_sem semaphore, which is necessary when we don't hold a transaction handle or any other mechanism to prevent a transaction commit from wiping out commit roots. However btrfs_qgroup_trace_extent_post() can be called in a context where we are holding a write lock on an extent buffer from a subvolume tree, namely from btrfs_truncate_inode_items(), called either during truncate or unlink operations. In this case we end up with a lock inversion problem because the commit_root_sem is a higher level lock, always supposed to be acquired before locking any extent buffer. Lockdep detects this lock inversion problem since we switched the extent buffer locks from custom locks to semaphores, and when running btrfs/158 from fstests, it reported the following trace: [ 9057.626435] ====================================================== [ 9057.627541] WARNING: possible circular locking dependency detected [ 9057.628334] 5.14.0-rc2-btrfs-next-93 #1 Not tainted [ 9057.628961] ------------------------------------------------------ [ 9057.629867] kworker/u16:4/30781 is trying to acquire lock: [ 9057.630824] ffff8e2590f58760 (btrfs-tree-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x24/0x110 [btrfs] [ 9057.632542] but task is already holding lock: [ 9057.633551] ffff8e25582d4b70 (&fs_info->commit_root_sem){++++}-{3:3}, at: iterate_extent_inodes+0x10b/0x280 [btrfs] [ 9057.635255] which lock already depends on the new lock. [ 9057.636292] the existing dependency chain (in reverse order) is: [ 9057.637240] -> #1 (&fs_info->commit_root_sem){++++}-{3:3}: [ 9057.638138] down_read+0x46/0x140 [ 9057.638648] btrfs_find_all_roots+0x41/0x80 [btrfs] [ 9057.639398] btrfs_qgroup_trace_extent_post+0x37/0x70 [btrfs] [ 9057.640283] btrfs_add_delayed_data_ref+0x418/0x490 [btrfs] [ 9057.641114] btrfs_free_extent+0x35/0xb0 [btrfs] [ 9057.641819] btrfs_truncate_inode_items+0x424/0xf70 [btrfs] [ 9057.642643] btrfs_evict_inode+0x454/0x4f0 [btrfs] [ 9057.643418] evict+0xcf/0x1d0 [ 9057.643895] do_unlinkat+0x1e9/0x300 [ 9057.644525] do_syscall_64+0x3b/0xc0 [ 9057.645110] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 9057.645835] -> #0 (btrfs-tree-00){++++}-{3:3}: [ 9057.646600] __lock_acquire+0x130e/0x2210 [ 9057.647248] lock_acquire+0xd7/0x310 [ 9057.647773] down_read_nested+0x4b/0x140 [ 9057.648350] __btrfs_tree_read_lock+0x24/0x110 [btrfs] [ 9057.649175] btrfs_read_lock_root_node+0x31/0x40 [btrfs] [ 9057.650010] btrfs_search_slot+0x537/0xc00 [btrfs] [ 9057.650849] scrub_print_warning_inode+0x89/0x370 [btrfs] [ 9057.651733] iterate_extent_inodes+0x1e3/0x280 [btrfs] [ 9057.652501] scrub_print_warning+0x15d/0x2f0 [btrfs] [ 9057.653264] scrub_handle_errored_block.isra.0+0x135f/0x1640 [btrfs] [ 9057.654295] scrub_bio_end_io_worker+0x101/0x2e0 [btrfs] [ 9057.655111] btrfs_work_helper+0xf8/0x400 [btrfs] [ 9057.655831] process_one_work+0x247/0x5a0 [ 9057.656425] worker_thread+0x55/0x3c0 [ 9057.656993] kthread+0x155/0x180 [ 9057.657494] ret_from_fork+0x22/0x30 [ 9057.658030] other info that might help us debug this: [ 9057.659064] Possible unsafe locking scenario: [ 9057.659824] CPU0 CPU1 [ 9057.660402] ---- ---- [ 9057.660988] lock(&fs_info->commit_root_sem); [ 9057.661581] lock(btrfs-tree-00); [ 9057.662348] lock(&fs_info->commit_root_sem); [ 9057.663254] lock(btrfs-tree-00); [ 9057.663690] *** DEADLOCK *** [ 9057.664437] 4 locks held by kworker/u16:4/30781: [ 9057.665023] #0: ffff8e25922a1148 ((wq_completion)btrfs-scrub){+.+.}-{0:0}, at: process_one_work+0x1c7/0x5a0 [ 9057.666260] #1: ffffabb3451ffe70 ((work_completion)(&work->normal_work)){+.+.}-{0:0}, at: process_one_work+0x1c7/0x5a0 [ 9057.667639] #2: ffff8e25922da198 (&ret->mutex){+.+.}-{3:3}, at: scrub_handle_errored_block.isra.0+0x5d2/0x1640 [btrfs] [ 9057.669017] #3: ffff8e25582d4b70 (&fs_info->commit_root_sem){++++}-{3:3}, at: iterate_extent_inodes+0x10b/0x280 [btrfs] [ 9057.670408] stack backtrace: [ 9057.670976] CPU: 7 PID: 30781 Comm: kworker/u16:4 Not tainted 5.14.0-rc2-btrfs-next-93 #1 [ 9057.672030] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 [ 9057.673492] Workqueue: btrfs-scrub btrfs_work_helper [btrfs] [ 9057.674258] Call Trace: [ 9057.674588] dump_stack_lvl+0x57/0x72 [ 9057.675083] check_noncircular+0xf3/0x110 [ 9057.675611] __lock_acquire+0x130e/0x2210 [ 9057.676132] lock_acquire+0xd7/0x310 [ 9057.676605] ? __btrfs_tree_read_lock+0x24/0x110 [btrfs] [ 9057.677313] ? lock_is_held_type+0xe8/0x140 [ 9057.677849] down_read_nested+0x4b/0x140 [ 9057.678349] ? __btrfs_tree_read_lock+0x24/0x110 [btrfs] [ 9057.679068] __btrfs_tree_read_lock+0x24/0x110 [btrfs] [ 9057.679760] btrfs_read_lock_root_node+0x31/0x40 [btrfs] [ 9057.680458] btrfs_search_slot+0x537/0xc00 [btrfs] [ 9057.681083] ? _raw_spin_unlock+0x29/0x40 [ 9057.681594] ? btrfs_find_all_roots_safe+0x11f/0x140 [btrfs] [ 9057.682336] scrub_print_warning_inode+0x89/0x370 [btrfs] [ 9057.683058] ? btrfs_find_all_roots_safe+0x11f/0x140 [btrfs] [ 9057.683834] ? scrub_write_block_to_dev_replace+0xb0/0xb0 [btrfs] [ 9057.684632] iterate_extent_inodes+0x1e3/0x280 [btrfs] [ 9057.685316] scrub_print_warning+0x15d/0x2f0 [btrfs] [ 9057.685977] ? ___ratelimit+0xa4/0x110 [ 9057.686460] scrub_handle_errored_block.isra.0+0x135f/0x1640 [btrfs] [ 9057.687316] scrub_bio_end_io_worker+0x101/0x2e0 [btrfs] [ 9057.688021] btrfs_work_helper+0xf8/0x400 [btrfs] [ 9057.688649] ? lock_is_held_type+0xe8/0x140 [ 9057.689180] process_one_work+0x247/0x5a0 [ 9057.689696] worker_thread+0x55/0x3c0 [ 9057.690175] ? process_one_work+0x5a0/0x5a0 [ 9057.690731] kthread+0x155/0x180 [ 9057.691158] ? set_kthread_struct+0x40/0x40 [ 9057.691697] ret_from_fork+0x22/0x30 Fix this by making btrfs_find_all_roots() never attempt to lock the commit_root_sem when it is called from btrfs_qgroup_trace_extent_post(). We can't just pass a non-NULL transaction handle to btrfs_find_all_roots() from btrfs_qgroup_trace_extent_post(), because that would make backref lookup not use commit roots and acquire read locks on extent buffers, and therefore could deadlock when btrfs_qgroup_trace_extent_post() is called from the btrfs_truncate_inode_items() code path which has acquired a write lock on an extent buffer of the subvolume btree. CC: stable@vger.kernel.org # 4.19+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-03-02btrfs: export and rename qgroup_reserve_metaNikolay Borisov1-0/+2
Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-27btrfs: qgroup: export qgroups in sysfsQu Wenruo1-0/+11
This patch will add the following sysfs interface: /sys/fs/btrfs/<UUID>/qgroups/<qgroup_id>/referenced /sys/fs/btrfs/<UUID>/qgroups/<qgroup_id>/exclusive /sys/fs/btrfs/<UUID>/qgroups/<qgroup_id>/max_referenced /sys/fs/btrfs/<UUID>/qgroups/<qgroup_id>/max_exclusive /sys/fs/btrfs/<UUID>/qgroups/<qgroup_id>/limit_flags Which is also available in output of "btrfs qgroup show". /sys/fs/btrfs/<UUID>/qgroups/<qgroup_id>/rsv_data /sys/fs/btrfs/<UUID>/qgroups/<qgroup_id>/rsv_meta_pertrans /sys/fs/btrfs/<UUID>/qgroups/<qgroup_id>/rsv_meta_prealloc The last 3 rsv related members are not visible to users, but can be very useful to debug qgroup limit related bugs. Also, to avoid '/' used in <qgroup_id>, the separator between qgroup level and qgroup id is changed to '_'. The interface is not hidden behind 'debug' as we want this interface to be included into production build and to provide another way to read the qgroup information besides the ioctls. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-27btrfs: make btrfs_qgroup_check_reserved_leak take btrfs_inodeNikolay Borisov1-1/+1
vfs_inode is used only for the inode number everything else requires btrfs_inode. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ use btrfs_ino ] Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-27btrfs: make btrfs_qgroup_reserve_data take btrfs_inodeNikolay Borisov1-1/+1
There's only a single use of vfs_inode in a tracepoint so let's take btrfs_inode directly. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-27btrfs: make btrfs_qgroup_release_data take btrfs_inodeNikolay Borisov1-1/+1
It just forwards its argument to __btrfs_qgroup_release_data. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-27btrfs: make btrfs_qgroup_free_data take btrfs_inodeNikolay Borisov1-3/+3
It passes btrfs_inode to its callee so change the interface. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-27btrfs: qgroup: catch reserved space leaks at unmount timeQu Wenruo1-0/+1
Before this patch, qgroup completely relies on per-inode extent io tree to detect reserved data space leak. However previous bug has already shown how release page before btrfs_finish_ordered_io() could lead to leak, and since it's QGROUP_RESERVED bit cleared without triggering qgroup rsv, it can't be detected by per-inode extent io tree. So this patch adds another (and hopefully the final) safety net to catch qgroup data reserved space leak. At least the new safety net catches all the leaks during development, so it should be pretty useful in the real world. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-02-19btrfs: destroy qgroup extent records on transaction abortJeff Mahoney1-0/+1
We clean up the delayed references when we abort a transaction but we leave the pending qgroup extent records behind, leaking memory. This patch destroys the extent records when we destroy the delayed refs and makes sure ensure they're gone before releasing the transaction. Fixes: 3368d001ba5d ("btrfs: qgroup: Record possible quota-related extent for qgroup.") CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Jeff Mahoney <jeffm@suse.com> [ Rebased to latest upstream, remove to_qgroup() helper, use rbtree_postorder_for_each_entry_safe() wrapper ] Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: rename btrfs_block_group_cacheDavid Sterba1-1/+1
The type name is misleading, a single entry is named 'cache' while this normally means a collection of objects. Rename that everywhere. Also the identifier was quite long, making function prototypes harder to format. Suggested-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-02-25btrfs: qgroup: Move reserved data accounting from btrfs_delayed_ref_head to ↵Qu Wenruo1-9/+11
btrfs_qgroup_extent_record [BUG] Btrfs/139 will fail with a high probability if the testing machine (VM) has only 2G RAM. Resulting the final write success while it should fail due to EDQUOT, and the fs will have quota exceeding the limit by 16K. The simplified reproducer will be: (needs a 2G ram VM) $ mkfs.btrfs -f $dev $ mount $dev $mnt $ btrfs subv create $mnt/subv $ btrfs quota enable $mnt $ btrfs quota rescan -w $mnt $ btrfs qgroup limit -e 1G $mnt/subv $ for i in $(seq -w 1 8); do xfs_io -f -c "pwrite 0 128M" $mnt/subv/file_$i > /dev/null echo "file $i written" > /dev/kmsg done $ sync $ btrfs qgroup show -pcre --raw $mnt The last pwrite will not trigger EDQUOT and final 'qgroup show' will show something like: qgroupid rfer excl max_rfer max_excl parent child -------- ---- ---- -------- -------- ------ ----- 0/5 16384 16384 none none --- --- 0/256 1073758208 1073758208 none 1073741824 --- --- And 1073758208 is larger than > 1073741824. [CAUSE] It's a bug in btrfs qgroup data reserved space management. For quota limit, we must ensure that: reserved (data + metadata) + rfer/excl <= limit Since rfer/excl is only updated at transaction commmit time, reserved space needs to be taken special care. One important part of reserved space is data, and for a new data extent written to disk, we still need to take the reserved space until rfer/excl numbers get updated. Originally when an ordered extent finishes, we migrate the reserved qgroup data space from extent_io tree to delayed ref head of the data extent, expecting delayed ref will only be cleaned up at commit transaction time. However for small RAM machine, due to memory pressure dirty pages can be flushed back to disk without committing a transaction. The related events will be something like: file 1 written btrfs_finish_ordered_io: ino=258 ordered offset=0 len=54947840 btrfs_finish_ordered_io: ino=258 ordered offset=54947840 len=5636096 btrfs_finish_ordered_io: ino=258 ordered offset=61153280 len=57344 btrfs_finish_ordered_io: ino=258 ordered offset=61210624 len=8192 btrfs_finish_ordered_io: ino=258 ordered offset=60583936 len=569344 cleanup_ref_head: num_bytes=54947840 cleanup_ref_head: num_bytes=5636096 cleanup_ref_head: num_bytes=569344 cleanup_ref_head: num_bytes=57344 cleanup_ref_head: num_bytes=8192 ^^^^^^^^^^^^^^^^ This will free qgroup data reserved space file 2 written ... file 8 written cleanup_ref_head: num_bytes=8192 ... btrfs_commit_transaction <<< the only transaction committed during the test When file 2 is written, we have already freed 128M reserved qgroup data space for ino 258. Thus later write won't trigger EDQUOT. This allows us to write more data beyond qgroup limit. In my 2G ram VM, it could reach about 1.2G before hitting EDQUOT. [FIX] By moving reserved qgroup data space from btrfs_delayed_ref_head to btrfs_qgroup_extent_record, we can ensure that reserved qgroup data space won't be freed half way before commit transaction, thus fix the problem. Fixes: f64d5ca86821 ("btrfs: delayed_ref: Add new function to record reserved space into delayed ref") Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-02-25btrfs: qgroup: Cleanup old subtree swap codeQu Wenruo1-6/+0
Since it's replaced by new delayed subtree swap code, remove the original code. The cleanup is small since most of its core function is still used by delayed subtree swap trace. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-02-25btrfs: qgroup: Use delayed subtree rescan for balanceQu Wenruo1-0/+2
Before this patch, qgroup code traces the whole subtree of subvolume and reloc trees unconditionally. This makes qgroup numbers consistent, but it could cause tons of unnecessary extent tracing, which causes a lot of overhead. However for subtree swap of balance, just swap both subtrees because they contain the same contents and tree structure, so qgroup numbers won't change. It's the race window between subtree swap and transaction commit could cause qgroup number change. This patch will delay the qgroup subtree scan until COW happens for the subtree root. So if there is no other operations for the fs, balance won't cause extra qgroup overhead. (best case scenario) Depending on the workload, most of the subtree scan can still be avoided. Only for worst case scenario, it will fall back to old subtree swap overhead. (scan all swapped subtrees) [[Benchmark]] Hardware: VM 4G vRAM, 8 vCPUs, disk is using 'unsafe' cache mode, backing device is SAMSUNG 850 evo SSD. Host has 16G ram. Mkfs parameter: --nodesize 4K (To bump up tree size) Initial subvolume contents: 4G data copied from /usr and /lib. (With enough regular small files) Snapshots: 16 snapshots of the original subvolume. each snapshot has 3 random files modified. balance parameter: -m So the content should be pretty similar to a real world root fs layout. And after file system population, there is no other activity, so it should be the best case scenario. | v4.20-rc1 | w/ patchset | diff ----------------------------------------------------------------------- relocated extents | 22615 | 22457 | -0.1% qgroup dirty extents | 163457 | 121606 | -25.6% time (sys) | 22.884s | 18.842s | -17.6% time (real) | 27.724s | 22.884s | -17.5% Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-02-25btrfs: qgroup: Introduce per-root swapped blocks infrastructureQu Wenruo1-0/+92
To allow delayed subtree swap rescan, btrfs needs to record per-root information about which tree blocks get swapped. This patch introduces the required infrastructure. The designed workflow will be: 1) Record the subtree root block that gets swapped. During subtree swap: O = Old tree blocks N = New tree blocks reloc tree subvolume tree X Root Root / \ / \ NA OB OA OB / | | \ / | | \ NC ND OE OF OC OD OE OF In this case, NA and OA are going to be swapped, record (NA, OA) into subvolume tree X. 2) After subtree swap. reloc tree subvolume tree X Root Root / \ / \ OA OB NA OB / | | \ / | | \ OC OD OE OF NC ND OE OF 3a) COW happens for OB If we are going to COW tree block OB, we check OB's bytenr against tree X's swapped_blocks structure. If it doesn't fit any, nothing will happen. 3b) COW happens for NA Check NA's bytenr against tree X's swapped_blocks, and get a hit. Then we do subtree scan on both subtrees OA and NA. Resulting 6 tree blocks to be scanned (OA, OC, OD, NA, NC, ND). Then no matter what we do to subvolume tree X, qgroup numbers will still be correct. Then NA's record gets removed from X's swapped_blocks. 4) Transaction commit Any record in X's swapped_blocks gets removed, since there is no modification to swapped subtrees, no need to trigger heavy qgroup subtree rescan for them. This will introduce 128 bytes overhead for each btrfs_root even qgroup is not enabled. This is to reduce memory allocations and potential failures. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17btrfs: Fix typos in comments and stringsAndrea Gelmini1-2/+2
The typos accumulate over time so once in a while time they get fixed in a large patch. Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17btrfs: drop extra enum initialization where using defaultsDavid Sterba1-1/+1
The first auto-assigned value to enum is 0, we can use that and not initialize all members where the auto-increment does the same. This is used for values that are not part of on-disk format. Reviewed-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15btrfs: qgroup: Avoid calling qgroup functions if qgroup is not enabledQu Wenruo1-0/+2
Some qgroup trace events like btrfs_qgroup_release_data() and btrfs_qgroup_free_delayed_ref() can still be triggered even if qgroup is not enabled. This is caused by the lack of qgroup status check before calling some qgroup functions. Thankfully the functions can handle quota disabled case well and just do nothing for qgroup disabled case. This patch will do earlier check before triggering related trace events. And for enabled <-> disabled race case: 1) For enabled->disabled case Disable will wipe out all qgroups data including reservation and excl/rfer. Even if we leak some reservation or numbers, it will still be cleared, so nothing will go wrong. 2) For disabled -> enabled case Current btrfs_qgroup_release_data() will use extent_io tree to ensure we won't underflow reservation. And for delayed_ref we use head->qgroup_reserved to record the reserved space, so in that case head->qgroup_reserved should be 0 and we won't underflow. CC: stable@vger.kernel.org # 4.14+ Reported-by: Chris Murphy <lists@colorremedies.com> Link: https://lore.kernel.org/linux-btrfs/CAJCQCtQau7DtuUUeycCkZ36qjbKuxNzsgqJ7+sJ6W0dK_NLE3w@mail.gmail.com/ Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15btrfs: qgroup: Only trace data extents in leaves if we're relocating data ↵Qu Wenruo1-0/+1
block group For qgroup_trace_extent_swap(), if we find one leaf that needs to be traced, we will also iterate all file extents and trace them. This is OK if we're relocating data block groups, but if we're relocating metadata block groups, balance code itself has ensured that both subtree of file tree and reloc tree contain the same contents. That's to say, if we're relocating metadata block groups, all file extents in reloc and file tree should match, thus no need to trace them. This should reduce the total number of dirty extents processed in metadata block group balance. [[Benchmark]] (with all previous enhancement) Hardware: VM 4G vRAM, 8 vCPUs, disk is using 'unsafe' cache mode, backing device is SAMSUNG 850 evo SSD. Host has 16G ram. Mkfs parameter: --nodesize 4K (To bump up tree size) Initial subvolume contents: 4G data copied from /usr and /lib. (With enough regular small files) Snapshots: 16 snapshots of the original subvolume. each snapshot has 3 random files modified. balance parameter: -m So the content should be pretty similar to a real world root fs layout. | v4.19-rc1 | w/ patchset | diff (*) --------------------------------------------------------------- relocated extents | 22929 | 22851 | -0.3% qgroup dirty extents | 227757 | 140886 | -38.1% time (sys) | 65.253s | 37.464s | -42.6% time (real) | 74.032s | 44.722s | -39.6% Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15btrfs: qgroup: Use generation-aware subtree swap to mark dirty extentsQu Wenruo1-0/+5
Before this patch, with quota enabled during balance, we need to mark the whole subtree dirty for quota. E.g. OO = Old tree blocks (from file tree) NN = New tree blocks (from reloc tree) File tree (src) Reloc tree (dst) OO (a) NN (a) / \ / \ (b) OO OO (c) (b) NN NN (c) / \ / \ / \ / \ OO OO OO OO (d) OO OO OO NN (d) For old balance + quota case, quota will mark the whole src and dst tree dirty, including all the 3 old tree blocks in reloc tree. It's doable for small file tree or new tree blocks are all located at lower level. But for large file tree or new tree blocks are all located at higher level, this will lead to mark the whole tree dirty, and be unbelievably slow. This patch will change how we handle such balance with quota enabled case. Now we will search from (b) and (c) for any new tree blocks whose generation is equal to @last_snapshot, and only mark them dirty. In above case, we only need to trace tree blocks NN(b), NN(c) and NN(d). (NN(a) will be traced when COW happens for nodeptr modification). And also for tree blocks OO(b), OO(c), OO(d). (OO(a) will be traced when COW happens for nodeptr modification.) For above case, we could skip 3 tree blocks, but for larger tree, we can skip tons of unmodified tree blocks, and hugely speed up balance. This patch will introduce a new function, btrfs_qgroup_trace_subtree_swap(), which will do the following main work: 1) Read out real root eb And setup basic dst_path for later calls 2) Call qgroup_trace_new_subtree_blocks() To trace all new tree blocks in reloc tree and their counter parts in the file tree. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06btrfs: qgroup: Drop fs_info parameter from btrfs_qgroup_inheritLu Fengqi1-3/+2
It can be fetched from the transaction handle. Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06btrfs: qgroup: Drop fs_info parameter from btrfs_run_qgroupsLu Fengqi1-2/+1
It can be fetched from the transaction handle. Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06btrfs: qgroup: Drop fs_info parameter from btrfs_qgroup_account_extentLu Fengqi1-5/+3
It can be fetched from the transaction handle. Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06btrfs: qgroup: Drop root parameter from btrfs_qgroup_trace_subtreeLu Fengqi1-1/+0
The fs_info can be fetched from the transaction handle directly. Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06btrfs: qgroup: Drop fs_info parameter from btrfs_qgroup_trace_leaf_itemsLu Fengqi1-1/+0
It can be fetched from the transaction handle. Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06btrfs: qgroup: Drop fs_info parameter from btrfs_qgroup_trace_extentLu Fengqi1-3/+2
It can be fetched from the transaction handle. In addition, remove the WARN_ON(trans == NULL) because it's not possible to hit this condition. Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06btrfs: qgroup: Drop fs_info parameter from btrfs_limit_qgroupLu Fengqi1-2/+1
It can be fetched from the transaction handle. Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06btrfs: qgroup: Drop fs_info parameter from btrfs_remove_qgroupLu Fengqi1-2/+1
It can be fetched from the transaction handle. Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06btrfs: qgroup: Drop fs_info parameter from btrfs_create_qgroupLu Fengqi1-2/+1
It can be fetched from the transaction handle. Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06btrfs: qgroup: Drop fs_info parameter from btrfs_del_qgroup_relationLu Fengqi1-2/+2
It can be fetched from the transaction handle. Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>