summaryrefslogtreecommitdiff
path: root/fs/btrfs/delayed-ref.c
AgeCommit message (Collapse)AuthorFilesLines
2024-05-07btrfs: replace btrfs_delayed_*_ref with btrfs_*_refJosef Bacik1-6/+4
Now that these two structs are the same, move the btrfs_data_ref and btrfs_tree_ref up and use these in the btrfs_delayed_ref_node. Then remove the btrfs_delayed_*_ref structs. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07btrfs: rename btrfs_data_ref->ino to ->objectidJosef Bacik1-2/+2
This is how we refer to it in the rest of the extent reference related code, make it consistent. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07btrfs: move ->parent and ->ref_root into btrfs_delayed_ref_nodeJosef Bacik1-56/+26
These two members are shared by both the tree refs and data refs, so move them into btrfs_delayed_ref_node proper. This allows us to greatly simplify the comparison code, as the shared refs always only sort on parent, and the non shared refs always sort first on ref_root, and then only data refs sort on their specific fields. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07btrfs: rename ->len to ->num_bytes in btrfs_refJosef Bacik1-4/+4
We consistently use ->num_bytes everywhere through the delayed ref code, except in btrfs_ref. Rename btrfs_ref to match all the other code. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07btrfs: unify the btrfs_add_delayed_*_ref helpers into one helperJosef Bacik1-77/+25
Now that these helpers are identical, create a helper function that handles everything properly and strip the individual helpers down to use just the common helper. This cleans up a significant amount of duplicated code. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07btrfs: simplify delayed ref tracepointsJosef Bacik1-12/+2
Now that all of the delayed ref information is in the delayed ref node, drastically simplify the delayed ref tracepoints by simply passing in the btrfs_delayed_ref_node and populating the tracepoints with the values from the structure itself. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07btrfs: move ref specific initialization into init_delayed_ref_commonJosef Bacik1-14/+11
Now that the btrfs_delayed_ref_node contains a union of the data and metadata specific information we can move the initialization into init_delayed_ref_common and just use the btrfs_ref to initialize the correct fields of the reference. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07btrfs: initialize btrfs_delayed_ref_head with btrfs_refJosef Bacik1-28/+25
We are calling init_delayed_ref_head with all of the elements from btrfs_ref, clean this up to simply pass in the btrfs_ref and initialize the btrfs_delayed_ref_head with the values from the btrfs_ref directly. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07btrfs: pass btrfs_ref to init_delayed_ref_commonJosef Bacik1-21/+8
We're extracting all of these values from the btrfs_ref we passed in already, just pass the btrfs_ref through to init_delayed_ref_common and get the values directly from the struct. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07btrfs: move ref_root into btrfs_refJosef Bacik1-16/+13
We have this in both btrfs_tree_ref and btrfs_data_ref, which is just wasting space and making the code more complicated. Move this into btrfs_ref proper and update all the call sites to do the assignment in btrfs_ref. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07btrfs: do not use a function to initialize btrfs_refJosef Bacik1-10/+0
btrfs_ref currently has ->owning_root, and ->ref_root is shared between the tree ref and data ref, so in order to move that into btrfs_ref proper I would need to add another root parameter to the initialization function. This function has too many arguments, and adding another root will make it easy to make mistakes about which root goes where. Drop the generic ref init function and statically initialize the btrfs_ref in every usage. This makes the code easier to read because we can see what elements we're assigning, and will make the upcoming change moving the ref_root into the btrfs_ref more clear and less error prone than adding a new element to the initialization function. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07btrfs: embed data_ref and tree_ref in btrfs_delayed_ref_nodeJosef Bacik1-34/+17
We have been embedding btrfs_delayed_ref_node in the btrfs_delayed_data_ref and btrfs_delayed_tree_ref, and then we have two sets of cachep's and a variety of handling that is awkward because of this separation. Instead union these two members inside of btrfs_delayed_ref_node and make that the first class object. This allows us to go down to one cachep for our delayed ref nodes instead of two. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07btrfs: add a helper to get the delayed ref node from the data/tree refJosef Bacik1-9/+19
We have several different ways we refer to references throughout the code and it's not consistent and there's a bit of duplication. In order to clean this up I want to have one structure we use to define reference information, and one structure we use for the delayed reference information. Start this process by adding a helper to get from the btrfs_delayed_data_ref/btrfs_delayed_tree_ref to the btrfs_delayed_ref_node so that it'll make moving these structures around simpler. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-05btrfs: remove SLAB_MEM_SPREAD flag useChengming Zhou1-8/+4
The SLAB_MEM_SPREAD flag used to be implemented in SLAB, which was removed as of v6.8-rc1, so it became a dead flag since the commit 16a1d968358a ("mm/slab: remove mm/slab.c and slab_def.h"). And the series[1] went on to mark it obsolete to avoid confusion for users. Here we can just remove all its users, which has no functional change. [1] https://lore.kernel.org/all/20240223-slab-cleanup-flags-v2-1-02f1753e8303@suse.cz/ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04btrfs: use KMEM_CACHE() to create delayed ref cachesKunwu Chan1-16/+8
Use the KMEM_CACHE() macro instead of kmem_cache_create() to simplify the creation of SLAB caches related to delayed refs when the default values are used. Signed-off-by: Kunwu Chan <chentao@kylinos.cn> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04btrfs: uninline some static inline helpers from delayed-ref.hDavid Sterba1-0/+65
The helpers are doing an initialization or release work, none of which is performance critical that it would require a static inline, so move them to the .c file. Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-09btrfs: fix qgroup record leaks when using simple quotasFilipe Manana1-2/+2
When using simple quotas we are not supposed to allocate qgroup records when adding delayed references. However we allocate them if either mode of quotas is enabled (the new simple one or the old one), but then we never free them because running the accounting, which frees the records, is only run when using the old quotas (at btrfs_qgroup_account_extents()), resulting in a memory leak of the records allocated when adding delayed references. Fix this by allocating the records only if the old quotas mode is enabled. Also fix btrfs_qgroup_trace_extent_nolock() to return 1 if the old quotas mode is not enabled - meaning the caller has to free the record. Fixes: 182940f4f4db ("btrfs: qgroup: add new quota mode for simple quotas") Reported-by: syzbot+d3ddc6dcc6386dea398b@syzkaller.appspotmail.com Link: https://lore.kernel.org/linux-btrfs/00000000000004769106097f9a34@google.com/ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: stop reserving excessive space for block group item insertionsFilipe Manana1-0/+35
Space for block group item insertions, necessary after allocating a new block group, is reserved in the delayed refs block reserve. Currently we do this by incrementing the transaction handle's delayed_ref_updates counter and then calling btrfs_update_delayed_refs_rsv(), which will increase the size of the delayed refs block reserve by an amount that corresponds to the same amount we use for delayed refs, given by btrfs_calc_delayed_ref_bytes(). That is an excessive amount because it corresponds to the amount of space needed to insert one item in a btree (btrfs_calc_insert_metadata_size()) times 2 when the free space tree feature is enabled. All we need is an amount as given by btrfs_calc_insert_metadata_size(), since we only need to insert a block group item in the extent tree (or block group tree if this feature is enabled). By using btrfs_calc_insert_metadata_size() we will need to reserve 2 times less space when using the free space tree, putting less pressure on space reservation. So use helpers to reserve and release space for block group item insertions that use btrfs_calc_insert_metadata_size() for calculation of the space. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: stop reserving excessive space for block group item updatesFilipe Manana1-0/+35
Space for block group item updates, necessary after allocating or deallocating an extent from a block group, is reserved in the delayed refs block reserve. Currently we do this by incrementing the transaction handle's delayed_ref_updates counter and then calling btrfs_update_delayed_refs_rsv(), which will increase the size of the delayed refs block reserve by an amount that corresponds to the same amount we use for delayed refs, given by btrfs_calc_delayed_ref_bytes(). That is an excessive amount because it corresponds to the amount of space needed to insert one item in a btree (btrfs_calc_insert_metadata_size()) times 2 when the free space tree feature is enabled. All we need is an amount as given by btrfs_calc_metadata_size(), since we only need to update an existing block group item in the extent tree (or block group tree if this feature is enabled). By using btrfs_calc_metadata_size() we will need to reserve 4 times less space when using the free space tree and 2 times less space when not using it, putting less pressure on space reservation. So use helpers to reserve and release space for block group item updates that use btrfs_calc_metadata_size() for calculation of the space. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: record simple quota deltas in delayed refsBoris Burkov1-0/+1
At the moment that we run delayed refs, we make the final ref-count based decision on creating/removing extent (and metadata) items. Therefore, it is exactly the spot to hook up simple quotas. There are a few important subtleties to the fields we must collect to accurately track simple quotas, particularly when removing an extent. When removing a data extent, the ref could be in any tree (due to reflink, for example) and so we need to recover the owning root id from the owner ref item. When removing a metadata extent, we know the owning root from the owner field in the header when we create the delayed ref, so we can recover it from there. We must also be careful to handle reservations properly to not leaked reserved space. The happy path is freeing the reservation when the simple quota delta runs on a data extent. If that doesn't happen, due to refs canceling out or some error, the ref head already has the must_insert_reserved machinery to handle this, so we piggy back on that and use it to clean up the reserved data. Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: track original extent owner in head_refBoris Burkov1-4/+15
Simple quotas requires tracking the original creating root of any given extent. This gets complicated when multiple subvolumes create overlapping/contradictory refs in the same transaction. For example, due to modifying or deleting an extent while also snapshotting it. To resolve this in a general way, take advantage of the fact that we are essentially already tracking this for handling releasing reservations. The head ref coalesces the various refs and uses must_insert_reserved to check if it needs to create an extent/free reservation. Store the ref that set must_insert_reserved as the owning ref on the head ref. Note that this can result in writing an extent for the very first time with an owner different from its only ref, but it will look the same as if you first created it with the original owning ref, then added the other ref, then removed the owning ref. Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: rename tree_ref and data_ref owning_rootBoris Burkov1-5/+5
commit 113479d5b8eb ("btrfs: rename root fields in delayed refs structs") changed these from ref_root to owning_root. However, there are many circumstances where that name is not really accurate and the root on the ref struct _is_ the referring root. In general, these are not the owning root, though it does happen in some ref merging cases involving overwrites during snapshots and similar. Simple quotas cares quite a bit about tracking the original owner of an extent through delayed refs, so rename these back to free up the name for the real owning root (which will live on the generic btrfs_ref and the head ref) Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: qgroup: add new quota mode for simple quotasBoris Burkov1-4/+2
Add a new quota mode called "simple quotas". It can be enabled by the existing quota enable ioctl via a new command, and sets an incompat bit, as the implementation of simple quotas will make backwards incompatible changes to the disk format of the extent tree. Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: always reserve space for delayed refs when starting transactionFilipe Manana1-1/+20
When starting a transaction (or joining an existing one with btrfs_start_transaction()), we reserve space for the number of items we want to insert in a btree, but we don't do it for the delayed refs we will generate while using the transaction to modify (COW) extent buffers in a btree or allocate new extent buffers. Basically how it works: 1) When we start a transaction we reserve space for the number of items the caller wants to be inserted/modified/deleted in a btree. This space goes to the transaction block reserve; 2) If the delayed refs block reserve is not full, its size is greater than the amount of its reserved space, and the flush method is BTRFS_RESERVE_FLUSH_ALL, then we attempt to reserve more space for it corresponding to the number of items the caller wants to insert/modify/delete in a btree; 3) The size of the delayed refs block reserve is increased when a task creates delayed refs after COWing an extent buffer, allocating a new one or deleting (freeing) an extent buffer. This happens after the the task started or joined a transaction, whenever it calls btrfs_update_delayed_refs_rsv(); 4) The delayed refs block reserve is then refilled by anyone calling btrfs_delayed_refs_rsv_refill(), either during unlink/truncate operations or when someone else calls btrfs_start_transaction() with a 0 number of items and flush method BTRFS_RESERVE_FLUSH_ALL; 5) As a task COWs or allocates extent buffers, it consumes space from the transaction block reserve. When the task releases its transaction handle (btrfs_end_transaction()) or it attempts to commit the transaction, it releases any remaining space in the transaction block reserve that it did not use, as not all space may have been used (due to pessimistic space calculation) by calling btrfs_block_rsv_release() which will try to add that unused space to the delayed refs block reserve (if its current size is greater than its reserved space). That transferred space may not be enough to completely fulfill the delayed refs block reserve. Plus we have some tasks that will attempt do modify as many leaves as they can before getting -ENOSPC (and then reserving more space and retrying), such as hole punching and extent cloning which call btrfs_replace_file_extents(). Such tasks can generate therefore a high number of delayed refs, for both metadata and data (we can't know in advance how many file extent items we will find in a range and therefore how many delayed refs for dropping references on data extents we will generate); 6) If a transaction starts its commit before the delayed refs block reserve is refilled, for example by the transaction kthread or by someone who called btrfs_join_transaction() before starting the commit, then when running delayed references if we don't have enough reserved space in the delayed refs block reserve, we will consume space from the global block reserve. Now this doesn't make a lot of sense because: 1) We should reserve space for delayed references when starting the transaction, since we have no guarantees the delayed refs block reserve will be refilled; 2) If no refill happens then we will consume from the global block reserve when running delayed refs during the transaction commit; 3) If we have a bunch of tasks calling btrfs_start_transaction() with a number of items greater than zero and at the time the delayed refs reserve is full, then we don't reserve any space at btrfs_start_transaction() for the delayed refs that will be generated by a task, and we can therefore end up using a lot of space from the global reserve when running the delayed refs during a transaction commit; 4) There are also other operations that result in bumping the size of the delayed refs reserve, such as creating and deleting block groups, as well as the need to update a block group item because we allocated or freed an extent from the respective block group; 5) If we have a significant gap between the delayed refs reserve's size and its reserved space, two very bad things may happen: 1) The reserved space of the global reserve may not be enough and we fail the transaction commit with -ENOSPC when running delayed refs; 2) If the available space in the global reserve is enough it may result in nearly exhausting it. If the fs has no more unallocated device space for allocating a new block group and all the available space in existing metadata block groups is not far from the global reserve's size before we started the transaction commit, we may end up in a situation where after the transaction commit we have too little available metadata space, and any future transaction commit will fail with -ENOSPC, because although we were able to reserve space to start the transaction, we were not able to commit it, as running delayed refs generates some more delayed refs (to update the extent tree for example) - this includes not even being able to commit a transaction that was started with the goal of unlinking a file, removing an empty data block group or doing reclaim/balance, so there's no way to release metadata space. In the worst case the next time we mount the filesystem we may also fail with -ENOSPC due to failure to commit a transaction to cleanup orphan inodes. This later case was reported and hit by someone running a SLE (SUSE Linux Enterprise) distribution for example - where the fs had no more unallocated space that could be used to allocate a new metadata block group, and the available metadata space was about 1.5M, not enough to commit a transaction to cleanup an orphan inode (or do relocation of data block groups that were far from being full). So improve on this situation by always reserving space for delayed refs when calling start_transaction(), and if the flush method is BTRFS_RESERVE_FLUSH_ALL, also try to refill the delayed refs block reserve if it's not full. The space reserved for the delayed refs is added to a local block reserve that is part of the transaction handle, and when a task updates the delayed refs block reserve size, after creating a delayed ref, the space is transferred from that local reserve to the global delayed refs reserve (fs_info->delayed_refs_rsv). In case the local reserve does not have enough space, which may happen for tasks that generate a variable and potentially large number of delayed refs (such as the hole punching and extent cloning cases mentioned before), we transfer any available space and then rely on the current behaviour of hoping some other task refills the delayed refs reserve or fallback to the global block reserve. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: stop doing excessive space reservation for csum deletionFilipe Manana1-13/+20
Currently when reserving space for deleting the csum items for a data extent, when adding or updating a delayed ref head, we determine how many leaves of csum items we can have and then pass that number to the helper btrfs_calc_delayed_ref_bytes(). This helper is used for calculating space for all tree modifications we need when running delayed references, however the amount of space it computes is excessive for deleting csum items because: 1) It uses btrfs_calc_insert_metadata_size() which is excessive because we only need to delete csum items from the csum tree, we don't need to insert any items, so btrfs_calc_metadata_size() is all we need (as it computes space needed to delete an item); 2) If the free space tree is enabled, it doubles the amount of space, which is pointless for csum deletion since we don't need to touch the free space tree or any other tree other than the csum tree. So improve on this by tracking how many csum deletions we have and using a new helper to calculate space for csum deletions (just a wrapper around btrfs_calc_metadata_size() with a comment). This reduces the amount of space we need to reserve for csum deletions by a factor of 4, and it helps reduce the number of times we have to block space reservations and have the reclaim task enter the space flushing algorithm (flush delayed items, flush delayed refs, etc) in order to satisfy tickets. For example this results in a total time decrease when unlinking (or truncating) files with many extents, as we end up having to block on space metadata reservations less often. Example test: $ cat test.sh #!/bin/bash DEV=/dev/nullb0 MNT=/mnt/test umount $DEV &> /dev/null mkfs.btrfs -f $DEV # Use compression to quickly create files with a lot of extents # (each with a size of 128K). mount -o compress=lzo $DEV $MNT # 100G gives at least 983040 extents with a size of 128K. xfs_io -f -c "pwrite -S 0xab -b 1M 0 120G" $MNT/foobar # Flush all delalloc and clear all metadata from memory. umount $MNT mount -o compress=lzo $DEV $MNT start=$(date +%s%N) rm -f $MNT/foobar end=$(date +%s%N) dur=$(( (end - start) / 1000000 )) echo "rm took $dur milliseconds" umount $MNT Before this change rm took: 7504 milliseconds After this change rm took: 6574 milliseconds (-12.4%) Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: remove pointless initialization at btrfs_delayed_refs_rsv_release()Filipe Manana1-1/+1
There's no point in initializing to 0 the local variable 'released' as we don't use it before the next assignment to it. So remove the initialization. This may help avoid some warnings with clang tools such as the one reported/fixed by commit 966de47ff0c9 ("btrfs: remove redundant initialization of variables in log_new_ancestors"). Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: reserve space for delayed refs on a per ref basisFilipe Manana1-10/+22
Currently when reserving space for delayed refs we do it on a per ref head basis. This is generally enough because most back refs for an extent end up being inlined in the extent item - with the default leaf size of 16K we can have at most 33 inline back refs (this is calculated by the macro BTRFS_MAX_EXTENT_ITEM_SIZE()). The amount of bytes reserved for each ref head is given by btrfs_calc_delayed_ref_bytes(), which basically corresponds to a single path for insertion into the extent tree plus another path for insertion into the free space tree if it's enabled. However if we have reached the limit of inline refs or we have a mix of inline and non-inline refs, then we will need to insert a non-inline ref and update the existing extent item to update the total number of references for the extent. This implies we need reserved space for two insertion paths in the extent tree, but we only reserved for one path. The extent item and the non-inline ref item may be located in different leaves, or even if they are located in the same leaf, after updating the extent item and before inserting the non-inline ref item, the extent buffers in the btree path may have been written (due to memory pressure for e.g.), in which case we need to COW the entire path again. In this case since we have not reserved enough space for the delayed refs block reserve, we will use the global block reserve. If we are in a situation where the fs has no more unallocated space enough to allocate a new metadata block group and available space in the existing metadata block groups is close to the maximum size of the global block reserve (512M), we may end up consuming too much of the free metadata space to the point where we can't commit any future transaction because it will fail, with -ENOSPC, during its commit when trying to allocate an extent for some COW operation (running delayed refs generated by running delayed refs or COWing the root tree's root node at commit_cowonly_roots() for example). Such dramatic scenario can happen if we have many delayed refs that require the insertion of non-inline ref items, due to too many reflinks or snapshots. We also have situations where we use the global block reserve because we could not in advance know that we will need space to update some trees (block group creation for example), so this all adds up to increase the chances of exhausting the global block reserve and making any future transaction commit to fail with -ENOSPC and turn the fs into RO mode, or fail the mount operation in case the mount needs to start and commit a transaction, such as when we have orphans to cleanup for example - such case was reported and hit by someone running a SLE (SUSE Linux Enterprise) distribution for example - where the fs had no more unallocated space that could be used to allocate a new metadata block group, and the available metadata space was about 1.5M, not enough to commit a transaction to cleanup an orphan inode (or do relocation of data block groups that were far from being full). So reserve space for delayed refs by individual refs and not by ref heads, as we may need to COW multiple extent tree paths due to non-inline ref items. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: pass a space_info argument to btrfs_reserve_metadata_bytes()Filipe Manana1-3/+3
We are passing a block reserve argument to btrfs_reserve_metadata_bytes() which is not really used, all we need is to pass the space_info associated to the block reserve, we don't change the block reserve at all. Not only it's pointless to pass the block reserve, it's also confusing as one might think that the reserved bytes will end up being added to the passed block reserve, when that's not the case. The pattern for reserving space and adding it to a block reserve is to first reserve space with btrfs_reserve_metadata_bytes() and if that succeeds, then add the space to a block reserve by calling btrfs_block_rsv_add_bytes(). Also the reverse of btrfs_reserve_metadata_bytes(), which is btrfs_space_info_free_bytes_may_use(), takes a space_info argument and not a block reserve, so one more reason to pass a space_info and not a block reserve to btrfs_reserve_metadata_bytes(). So change btrfs_reserve_metadata_bytes() and its callers to pass a space_info argument instead of a block reserve argument. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: reformat remaining kdoc style commentsDavid Sterba1-2/+1
Function name in the comment does not bring much value to code not exposed as API and we don't stick to the kdoc format anymore. Update formatting of parameter descriptions. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-09-20btrfs: prevent transaction block reserve underflow when starting transactionFilipe Manana1-8/+1
When starting a transaction, with a non-zero number of items, we reserve metadata space for that number of items and for delayed refs by doing a call to btrfs_block_rsv_add(), with the transaction block reserve passed as the block reserve argument. This reserves metadata space and adds it to the transaction block reserve. Later we migrate the space we reserved for delayed references from the transaction block reserve into the delayed refs block reserve, by calling btrfs_migrate_to_delayed_refs_rsv(). btrfs_migrate_to_delayed_refs_rsv() decrements the number of bytes to migrate from the source block reserve, and this however may result in an underflow in case the space added to the transaction block reserve ended up being used by another task that has not reserved enough space for its own use - examples are tasks doing reflinks or hole punching because they end up calling btrfs_replace_file_extents() -> btrfs_drop_extents() and may need to modify/COW a variable number of leaves/paths, so they keep trying to use space from the transaction block reserve when they need to COW an extent buffer, and may end up trying to use more space then they have reserved (1 unit/path only for removing file extent items). This can be avoided by simply reserving space first without adding it to the transaction block reserve, then add the space for delayed refs to the delayed refs block reserve and finally add the remaining reserved space to the transaction block reserve. This also makes the code a bit shorter and simpler. So just do that. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-09-20btrfs: fix race when refilling delayed refs block reserveFilipe Manana1-3/+34
If we have two (or more) tasks attempting to refill the delayed refs block reserve we can end up with the delayed block reserve being over reserved, that is, with a reserved space greater than its size. If this happens, we are holding to more reserved space than necessary for a while. The race happens like this: 1) The delayed refs block reserve has a size of 8M and a reserved space of 6M for example; 2) Task A calls btrfs_delayed_refs_rsv_refill(); 3) Task B also calls btrfs_delayed_refs_rsv_refill(); 4) Task A sees there's a 2M difference between the size and the reserved space of the delayed refs rsv, so it will reserve 2M of space by calling btrfs_reserve_metadata_bytes(); 5) Task B also sees that 2M difference, and like task A, it reserves another 2M of metadata space; 6) Both task A and task B increase the reserved space of block reserve by 2M, by calling btrfs_block_rsv_add_bytes(), so the block reserve ends up with a size of 8M and a reserved space of 10M; 7) The extra, over reserved space will eventually be freed by some task calling btrfs_delayed_refs_rsv_release() -> btrfs_block_rsv_release() -> block_rsv_release_bytes(), as there we will detect the over reserve and release that space. So fix this by checking if we still need to add space to the delayed refs block reserve after reserving the metadata space, and if we don't, just release that space immediately. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: use a single switch statement when initializing delayed ref headFilipe Manana1-20/+24
At init_delayed_ref_head(), we are using two separate if statements to check the delayed ref head action, and initializing 'must_insert_reserved' to false twice, once when the variable is declared and once again in an else branch. Make this simpler and more straightforward by having a single switch statement, also moving the comment about a drop action to the corresponding switch case to make it more clear and eliminating the duplicated initialization of 'must_insert_reserved' to false. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: use bool type for delayed ref head fields that are used as booleansFilipe Manana1-6/+6
There's no point in have several fields defined as 1 bit unsigned int in struct btrfs_delayed_ref_head, we can instead use a bool type, it makes the code a bit more readable and it doesn't change the structure size. So switch them to proper booleans. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: assert correct lock is held at btrfs_select_ref_head()Filipe Manana1-0/+1
The function btrfs_select_ref_head() iterates over the red black tree of delayed reference heads, which is protected by the spinlock in the delayed refs root. The function doesn't take the lock, it's taken by its single caller, btrfs_obtain_ref_head(), because it needs to call that function and btrfs_delayed_ref_lock() in the same critical section (delimited by that spinlock). So assert at btrfs_select_ref_head() that we are holding the expected lock. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: get rid of label and goto at insert_delayed_ref()Filipe Manana1-11/+8
At insert_delayed_ref() there's no point of having a label and goto in the case we were able to insert the delayed ref head. We can just add the code under label to the if statement's body and return immediately, and also there is no need to track the return value in a variable, we can just return a literal true or false value directly. So do those changes. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: make insert_delayed_ref() return a bool instead of an intFilipe Manana1-13/+14
insert_delayed_ref() can only return 0 or 1, to indicate if the given delayed reference was added to the head reference or if it was merged into an existing delayed ref, respectively. So just make it return a boolean instead. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: use a bool to track qgroup record insertion when adding ref headFilipe Manana1-5/+5
We are using an integer as a boolean to track the qgroup record insertion status when adding a delayed reference head. Since all we need is a boolean, switch the type from int to bool to make it more obvious. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: remove pointless in_tree field from struct btrfs_delayed_ref_nodeFilipe Manana1-2/+0
The 'in_tree' field is really not needed in struct btrfs_delayed_ref_node, as we can check whether a reference is in the tree or not simply by checking its red black tree node member with RB_EMPTY_NODE(), as when we remove it from the tree we always call RB_CLEAR_NODE(). So remove that field and use RB_EMPTY_NODE(). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: remove unused is_head field from struct btrfs_delayed_ref_nodeFilipe Manana1-1/+0
The 'is_head' field of struct btrfs_delayed_ref_node is no longer after commit d278850eff30 ("btrfs: remove delayed_ref_node from ref_head"), so remove it. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17btrfs: add helper to calculate space for delayed referencesFilipe Manana1-36/+4
Instead of duplicating the logic for calculating how much space is required for a given number of delayed references, add an inline helper to encapsulate that logic and use it everywhere we are calculating the space required. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17btrfs: calculate the right space for a single delayed ref when refillingFilipe Manana1-0/+11
When refilling the delayed block reserve we are incorrectly computing the amount of bytes for a single delayed reference if the free space tree is being used. In that case we should double the calculated amount. Everywhere else we compute the correct amount, like when updating the delayed block reserve, at btrfs_update_delayed_refs_rsv(), or when releasing space from the delayed block reserve, at btrfs_delayed_refs_rsv_release(). So fix btrfs_delayed_refs_rsv_refill() to multiply the amount of bytes for a single delayed reference by two in case the free space tree is used. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17btrfs: remove obsolete delayed ref throttling logic when truncating itemsFilipe Manana1-16/+0
We have this logic encapsulated in btrfs_should_throttle_delayed_refs() where we try to estimate if running the current amount of delayed references we have will take more than half a second, and if so, the caller btrfs_should_throttle_delayed_refs() should do something to prevent more and more delayed refs from being accumulated. This logic was added in commit 0a2b2a844af6 ("Btrfs: throttle delayed refs better") and then further refined in commit a79b7d4b3e81 ("Btrfs: async delayed refs"). The idea back then was that the caller of btrfs_should_throttle_delayed_refs() would release its transaction handle (by calling btrfs_end_transaction()) when that function returned true, then btrfs_end_transaction() would trigger an async job to run delayed references in a workqueue, and later start/join a transaction again and do more work. However we don't run delayed references asynchronously anymore, that was removed in commit db2462a6ad3d ("btrfs: don't run delayed refs in the end transaction logic"). That makes the logic that tries to estimate how long we will take to run our current delayed references, at btrfs_should_throttle_delayed_refs(), pointless as we don't take any action to run delayed references anymore. We do have other type of throttling, which consists of checking the size and reserved space of the delayed and global block reserves, as well as if fluhsing delayed references for the current transaction was already started, etc - this is all done by btrfs_should_end_transaction(), and the only user of btrfs_should_throttle_delayed_refs() does periodically call btrfs_should_end_transaction(). So remove btrfs_should_throttle_delayed_refs() and the infrastructure that keeps track of the average time used for running delayed references, as well as adapting btrfs_truncate_inode_items() to call btrfs_check_space_for_delayed_refs() instead. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17btrfs: simplify btrfs_should_throttle_delayed_refs()Filipe Manana1-4/+2
Currently btrfs_should_throttle_delayed_refs() returns 1 or 2 in case the delayed refs should be throttled, however the only caller (inode eviction and truncation path) does not care about those two different conditions, it treats the return value as a boolean. This allows us to remove one of the conditions in btrfs_should_throttle_delayed_refs() and change its return value from 'int' to 'bool'. So just do that. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17btrfs: pass a bool size update argument to btrfs_block_rsv_add_bytes()Filipe Manana1-1/+1
At btrfs_delayed_refs_rsv_refill(), we are passing a value of 0 to the 'update_size' argument of btrfs_block_rsv_add_bytes(), which is defined as a boolean. Functionally this is fine because a 0 is, implicitly, converted to a boolean false value. However it's easier to read an explicit 'false' value, so just pass 'false' instead of 0. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-13btrfs: directly pass in fs_info to btrfs_merge_delayed_refsJohannes Thumshirn1-2/+1
Now that none of the functions called by btrfs_merge_delayed_refs() needs a btrfs_trans_handle, directly pass in a btrfs_fs_info to btrfs_merge_delayed_refs(). Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-13btrfs: drop trans parameter of insert_delayed_refJohannes Thumshirn1-4/+3
Now that drop_delayed_ref() doesn't need a btrfs_trans_handle, drop it from insert_delayed_ref() as well. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-13btrfs: remove trans parameter of merge_refJohannes Thumshirn1-3/+2
Now that drop_delayed_ref() doesn't get the btrfs_trans_handle passed in anymore, we can get rid of it in merge_ref() as well. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-13btrfs: drop unused trans parameter of drop_delayed_refJohannes Thumshirn1-5/+4
drop_delayed_ref() doesn't use the btrfs_trans_handle it gets passed in, so remove it. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: update function commentsDavid Sterba1-10/+9
Update, reformat or reword function comments. This also removes the kdoc marker so we don't get reports when the function name is missing. Changes made: - remove kdoc markers - reformat the brief description to be a proper sentence - reword to imperative voice - align parameter list - fix typos Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: move mount option definitions to fs.hJosef Bacik1-0/+1
These are fs wide definitions and helpers, move them out of ctree.h and into fs.h. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>