summaryrefslogtreecommitdiff
path: root/fs/btrfs/space-info.c
AgeCommit message (Collapse)AuthorFilesLines
2024-03-04btrfs: remove unused included headersDavid Sterba1-1/+0
With help of neovim, LSP and clangd we can identify header files that are not actually needed to be included in the .c files. This is focused only on removal (with minor fixups), further cleanups are possible but will require doing the header files properly with forward declarations, minimized includes and include-what-you-use care. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-02-22btrfs: fix data races when accessing the reserved amount of block reservesFilipe Manana1-13/+13
At space_info.c we have several places where we access the ->reserved field of a block reserve without taking the block reserve's spinlock first, which makes KCSAN warn about a data race since that field is always updated while holding the spinlock. The reports from KCSAN are like the following: [117.193526] BUG: KCSAN: data-race in btrfs_block_rsv_release [btrfs] / need_preemptive_reclaim [btrfs] [117.195148] read to 0x000000017f587190 of 8 bytes by task 6303 on cpu 3: [117.195172] need_preemptive_reclaim+0x222/0x2f0 [btrfs] [117.195992] __reserve_bytes+0xbb0/0xdc8 [btrfs] [117.196807] btrfs_reserve_metadata_bytes+0x4c/0x120 [btrfs] [117.197620] btrfs_block_rsv_add+0x78/0xa8 [btrfs] [117.198434] btrfs_delayed_update_inode+0x154/0x368 [btrfs] [117.199300] btrfs_update_inode+0x108/0x1c8 [btrfs] [117.200122] btrfs_dirty_inode+0xb4/0x140 [btrfs] [117.200937] btrfs_update_time+0x8c/0xb0 [btrfs] [117.201754] touch_atime+0x16c/0x1e0 [117.201789] filemap_read+0x674/0x728 [117.201823] btrfs_file_read_iter+0xf8/0x410 [btrfs] [117.202653] vfs_read+0x2b6/0x498 [117.203454] ksys_read+0xa2/0x150 [117.203473] __s390x_sys_read+0x68/0x88 [117.203495] do_syscall+0x1c6/0x210 [117.203517] __do_syscall+0xc8/0xf0 [117.203539] system_call+0x70/0x98 [117.203579] write to 0x000000017f587190 of 8 bytes by task 11 on cpu 0: [117.203604] btrfs_block_rsv_release+0x2e8/0x578 [btrfs] [117.204432] btrfs_delayed_inode_release_metadata+0x7c/0x1d0 [btrfs] [117.205259] __btrfs_update_delayed_inode+0x37c/0x5e0 [btrfs] [117.206093] btrfs_async_run_delayed_root+0x356/0x498 [btrfs] [117.206917] btrfs_work_helper+0x160/0x7a0 [btrfs] [117.207738] process_one_work+0x3b6/0x838 [117.207768] worker_thread+0x75e/0xb10 [117.207797] kthread+0x21a/0x230 [117.207830] __ret_from_fork+0x6c/0xb8 [117.207861] ret_from_fork+0xa/0x30 So add a helper to get the reserved amount of a block reserve while holding the lock. The value may be not be up to date anymore when used by need_preemptive_reclaim() and btrfs_preempt_reclaim_metadata_space(), but that's ok since the worst it can do is cause more reclaim work do be done sooner rather than later. Reading the field while holding the lock instead of using the data_race() annotation is used in order to prevent load tearing. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: adjust overcommit logic when very close to fullJosef Bacik1-0/+32
A user reported some unpleasant behavior with very small file systems. The reproducer is this $ mkfs.btrfs -f -m single -b 8g /dev/vdb $ mount /dev/vdb /mnt/test $ dd if=/dev/zero of=/mnt/test/testfile bs=512M count=20 This will result in usage that looks like this Overall: Device size: 8.00GiB Device allocated: 8.00GiB Device unallocated: 1.00MiB Device missing: 0.00B Device slack: 2.00GiB Used: 5.47GiB Free (estimated): 2.52GiB (min: 2.52GiB) Free (statfs, df): 0.00B Data ratio: 1.00 Metadata ratio: 1.00 Global reserve: 5.50MiB (used: 0.00B) Multiple profiles: no Data,single: Size:7.99GiB, Used:5.46GiB (68.41%) /dev/vdb 7.99GiB Metadata,single: Size:8.00MiB, Used:5.77MiB (72.07%) /dev/vdb 8.00MiB System,single: Size:4.00MiB, Used:16.00KiB (0.39%) /dev/vdb 4.00MiB Unallocated: /dev/vdb 1.00MiB As you can see we've gotten ourselves quite full with metadata, with all of the disk being allocated for data. On smaller file systems there's not a lot of time before we get full, so our overcommit behavior bites us here. Generally speaking data reservations result in chunk allocations as we assume reservation == actual use for data. This means at any point we could end up with a chunk allocation for data, and if we're very close to full we could do this before we have a chance to figure out that we need another metadata chunk. Address this by adjusting the overcommit logic. Simply put we need to take away 1 chunk from the available chunk space in case of a data reservation. This will allow us to stop overcommitting before we potentially lose this space to a data allocation. With this fix in place we properly allocate a metadata chunk before we're completely full, allowing for enough slack space in metadata. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: allow to run delayed refs by bytes to be released instead of countFilipe Manana1-15/+2
When running delayed references, through btrfs_run_delayed_refs(), we can specify how many to run, run all existing delayed references and keep running delayed references while we can find any. This is controlled with the value of the 'count' argument, where a value of 0 means to run all delayed references that exist by the time btrfs_run_delayed_refs() is called, (unsigned long)-1 means to keep running delayed references while we are able find any, and any other value to run that exact number of delayed references. Typically a specific value other than 0 or -1 is used when flushing space to try to release a certain amount of bytes for a ticket. In this case we just simply calculate how many delayed reference heads correspond to a specific amount of bytes, with calc_delayed_refs_nr(). However that only takes into account the space reserved for the reference heads themselves, and does not account for the space reserved for deleting checksums from the csum tree (see add_delayed_ref_head() and update_existing_head_ref()) in case we are going to delete a data extent. This means we may end up running more delayed references than necessary in case we process delayed references for deleting a data extent. So change the logic of btrfs_run_delayed_refs() to take a bytes argument to specify how many bytes of delayed references to run/release, using the special values of 0 to mean all existing delayed references and U64_MAX (or (u64)-1) to keep running delayed references while we can find any. This prevents running more delayed references than necessary, when we have delayed references for deleting data extents, but also makes the upcoming changes/patches simpler and it's preparatory work for them. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: pass a space_info argument to btrfs_reserve_metadata_bytes()Filipe Manana1-7/+5
We are passing a block reserve argument to btrfs_reserve_metadata_bytes() which is not really used, all we need is to pass the space_info associated to the block reserve, we don't change the block reserve at all. Not only it's pointless to pass the block reserve, it's also confusing as one might think that the reserved bytes will end up being added to the passed block reserve, when that's not the case. The pattern for reserving space and adding it to a block reserve is to first reserve space with btrfs_reserve_metadata_bytes() and if that succeeds, then add the space to a block reserve by calling btrfs_block_rsv_add_bytes(). Also the reverse of btrfs_reserve_metadata_bytes(), which is btrfs_space_info_free_bytes_may_use(), takes a space_info argument and not a block reserve, so one more reason to pass a space_info and not a block reserve to btrfs_reserve_metadata_bytes(). So change btrfs_reserve_metadata_bytes() and its callers to pass a space_info argument instead of a block reserve argument. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: reformat remaining kdoc style commentsDavid Sterba1-1/+2
Function name in the comment does not bring much value to code not exposed as API and we don't stick to the kdoc format anymore. Update formatting of parameter descriptions. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21btrfs: zoned: re-enable metadata over-commit for zoned modeNaohiro Aota1-5/+1
Now that, we can re-enable metadata over-commit. As we moved the activation from the reservation time to the write time, we no longer need to ensure all the reserved bytes is properly activated. Without the metadata over-commit, it suffers from lower performance because it needs to flush the delalloc items more often and allocate more block groups. Re-enabling metadata over-commit will solve the issue. Fixes: 79417d040f4f ("btrfs: zoned: disable metadata overcommit for zoned") CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21btrfs: zoned: don't activate non-DATA BG on allocationNaohiro Aota1-28/+0
Now that a non-DATA block group is activated at write time, don't activate it on allocation time. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21btrfs: avoid starting and committing empty transaction when flushing spaceFilipe Manana1-1/+10
When flushing space and we are in the COMMIT_TRANS state, we join a transaction with btrfs_join_transaction() and then commit the returned transaction. However btrfs_join_transaction() starts a new transaction if there is none currently open, which is pointless since comitting a new, empty transaction, doesn't achieve anything, it only wastes time, IO and creates an unnecessary rotation of the backup roots. So use btrfs_attach_transaction_barrier() to avoid starting a new transaction. This also waits for any ongoing transaction that is committing (state >= TRANS_STATE_COMMIT_DOING) to fully complete, and therefore wait for all the extents that were pinned during the transaction's lifetime to be unpinned. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21btrfs: avoid starting new transaction when flushing delayed items and refsFilipe Manana1-2/+6
When flushing space we join a transaction to flush delayed items and delayed references, in order to try to release space. However using btrfs_join_transaction() not only joins an existing transaction as well as it starts a new transaction if there is none open. If there is no transaction open, we don't have neither delayed items nor delayed references, so creating a new transaction is a waste of time, IO and creates an unnecessary rotation of the backup roots without gaining any benefits (including releasing space). So use btrfs_join_transaction_nostart() when attempting to flush delayed items and references. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21btrfs: fail priority metadata ticket with real fs errorFilipe Manana1-5/+5
At priority_reclaim_metadata_space(), if we were not able to satisfy the the ticket after going through the various flushing states and we notice the fs went into an error state, likely due to a transaction abort during the flushing, set the ticket's error to the error that caused the transaction abort instead of an unconditional -EROFS. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21btrfs: don't steal space from global rsv after a transaction abortFilipe Manana1-2/+12
When doing a priority metadata space reclaim, while we are going through the flush states and running their respective operations, it's possible that a transaction abort happened, for example when running delayed refs we hit -ENOSPC or in the critical section of transaction commit we failed with -ENOSPC or some other error. In these cases a transaction was aborted and the fs turned into error state. If that happened, then it makes no sense to steal from the global block reserve and return success to the caller if the stealing was successful - the caller will later get an error when attempting to modify the fs. Instead make the ticket fail if we have the fs in error state and don't attempt to steal from the global rsv, as it's not only it's pointless, it also simplifies debugging some -ENOSPC problems. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21btrfs: print available space across all block groups when dumping space infoFilipe Manana1-0/+4
When dumping a space info also sum the available space for all block groups and then print it. This often useful for debugging -ENOSPC related problems. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21btrfs: print available space for a block group when dumping a space infoFilipe Manana1-2/+7
When dumping a space info, we iterate over all its block groups and then print their size and the amounts of bytes used, reserved, pinned, etc. When debugging -ENOSPC problems it's also useful to know how much space is available (free), so calculate that and print it as well. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21btrfs: print block group super and delalloc bytes when dumping space infoFilipe Manana1-4/+5
When dumping a space info's block groups, also print the number of bytes used for super blocks and delalloc. This is often useful for debugging -ENOSPC problems. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17btrfs: add helper to calculate space for delayed referencesFilipe Manana1-13/+1
Instead of duplicating the logic for calculating how much space is required for a given number of delayed references, add an inline helper to encapsulate that logic and use it everywhere we are calculating the space required. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17btrfs: constify fs_info argument for the reclaim items calculation helpersFilipe Manana1-2/+2
Now that btrfs_calc_insert_metadata_size() can take a const fs_info argument, make the fs_info argument of calc_reclaim_items_nr() and of calc_delayed_refs_nr() const as well. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17btrfs: accurately calculate number of delayed refs when flushingFilipe Manana1-1/+25
When flushing a limited number of delayed references (FLUSH_DELAYED_REFS_NR state), we are assuming each delayed reference is holding a number of bytes matching the needed space for inserting for a single metadata item (the result of btrfs_calc_insert_metadata_size()). That is not correct when using the free space tree, as in that case we have to multiply that value by 2 since we need to touch the free space tree as well. This is the same computation as we do at btrfs_update_delayed_refs_rsv() and at btrfs_delayed_refs_rsv_release(). So correct the computation for the amount of delayed references we need to flush in case we have the free space tree. This does not fix a functional issue, instead it makes the flush code flush less delayed references, only the minimum necessary to satisfy a ticket. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17btrfs: initialize ret to -ENOSPC at __reserve_bytes()Filipe Manana1-2/+1
At space-info.c:__reserve_bytes(), instead of initializing 'ret' to 0 when it's declared and then shortly after set it to -ENOSPC under the space info's spinlock, initialize it to -ENOSPC when declaring it. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17btrfs: update flush method assertion when reserving spaceFilipe Manana1-1/+12
When reserving space, at space-info.c:__reserve_bytes(), we assert that either the current task is not holding a transacion handle, or, if it is, that the flush method is not BTRFS_RESERVE_FLUSH_ALL. This is because that flush method can trigger transaction commits, and therefore could lead to a deadlock. However there are other 2 flush methods that can trigger transaction commits: 1) BTRFS_RESERVE_FLUSH_ALL_STEAL 2) BTRFS_RESERVE_FLUSH_EVICT So update the assertion to check the flush method is also not one those two methods if the current task is holding a transaction handle. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-03-15btrfs: zoned: drop space_info->active_total_bytesNaohiro Aota1-31/+9
The space_info->active_total_bytes is no longer necessary as we now count the region of newly allocated block group as zone_unusable. Drop its usage. Fixes: 6a921de58992 ("btrfs: zoned: introduce space_info->active_total_bytes") CC: stable@vger.kernel.org # 6.1+ Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-03-15btrfs: rename BTRFS_FS_NO_OVERCOMMIT to BTRFS_FS_ACTIVE_ZONE_TRACKINGJosef Bacik1-1/+1
This flag only gets set when we're doing active zone tracking, and we're going to need to use this flag for things related to this behavior. Rename the flag to represent what it actually means for the file system so it can be used in other ways and still make sense. Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-01-11btrfs: zoned: enable metadata over-commit for non-ZNS setupNaohiro Aota1-1/+2
The commit 79417d040f4f ("btrfs: zoned: disable metadata overcommit for zoned") disabled the metadata over-commit to track active zones properly. However, it also introduced a heavy overhead by allocating new metadata block groups and/or flushing dirty buffers to release the space reservations. Specifically, a workload (write only without any sync operations) worsen its performance from 343.77 MB/sec (v5.19) to 182.89 MB/sec (v6.0). The performance is still bad on current misc-next which is 187.95 MB/sec. And, with this patch applied, it improves back to 326.70 MB/sec (+73.82%). This patch introduces a new fs_info->flag BTRFS_FS_NO_OVERCOMMIT to indicate it needs to disable the metadata over-commit. The flag is enabled when a device with max active zones limit is loaded into a file-system. Fixes: 79417d040f4f ("btrfs: zoned: disable metadata overcommit for zoned") CC: stable@vger.kernel.org # 6.0+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: simplify percent calculation helpers, rename div_factorDavid Sterba1-2/+2
The div_factor* helpers calculate fraction or percentage fraction. The name is a bit confusing, we use it only for percentage calculations and there are two helpers. There's a helper mult_frac that's for general fractions, that tries to be accurate but we multiply and divide by small numbers so we can use the div_u64 helper. Rename the div_factor* helpers and use 1..100 percentage range, also drop the case checking for percentage == 100, it's never hit. The conversions: * div_factor calculates tenths and the numbers need to be adjusted * div_factor_fine is direct replacement Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: update function commentsDavid Sterba1-8/+8
Update, reformat or reword function comments. This also removes the kdoc marker so we don't get reports when the function name is missing. Changes made: - remove kdoc markers - reformat the brief description to be a proper sentence - reword to imperative voice - align parameter list - fix typos Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: move extent-tree helpers into their own header fileJosef Bacik1-0/+1
Move all the extent tree related prototypes to extent-tree.h out of ctree.h, and then go include it everywhere needed so everything compiles. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: move btrfs_account_ro_block_groups_free_space into space-info.cJosef Bacik1-0/+34
This was prototyped in ctree.h and the code existed in extent-tree.c, but it's space-info related so move it into space-info.c. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: move accessor helpers into accessors.hJosef Bacik1-0/+1
This is a large patch, but because they're all macros it's impossible to split up. Simply copy all of the item accessors in ctree.h and paste them in accessors.h, and then update any files to include the header so everything compiles. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ reformat comments, style fixups ] Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: move fs wide helpers out of ctree.hJosef Bacik1-0/+1
We have several fs wide related helpers in ctree.h. The bulk of these are the incompat flag test helpers, but there are things such as btrfs_fs_closing() and the read only helpers that also aren't directly related to the ctree code. Move these into a fs.h header, which will serve as the location for file system wide related helpers. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: introduce BTRFS_RESERVE_FLUSH_EMERGENCYJosef Bacik1-2/+27
Inside of FB, as well as some user reports, we've had a consistent problem of occasional ENOSPC transaction aborts. Inside FB we were seeing ~100-200 ENOSPC aborts per day in the fleet, which is a really low occurrence rate given the size of our fleet, but it's not nothing. There are two causes of this particular problem. First is delayed allocation. The reservation system for delalloc assumes that contiguous dirty ranges will result in 1 file extent item. However if there is memory pressure that results in fragmented writeout, or there is fragmentation in the block groups, this won't necessarily be true. Consider the case where we do a single 256MiB write to a file and then close it. We will have 1 reservation for the inode update, the reservations for the checksum updates, and 1 reservation for the file extent item. At some point later we decide to write this entire range out, but we're so fragmented that we break this into 100 different file extents. Since we've already closed the file and are no longer writing to it there's nothing to trigger a refill of the delalloc block rsv to satisfy the 99 new file extent reservations we need. At this point we exhaust our delalloc reservation, and we begin to steal from the global reserve. If you have enough of these cases going in parallel you can easily exhaust the global reserve, get an ENOSPC at btrfs_alloc_tree_block() time, and then abort the transaction. The other case is the delayed refs reserve. The delayed refs reserve updates its size based on outstanding delayed refs and dirty block groups. However we only refill this block reserve when returning excess reservations and when we call btrfs_start_transaction(root, X). We will reserve 2*X credits at transaction start time, and fill in X into the delayed refs reserve to make sure it stays topped off. Generally this works well, but clearly has downsides. If we do a particularly delayed ref heavy operation we may never catch up in our reservations. Additionally running delayed refs generates more delayed refs, and at that point we may be committing the transaction and have no way to trigger a refill of our delayed refs rsv. Then a similar thing occurs with the delalloc reserve. Generally speaking we well over-reserve in all of our block rsvs. If we reserve 1 credit we're usually reserving around 264k of space, but we'll often not use any of that reservation, or use a few blocks of that reservation. We can be reasonably sure that as long as you were able to reserve space up front for your operation you'll be able to find space on disk for that reservation. So introduce a new flushing state, BTRFS_RESERVE_FLUSH_EMERGENCY. This gets used in the case that we've exhausted our reserve and the global reserve. It simply forces a reservation if we have enough actual space on disk to make the reservation, which is almost always the case. This keeps us from hitting ENOSPC aborts in these odd occurrences where we've not kept up with the delayed work. Fixing this in a complete way is going to be relatively complicated and time consuming. This patch is what I discussed with Filipe earlier this year, and what I put into our kernels inside FB. With this patch we're down to 1-2 ENOSPC aborts per week, which is a significant reduction. This is a decent stop gap until we can work out a more wholistic solution to these two corner cases. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-29btrfs: add the ability to use NO_FLUSH for data reservationsJosef Bacik1-1/+2
In order to accommodate NOWAIT IOCB's we need to be able to do NO_FLUSH data reservations, so plumb this through the delalloc reservation system. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Stefan Roesch <shr@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: remove useless used space increment during space reservationFilipe Manana1-1/+0
At space-info.c:__reserve_bytes(), we increment the 'used' variable, but then we don't use the variable anymore, making the increment pointless. The increment became useless with commit 2e294c60497f29 ("btrfs: simplify the logic in need_preemptive_flushing"), so just remove it. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: dump all space infos if we abort transaction due to ENOSPCQu Wenruo1-7/+24
We have hit some transaction abort due to -ENOSPC internally. Normally we should always reserve enough space for metadata for every transaction, thus hitting -ENOSPC should really indicate some cases we didn't expect. But unfortunately current error reporting will only give a kernel warning and stack trace, not really helpful to debug what's causing the problem. And mount option debug_enospc can only help when user can reproduce the problem, but under most cases, such transaction abort by -ENOSPC is really hard to reproduce. So this patch will dump all space infos (data, metadata, system) when we abort the first transaction with -ENOSPC. This should at least provide some clue to us. The example of a dump would look like this: BTRFS: Transaction aborted (error -28) WARNING: CPU: 8 PID: 3366 at fs/btrfs/transaction.c:2137 btrfs_commit_transaction+0xf81/0xfb0 [btrfs] <call trace skipped> ---[ end trace 0000000000000000 ]--- BTRFS info (device dm-1: state A): dumping space info: BTRFS info (device dm-1: state A): space_info DATA has 6791168 free, is not full BTRFS info (device dm-1: state A): space_info total=8388608, used=1597440, pinned=0, reserved=0, may_use=0, readonly=0 zone_unusable=0 BTRFS info (device dm-1: state A): space_info METADATA has 257114112 free, is not full BTRFS info (device dm-1: state A): space_info total=268435456, used=131072, pinned=180224, reserved=65536, may_use=10878976, readonly=65536 zone_unusable=0 BTRFS info (device dm-1: state A): space_info SYSTEM has 8372224 free, is not full BTRFS info (device dm-1: state A): space_info total=8388608, used=16384, pinned=0, reserved=0, may_use=0, readonly=0 zone_unusable=0 BTRFS info (device dm-1: state A): global_block_rsv: size 3670016 reserved 3670016 BTRFS info (device dm-1: state A): trans_block_rsv: size 0 reserved 0 BTRFS info (device dm-1: state A): chunk_block_rsv: size 0 reserved 0 BTRFS info (device dm-1: state A): delayed_block_rsv: size 4063232 reserved 4063232 BTRFS info (device dm-1: state A): delayed_refs_rsv: size 3145728 reserved 3145728 BTRFS: error (device dm-1: state A) in btrfs_commit_transaction:2137: errno=-28 No space left BTRFS info (device dm-1: state EA): forced readonly Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: output human readable space info flagQu Wenruo1-3/+20
For btrfs_space_info, its flags has only 4 possible values: - BTRFS_BLOCK_GROUP_SYSTEM - BTRFS_BLOCK_GROUP_METADATA | BTRFS_BLOCK_GROUP_DATA - BTRFS_BLOCK_GROUP_METADATA - BTRFS_BLOCK_GROUP_DATA Make the output more human readable, now it looks like: BTRFS info (device dm-1: state A): space_info METADATA has 251494400 free, is not full Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: convert block group bit field to use bit helpersJosef Bacik1-1/+1
We use a bit field in the btrfs_block_group for different flags, however this is awkward because we have to hold the block_group->lock for any modification of any of these fields, and makes the code clunky for a few of these flags. Convert these to a properly flags setup so we can utilize the bit helpers. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: handle space_info setting of bg in btrfs_add_bg_to_space_infoJosef Bacik1-4/+9
We previously had the pattern of btrfs_update_space_info(all, the, bg, fields, &space_info); link_block_group(bg); bg->space_info = space_info; Now that we're passing the bg into btrfs_add_bg_to_space_info we can do the linking in that function, transforming this to simply btrfs_add_bg_to_space_info(fs_info, bg); and put the link_block_group() and bg->space_info assignment directly in btrfs_add_bg_to_space_info. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: simplify arguments of btrfs_update_space_info and renameJosef Bacik1-15/+14
This function has grown a bunch of new arguments, and it just boils down to passing in all the block group fields as arguments. Simplify this by passing in the block group itself and updating the space_info fields based on the block group fields directly. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-06btrfs: fix the max chunk size and stripe length calculationQu Wenruo1-1/+1
[BEHAVIOR CHANGE] Since commit f6fca3917b4d ("btrfs: store chunk size in space-info struct"), btrfs no longer can create larger data chunks than 1G: mkfs.btrfs -f -m raid1 -d raid0 $dev1 $dev2 $dev3 $dev4 mount $dev1 $mnt btrfs balance start --full $mnt btrfs balance start --full $mnt umount $mnt btrfs ins dump-tree -t chunk $dev1 | grep "DATA|RAID0" -C 2 Before that offending commit, what we got is a 4G data chunk: item 6 key (FIRST_CHUNK_TREE CHUNK_ITEM 9492758528) itemoff 15491 itemsize 176 length 4294967296 owner 2 stripe_len 65536 type DATA|RAID0 io_align 65536 io_width 65536 sector_size 4096 num_stripes 4 sub_stripes 1 Now what we got is only 1G data chunk: item 6 key (FIRST_CHUNK_TREE CHUNK_ITEM 6271533056) itemoff 15491 itemsize 176 length 1073741824 owner 2 stripe_len 65536 type DATA|RAID0 io_align 65536 io_width 65536 sector_size 4096 num_stripes 4 sub_stripes 1 This will increase the number of data chunks by the number of devices, not only increase system chunk usage, but also greatly increase mount time. Without a proper reason, we should not change the max chunk size. [CAUSE] Previously, we set max data chunk size to 10G, while max data stripe length to 1G. Commit f6fca3917b4d ("btrfs: store chunk size in space-info struct") completely ignored the 10G limit, but use 1G max stripe limit instead, causing above shrink in max data chunk size. [FIX] Fix the max data chunk size to 10G, and in decide_stripe_size_regular() we limit stripe_size to 1G manually. This should only affect data chunks, as for metadata chunks we always set the max stripe size the same as max chunk size (256M or 1G depending on fs size). Now the same script result the same old result: item 6 key (FIRST_CHUNK_TREE CHUNK_ITEM 9492758528) itemoff 15491 itemsize 176 length 4294967296 owner 2 stripe_len 65536 type DATA|RAID0 io_align 65536 io_width 65536 sector_size 4096 num_stripes 4 sub_stripes 1 Reported-by: Wang Yugui <wangyugui@e16-tech.com> Fixes: f6fca3917b4d ("btrfs: store chunk size in space-info struct") Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: zoned: activate metadata block group on flush_spaceNaohiro Aota1-0/+30
For metadata space on zoned filesystem, reaching ALLOC_CHUNK{,_FORCE} means we don't have enough space left in the active_total_bytes. Before allocating a new chunk, we can try to activate an existing block group in this case. Also, allocating a chunk is not enough to grant a ticket for metadata space on zoned filesystem we need to activate the block group to increase the active_total_bytes. btrfs_zoned_activate_one_bg() implements the activation feature. It will activate a block group by (maybe) finishing a block group. It will give up activating a block group if it cannot finish any block group. CC: stable@vger.kernel.org # 5.16+ Fixes: afba2bc036b0 ("btrfs: zoned: implement active zone tracking") Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: zoned: disable metadata overcommit for zonedNaohiro Aota1-1/+4
The metadata overcommit makes the space reservation flexible but it is also harmful to active zone tracking. Since we cannot finish a block group from the metadata allocation context, we might not activate a new block group and might not be able to actually write out the overcommit reservations. So, disable metadata overcommit for zoned filesystems. We will ensure the reservations are under active_total_bytes in the following patches. CC: stable@vger.kernel.org # 5.16+ Fixes: afba2bc036b0 ("btrfs: zoned: implement active zone tracking") Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: zoned: introduce space_info->active_total_bytesNaohiro Aota1-9/+32
The active_total_bytes, like the total_bytes, accounts for the total bytes of active block groups in the space_info. With an introduction of active_total_bytes, we can check if the reserved bytes can be written to the block groups without activating a new block group. The check is necessary for metadata allocation on zoned filesystem. We cannot finish a block group, which may require waiting for the current transaction, from the metadata allocation context. Instead, we need to ensure the ongoing allocation (reserved bytes) fits in active block groups. Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: store chunk size in space-info structStefan Roesch1-0/+32
The chunk size is stored in the btrfs_space_info structure. It is initialized at the start and is then used. A new API is added to update the current chunk size. This API is used to be able to expose the chunk_size as a sysfs setting. Signed-off-by: Stefan Roesch <shr@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> [ rename and merge helpers, switch atomic type to u64, style fixes ] Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: fix typos in commentsDavid Sterba1-1/+1
Codespell has found a few typos. Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16btrfs: make the bg_reclaim_threshold per-space infoJosef Bacik1-0/+9
For non-zoned file systems it's useful to have the auto reclaim feature, however there are different use cases for non-zoned, for example we may not want to reclaim metadata chunks ever, only data chunks. Move this sysfs flag to per-space_info. This won't affect current users because this tunable only ever did anything for zoned, and that is currently hidden behind BTRFS_CONFIG_DEBUG. Tested-by: Pankaj Raghav <p.raghav@samsung.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> [ jth restore global bg_reclaim_threshold ] Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16btrfs: remove unnecessary type castsYu Zhe1-1/+1
Explicit type casts are not necessary when it's void* to another pointer type. Signed-off-by: Yu Zhe <yuzhe@nfschina.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14btrfs: add lockdep_assert_held to need_preemptive_reclaimNiels Dossche1-0/+2
In a previous patch ("btrfs: extend locking to all space_info members accesses") the locking for the space_info members was extended in btrfs_preempt_reclaim_metadata_space because not all the member accesses that needed locks were actually locked (bytes_pinned et al). It was then suggested to also add a call to lockdep_assert_held to need_preemptive_reclaim. This function also works with space_info members. As of now, it has only two call sites which both hold the lock. Suggested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Niels Dossche <dossche.niels@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14btrfs: extend locking to all space_info members accessesNiels Dossche1-1/+2
bytes_pinned is always accessed under space_info->lock, except in btrfs_preempt_reclaim_metadata_space, however the other members are accessed under that lock. The reserved member of the rsv's are also partially accessed under a lock and partially not. Move all these accesses into the same lock to ensure consistency. This could potentially race and lead to a flush instead of a commit but it's not a big problem as it's only for preemptive flush. CC: stable@vger.kernel.org # 5.15+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Niels Dossche <niels.dossche@ugent.be> Signed-off-by: Niels Dossche <dossche.niels@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07btrfs: fix argument list that the kdoc format and script verifiedYang Li1-1/+1
The warnings were found by running scripts/kernel-doc, which is caused by using 'make W=1'. fs/btrfs/extent_io.c:3210: warning: Function parameter or member 'bio_ctrl' not described in 'btrfs_bio_add_page' fs/btrfs/extent_io.c:3210: warning: Excess function parameter 'bio' description in 'btrfs_bio_add_page' fs/btrfs/extent_io.c:3210: warning: Excess function parameter 'prev_bio_flags' description in 'btrfs_bio_add_page' fs/btrfs/space-info.c:1602: warning: Excess function parameter 'root' description in 'btrfs_reserve_metadata_bytes' fs/btrfs/space-info.c:1602: warning: Function parameter or member 'fs_info' not described in 'btrfs_reserve_metadata_bytes' Note: this is fixing only the warnings regarding parameter list, the first line is not strictly conforming to the kdoc format as the btrfs codebase does not stick to that and keeps the first line more free form (because it's only for internal use). Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add note ] Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-03btrfs: don't use the extent_root in flush_spaceJosef Bacik1-1/+1
We only need the root to start a transaction, and since it's a global root we can pick anything, change to the tree_root as we'll have a lot of extent roots in the future. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-03btrfs: change root to fs_info for btrfs_reserve_metadata_bytesJosef Bacik1-2/+1
We used to need the root for btrfs_reserve_metadata_bytes to check the orphan cleanup state, but we no longer need that, we simply need the fs_info. Change btrfs_reserve_metadata_bytes() to use the fs_info, and change both btrfs_block_rsv_refill() and btrfs_block_rsv_add() to do the same as they simply call btrfs_reserve_metadata_bytes() and then manipulate the block_rsv that is being used. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>