summaryrefslogtreecommitdiff
path: root/fs/btrfs/zoned.c
AgeCommit message (Collapse)AuthorFilesLines
2022-05-16btrfs: zoned: make auto-reclaim less aggressiveJohannes Thumshirn1-0/+27
The current auto-reclaim algorithm starts reclaiming all block groups with a zone_unusable value above a configured threshold. This is causing a lot of reclaim IO even if there would be enough free zones on the device. Instead of only accounting a block groups zone_unusable value, also take the ratio of free and not usable (written as well as zone_unusable) bytes a device has into account. Tested-by: Pankaj Raghav <p.raghav@samsung.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-05btrfs: zoned: activate block group properly on unlimited active zone deviceNaohiro Aota1-14/+8
btrfs_zone_activate() checks if it activated all the underlying zones in the loop. However, that check never hit on an unlimited activate zone device (max_active_zones == 0). Fortunately, it still works without ENOSPC because btrfs_zone_activate() returns true in the end, even if block_group->zone_is_active == 0. But, it is confusing to have non zone_is_active block group still usable for allocation. Also, we are wasting CPU time to iterate the loop every time btrfs_zone_activate() is called for the blog groups. Since error case in the loop is handled by out_unlock, we can just set zone_is_active and do the list stuff after the loop. Fixes: f9a912a3c45f ("btrfs: zoned: make zone activation multi stripe capable") Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-05btrfs: zoned: move non-changing condition check out of the loopNaohiro Aota1-6/+6
btrfs_zone_activate() checks if block_group->alloc_offset == block_group->zone_capacity every time it iterates the loop. But, it is not depending on the index. Move out the check and do it only once. Fixes: f9a912a3c45f ("btrfs: zoned: make zone activation multi stripe capable") Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-24btrfs: zoned: remove left over ASSERT checking for single profileJohannes Thumshirn1-4/+0
With commit dcf5652291f6 ("btrfs: zoned: allow DUP on meta-data block groups") we started allowing DUP on metadata block groups, so the ASSERT()s in btrfs_can_activate_zone() and btrfs_zoned_get_device() are no longer valid and in fact even harmful. Fixes: dcf5652291f6 ("btrfs: zoned: allow DUP on meta-data block groups") CC: stable@vger.kernel.org # 5.17 Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-24btrfs: zoned: traverse devices under chunk_mutex in btrfs_can_activate_zoneJohannes Thumshirn1-4/+5
btrfs_can_activate_zone() can be called with the device_list_mutex already held, which will lead to a deadlock: insert_dev_extents() // Takes device_list_mutex `-> insert_dev_extent() `-> btrfs_insert_empty_item() `-> btrfs_insert_empty_items() `-> btrfs_search_slot() `-> btrfs_cow_block() `-> __btrfs_cow_block() `-> btrfs_alloc_tree_block() `-> btrfs_reserve_extent() `-> find_free_extent() `-> find_free_extent_update_loop() `-> can_allocate_chunk() `-> btrfs_can_activate_zone() // Takes device_list_mutex again Instead of using the RCU on fs_devices->device_list we can use fs_devices->alloc_list, protected by the chunk_mutex to traverse the list of active devices. We are in the chunk allocation thread. The newer chunk allocation happens from the devices in the fs_device->alloc_list protected by the chunk_mutex. btrfs_create_chunk() lockdep_assert_held(&info->chunk_mutex); gather_device_info list_for_each_entry(device, &fs_devices->alloc_list, dev_alloc_list) Also, a device that reappears after the mount won't join the alloc_list yet and, it will be in the dev_list, which we don't want to consider in the context of the chunk alloc. [15.166572] WARNING: possible recursive locking detected [15.167117] 5.17.0-rc6-dennis #79 Not tainted [15.167487] -------------------------------------------- [15.167733] kworker/u8:3/146 is trying to acquire lock: [15.167733] ffff888102962ee0 (&fs_devs->device_list_mutex){+.+.}-{3:3}, at: find_free_extent+0x15a/0x14f0 [btrfs] [15.167733] [15.167733] but task is already holding lock: [15.167733] ffff888102962ee0 (&fs_devs->device_list_mutex){+.+.}-{3:3}, at: btrfs_create_pending_block_groups+0x20a/0x560 [btrfs] [15.167733] [15.167733] other info that might help us debug this: [15.167733] Possible unsafe locking scenario: [15.167733] [15.171834] CPU0 [15.171834] ---- [15.171834] lock(&fs_devs->device_list_mutex); [15.171834] lock(&fs_devs->device_list_mutex); [15.171834] [15.171834] *** DEADLOCK *** [15.171834] [15.171834] May be due to missing lock nesting notation [15.171834] [15.171834] 5 locks held by kworker/u8:3/146: [15.171834] #0: ffff888100050938 ((wq_completion)events_unbound){+.+.}-{0:0}, at: process_one_work+0x1c3/0x5a0 [15.171834] #1: ffffc9000067be80 ((work_completion)(&fs_info->async_data_reclaim_work)){+.+.}-{0:0}, at: process_one_work+0x1c3/0x5a0 [15.176244] #2: ffff88810521e620 (sb_internal){.+.+}-{0:0}, at: flush_space+0x335/0x600 [btrfs] [15.176244] #3: ffff888102962ee0 (&fs_devs->device_list_mutex){+.+.}-{3:3}, at: btrfs_create_pending_block_groups+0x20a/0x560 [btrfs] [15.176244] #4: ffff8881152e4b78 (btrfs-dev-00){++++}-{3:3}, at: __btrfs_tree_lock+0x27/0x130 [btrfs] [15.179641] [15.179641] stack backtrace: [15.179641] CPU: 1 PID: 146 Comm: kworker/u8:3 Not tainted 5.17.0-rc6-dennis #79 [15.179641] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1.fc35 04/01/2014 [15.179641] Workqueue: events_unbound btrfs_async_reclaim_data_space [btrfs] [15.179641] Call Trace: [15.179641] <TASK> [15.179641] dump_stack_lvl+0x45/0x59 [15.179641] __lock_acquire.cold+0x217/0x2b2 [15.179641] lock_acquire+0xbf/0x2b0 [15.183838] ? find_free_extent+0x15a/0x14f0 [btrfs] [15.183838] __mutex_lock+0x8e/0x970 [15.183838] ? find_free_extent+0x15a/0x14f0 [btrfs] [15.183838] ? find_free_extent+0x15a/0x14f0 [btrfs] [15.183838] ? lock_is_held_type+0xd7/0x130 [15.183838] ? find_free_extent+0x15a/0x14f0 [btrfs] [15.183838] find_free_extent+0x15a/0x14f0 [btrfs] [15.183838] ? _raw_spin_unlock+0x24/0x40 [15.183838] ? btrfs_get_alloc_profile+0x106/0x230 [btrfs] [15.187601] btrfs_reserve_extent+0x131/0x260 [btrfs] [15.187601] btrfs_alloc_tree_block+0xb5/0x3b0 [btrfs] [15.187601] __btrfs_cow_block+0x138/0x600 [btrfs] [15.187601] btrfs_cow_block+0x10f/0x230 [btrfs] [15.187601] btrfs_search_slot+0x55f/0xbc0 [btrfs] [15.187601] ? lock_is_held_type+0xd7/0x130 [15.187601] btrfs_insert_empty_items+0x2d/0x60 [btrfs] [15.187601] btrfs_create_pending_block_groups+0x2b3/0x560 [btrfs] [15.187601] __btrfs_end_transaction+0x36/0x2a0 [btrfs] [15.192037] flush_space+0x374/0x600 [btrfs] [15.192037] ? find_held_lock+0x2b/0x80 [15.192037] ? btrfs_async_reclaim_data_space+0x49/0x180 [btrfs] [15.192037] ? lock_release+0x131/0x2b0 [15.192037] btrfs_async_reclaim_data_space+0x70/0x180 [btrfs] [15.192037] process_one_work+0x24c/0x5a0 [15.192037] worker_thread+0x4a/0x3d0 Fixes: a85f05e59bc1 ("btrfs: zoned: avoid chunk allocation if active block group has enough space") CC: stable@vger.kernel.org # 5.16+ Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14btrfs: zoned: remove redundant assignment in btrfs_check_zoned_modePankaj Raghav1-2/+1
Remove the redundant assignment to zone_info variable in btrfs_check_zoned_mode function. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14btrfs: zoned: allow DUP on meta-data block groupsJohannes Thumshirn1-0/+36
Allow creating or reading block-groups on a zoned device with DUP as a meta-data profile. This works because we're using the zoned_meta_io_lock and REQ_OP_WRITE operations for meta-data on zoned btrfs, so all writes to meta-data zones are aligned to the zone's write-pointer. Upon loading of the block-group, it is ensured both zones do have the same zone capacity and write-pointer offsets, so no extra machinery is needed to keep the write-pointers in sync for the meta-data zones. If this prerequisite is not met, loading of the block-group is refused. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14btrfs: zoned: prepare for allowing DUP on zonedJohannes Thumshirn1-9/+16
Allow for a block-group to be placed on more than one physical zone. This is a preparation for allowing DUP profiles for meta-data on a zoned file-system. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14btrfs: zoned: make zone finishing multi stripe capableJohannes Thumshirn1-22/+24
Currently finishing of a zone only works if the block group isn't spanning more than one zone. This limitation is purely artificial and can be easily expanded to block groups being places across multiple zones. This is a preparation for allowing DUP and later more complex block-group profiles on zoned btrfs. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14btrfs: zoned: make zone activation multi stripe capableJohannes Thumshirn1-26/+31
Currently activation of a zone only works if the block group isn't spanning more than one zone. This limitation is purely artificial and can be easily expanded to block groups being places across multiple zones. This is a preparation for allowing DUP and later more complex block-group profiles on zoned btrfs. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07btrfs: zoned: fix chunk allocation condition for zoned allocatorNaohiro Aota1-3/+2
The ZNS specification defines a limit on the number of "active" zones. That limit impose us to limit the number of block groups which can be used for an allocation at the same time. Not to exceed the limit, we reuse the existing active block groups as much as possible when we can't activate any other zones without sacrificing an already activated block group in commit a85f05e59bc1 ("btrfs: zoned: avoid chunk allocation if active block group has enough space"). However, the check is wrong in two ways. First, it checks the condition for every raid index (ffe_ctl->index). Even if it reaches the condition and "ffe_ctl->max_extent_size >= ffe_ctl->min_alloc_size" is met, there can be other block groups having enough space to hold ffe_ctl->num_bytes. (Actually, this won't happen in the current zoned code as it only supports SINGLE profile. But, it can happen once it enables other RAID types.) Second, it checks the active zone availability depending on the raid index. The raid index is just an index for space_info->block_groups, so it has nothing to do with chunk allocation. These mistakes are causing a faulty allocation in a certain situation. Consider we are running zoned btrfs on a device whose max_active_zone == 0 (no limit). And, suppose no block group have a room to fit ffe_ctl->num_bytes but some room to meet ffe_ctl->min_alloc_size (i.e. max_extent_size > num_bytes >= min_alloc_size). In this situation, the following occur: - With SINGLE raid_index, it reaches the chunk allocation checking code - The check returns true because we can activate a new zone (no limit) - But, before allocating the chunk, it iterates to the next raid index (RAID5) - Since there are no RAID5 block groups on zoned mode, it again reaches the check code - The check returns false because of btrfs_can_activate_zone()'s "if (raid_index != BTRFS_RAID_SINGLE)" part - That results in returning -ENOSPC without allocating a new chunk As a result, we end up hitting -ENOSPC too early. Move the check to the right place in the can_allocate_chunk() hook, and do the active zone check depending on the allocation flag, not on the raid index. CC: stable@vger.kernel.org # 5.16 Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07btrfs: zoned: simplify btrfs_check_meta_write_pointerJohannes Thumshirn1-18/+8
btrfs_check_meta_write_pointer() will always be called with a NULL 'cache_ret' argument. As there's no need to check if we have a valid block_group passed in remove these checks. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-03btrfs: stop accessing ->extent_root directlyJosef Bacik1-1/+2
When we start having multiple extent roots we'll need to use a helper to get to the correct extent_root. Rename fs_info->extent_root to _extent_root and convert all of the users of the extent root to using the btrfs_extent_root() helper. This will allow us to easily clean up the remaining direct accesses in the future. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-03btrfs: zoned: cache reported zone during mountNaohiro Aota1-9/+77
When mounting a device, we are reporting the zones twice: once for checking the zone attributes in btrfs_get_dev_zone_info and once for loading block groups' zone info in btrfs_load_block_group_zone_info(). With a lot of block groups, that leads to a lot of REPORT ZONE commands and slows down the mount process. This patch introduces a zone info cache in struct btrfs_zoned_device_info. The cache is populated while in btrfs_get_dev_zone_info() and used for btrfs_load_block_group_zone_info() to reduce the number of REPORT ZONE commands. The zone cache is then released after loading the block groups, as it will not be much effective during the run time. Benchmark: Mount an HDD with 57,007 block groups Before patch: 171.368 seconds After patch: 64.064 seconds While it still takes a minute due to the slowness of loading all the block groups, the patch reduces the mount time by 1/3. Link: https://lore.kernel.org/linux-btrfs/CAHQ7scUiLtcTqZOMMY5kbWUBOhGRwKo6J6wYPT5WY+C=cD49nQ@mail.gmail.com/ Fixes: 5b316468983d ("btrfs: get zone information of zoned block devices") CC: stable@vger.kernel.org Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-12-08btrfs: zoned: clear data relocation bg on zone finishJohannes Thumshirn1-0/+2
When finishing a zone that is used by a dedicated data relocation block group, also remove its reference from fs_info, so we're not trying to use a full block group for allocations during data relocation, which will always fail. The result is we're not making any forward progress and end up in a deadlock situation. Fixes: c2707a255623 ("btrfs: zoned: add a dedicated data relocation block group") Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: zoned: use kmemdup() to replace kmalloc + memcpyKai Song1-3/+1
Fix memdup.cocci warning: fs/btrfs/zoned.c:1198:23-30: WARNING opportunity for kmemdup Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Kai Song <songkai01@inspur.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: rename btrfs_bio to btrfs_io_contextQu Wenruo1-8/+8
The structure btrfs_bio is used by two different sites: - bio->bi_private for mirror based profiles For those profiles (SINGLE/DUP/RAID1*/RAID10), this structures records how many mirrors are still pending, and save the original endio function of the bio. - RAID56 code In that case, RAID56 only utilize the stripes info, and no long uses that to trace the pending mirrors. So btrfs_bio is not always bind to a bio, and contains more info for IO context, thus renaming it will make the naming less confusing. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: zoned: use regular writes for relocationJohannes Thumshirn1-0/+11
Now that we have a dedicated block group for relocation, we can use REQ_OP_WRITE instead of REQ_OP_ZONE_APPEND for writing out the data on relocation. Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: zoned: add a dedicated data relocation block groupJohannes Thumshirn1-0/+10
Relocation in a zoned filesystem can fail with a transaction abort with error -22 (EINVAL). This happens because the relocation code assumes that the extents we relocated the data to have the same size the source extents had and ensures this by preallocating the extents. But in a zoned filesystem we currently can't preallocate the extents as this would break the sequential write required rule. Therefore it can happen that the writeback process kicks in while we're still adding pages to a delalloc range and starts writing out dirty pages. This then creates destination extents that are smaller than the source extents, triggering the following safety check in get_new_location(): 1034 if (num_bytes != btrfs_file_extent_disk_num_bytes(leaf, fi)) { 1035 ret = -EINVAL; 1036 goto out; 1037 } Temporarily create a dedicated block group for the relocation process, so no non-relocation data writes can interfere with the relocation writes. This is needed that we can switch the relocation process on a zoned filesystem from the REQ_OP_ZONE_APPEND writing we use for data to a scheme like in a non-zoned filesystem using REQ_OP_WRITE and preallocation. Fixes: 32430c614844 ("btrfs: zoned: enable relocation on a zoned filesystem") Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: zoned: finish fully written block groupNaohiro Aota1-0/+50
If we have written to the zone capacity, the device automatically deactivates the zone. Sync up block group side (the active BG list and zone_is_active flag) with it. We need to do it both on data BGs and metadata BGs. On data side, we add a hook to btrfs_finish_ordered_io(). On metadata side, we use end_extent_buffer_writeback(). To reduce excess lookup of a block group, we mark the last extent buffer in a block group with EXTENT_BUFFER_ZONE_FINISH flag. This cannot be done for data (ordered_extent), because the address may change due to REQ_OP_ZONE_APPEND. Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: zoned: avoid chunk allocation if active block group has enough spaceNaohiro Aota1-0/+31
The current extent allocator tries to allocate a new block group when the existing block groups do not have enough space. On a ZNS device, a new block group means a new active zone. If the number of active zones has already reached the max_active_zones, activating a new zone needs to finish an existing zone, leading to wasting the free space there. So, instead, it should reuse the existing active block groups as much as possible when we can't activate any other zones without sacrificing an already activated block group. While at it, I converted find_free_extent_update_loop() to check the found_extent() case early and made the other conditions simpler. Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: zoned: load active zone info for block groupNaohiro Aota1-0/+24
Load activeness of underlying zones of a block group. When underlying zones are active, we add the block group to the fs_info->zone_active_bgs list. Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: zoned: implement active zone trackingNaohiro Aota1-0/+193
Add zone_is_active flag to btrfs_block_group. This flag indicates the underlying zones are all active. Such zone active block groups are tracked by fs_info->active_bg_list. btrfs_dev_{set,clear}_active_zone() take responsibility for the underlying device part. They set/clear the bitmap to indicate zone activeness and count the number of zones we can activate left. btrfs_zone_{activate,finish}() take responsibility for the logical part and the list management. In addition, btrfs_zone_finish() wait for any writes on it and send REQ_OP_ZONE_FINISH to the zone. Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: zoned: introduce physical_map to btrfs_block_groupNaohiro Aota1-2/+14
We will use a block group's physical location to track active zones and finish fully written zones in the following commits. Since the zone activation is done in the extent allocation context which already holding the tree locks, we can't query the chunk tree for the physical locations. So, copy the location info into a block group and use it for activation. Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: zoned: load active zone information from devicesNaohiro Aota1-1/+57
The ZNS specification defines a limit on the number of zones that can be in the implicit open, explicit open or closed conditions. Any zone with such condition is defined as an active zone and correspond to any zone that is being written or that has been only partially written. If the maximum number of active zones is reached, we must either reset or finish some active zones before being able to chose other zones for storing data. Load queue_max_active_zones() and track the number of active zones left on the device. Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: zoned: finish superblock zone once no space left for new SBNaohiro Aota1-16/+36
If there is no more space left for a new superblock in a superblock zone, then it is better to ZONE_FINISH the zone and frees up the active zone count. Since btrfs_advance_sb_log() can now issue REQ_OP_ZONE_FINISH, we also need to convert it to return int for the error case. Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: zoned: locate superblock position using zone capacityNaohiro Aota1-2/+13
sb_write_pointer() returns the write position of next superblock. For READ, we need a previous location. When the pointer is at the head, the previous one is the last one of the other zone. Calculate the last one's position from zone capacity. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: zoned: consider zone as full when no more SB can be writtenNaohiro Aota1-8/+15
We cannot write beyond zone capacity. So, we should consider a zone as "full" when the write pointer goes beyond capacity - the size of super info. Also, take this opportunity to replace a subtle duplicated code with a loop and fix a typo in comment. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: zoned: calculate free space from zone capacityNaohiro Aota1-2/+3
Now that we introduced capacity in a block group, we need to calculate free space using the capacity instead of the length. Thus, bytes we account capacity - alloc_pointer as free, and account bytes [capacity, length] as zone unusable. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: zoned: move btrfs_free_excluded_extents out of btrfs_calc_zone_unusableNaohiro Aota1-3/+0
btrfs_free_excluded_extents() is not neccessary for btrfs_calc_zone_unusable() and it makes btrfs_calc_zone_unusable() difficult to reuse. Move it out and call btrfs_free_excluded_extents() in proper context. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: zoned: load zone capacity information from devicesNaohiro Aota1-1/+23
The ZNS specification introduces the concept of a Zone Capacity. A zone capacity is an additional per-zone attribute that indicates the number of usable logical blocks within each zone, starting from the first logical block of each zone. It is always smaller or equal to the zone size. With the SINGLE profile, we can set a block group's "capacity" as the same as the underlying zone's Zone Capacity. We will limit the allocation not to exceed in a following commit. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-25btrfs: rename btrfs_alloc_chunk to btrfs_create_chunkNikolay Borisov1-1/+1
The user facing function used to allocate new chunks is btrfs_chunk_alloc, unfortunately there is yet another similar sounding function - btrfs_alloc_chunk. This creates confusion, especially since the latter function can be considered "private" in the sense that it implements the first stage of chunk creation and as such is called by btrfs_chunk_alloc. To avoid the awkwardness that comes with having similarly named but distinctly different in their purpose function rename btrfs_alloc_chunk to btrfs_create_chunk, given that the main purpose of this function is to orchestrate the whole process of allocating a chunk - reserving space into devices, deciding on characteristics of the stripe size and creating the in-memory structures. Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23btrfs: use btrfs_next_leaf instead of btrfs_next_item when slots > nritemsMarcos Paulo de Souza1-1/+1
After calling btrfs_search_slot is a common practice to check if the slot found isn't bigger than number of slots in the current leaf, and if so, search for the same key in the next leaf by calling btrfs_next_leaf, which calls btrfs_next_old_leaf to do the job. Calling btrfs_next_item in the same situation would end up in the same code flow, since * btrfs_next_item * btrfs_next_old_item * if slot >= nritems(curr_leaf) btrfs_next_old_leaf Change btrfs_verify_dev_extents and calculate_emulated_zone_size functions to use btrfs_next_leaf in the same situation. Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23btrfs: zoned: remove max_zone_append_size logicJohannes Thumshirn1-20/+0
There used to be a patch in the original series for zoned support which limited the extent size to max_zone_append_size, but this patch has been dropped somewhere around v9. We've decided to go the opposite direction, instead of limiting extents in the first place we split them before submission to comply with the device's limits. Remove the related code, btrfs_fs_info::max_zone_append_size and btrfs_zoned_device_info::max_zone_append_size. This also removes the workaround for dm-crypt introduced in 1d68128c107a ("btrfs: zoned: fail mount if the device does not support zone append") because the fix has been merged as f34ee1dce642 ("dm crypt: Fix zoned block device support"). Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-07-22btrfs: store a block_device in struct btrfs_ordered_extentChristoph Hellwig1-8/+4
Store the block device instead of the gendisk in the btrfs_ordered_extent structure instead of acquiring a reference to it later. Note: this is from series removing bdgrab/bdput, btrfs is one of the last users. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-22btrfs: fix typos in commentsDavid Sterba1-2/+2
Fix typos that have snuck in since the last round. Found by codespell. Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-21btrfs: zoned: factor out zoned device lookupJohannes Thumshirn1-0/+21
To be able to construct a zone append bio we need to look up the btrfs_device. The code doing the chunk map lookup to get the device is present in btrfs_submit_compressed_write and submit_extent_page. Factor out the lookup calls into a helper and use it in the submission paths. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-21btrfs: zoned: bail out if we can't read a reliable write pointerJohannes Thumshirn1-0/+14
If we can't read a reliable write pointer from a sequential zone fail creating the block group with an I/O error. Also if the read write pointer is beyond the end of the respective zone, fail the creation of the block group on this zone with an I/O error. While this could also happen in real world scenarios with misbehaving drives, this issue addresses a problem uncovered by fstests' test case generic/475. CC: stable@vger.kernel.org # 5.12+ Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-21btrfs: zoned: print message when zone sanity check type failsNaohiro Aota1-0/+4
This extends patch 784daf2b9628 ("btrfs: zoned: sanity check zone type"), the message was supposed to be there but was lost during merge. We want to make the error noticeable so add it. Fixes: 784daf2b9628 ("btrfs: zoned: sanity check zone type") CC: stable@vger.kernel.org # 5.12+ Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-04btrfs: zoned: fix zone number to sector/physical calculationNaohiro Aota1-5/+18
In btrfs_get_dev_zone_info(), we have "u32 sb_zone" and calculate "sector_t sector" by shifting it. But, this "sector" is calculated in 32bit, leading it to be 0 for the 2nd superblock copy. Since zone number is u32, shifting it to sector (sector_t) or physical address (u64) can easily trigger a missing cast bug like this. This commit introduces helpers to convert zone number to sector/LBA, so we won't fall into the same pitfall again. Reported-by: Dmitry Fomichev <Dmitry.Fomichev@wdc.com> Fixes: 12659251ca5d ("btrfs: implement log-structured superblock for ZONED mode") CC: stable@vger.kernel.org # 5.11+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-05-20btrfs: zoned: pass start block to btrfs_use_zone_appendJohannes Thumshirn1-2/+2
btrfs_use_zone_append only needs the passed in extent_map's block_start member, so there's no need to pass in the full extent map. This also enables the use of btrfs_use_zone_append in places where we only have a start byte but no extent_map. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-05-04btrfs: zoned: sanity check zone typeNaohiro Aota1-0/+5
The fstests test case generic/475 creates a dm-linear device that gets changed to a dm-error device. This leads to errors in loading the block group's zone information when running on a zoned file system, ultimately resulting in a list corruption. When running on a kernel with list debugging enabled this leads to the following crash. BTRFS: error (device dm-2) in cleanup_transaction:1953: errno=-5 IO failure kernel BUG at lib/list_debug.c:54! invalid opcode: 0000 [#1] SMP PTI CPU: 1 PID: 2433 Comm: umount Tainted: G W 5.12.0+ #1018 RIP: 0010:__list_del_entry_valid.cold+0x1d/0x47 RSP: 0018:ffffc90001473df0 EFLAGS: 00010296 RAX: 0000000000000054 RBX: ffff8881038fd000 RCX: ffffc90001473c90 RDX: 0000000100001a31 RSI: 0000000000000003 RDI: 0000000000000003 RBP: ffff888308871108 R08: 0000000000000003 R09: 0000000000000001 R10: 3961373532383838 R11: 6666666620736177 R12: ffff888308871000 R13: ffff8881038fd088 R14: ffff8881038fdc78 R15: dead000000000100 FS: 00007f353c9b1540(0000) GS:ffff888627d00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f353cc2c710 CR3: 000000018e13c000 CR4: 00000000000006a0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: btrfs_free_block_groups+0xc9/0x310 [btrfs] close_ctree+0x2ee/0x31a [btrfs] ? call_rcu+0x8f/0x270 ? mutex_lock+0x1c/0x40 generic_shutdown_super+0x67/0x100 kill_anon_super+0x14/0x30 btrfs_kill_super+0x12/0x20 [btrfs] deactivate_locked_super+0x31/0x90 cleanup_mnt+0x13e/0x1b0 task_work_run+0x63/0xb0 exit_to_user_mode_loop+0xd9/0xe0 exit_to_user_mode_prepare+0x3e/0x60 syscall_exit_to_user_mode+0x1d/0x50 entry_SYSCALL_64_after_hwframe+0x44/0xae As dm-error has no support for zones, btrfs will run it's zone emulation mode on this device. The zone emulation mode emulates conventional zones, so bail out if the zone bitmap that gets populated on mount sees the zone as sequential while we're thinking it's a conventional zone when creating a block group. Note: this scenario is unlikely in a real wold application and can only happen by this (ab)use of device-mapper targets. CC: stable@vger.kernel.org # 5.12+ Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-04-19btrfs: zoned: fail mount if the device does not support zone appendJohannes Thumshirn1-0/+7
For zoned btrfs, zone append is mandatory to write to a sequential write only zone, otherwise parallel writes to the same zone could result in unaligned write errors. If a zoned block device does not support zone append (e.g. a dm-crypt zoned device using a non-NULL IV cypher), fail to mount. CC: stable@vger.kernel.org # 5.12 Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-04-11Merge tag 'for-5.12-rc6-tag' of ↵Linus Torvalds1-11/+42
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fix from David Sterba: "One more patch that we'd like to get to 5.12 before release. It's changing where and how the superblock is stored in the zoned mode. It is an on-disk format change but so far there are no implications for users as the proper mkfs support hasn't been merged and is waiting for the kernel side to settle. Until now, the superblocks were derived from the zone index, but zone size can differ per device. This is changed to be based on fixed offset values, to make it independent of the device zone size. The work on that got a bit delayed, we discussed the exact locations to support potential device sizes and usecases. (Partially delayed also due to my vacation.) Having that in the same release where the zoned mode is declared usable is highly desired, there are userspace projects that need to be updated to recognize the feature. Pushing that to the next release would make things harder to test" * tag 'for-5.12-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: zoned: move superblock logging zone location
2021-04-10btrfs: zoned: move superblock logging zone locationNaohiro Aota1-11/+42
Moves the location of the superblock logging zones. The new locations of the logging zones are now determined based on fixed block addresses instead of on fixed zone numbers. The old placement method based on fixed zone numbers causes problems when one needs to inspect a file system image without access to the drive zone information. In such case, the super block locations cannot be reliably determined as the zone size is unknown. By locating the superblock logging zones using fixed addresses, we can scan a dumped file system image without the zone information since a super block copy will always be present at or after the fixed known locations. Introduce the following three pairs of zones containing fixed offset locations, regardless of the device zone size. - primary superblock: offset 0B (and the following zone) - first copy: offset 512G (and the following zone) - Second copy: offset 4T (4096G, and the following zone) If a logging zone is outside of the disk capacity, we do not record the superblock copy. The first copy position is much larger than for a non-zoned filesystem, which is at 64M. This is to avoid overlapping with the log zones for the primary superblock. This higher location is arbitrary but allows supporting devices with very large zone sizes, plus some space around in between. Such large zone size is unrealistic and very unlikely to ever be seen in real devices. Currently, SMR disks have a zone size of 256MB, and we are expecting ZNS drives to be in the 1-4GB range, so this limit gives us room to breathe. For now, we only allow zone sizes up to 8GB. The maximum zone size that would still fit in the space is 256G. The fixed location addresses are somewhat arbitrary, with the intent of maintaining superblock reliability for smaller and larger devices, with the preference for the latter. For this reason, there are two superblocks under the first 1T. This should cover use cases for physical devices and for emulated/device-mapper devices. The superblock logging zones are reserved for superblock logging and never used for data or metadata blocks. Note that we only reserve the two zones per primary/copy actually used for superblock logging. We do not reserve the ranges of zones possibly containing superblocks with the largest supported zone size (0-16GB, 512G-528GB, 4096G-4112G). The zones containing the fixed location offsets used to store superblocks on a non-zoned volume are also reserved to avoid confusion. Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-03-05Merge tag 'for-5.12-rc1-tag' of ↵Linus Torvalds1-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "More regression fixes and stabilization. Regressions: - zoned mode - count zone sizes in wider int types - fix space accounting for read-only block groups - subpage: fix page tail zeroing Fixes: - fix spurious warning when remounting with free space tree - fix warning when creating a directory with smack enabled - ioctl checks for qgroup inheritance when creating a snapshot - qgroup - fix missing unlock on error path in zero range - fix amount of released reservation on error - fix flushing from unsafe context with open transaction, potentially deadlocking - minor build warning fixes" * tag 'for-5.12-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: zoned: do not account freed region of read-only block group as zone_unusable btrfs: zoned: use sector_t for zone sectors btrfs: subpage: fix the false data csum mismatch error btrfs: fix warning when creating a directory with smack enabled btrfs: don't flush from btrfs_delayed_inode_reserve_metadata btrfs: export and rename qgroup_reserve_meta btrfs: free correct amount of space in btrfs_delayed_inode_reserve_metadata btrfs: fix spurious free_space_tree remount warning btrfs: validate qgroup inherit for SNAP_CREATE_V2 ioctl btrfs: unlock extents in btrfs_zero_range in case of quota reservation errors btrfs: ref-verify: use 'inline void' keyword ordering
2021-03-04btrfs: zoned: use sector_t for zone sectorsNaohiro Aota1-2/+2
We need to use sector_t for zone_sectors, or it would set the zone size to zero when the size >= 4GB (= 2^24 sectors) by shifting the zone_sectors value by SECTOR_SHIFT. We're assuming zones sizes up to 8GiB. Fixes: 5b316468983d ("btrfs: get zone information of zoned block devices") Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-21Merge tag 'for-5.12/block-2021-02-17' of git://git.kernel.dk/linux-blockLinus Torvalds1-2/+2
Pull core block updates from Jens Axboe: "Another nice round of removing more code than what is added, mostly due to Christoph's relentless pursuit of tech debt removal/cleanups. This pull request contains: - Two series of BFQ improvements (Paolo, Jan, Jia) - Block iov_iter improvements (Pavel) - bsg error path fix (Pan) - blk-mq scheduler improvements (Jan) - -EBUSY discard fix (Jan) - bvec allocation improvements (Ming, Christoph) - bio allocation and init improvements (Christoph) - Store bdev pointer in bio instead of gendisk + partno (Christoph) - Block trace point cleanups (Christoph) - hard read-only vs read-only split (Christoph) - Block based swap cleanups (Christoph) - Zoned write granularity support (Damien) - Various fixes/tweaks (Chunguang, Guoqing, Lei, Lukas, Huhai)" * tag 'for-5.12/block-2021-02-17' of git://git.kernel.dk/linux-block: (104 commits) mm: simplify swapdev_block sd_zbc: clear zone resources for non-zoned case block: introduce blk_queue_clear_zone_settings() zonefs: use zone write granularity as block size block: introduce zone_write_granularity limit block: use blk_queue_set_zoned in add_partition() nullb: use blk_queue_set_zoned() to setup zoned devices nvme: cleanup zone information initialization block: document zone_append_max_bytes attribute block: use bi_max_vecs to find the bvec pool md/raid10: remove dead code in reshape_request block: mark the bio as cloned in bio_iov_bvec_set block: set BIO_NO_PAGE_REF in bio_iov_bvec_set block: remove a layer of indentation in bio_iov_iter_get_pages block: turn the nr_iovecs argument to bio_alloc* into an unsigned short block: remove the 1 and 4 vec bvec_slabs entries block: streamline bvec_alloc block: factor out a bvec_alloc_gfp helper block: move struct biovec_slab to bio.c block: reuse BIO_INLINE_VECS for integrity bvecs ...
2021-02-09btrfs: zoned: support dev-replace in zoned filesystemsNaohiro Aota1-0/+74
This is 4/4 patch to implement device-replace on zoned filesystems. Even after the copying is done, the write pointers of the source device and the destination device may not be synchronized. For example, when the last allocated extent is freed before device-replace process, the extent is not copied, leaving a hole there. Synchronize the write pointers by writing zeroes to the destination device. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-09btrfs: zoned: implement copying for zoned device-replaceNaohiro Aota1-0/+9
This is 3/4 patch to implement device-replace on zoned filesystems. This commit implements copying. To do this, it tracks the write pointer during the device replace process. As device-replace's copy process is smart enough to only copy used extents on the source device, we have to fill the gap to honor the sequential write requirement in the target device. The device-replace process on zoned filesystems must copy or clone all the extents in the source device exactly once. So, we need to ensure allocations started just before the dev-replace process to have their corresponding extent information in the B-trees. finish_extent_writes_for_zoned() implements that functionality, which basically is the removed code in the commit 042528f8d840 ("Btrfs: fix block group remaining RO forever after error during device replace"). Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>