summaryrefslogtreecommitdiff
path: root/block
AgeCommit message (Collapse)AuthorFilesLines
2024-05-03Use bdev_is_paritition() instead of open-coding itAl Viro2-3/+4
Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-05-03make set_blocksize() fail unless block device is opened exclusiveAl Viro1-0/+3
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-05-03set_blocksize(): switch to passing struct file *Al Viro2-13/+19
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-05-01block: Cleanup blk_revalidate_zone_cb()Damien Le Moal1-52/+77
Define the code for checking conventional and sequential write required zones suing the functions blk_revalidate_conv_zone() and blk_revalidate_seq_zone() respectively. This simplifies the zone type switch-case in blk_revalidate_zone_cb(). No functional changes. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20240501110907.96950-15-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-01block: Simplify zone write plug BIO abortDamien Le Moal1-8/+7
When BIOs plugged in a zone write plug are aborted, blk_zone_wplug_bio_io_error() clears the BIO BIO_ZONE_WRITE_PLUGGING flag so that bio_io_error(bio) does not end up calling blk_zone_write_plug_bio_endio() and we thus need to manually drop the reference on the zone write plug held by the aborted BIO. Move the call to disk_put_zone_wplug() that is alwasy following the call to blk_zone_wplug_bio_io_error() inside that function to simplify the code. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20240501110907.96950-14-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-01block: Simplify blk_zone_write_plug_bio_endio()Damien Le Moal1-2/+1
We already have the disk variable obtained from the bio when calling disk_get_zone_wplug(). So use that variable instead of dereferencing the bio bdev again for the disk argument of disk_get_zone_wplug(). Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20240501110907.96950-13-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-01block: Improve zone write request completion handlingDamien Le Moal3-13/+12
blk_zone_complete_request() must be called to handle the completion of a zone write request handled with zone write plugging. This function is called from blk_complete_request(), blk_update_request() and also in blk_mq_submit_bio() error path. Improve this by moving this function call into blk_mq_finish_request() as all requests are processed with this function when they complete as well as when they are freed without being executed. This also improves blk_update_request() used by scsi devices as these may repeatedly call this function to handle partial completions. To be consistent with this change, blk_zone_complete_request() is renamed to blk_zone_finish_request() and blk_zone_write_plug_complete_request() is renamed to blk_zone_write_plug_finish_request(). Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20240501110907.96950-12-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-01block: Improve blk_zone_write_plug_bio_merged()Damien Le Moal1-2/+7
Improve blk_zone_write_plug_bio_merged() to check that we succefully get a reference on the zone write plug of the merged BIO, as expected since for a merge we already have at least one request and one BIO referencing the zone write plug. Comments in this function are also improved to better explain the references to the BIO zone write plug. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20240501110907.96950-11-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-01block: Fix handling of non-empty flush write requests to zonesDamien Le Moal3-9/+13
Zone write plugging ignores empty (no data) flush operations but handles flush BIOs that have data to ensure that the flush machinery generated write is processed in order. However, the call to blk_zone_write_plug_attempt_merge() which sets a request RQF_ZONE_WRITE_PLUGGING flag is called after blk_insert_flush(), thus missing indicating that a non empty flush request completion needs handling by zone write plugging. Fix this by moving the call to blk_zone_write_plug_attempt_merge() before blk_insert_flush(). And while at it, rename that function as blk_zone_write_plug_init_request() to be clear that it is not just about merging plugged BIOs in the request. While at it, also add a WARN_ONCE() check that the zone write plug for the request is not NULL. Fixes: dd291d77cc90 ("block: Introduce zone write plugging") Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20240501110907.96950-10-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-01block: Fix flush request sector restoreDamien Le Moal1-1/+2
Make sure that a request bio is not NULL before trying to restore the request start sector. Reported-by: Yi Zhang <yi.zhang@redhat.com> Fixes: 6f8fd758de63 ("block: Restore sector of flush requests") Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20240501110907.96950-9-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-01block: Do not remove zone write plugs still in useDamien Le Moal1-8/+31
Large write BIOs that span a zone boundary are split in blk_mq_submit_bio() before being passed to blk_zone_plug_bio() for zone write plugging. Such split BIO will be chained with one fragment targeting one zone and the remainder of the BIO targeting the next zone. The two BIOs can be executed in parallel, without a predetermine order relative to eachother and their completion may be reversed: the remainder first completing and the first fragment then completing. In such case, bio_endio() will not immediately execute blk_zone_write_plug_bio_endio() for the parent BIO (the remainder of the split BIO) as the BIOs are chained. blk_zone_write_plug_bio_endio() for the parent BIO will be executed only once the first fragment completes. In the case of a device with small zones and very large BIOs, uch completion pattern can lead to disk_should_remove_zone_wplug() to return true for the zone of the parent BIO when the parent BIO request completes and blk_zone_write_plug_complete_request() is executed. This triggers the removal of the zone write plug from the hash table using disk_remove_zone_wplug(). With the zone write plug of the parent BIO missing, the call to disk_get_zone_wplug() in blk_zone_write_plug_bio_endio() returns NULL and triggers a warning. This patterns can be recreated fairly easily using a scsi_debug device with small zone and btrfs. E.g. modprobe scsi_debug delay=0 dev_size_mb=1024 sector_size=4096 \ zbc=host-managed zone_cap_mb=3 zone_nr_conv=0 zone_size_mb=4 mkfs.btrfs -f -O zoned /dev/sda mount -t btrfs /dev/sda /mnt fio --name=wrtest --rw=randwrite --direct=1 --ioengine=libaio \ --bs=4k --iodepth=16 --size=1M --directory=/mnt --time_based \ --runtime=10 umount /dev/sda Will result in the warning: [ 29.035538] WARNING: CPU: 3 PID: 37 at block/blk-zoned.c:1207 blk_zone_write_plug_bio_endio+0xee/0x1e0 ... [ 29.058682] Call Trace: [ 29.059095] <TASK> [ 29.059473] ? __warn+0x80/0x120 [ 29.059983] ? blk_zone_write_plug_bio_endio+0xee/0x1e0 [ 29.060728] ? report_bug+0x160/0x190 [ 29.061283] ? handle_bug+0x36/0x70 [ 29.061830] ? exc_invalid_op+0x17/0x60 [ 29.062399] ? asm_exc_invalid_op+0x1a/0x20 [ 29.063025] ? blk_zone_write_plug_bio_endio+0xee/0x1e0 [ 29.063760] bio_endio+0xb7/0x150 [ 29.064280] btrfs_clone_write_end_io+0x2b/0x60 [btrfs] [ 29.065049] blk_update_request+0x17c/0x500 [ 29.065666] scsi_end_request+0x27/0x1a0 [scsi_mod] [ 29.066356] scsi_io_completion+0x5b/0x690 [scsi_mod] [ 29.067077] blk_complete_reqs+0x3a/0x50 [ 29.067692] __do_softirq+0xcf/0x2b3 [ 29.068248] ? sort_range+0x20/0x20 [ 29.068791] run_ksoftirqd+0x1c/0x30 [ 29.069339] smpboot_thread_fn+0xcc/0x1b0 [ 29.069936] kthread+0xcf/0x100 [ 29.070438] ? kthread_complete_and_exit+0x20/0x20 [ 29.071314] ret_from_fork+0x31/0x50 [ 29.071873] ? kthread_complete_and_exit+0x20/0x20 [ 29.072563] ret_from_fork_asm+0x11/0x20 [ 29.073146] </TASK> either when fio executes or when unmount is executed. Fix this by modifying disk_should_remove_zone_wplug() to check that the reference count to a zone write plug is not larger than 2, that is, that the only references left on the zone are the caller held reference (blk_zone_write_plug_complete_request()) and the initial extra reference for the zone write plug taken when it was initialized (and that is dropped when the zone write plug is removed from the hash table). To be consistent with this change, make sure to drop the request or BIO held reference to the zone write plug before calling disk_zone_wplug_unplug_bio(). All references are also dropped using disk_put_zone_wplug() instead of atomic_dec() to ensure that the zone write plug is freed if it needs to be. Comments are also improved to clarify zone write plugs reference handling. Reported-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Fixes: dd291d77cc90 ("block: Introduce zone write plugging") Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20240501110907.96950-8-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-01block: Unhash a zone write plug only if neededDamien Le Moal1-23/+32
Fix disk_remove_zone_wplug() to ensure that a zone write plug already removed from a disk hash table of zone write plugs is not removed again. Do this by checking the BLK_ZONE_WPLUG_UNHASHED flag of the plug and calling hlist_del_init_rcu() only if the flag is not set. Furthermore, since BIO completions can happen at any time, that is, decrementing of the zone write plug reference count can happen at any time, make sure to use disk_put_zone_wplug() instead of atomic_dec() to ensure that the zone write plug is freed when its last reference is dropped. In order to do this, disk_remove_zone_wplug() is moved after the definition of disk_put_zone_wplug(). disk_should_remove_zone_wplug() is moved as well to keep it together with disk_remove_zone_wplug(). To be consistent with this change, add a check in disk_put_zone_wplug() to ensure that a zone write plug being freed was already removed from the disk hash table. Fixes: dd291d77cc90 ("block: Introduce zone write plugging") Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20240501110907.96950-7-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-01block: Hold a reference on zone write plugs to schedule submissionDamien Le Moal1-5/+21
Since a zone write plug BIO work is a field of struct blk_zone_wplug, we must ensure that a zone write plug is never freed when its BIO submission work is queued or running. Do this by holding a reference on the zone write plug when the submission work is scheduled for execution with queue_work() and releasing the reference at the end of the execution of the work function blk_zone_wplug_bio_work(). The helper function disk_zone_wplug_schedule_bio_work() is introduced to get a reference on a zone write plug and queue its work. This helper is used in disk_zone_wplug_unplug_bio() and disk_zone_wplug_handle_error(). Fixes: dd291d77cc90 ("block: Introduce zone write plugging") Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20240501110907.96950-6-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-01block: Fix reference counting for zone write plugs in error stateDamien Le Moal1-26/+49
When zone is reset or finished, disk_zone_wplug_set_wp_offset() is called to update the zone write plug write pointer offset and to clear the zone error state (BLK_ZONE_WPLUG_ERROR flag) if it is set. However, this processing is missing dropping the reference to the zone write plug that was taken in disk_zone_wplug_set_error() when the error flag was first set. Furthermore, the error state handling must release the zone write plug lock to first execute a report zones command. When the report zone races with a reset or finish operation that clears the error, we can end up decrementing the zone write plug reference count twice: once in disk_zone_wplug_set_wp_offset() for the reset/finish operation and one more time in disk_zone_wplugs_work() once disk_zone_wplug_handle_error() completes. Fix this by introducing disk_zone_wplug_clear_error() as the symmetric function of disk_zone_wplug_set_error(). disk_zone_wplug_clear_error() decrements the zone write plug reference count obtained in disk_zone_wplug_set_error() only if the error handling has not started yet, that is, only if disk_zone_wplugs_work() has not yet taken the zone write plug off the error list. This ensure that either disk_zone_wplug_clear_error() or disk_zone_wplugs_work() drop the zone write plug reference count. Fixes: dd291d77cc90 ("block: Introduce zone write plugging") Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20240501110907.96950-5-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-01block: Fix zone write plug initialization from blk_revalidate_zone_cb()Damien Le Moal1-3/+4
When revalidating the zones of a zoned block device, blk_revalidate_zone_cb() must allocate a zone write plug for any sequential write required zone that is not empty nor full. However, the current code tests the latter case by comparing the zone write pointer offset to the zone size instead of the zone capacity. Furthermore, disk_get_and_lock_zone_wplug() is called with a sector argument equal to the zone start instead of the current zone write pointer position. This commit fixes both issues by calling disk_get_and_lock_zone_wplug() for a zone that is not empty and with a write pointer offset lower than the zone capacity and use the zone capacity sector as the sector argument for disk_get_and_lock_zone_wplug(). Fixes: dd291d77cc90 ("block: Introduce zone write plugging") Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20240501110907.96950-4-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-01block: Exclude conventional zones when faking max open limitDamien Le Moal1-10/+28
For a device that has no limits for the maximum number of open and active zones, we default to using the number of zones, limited to BLK_ZONE_WPLUG_DEFAULT_POOL_SIZE (128), for the maximum number of open zones indicated to the user. However, for a device that has conventional zones and less zones than BLK_ZONE_WPLUG_DEFAULT_POOL_SIZE, we should not account conventional zones and set the limit to the number of sequential write required zones. Furthermore, for cases where the limit is equal to the number of sequential write required zones, we can advertize a limit of 0 to indicate "no limits". Fix this by moving the zone write plug mempool resizing from disk_revalidate_zone_resources() to disk_update_zone_resources() where we can safely compute the number of conventional zones and update the limits. Fixes: 843283e96e5a ("block: Fake max open zones limit when there is no limit") Reported-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20240501110907.96950-3-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-26Merge tag 'vfs-6.9-rc6.fixes' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs fixes from Christian Brauner: "This contains a few small fixes for this merge window and the attempt to handle the ntfs removal regression that was reported a little while ago: - After the removal of the legacy ntfs driver we received reports about regressions for some people that do mount "ntfs" explicitly and expect the driver to be available. Since ntfs3 is a drop-in for legacy ntfs we alias legacy ntfs to ntfs3 just like ext3 is aliased to ext4. We also enforce legacy ntfs is always mounted read-only and give it custom file operations to ensure that ioctl()'s can't be abused to perform write operations. - Fix an unbalanced module_get() in bdev_open(). - Two smaller fixes for the netfs work done earlier in this cycle. - Fix the errno returned from the new FS_IOC_GETUUID and FS_IOC_GETFSSYSFSPATH ioctls. Both commands just pull information out of the superblock so there's no need to call into the actual ioctl handlers. So instead of returning ENOIOCTLCMD to indicate to fallback we just return ENOTTY directly avoiding that indirection" * tag 'vfs-6.9-rc6.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: netfs: Fix the pre-flush when appending to a file in writethrough mode netfs: Fix writethrough-mode error handling ntfs3: add legacy ntfs file operations ntfs3: enforce read-only when used as legacy ntfs driver ntfs3: serve as alias for the legacy ntfs driver block: fix module reference leakage from bdev_open_by_dev error path fs: Return ENOTTY directly if FS_IOC_GETUUID or FS_IOC_GETFSSYSFSPATH fail
2024-04-26block/partitions/ldm: convert strncpy() to strscpy()Arnd Bergmann1-4/+2
The strncpy() here can cause a non-terminated string, which older gcc versions such as gcc-9 warn about: In function 'ldm_parse_tocblock', inlined from 'ldm_validate_tocblocks' at block/partitions/ldm.c:386:7, inlined from 'ldm_partition' at block/partitions/ldm.c:1457:7: block/partitions/ldm.c:134:2: error: 'strncpy' specified bound 16 equals destination size [-Werror=stringop-truncation] 134 | strncpy (toc->bitmap1_name, data + 0x24, sizeof (toc->bitmap1_name)); | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ block/partitions/ldm.c:145:2: error: 'strncpy' specified bound 16 equals destination size [-Werror=stringop-truncation] 145 | strncpy (toc->bitmap2_name, data + 0x46, sizeof (toc->bitmap2_name)); | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ New versions notice that the code is correct after all because of the following termination, but replacing the strncpy() with strscpy_pad() or strcpy() avoids the warning and simplifies the code at the same time. Use the padding version here to keep the existing behavior, in case the code relies on not including uninitialized data. Link: https://lkml.kernel.org/r/20240409140059.3806717-4-arnd@kernel.org Reviewed-by: Justin Stitt <justinstitt@google.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Cc: Alexey Starikovskiy <astarikovskiy@suse.de> Cc: Bob Moore <robert.moore@intel.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Len Brown <lenb@kernel.org> Cc: Lin Ming <ming.m.lin@intel.com> Cc: Masahiro Yamada <masahiroy@kernel.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Nicolas Schier <nicolas@fjasle.eu> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: "Richard Russon (FlatCap)" <ldm@flatcap.org> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25block: check if zone_wplugs_hash exists in queue_zone_wplugs_showJohannes Thumshirn1-0/+3
Changhui reported a kernel crash when running this simple shell reproducer: # cd /sys/kernel/debug/block && find . -type f -exec grep -aH . {} \; The above results in a NULL pointer dereference if a device does not have a zone_wplugs_hash allocated. To fix this, return early if we don't have a zone_wplugs_hash. Reported-by: Changhui Zhong <czhong@redhat.com> Fixes: a98b05b02f0f ("block: Replace zone_wlock debugfs entry with zone_wplugs entry") Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/e5fec079dfca448cc21c425cfa5d7b291f5faa67.1714046443.git.johannes.thumshirn@wdc.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-23block: use a per disk workqueue for zone write pluggingDamien Le Moal1-8/+24
A zone write plug BIO work function blk_zone_wplug_bio_work() calls submit_bio_noacct_nocheck() to execute the next unplugged BIO. This function may block. So executing zone plugs BIO works using the block layer global kblockd workqueue can potentially lead to preformance or latency issues as the number of concurrent work for a workqueue is limited to WQ_DFL_ACTIVE (256). 1) For a system with a large number of zoned disks, issuing write requests to otherwise unused zones may be delayed wiating for a work thread to become available. 2) Requeue operations which use kblockd but are independent of zone write plugging may alsoi end up being delayed. To avoid these potential performance issues, create a workqueue per zoned device to execute zone plugs BIO work. The workqueue max active parameter is set to the maximum number of zone write plugs allocated with the zone write plug mempool. This limit is equal to the maximum number of open zones of the disk and defaults to 128 for disks that do not have a limit on the number of open zones. Fixes: dd291d77cc90 ("block: Introduce zone write plugging") Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240420075811.1276893-3-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-20Merge tag 'block-6.9-20240420' of git://git.kernel.dk/linuxLinus Torvalds3-13/+26
Pull block fixes from Jens Axboe: "Just two minor fixes that should go into the 6.9 kernel release, one fixing a regression with partition scanning errors, and one fixing a WARN_ON() that can get triggered if we race with a timer" * tag 'block-6.9-20240420' of git://git.kernel.dk/linux: blk-iocost: do not WARN if iocg was already offlined block: propagate partition scanning errors to the BLKRRPART ioctl
2024-04-19block/mq-deadline: Remove some unused functionsJiapeng Chong1-28/+0
These functions are defined in the mq-deadline.c file, but not called elsewhere, so delete these unused functions. block/mq-deadline.c:134:1: warning: unused function 'deadline_earlier_request'. block/mq-deadline.c:148:1: warning: unused function 'deadline_latter_request'. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=8803 Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Link: https://lore.kernel.org/r/20240419025610.34298-1-jiapeng.chong@linux.alibaba.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-19blk-iocost: do not WARN if iocg was already offlinedLi Nan1-2/+5
In iocg_pay_debt(), warn is triggered if 'active_list' is empty, which is intended to confirm iocg is active when it has debt. However, warn can be triggered during a blkcg or disk removal, if iocg_waitq_timer_fn() is run at that time: WARNING: CPU: 0 PID: 2344971 at block/blk-iocost.c:1402 iocg_pay_debt+0x14c/0x190 Call trace: iocg_pay_debt+0x14c/0x190 iocg_kick_waitq+0x438/0x4c0 iocg_waitq_timer_fn+0xd8/0x130 __run_hrtimer+0x144/0x45c __hrtimer_run_queues+0x16c/0x244 hrtimer_interrupt+0x2cc/0x7b0 The warn in this situation is meaningless. Since this iocg is being removed, the state of the 'active_list' is irrelevant, and 'waitq_timer' is canceled after removing 'active_list' in ioc_pd_free(), which ensures iocg is freed after iocg_waitq_timer_fn() returns. Therefore, add the check if iocg was already offlined to avoid warn when removing a blkcg or disk. Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20240419093257.3004211-1-linan666@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-18block: propagate partition scanning errors to the BLKRRPART ioctlChristoph Hellwig2-11/+21
Commit 4601b4b130de ("block: reopen the device in blkdev_reread_part") lost the propagation of I/O errors from the low-level read of the partition table to the user space caller of the BLKRRPART. Apparently some user space relies on, so restore the propagation. This isn't exactly pretty as other block device open calls explicitly do not are about these errors, so add a new BLK_OPEN_STRICT_SCAN to opt into the error propagation. Fixes: 4601b4b130de ("block: reopen the device in blkdev_reread_part") Reported-by: Saranya Muruganandam <saranyamohan@google.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Link: https://lore.kernel.org/r/20240417144743.2277601-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-17block: Do not special-case plugging of zone write operationsDamien Le Moal4-45/+2
With the block layer zone write plugging being automatically done for any write operation to a zone of a zoned block device, a regular request plugging handled through current->plug can only ever see at most a single write request per zone. In such case, any potential reordering of the plugged requests will be harmless. We can thus remove the special casing for write operations to zones and have these requests plugged as well. This allows removing the function blk_mq_plug and instead directly using current->plug where needed. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Tested-by: Hans Holmberg <hans.holmberg@wdc.com> Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240408014128.205141-29-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-17block: Do not force select mq-deadline with CONFIG_BLK_DEV_ZONEDDamien Le Moal1-1/+0
Now that zone block device write ordering control does not depend anymore on mq-deadline and zone write locking, there is no need to force select the mq-deadline scheduler when CONFIG_BLK_DEV_ZONED is enabled. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Tested-by: Hans Holmberg <hans.holmberg@wdc.com> Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240408014128.205141-28-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-17block: Remove zone write lockingDamien Le Moal2-64/+3
Zone write locking is now unused and replaced with zone write plugging. Remove all code that was implementing zone write locking, that is, the various helper functions controlling request zone write locking and the gendisk attached zone bitmaps. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Tested-by: Hans Holmberg <hans.holmberg@wdc.com> Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240408014128.205141-27-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-17block: Replace zone_wlock debugfs entry with zone_wplugs entryDamien Le Moal3-10/+27
In preparation to completely remove zone write locking, replace the "zone_wlock" mq-debugfs entry that was listing zones that are write-locked with the zone_wplugs entry which lists the zones that currently have a write plug allocated. The write plug information provided is: the zone number, the zone write plug flags, the zone write plug write pointer offset and the number of BIOs currently waiting for execution in the zone write plug BIO list. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Tested-by: Hans Holmberg <hans.holmberg@wdc.com> Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240408014128.205141-26-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-17block: Move zone related debugfs attribute to blk-zoned.cDamien Le Moal5-28/+21
block/blk-mq-debugfs-zone.c contains a single debugfs attribute function. Defining this outside of block/blk-zoned.c does not really help in any way, so move this zone related debugfs attribute to block/blk-zoned.c and delete block/blk-mq-debugfs-zone.c. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Tested-by: Hans Holmberg <hans.holmberg@wdc.com> Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240408014128.205141-25-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-17block: Do not check zone type in blk_check_zone_append()Damien Le Moal1-2/+1
Zone append operations are only allowed to target sequential write required zones. blk_check_zone_append() uses bio_zone_is_seq() to check this. However, this check is not necessary because: 1) For NVMe ZNS namespace devices, only sequential write required zones exist, making the zone type check useless. 2) For null_blk, the driver will fail the request anyway, thus notifying the user that a conventional zone was targeted. 3) For all other zoned devices, zone append is now emulated using zone write plugging, which checks that a zone append operation does not target a conventional zone. In preparation for the removal of zone write locking and its conventional zone bitmap (used by bio_zone_is_seq()), remove the bio_zone_is_seq() call from blk_check_zone_append(). Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Tested-by: Hans Holmberg <hans.holmberg@wdc.com> Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240408014128.205141-24-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-17block: Remove elevator required featuresDamien Le Moal3-58/+5
The only elevator feature ever implemented is ELEVATOR_F_ZBD_SEQ_WRITE for signaling that a scheduler implements zone write locking to tightly control the dispatching order of write operations to zoned block devices. With the removal of zone write locking support in mq-deadline and the reliance of all block device drivers on the block layer zone write plugging to control ordering of write operations to zones, the elevator feature ELEVATOR_F_ZBD_SEQ_WRITE is completely unused. Remove it, and also remove the now unused code for filtering the possible schedulers for a block device based on required features. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Tested-by: Hans Holmberg <hans.holmberg@wdc.com> Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240408014128.205141-23-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-17block: mq-deadline: Remove support for zone write lockingDamien Le Moal1-170/+6
With the block layer generic plugging of write operations for zoned block devices, mq-deadline, or any other scheduler, can only ever see at most one write operation per zone at any time. There is thus no sequentiality requirements for these writes and thus no need to tightly control the dispatching of write requests using zone write locking. Remove all the code that implement this control in the mq-deadline scheduler and remove advertizing support for the ELEVATOR_F_ZBD_SEQ_WRITE elevator feature. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Tested-by: Hans Holmberg <hans.holmberg@wdc.com> Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240408014128.205141-22-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-17block: Simplify blk_revalidate_disk_zones() interfaceDamien Le Moal1-14/+7
The only user of blk_revalidate_disk_zones() second argument was the SCSI disk driver (sd). Now that this driver does not require this update_driver_data argument, remove it to simplify the interface of blk_revalidate_disk_zones(). Also update the function kdoc comment to be more accurate (i.e. there is no gendisk ->revalidate method). Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Tested-by: Hans Holmberg <hans.holmberg@wdc.com> Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240408014128.205141-21-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-17block: Remove BLK_STS_ZONE_RESOURCEDamien Le Moal1-26/+0
The zone append emulation of the scsi disk driver was the only driver using BLK_STS_ZONE_RESOURCE. With this code removed, BLK_STS_ZONE_RESOURCE is now unused. Remove this macro definition and simplify blk_mq_dispatch_rq_list() where this status code was handled. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Tested-by: Hans Holmberg <hans.holmberg@wdc.com> Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240408014128.205141-20-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-17block: Allow BIO-based drivers to use blk_revalidate_disk_zones()Damien Le Moal1-6/+24
In preparation for allowing BIO based device drivers to use zone write plugging and its zone append emulation, allow these drivers to call blk_revalidate_disk_zones() so that all zone resources necessary to zone write plugging can be initialized. To do so, remove the check in blk_revalidate_disk_zones() restricting the use of this function to mq request-based drivers to allow also BIO-based drivers to use it. This is safe to do as long as the BIO-based block device queue is already setup and usable, as it should, and can be safely frozen. The helper function disk_need_zone_resources() is added to control the allocation and initialization of the zone write plug hash table and of the conventional zone bitmap only for mq devices and for BIO-based devices that require zone append emulation. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Tested-by: Hans Holmberg <hans.holmberg@wdc.com> Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240408014128.205141-12-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-17block: Implement zone append emulationDamien Le Moal3-15/+66
Given that zone write plugging manages all writes to zones of a zoned block device and tracks the write pointer position of all zones that are not full nor empty, emulating zone append operations using regular writes can be implemented generically, without relying on the underlying device driver to implement such emulation. This is needed for devices that do not natively support the zone append command (e.g. SMR hard-disks). A device may request zone append emulation by setting its max_zone_append_sectors queue limit to 0. For such device, the function blk_zone_wplug_prepare_bio() changes zone append BIOs into non-mergeable regular write BIOs. Modified zone append BIOs are flagged with the new BIO flag BIO_EMULATES_ZONE_APPEND. This flag is checked on completion of the BIO in blk_zone_write_plug_bio_endio() to restore the original REQ_OP_ZONE_APPEND operation code of the BIO. The block layer internal inline helper function bio_is_zone_append() is added to test if a BIO is either a native zone append operation (REQ_OP_ZONE_APPEND operation code) or if it is flagged with BIO_EMULATES_ZONE_APPEND. Given that both native and emulated zone append BIO completion handling should be similar, The functions blk_update_request() and blk_zone_complete_request_bio() are modified to use bio_is_zone_append() to execute blk_zone_update_request_bio() for both native and emulated zone append operations. This commit contains contributions from Christoph Hellwig <hch@lst.de>. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Tested-by: Hans Holmberg <hans.holmberg@wdc.com> Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240408014128.205141-11-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-17block: Allow zero value of max_zone_append_sectors queue limitDamien Le Moal4-14/+22
In preparation for adding a generic zone append emulation using zone write plugging, allow device drivers supporting zoned block device to set a the max_zone_append_sectors queue limit of a device to 0 to indicate the lack of native support for zone append operations and that the block layer should emulate these operations using regular write operations. blk_queue_max_zone_append_sectors() is modified to allow passing 0 as the max_zone_append_sectors argument. The function queue_max_zone_append_sectors() is also modified to ensure that the minimum of the max_hw_sectors and chunk_sectors limit is used whenever the max_zone_append_sectors limit is 0. This minimum is consistent with the value set for the max_zone_append_sectors limit by the function blk_validate_zoned_limits() when limits for a queue are validated. The helper functions queue_emulates_zone_append() and bdev_emulates_zone_append() are added to test if a queue (or block device) emulates zone append operations. In order for blk_revalidate_disk_zones() to accept zoned block devices relying on zone append emulation, the direct check to the max_zone_append_sectors queue limit of the disk is replaced by a check using the value returned by queue_max_zone_append_sectors(). Similarly, queue_zone_append_max_show() is modified to use the same accessor so that the sysfs attribute advertizes the non-zero limit that will be used, regardless if it is for native or emulated commands. For stacking drivers, a top device should not need to care if the underlying devices have native or emulated zone append operations. blk_stack_limits() is thus modified to set the top device max_zone_append_sectors limit using the new accessor queue_limits_max_zone_append_sectors(). queue_max_zone_append_sectors() is modified to use this function as well. Stacking drivers that require zone append emulation, e.g. dm-crypt, can still request this feature by calling blk_queue_max_zone_append_sectors() with a 0 limit. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Tested-by: Hans Holmberg <hans.holmberg@wdc.com> Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240408014128.205141-10-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-17block: Fake max open zones limit when there is no limitDamien Le Moal1-6/+35
For a zoned block device that has no limit on the number of open zones and no limit on the number of active zones, the zone write plug mempool is created with a size of 128 zone write plugs. For such case, set the device max_open_zones queue limit to this value to indicate to the user the potential performance penalty that may happen when writing simultaneously to more zones than the mempool size. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Tested-by: Hans Holmberg <hans.holmberg@wdc.com> Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240408014128.205141-9-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-17block: Introduce zone write pluggingDamien Le Moal6-10/+1181
Zone write plugging implements a per-zone "plug" for write operations to control the submission and execution order of write operations to sequential write required zones of a zoned block device. Per-zone plugging guarantees that at any time there is at most only one write request per zone being executed. This mechanism is intended to replace zone write locking which implements a similar per-zone write throttling at the scheduler level, but is implemented only by mq-deadline. Unlike zone write locking which operates on requests, zone write plugging operates on BIOs. A zone write plug is simply a BIO list that is atomically manipulated using a spinlock and a kblockd submission work. A write BIO to a zone is "plugged" to delay its execution if a write BIO for the same zone was already issued, that is, if a write request for the same zone is being executed. The next plugged BIO is unplugged and issued once the write request completes. This mechanism allows to: - Untangle zone write ordering from block IO schedulers. This allows removing the restriction on using mq-deadline for writing to zoned block devices. Any block IO scheduler, including "none" can be used. - Zone write plugging operates on BIOs instead of requests. Plugged BIOs waiting for execution thus do not hold scheduling tags and thus are not preventing other BIOs from executing (reads or writes to other zones). Depending on the workload, this can significantly improve the device use (higher queue depth operation) and performance. - Both blk-mq (request based) zoned devices and BIO-based zoned devices (e.g. device mapper) can use zone write plugging. It is mandatory for the former but optional for the latter. BIO-based drivers can use zone write plugging to implement write ordering guarantees, or the drivers can implement their own if needed. - The code is less invasive in the block layer and is mostly limited to blk-zoned.c with some small changes in blk-mq.c, blk-merge.c and bio.c. Zone write plugging is implemented using struct blk_zone_wplug. This structure includes a spinlock, a BIO list and a work structure to handle the submission of plugged BIOs. Zone write plugs structures are managed using a per-disk hash table. Plugging of zone write BIOs is done using the function blk_zone_write_plug_bio() which returns false if a BIO execution does not need to be delayed and true otherwise. This function is called from blk_mq_submit_bio() after a BIO is split to avoid large BIOs spanning multiple zones which would cause mishandling of zone write plugs. This ichange enables by default zone write plugging for any mq request-based block device. BIO-based device drivers can also use zone write plugging by expliclty calling blk_zone_write_plug_bio() in their ->submit_bio method. For such devices, the driver must ensure that a BIO passed to blk_zone_write_plug_bio() is already split and not straddling zone boundaries. Only write and write zeroes BIOs are plugged. Zone write plugging does not introduce any significant overhead for other operations. A BIO that is being handled through zone write plugging is flagged using the new BIO flag BIO_ZONE_WRITE_PLUGGING. A request handling a BIO flagged with this new flag is flagged with the new RQF_ZONE_WRITE_PLUGGING flag. The completion of BIOs and requests flagged trigger respectively calls to the functions blk_zone_write_bio_endio() and blk_zone_write_complete_request(). The latter function is used to trigger submission of the next plugged BIO using the zone plug work. blk_zone_write_bio_endio() does the same for BIO-based devices. This ensures that at any time, at most one request (blk-mq devices) or one BIO (BIO-based devices) is being executed for any zone. The handling of zone write plugs using a per-zone plug spinlock maximizes parallelism and device usage by allowing multiple zones to be writen simultaneously without lock contention. Zone write plugging ignores flush BIOs without data. Hovever, any flush BIO that has data is always plugged so that the write part of the flush sequence is serialized with other regular writes. Given that any BIO handled through zone write plugging will be the only BIO in flight for the target zone when it is executed, the unplugging and submission of a BIO will have no chance of successfully merging with plugged requests or requests in the scheduler. To overcome this potential performance degradation, blk_mq_submit_bio() calls the function blk_zone_write_plug_attempt_merge() to try to merge other plugged BIOs with the one just unplugged and submitted. Successful merging is signaled using blk_zone_write_plug_bio_merged(), called from bio_attempt_back_merge(). Furthermore, to avoid recalculating the number of segments of plugged BIOs to attempt merging, the number of segments of a plugged BIO is saved using the new struct bio field __bi_nr_segments. To avoid growing the size of struct bio, this field is added as a union with the bio_cookie field. This is safe to do as polling is always disabled for plugged BIOs. When BIOs are plugged in a zone write plug, the device request queue usage counter is always incremented. This reference is kept and reused for blk-mq devices when the plugged BIO is unplugged and submitted again using submit_bio_noacct_nocheck(). For this case, the unplugged BIO is already flagged with BIO_ZONE_WRITE_PLUGGING and blk_mq_submit_bio() proceeds directly to allocating a new request for the BIO, re-using the usage reference count taken when the BIO was plugged. This extra reference count is dropped in blk_zone_write_plug_attempt_merge() for any plugged BIO that is successfully merged. Given that BIO-based devices will not take this path, the extra reference is dropped after a plugged BIO is unplugged and submitted. Zone write plugs are dynamically allocated and managed using a hash table (an array of struct hlist_head) with RCU protection. A zone write plug is allocated when a write BIO is received for the zone and not freed until the zone is fully written, reset or finished. To detect when a zone write plug can be freed, the write state of each zone is tracked using a write pointer offset which corresponds to the offset of a zone write pointer relative to the zone start. Write operations always increment this write pointer offset. Zone reset operations set it to 0 and zone finish operations set it to the zone size. If a write error happens, the wp_offset value of a zone write plug may become incorrect and out of sync with the device managed write pointer. This is handled using the zone write plug flag BLK_ZONE_WPLUG_ERROR. The function blk_zone_wplug_handle_error() is called from the new disk zone write plug work when this flag is set. This function executes a report zone to update the zone write pointer offset to the current value as indicated by the device. The disk zone write plug work is scheduled whenever a BIO flagged with BIO_ZONE_WRITE_PLUGGING completes with an error or when bio_zone_wplug_prepare_bio() detects an unaligned write. Once scheduled, the disk zone write plugs work keeps running until all zone errors are handled. To match the new data structures used for zoned disks, the function disk_free_zone_bitmaps() is renamed to the more generic disk_free_zone_resources(). The function disk_init_zone_resources() is also introduced to initialize zone write plugs resources when a gendisk is allocated. In order to guarantee that the user can simultaneously write up to a number of zones equal to a device max active zone limit or max open zone limit, zone write plugs are allocated using a mempool sized to the maximum of these 2 device limits. For a device that does not have active and open zone limits, 128 is used as the default mempool size. If a change to the device active and open zone limits is detected, the disk mempool is resized when blk_revalidate_disk_zones() is executed. This commit contains contributions from Christoph Hellwig <hch@lst.de>. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Tested-by: Hans Holmberg <hans.holmberg@wdc.com> Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20240408014128.205141-8-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-17block: Remember zone capacity when revalidating zonesDamien Le Moal1-0/+26
In preparation for adding zone write plugging, modify blk_revalidate_disk_zones() to get the capacity of zones of a zoned block device. This capacity value as a number of 512B sectors is stored in the gendisk zone_capacity field. Given that host-managed SMR disks (including zoned UFS drives) and all known NVMe ZNS devices have the same zone capacity for all zones blk_revalidate_disk_zones() returns an error if different capacities are detected for different zones. This also adds check to verify that the values reported by the device for zone capacities are correct, that is, that the zone capacity is never 0, does not exceed the zone size and is equal to the zone size for conventional zones. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Tested-by: Hans Holmberg <hans.holmberg@wdc.com> Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20240408014128.205141-7-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-17block: Allow using bio_attempt_back_merge() internallyDamien Le Moal2-7/+9
Remove "static" from the definition of bio_attempt_back_merge() and declare this function in block/blk.h to allow using it internally from other block layer files. The definition of enum bio_merge_status is also moved to block/blk.h. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Tested-by: Hans Holmberg <hans.holmberg@wdc.com> Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20240408014128.205141-6-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-17block: Introduce blk_zone_update_request_bio()Damien Le Moal2-7/+23
On completion of a zone append request, the request sector indicates the location of the written data. This value must be returned to the user through the BIO iter sector. This is done in 2 places: in blk_complete_request() and in blk_update_request(). Introduce the inline helper function blk_zone_update_request_bio() to avoid duplicating this BIO update for zone append requests, and to compile out this helper call when CONFIG_BLK_DEV_ZONED is not enabled. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Tested-by: Hans Holmberg <hans.holmberg@wdc.com> Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20240408014128.205141-4-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-17block: Remove req_bio_endio()Damien Le Moal1-30/+28
Moving req_bio_endio() code into its only caller, blk_update_request(), allows reducing accesses to and tests of bio and request fields. Also, given that partial completions of zone append operations is not possible and that zone append operations cannot be merged, the update of the BIO sector using the request sector for these operations can be moved directly before the call to bio_endio(). Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Tested-by: Hans Holmberg <hans.holmberg@wdc.com> Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20240408014128.205141-3-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-17block: Restore sector of flush requestsDamien Le Moal1-0/+1
On completion of a flush sequence, blk_flush_restore_request() restores the bio of a request to the original submitted BIO. However, the last use of the request in the flush sequence may have been for a POSTFLUSH which does not have a sector. So make sure to restore the request sector using the iter sector of the original BIO. This BIO has not changed yet since the completions of the flush sequence intermediate steps use requeueing of the request until all steps are completed. Restoring the request sector ensures that blk_mq_end_request() will see a valid sector as originally set when the flush BIO was submitted. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Tested-by: Hans Holmberg <hans.holmberg@wdc.com> Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20240408014128.205141-2-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-15block: Call blkdev_dio_unaligned() from blkdev_direct_IO()John Garry1-17/+12
blkdev_dio_unaligned() is called from __blkdev_direct_IO(), __blkdev_direct_IO_simple(), and __blkdev_direct_IO_async(), and all these are only called from blkdev_direct_IO(). Move the blkdev_dio_unaligned() call to the common callsite, blkdev_direct_IO(). Pass those functions the bdev pointer from blkdev_direct_IO(), as it is non-trivial to look up. Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: John Garry <john.g.garry@oracle.com> Link: https://lore.kernel.org/r/20240415122020.1541594-1-john.g.garry@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-12Merge tag 'block-6.9-20240412' of git://git.kernel.dk/linuxLinus Torvalds5-16/+21
Pull block fixes from Jens Axboe: - MD pull request via Song: - UAF fix (Yu) - Avoid out-of-bounds shift in blk-iocost (Rik) - Fix for q->blkg_list corruption (Ming) - Relax virt boundary mask/size segment checking (Ming) * tag 'block-6.9-20240412' of git://git.kernel.dk/linux: block: fix that blk_time_get_ns() doesn't update time after schedule block: allow device to have both virt_boundary_mask and max segment size block: fix q->blkg_list corruption during disk rebind blk-iocost: avoid out of bounds shift raid1: fix use-after-free for original bio in raid1_write_request()
2024-04-12block: fix that blk_time_get_ns() doesn't update time after scheduleYu Kuai1-0/+1
While monitoring the throttle time of IO from iocost, it's found that such time is always zero after the io_schedule() from ioc_rqos_throttle, for example, with the following debug patch: + printk("%s-%d: %s enter %llu\n", current->comm, current->pid, __func__, blk_time_get_ns()); while (true) { set_current_state(TASK_UNINTERRUPTIBLE); if (wait.committed) break; io_schedule(); } + printk("%s-%d: %s exit %llu\n", current->comm, current->pid, __func__, blk_time_get_ns()); It can be observerd that blk_time_get_ns() always return the same time: [ 1068.096579] fio-1268: ioc_rqos_throttle enter 1067901962288 [ 1068.272587] fio-1268: ioc_rqos_throttle exit 1067901962288 [ 1068.274389] fio-1268: ioc_rqos_throttle enter 1067901962288 [ 1068.472690] fio-1268: ioc_rqos_throttle exit 1067901962288 [ 1068.474485] fio-1268: ioc_rqos_throttle enter 1067901962288 [ 1068.672656] fio-1268: ioc_rqos_throttle exit 1067901962288 [ 1068.674451] fio-1268: ioc_rqos_throttle enter 1067901962288 [ 1068.872655] fio-1268: ioc_rqos_throttle exit 1067901962288 And I think the root cause is that 'PF_BLOCK_TS' is always cleared by blk_flush_plug() before scheduel(), hence blk_plug_invalidate_ts() will never be called: blk_time_get_ns plug->cur_ktime = ktime_get_ns(); current->flags |= PF_BLOCK_TS; io_schedule: io_schedule_prepare blk_flush_plug __blk_flush_plug /* the flag is cleared, while time is not */ current->flags &= ~PF_BLOCK_TS; schedule sched_update_worker /* the flag is not set, hence plug->cur_ktime is not cleared */ if (tsk->flags & PF_BLOCK_TS) blk_plug_invalidate_ts() blk_time_get_ns /* got the time stashed before schedule */ return plug->cur_ktime; Fix the problem by clearing cached time in __blk_flush_plug(). Fixes: 06b23f92af87 ("block: update cached timestamp post schedule/preemption") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240411032349.3051233-2-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-12scsi: block: Remove now unused queue limits helpersChristoph Hellwig1-245/+0
Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240409143748.980206-24-hch@lst.de Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2024-04-12scsi: bsg: Pass queue_limits to bsg_setup_queue()Christoph Hellwig1-2/+4
This allows bsg_setup_queue() to pass them to blk_mq_alloc_queue() and thus set up the limits at queue allocation time. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240409143748.980206-3-hch@lst.de Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2024-04-11block: fix module reference leakage from bdev_open_by_dev error pathYu Kuai1-1/+1
At the time bdev_may_open() is called, module reference is grabbed already, hence module reference should be released if bdev_may_open() failed. This problem is found by code review. Fixes: ed5cc702d311 ("block: Add config option to not allow writing to mounted devices") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240406090930.2252838-22-yukuai1@huaweicloud.com Signed-off-by: Christian Brauner <brauner@kernel.org>