summaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2024-05-09block: fix that util can be greater than 100%Yu Kuai1-10/+2
util means the percentage that disk has IO, and theoretically it should not be greater than 100%. However, there is a gap for rq-based disk: io_ticks will be updated when rq is allocated, however, before such rq dispatch to driver, it will not be account as inflight from blk_mq_start_request() hence diskstats_show()/part_stat_show() will not update io_ticks. For example: 1) at t0, issue a new IO, rq is allocated, and blk_account_io_start() update io_ticks; 2) something is wrong with drivers, and the rq can't be dispatched; 3) at t0 + 10s, drivers recovers and rq is dispatched and done, io_ticks is updated; Then if user is using "iostat 1" to monitor "util", between t0 - t0+9s, util will be zero, and between t0+9s - t0+10s, util will be 1000%. Fix this problem by updating io_ticks from diskstats_show() and part_stat_show() if there are rq allocated. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240509123717.3223892-3-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-09block: support to account io_ticks preciselyYu Kuai5-5/+13
Currently, io_ticks is accounted based on sampling, specifically update_io_ticks() will always account io_ticks by 1 jiffies from bdev_start_io_acct()/blk_account_io_start(), and the result can be inaccurate, for example(HZ is 250): Test script: fio -filename=/dev/sda -bs=4k -rw=write -direct=1 -name=test -thinktime=4ms Test result: util is about 90%, while the disk is really idle. This behaviour is introduced by commit 5b18b5a73760 ("block: delete part_round_stats and switch to less precise counting"), however, there was a key point that is missed that this patch also improve performance a lot: Before the commit: part_round_stats: if (part->stamp != now) stats |= 1; part_in_flight() -> there can be lots of task here in 1 jiffies. part_round_stats_single() __part_stat_add() part->stamp = now; After the commit: update_io_ticks: stamp = part->bd_stamp; if (time_after(now, stamp)) if (try_cmpxchg()) __part_stat_add() -> only one task can reach here in 1 jiffies. Hence in order to account io_ticks precisely, we only need to know if there are IO inflight at most once in one jiffies. Noted that for rq-based device, iterating tags should not be used here because 'tags->lock' is grabbed in blk_mq_find_and_get_req(), hence part_stat_lock_inc/dec() and part_in_flight() is used to trace inflight. The additional overhead is quite little: - per cpu add/dec for each IO for rq-based device; - per cpu sum for each jiffies; And it's verified by null-blk that there are no performance degration under heavy IO pressure. Fixes: 5b18b5a73760 ("block: delete part_round_stats and switch to less precise counting") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240509123717.3223892-2-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-09block: add plug while submitting IOYu Kuai1-0/+6
So that if caller didn't use plug, for example, __blkdev_direct_IO_simple() and __blkdev_direct_IO_async(), block layer can still benefit from caching nsec time in the plug. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240509123825.3225207-1-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-09bcache: fix variable length array abuse in btree_iterMatthew Mirvish6-59/+70
btree_iter is used in two ways: either allocated on the stack with a fixed size MAX_BSETS, or from a mempool with a dynamic size based on the specific cache set. Previously, the struct had a fixed-length array of size MAX_BSETS which was indexed out-of-bounds for the dynamically-sized iterators, which causes UBSAN to complain. This patch uses the same approach as in bcachefs's sort_iter and splits the iterator into a btree_iter with a flexible array member and a btree_iter_stack which embeds a btree_iter as well as a fixed-length data array. Cc: stable@vger.kernel.org Closes: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2039368 Signed-off-by: Matthew Mirvish <matthew@mm12.xyz> Signed-off-by: Coly Li <colyli@suse.de> Link: https://lore.kernel.org/r/20240509011117.2697-3-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-09bcache: Remove usage of the deprecated ida_simple_xx() APIChristophe JAILLET1-5/+5
ida_alloc() and ida_free() should be preferred to the deprecated ida_simple_get() and ida_simple_remove(). Note that the upper limit of ida_simple_get() is exclusive, but the one of ida_alloc_max() is inclusive. So a -1 has been added when needed. Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Signed-off-by: Coly Li <colyli@suse.de> Link: https://lore.kernel.org/r/20240509011117.2697-2-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-07md: Revert "md: Fix overflow in is_mddev_idle"Li Nan3-7/+6
This reverts commit 3f9f231236ce7e48780d8a4f1f8cb9fae2df1e4e. Using 64bit for 'sync_io' is unnecessary from the gendisk side. This overflow will not cause any functional impact, except for a UBSAN warning. Solving this overflow requires introducing additional calculations and checks which are not necessary. So just keep using 32bit for 'sync_io'. Signed-off-by: Li Nan <linan122@huawei.com> Link: https://lore.kernel.org/r/20240507023103.781816-1-linan666@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-07blk-lib: check for kill signal in ioctl BLKDISCARDChristoph Hellwig1-3/+30
Discards can access a significant capacity and take longer than the user expected. A user may change their mind about wanting to run that command and attempt to kill the process and do something else with their device. But since the task is uninterruptable, they have to wait for it to finish, which could be many hours. Open code blkdev_issue_discard in the BLKDISCARD ioctl handler and check for a fatal signal at each iteration so the user doesn't have to wait for their regretted operation to complete naturally. Heavily based on an earlier patch from Keith Busch. Reported-by: Conrad Meyer <conradmeyer@meta.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240506042027.2289826-7-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-07block: add a bio_await_chain helperKeith Busch2-0/+21
Add a helper to wait for an entire chain of bios to complete. [hch: split from a larger patch, moved and changed the name now that it is non-static] Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240506042027.2289826-6-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-07block: add a blk_alloc_discard_bio helperChristoph Hellwig2-21/+32
Factor out a helper from __blkdev_issue_discard that chews off as much as possible from a discard range and allocates a bio for it. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240506042027.2289826-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-07block: add a bio_chain_and_submit helperChristoph Hellwig2-8/+20
This is basically blk_next_bio just with the bio allocation moved to the caller to allow for more flexible bio handling in the caller. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240506042027.2289826-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-07block: move discard checks into the ioctl handlerChristoph Hellwig2-15/+5
Most bio operations get basic sanity checking in submit_bio and anything more complicated than that is done in the callers. Discards are a bit different from that in that a lot of checking is done in __blkdev_issue_discard, and the specific errnos for that are returned to userspace. Move the checks that require specific errnos to the ioctl handler instead, and just leave the basic sanity checking in submit_bio for the other handlers. This introduces two changes in behavior: 1) the logical block size alignment check of the start and len is lost for non-ioctl callers. This matches what is done for other operations including reads and writes. We should probably verify this for all bios, but for now make discards match the normal flow. 2) for non-ioctl callers all errors are reported on I/O completion now instead of synchronously. Callers in general mostly ignore or log errors so this will actually simplify the code once cleaned up Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240506042027.2289826-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-07block: remove the discard_granularity check in __blkdev_issue_discardChristoph Hellwig1-7/+0
We now set a default granularity in the queue limits API, so don't bother with this extra check. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240506042027.2289826-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-07block/ioctl: prefer different overflow checkJustin Stitt1-1/+1
Running syzkaller with the newly reintroduced signed integer overflow sanitizer shows this report: [ 62.982337] ------------[ cut here ]------------ [ 62.985692] cgroup: Invalid name [ 62.986211] UBSAN: signed-integer-overflow in ../block/ioctl.c:36:46 [ 62.989370] 9pnet_fd: p9_fd_create_tcp (7343): problem connecting socket to 127.0.0.1 [ 62.992992] 9223372036854775807 + 4095 cannot be represented in type 'long long' [ 62.997827] 9pnet_fd: p9_fd_create_tcp (7345): problem connecting socket to 127.0.0.1 [ 62.999369] random: crng reseeded on system resumption [ 63.000634] GUP no longer grows the stack in syz-executor.2 (7353): 20002000-20003000 (20001000) [ 63.000668] CPU: 0 PID: 7353 Comm: syz-executor.2 Not tainted 6.8.0-rc2-00035-gb3ef86b5a957 #1 [ 63.000677] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 [ 63.000682] Call Trace: [ 63.000686] <TASK> [ 63.000731] dump_stack_lvl+0x93/0xd0 [ 63.000919] __get_user_pages+0x903/0xd30 [ 63.001030] __gup_longterm_locked+0x153e/0x1ba0 [ 63.001041] ? _raw_read_unlock_irqrestore+0x17/0x50 [ 63.001072] ? try_get_folio+0x29c/0x2d0 [ 63.001083] internal_get_user_pages_fast+0x1119/0x1530 [ 63.001109] iov_iter_extract_pages+0x23b/0x580 [ 63.001206] bio_iov_iter_get_pages+0x4de/0x1220 [ 63.001235] iomap_dio_bio_iter+0x9b6/0x1410 [ 63.001297] __iomap_dio_rw+0xab4/0x1810 [ 63.001316] iomap_dio_rw+0x45/0xa0 [ 63.001328] ext4_file_write_iter+0xdde/0x1390 [ 63.001372] vfs_write+0x599/0xbd0 [ 63.001394] ksys_write+0xc8/0x190 [ 63.001403] do_syscall_64+0xd4/0x1b0 [ 63.001421] ? arch_exit_to_user_mode_prepare+0x3a/0x60 [ 63.001479] entry_SYSCALL_64_after_hwframe+0x6f/0x77 [ 63.001535] RIP: 0033:0x7f7fd3ebf539 [ 63.001551] Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 f1 14 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48 [ 63.001562] RSP: 002b:00007f7fd32570c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 [ 63.001584] RAX: ffffffffffffffda RBX: 00007f7fd3ff3f80 RCX: 00007f7fd3ebf539 [ 63.001590] RDX: 4db6d1e4f7e43360 RSI: 0000000020000000 RDI: 0000000000000004 [ 63.001595] RBP: 00007f7fd3f1e496 R08: 0000000000000000 R09: 0000000000000000 [ 63.001599] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 [ 63.001604] R13: 0000000000000006 R14: 00007f7fd3ff3f80 R15: 00007ffd415ad2b8 ... [ 63.018142] ---[ end trace ]--- Historically, the signed integer overflow sanitizer did not work in the kernel due to its interaction with `-fwrapv` but this has since been changed [1] in the newest version of Clang; It was re-enabled in the kernel with Commit 557f8c582a9ba8ab ("ubsan: Reintroduce signed overflow sanitizer"). Let's rework this overflow checking logic to not actually perform an overflow during the check itself, thus avoiding the UBSAN splat. [1]: https://github.com/llvm/llvm-project/pull/82432 Signed-off-by: Justin Stitt <justinstitt@google.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240507-b4-sio-block-ioctl-v3-1-ba0c2b32275e@google.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-06null_blk: Fix the WARNING: modpost: missing MODULE_DESCRIPTION()Zhu Yanjun1-0/+1
No functional changes intended. Fixes: f2298c0403b0 ("null_blk: multi queue aware block test driver") Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20240506075538.6064-1-yanjun.zhu@linux.dev Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-03block: fix and simplify blkdevparts= cmdline parsingINAGAKI Hiroshi1-37/+12
Fix the cmdline parsing of the "blkdevparts=" parameter using strsep(), which makes the code simpler. Before commit 146afeb235cc ("block: use strscpy() to instead of strncpy()"), we used a strncpy() to copy a block device name and partition names. The commit simply replaced a strncpy() and NULL termination with a strscpy(). It did not update calculations of length passed to strscpy(). While the length passed to strncpy() is just a length of valid characters without NULL termination ('\0'), strscpy() takes it as a length of the destination buffer, including a NULL termination. Since the source buffer is not necessarily NULL terminated, the current code copies "length - 1" characters and puts a NULL character in the destination buffer. It replaces the last character with NULL and breaks the parsing. As an example, that buffer will be passed to parse_parts() and breaks parsing sub-partitions due to the missing ')' at the end, like the following. example (Check Point V-80 & OpenWrt): - Linux Kernel 6.6 [ 0.000000] Kernel command line: console=ttyS0,115200 earlycon=uart8250,mmio32,0xf0512000 crashkernel=30M mvpp2x.queue_mode=1 blkdevparts=mmcblk1:48M@10M(kernel-1),1M(dtb-1),720M(rootfs-1),48M(kernel-2),1M(dtb-2),720M(rootfs-2),300M(default_sw),650M(logs),1M(preset_cfg),1M(adsl),-(storage) maxcpus=4 ... [ 0.884016] mmc1: new HS200 MMC card at address 0001 [ 0.889951] mmcblk1: mmc1:0001 004GA0 3.69 GiB [ 0.895043] cmdline partition format is invalid. [ 0.895704] mmcblk1: p1 [ 0.903447] mmcblk1boot0: mmc1:0001 004GA0 2.00 MiB [ 0.908667] mmcblk1boot1: mmc1:0001 004GA0 2.00 MiB [ 0.913765] mmcblk1rpmb: mmc1:0001 004GA0 512 KiB, chardev (248:0) 1. "48M@10M(kernel-1),..." is passed to strscpy() with length=17 from parse_parts() 2. strscpy() returns -E2BIG and the destination buffer has "48M@10M(kernel-1\0" 3. "48M@10M(kernel-1\0" is passed to parse_subpart() 4. parse_subpart() fails to find ')' when parsing a partition name, and returns error - Linux Kernel 6.1 [ 0.000000] Kernel command line: console=ttyS0,115200 earlycon=uart8250,mmio32,0xf0512000 crashkernel=30M mvpp2x.queue_mode=1 blkdevparts=mmcblk1:48M@10M(kernel-1),1M(dtb-1),720M(rootfs-1),48M(kernel-2),1M(dtb-2),720M(rootfs-2),300M(default_sw),650M(logs),1M(preset_cfg),1M(adsl),-(storage) maxcpus=4 ... [ 0.953142] mmc1: new HS200 MMC card at address 0001 [ 0.959114] mmcblk1: mmc1:0001 004GA0 3.69 GiB [ 0.964259] mmcblk1: p1(kernel-1) p2(dtb-1) p3(rootfs-1) p4(kernel-2) p5(dtb-2) 6(rootfs-2) p7(default_sw) p8(logs) p9(preset_cfg) p10(adsl) p11(storage) [ 0.979174] mmcblk1boot0: mmc1:0001 004GA0 2.00 MiB [ 0.984674] mmcblk1boot1: mmc1:0001 004GA0 2.00 MiB [ 0.989926] mmcblk1rpmb: mmc1:0001 004GA0 512 KiB, chardev (248:0 By the way, strscpy() takes a length of destination buffer and it is often confusing when copying characters with a specified length. Using strsep() helps to separate the string by the specified character. Then, we can use strscpy() naturally with the size of the destination buffer. Separating the string on the fly is also useful to omit the redundant string copy, reducing memory usage and improve the code readability. Fixes: 146afeb235cc ("block: use strscpy() to instead of strncpy()") Suggested-by: Naohiro Aota <naota@elisp.net> Signed-off-by: INAGAKI Hiroshi <musashino.open@gmail.com> Reviewed-by: Daniel Golle <daniel@makrotopia.org> Link: https://lore.kernel.org/r/20240421074005.565-1-musashino.open@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-03block: refine the EOF check in blkdev_iomap_beginChristoph Hellwig1-1/+1
blkdev_iomap_begin rounds down the offset to the logical block size before stashing it in iomap->offset and checking that it still is inside the inode size. Check the i_size check to the raw pos value so that we don't try a zero size write if iter->pos is unaligned. Fixes: 487c607df790 ("block: use iomap for writes to block devices") Reported-by: syzbot+0a3683a0a6fecf909244@syzkaller.appspotmail.com Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: syzbot+0a3683a0a6fecf909244@syzkaller.appspotmail.com Link: https://lore.kernel.org/r/20240503081042.2078062-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-03block: add a partscan sysfs attribute for disksChristoph Hellwig2-0/+18
Userspace had been unknowingly relying on a non-stable interface of kernel internals to determine if partition scanning is enabled for a given disk. Provide a stable interface for this purpose instead. Cc: stable@vger.kernel.org # 6.3+ Depends-on: 140ce28dd3be ("block: add a disk_has_partscan helper") Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/linux-block/ZhQJf8mzq_wipkBH@gardel-login/ Link: https://lore.kernel.org/r/20240502130033.1958492-3-hch@lst.de [axboe: add links and commit message from Keith] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-03block: add a disk_has_partscan helperChristoph Hellwig3-9/+16
Add a helper to check if partition scanning is enabled instead of open coding the check in a few places. This now always checks for the hidden flag even if all but one of the callers are never reachable for hidden gendisks. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240502130033.1958492-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-03Merge tag 'md-6.10-20240502' of ↵Jens Axboe1-3/+3
https://git.kernel.org/pub/scm/linux/kernel/git/song/md into for-6.10/block Pull MD fix from Song: "This fixes an issue observed with dm-raid." * tag 'md-6.10-20240502' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md: md: fix resync softlockup when bitmap size is less than array size
2024-05-03md: fix resync softlockup when bitmap size is less than array sizeYu Kuai1-3/+3
Is is reported that for dm-raid10, lvextend + lvchange --syncaction will trigger following softlockup: kernel:watchdog: BUG: soft lockup - CPU#3 stuck for 26s! [mdX_resync:6976] CPU: 7 PID: 3588 Comm: mdX_resync Kdump: loaded Not tainted 6.9.0-rc4-next-20240419 #1 RIP: 0010:_raw_spin_unlock_irq+0x13/0x30 Call Trace: <TASK> md_bitmap_start_sync+0x6b/0xf0 raid10_sync_request+0x25c/0x1b40 [raid10] md_do_sync+0x64b/0x1020 md_thread+0xa7/0x170 kthread+0xcf/0x100 ret_from_fork+0x30/0x50 ret_from_fork_asm+0x1a/0x30 And the detailed process is as follows: md_do_sync j = mddev->resync_min while (j < max_sectors) sectors = raid10_sync_request(mddev, j, &skipped) if (!md_bitmap_start_sync(..., &sync_blocks)) // md_bitmap_start_sync set sync_blocks to 0 return sync_blocks + sectors_skippe; // sectors = 0; j += sectors; // j never change Root cause is that commit 301867b1c168 ("md/raid10: check slab-out-of-bounds in md_bitmap_get_counter") return early from md_bitmap_get_counter(), without setting returned blocks. Fix this problem by always set returned blocks from md_bitmap_get_counter"(), as it used to be. Noted that this patch just fix the softlockup problem in kernel, the case that bitmap size doesn't match array size still need to be fixed. Fixes: 301867b1c168 ("md/raid10: check slab-out-of-bounds in md_bitmap_get_counter") Reported-and-tested-by: Nigel Croxon <ncroxon@redhat.com> Closes: https://lore.kernel.org/all/71ba5272-ab07-43ba-8232-d2da642acb4e@redhat.com/ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240422065824.2516-1-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>
2024-05-01block: Cleanup blk_revalidate_zone_cb()Damien Le Moal1-52/+77
Define the code for checking conventional and sequential write required zones suing the functions blk_revalidate_conv_zone() and blk_revalidate_seq_zone() respectively. This simplifies the zone type switch-case in blk_revalidate_zone_cb(). No functional changes. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20240501110907.96950-15-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-01block: Simplify zone write plug BIO abortDamien Le Moal1-8/+7
When BIOs plugged in a zone write plug are aborted, blk_zone_wplug_bio_io_error() clears the BIO BIO_ZONE_WRITE_PLUGGING flag so that bio_io_error(bio) does not end up calling blk_zone_write_plug_bio_endio() and we thus need to manually drop the reference on the zone write plug held by the aborted BIO. Move the call to disk_put_zone_wplug() that is alwasy following the call to blk_zone_wplug_bio_io_error() inside that function to simplify the code. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20240501110907.96950-14-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-01block: Simplify blk_zone_write_plug_bio_endio()Damien Le Moal1-2/+1
We already have the disk variable obtained from the bio when calling disk_get_zone_wplug(). So use that variable instead of dereferencing the bio bdev again for the disk argument of disk_get_zone_wplug(). Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20240501110907.96950-13-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-01block: Improve zone write request completion handlingDamien Le Moal3-13/+12
blk_zone_complete_request() must be called to handle the completion of a zone write request handled with zone write plugging. This function is called from blk_complete_request(), blk_update_request() and also in blk_mq_submit_bio() error path. Improve this by moving this function call into blk_mq_finish_request() as all requests are processed with this function when they complete as well as when they are freed without being executed. This also improves blk_update_request() used by scsi devices as these may repeatedly call this function to handle partial completions. To be consistent with this change, blk_zone_complete_request() is renamed to blk_zone_finish_request() and blk_zone_write_plug_complete_request() is renamed to blk_zone_write_plug_finish_request(). Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20240501110907.96950-12-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-01block: Improve blk_zone_write_plug_bio_merged()Damien Le Moal1-2/+7
Improve blk_zone_write_plug_bio_merged() to check that we succefully get a reference on the zone write plug of the merged BIO, as expected since for a merge we already have at least one request and one BIO referencing the zone write plug. Comments in this function are also improved to better explain the references to the BIO zone write plug. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20240501110907.96950-11-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-01block: Fix handling of non-empty flush write requests to zonesDamien Le Moal3-9/+13
Zone write plugging ignores empty (no data) flush operations but handles flush BIOs that have data to ensure that the flush machinery generated write is processed in order. However, the call to blk_zone_write_plug_attempt_merge() which sets a request RQF_ZONE_WRITE_PLUGGING flag is called after blk_insert_flush(), thus missing indicating that a non empty flush request completion needs handling by zone write plugging. Fix this by moving the call to blk_zone_write_plug_attempt_merge() before blk_insert_flush(). And while at it, rename that function as blk_zone_write_plug_init_request() to be clear that it is not just about merging plugged BIOs in the request. While at it, also add a WARN_ONCE() check that the zone write plug for the request is not NULL. Fixes: dd291d77cc90 ("block: Introduce zone write plugging") Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20240501110907.96950-10-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-01block: Fix flush request sector restoreDamien Le Moal1-1/+2
Make sure that a request bio is not NULL before trying to restore the request start sector. Reported-by: Yi Zhang <yi.zhang@redhat.com> Fixes: 6f8fd758de63 ("block: Restore sector of flush requests") Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20240501110907.96950-9-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-01block: Do not remove zone write plugs still in useDamien Le Moal1-8/+31
Large write BIOs that span a zone boundary are split in blk_mq_submit_bio() before being passed to blk_zone_plug_bio() for zone write plugging. Such split BIO will be chained with one fragment targeting one zone and the remainder of the BIO targeting the next zone. The two BIOs can be executed in parallel, without a predetermine order relative to eachother and their completion may be reversed: the remainder first completing and the first fragment then completing. In such case, bio_endio() will not immediately execute blk_zone_write_plug_bio_endio() for the parent BIO (the remainder of the split BIO) as the BIOs are chained. blk_zone_write_plug_bio_endio() for the parent BIO will be executed only once the first fragment completes. In the case of a device with small zones and very large BIOs, uch completion pattern can lead to disk_should_remove_zone_wplug() to return true for the zone of the parent BIO when the parent BIO request completes and blk_zone_write_plug_complete_request() is executed. This triggers the removal of the zone write plug from the hash table using disk_remove_zone_wplug(). With the zone write plug of the parent BIO missing, the call to disk_get_zone_wplug() in blk_zone_write_plug_bio_endio() returns NULL and triggers a warning. This patterns can be recreated fairly easily using a scsi_debug device with small zone and btrfs. E.g. modprobe scsi_debug delay=0 dev_size_mb=1024 sector_size=4096 \ zbc=host-managed zone_cap_mb=3 zone_nr_conv=0 zone_size_mb=4 mkfs.btrfs -f -O zoned /dev/sda mount -t btrfs /dev/sda /mnt fio --name=wrtest --rw=randwrite --direct=1 --ioengine=libaio \ --bs=4k --iodepth=16 --size=1M --directory=/mnt --time_based \ --runtime=10 umount /dev/sda Will result in the warning: [ 29.035538] WARNING: CPU: 3 PID: 37 at block/blk-zoned.c:1207 blk_zone_write_plug_bio_endio+0xee/0x1e0 ... [ 29.058682] Call Trace: [ 29.059095] <TASK> [ 29.059473] ? __warn+0x80/0x120 [ 29.059983] ? blk_zone_write_plug_bio_endio+0xee/0x1e0 [ 29.060728] ? report_bug+0x160/0x190 [ 29.061283] ? handle_bug+0x36/0x70 [ 29.061830] ? exc_invalid_op+0x17/0x60 [ 29.062399] ? asm_exc_invalid_op+0x1a/0x20 [ 29.063025] ? blk_zone_write_plug_bio_endio+0xee/0x1e0 [ 29.063760] bio_endio+0xb7/0x150 [ 29.064280] btrfs_clone_write_end_io+0x2b/0x60 [btrfs] [ 29.065049] blk_update_request+0x17c/0x500 [ 29.065666] scsi_end_request+0x27/0x1a0 [scsi_mod] [ 29.066356] scsi_io_completion+0x5b/0x690 [scsi_mod] [ 29.067077] blk_complete_reqs+0x3a/0x50 [ 29.067692] __do_softirq+0xcf/0x2b3 [ 29.068248] ? sort_range+0x20/0x20 [ 29.068791] run_ksoftirqd+0x1c/0x30 [ 29.069339] smpboot_thread_fn+0xcc/0x1b0 [ 29.069936] kthread+0xcf/0x100 [ 29.070438] ? kthread_complete_and_exit+0x20/0x20 [ 29.071314] ret_from_fork+0x31/0x50 [ 29.071873] ? kthread_complete_and_exit+0x20/0x20 [ 29.072563] ret_from_fork_asm+0x11/0x20 [ 29.073146] </TASK> either when fio executes or when unmount is executed. Fix this by modifying disk_should_remove_zone_wplug() to check that the reference count to a zone write plug is not larger than 2, that is, that the only references left on the zone are the caller held reference (blk_zone_write_plug_complete_request()) and the initial extra reference for the zone write plug taken when it was initialized (and that is dropped when the zone write plug is removed from the hash table). To be consistent with this change, make sure to drop the request or BIO held reference to the zone write plug before calling disk_zone_wplug_unplug_bio(). All references are also dropped using disk_put_zone_wplug() instead of atomic_dec() to ensure that the zone write plug is freed if it needs to be. Comments are also improved to clarify zone write plugs reference handling. Reported-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Fixes: dd291d77cc90 ("block: Introduce zone write plugging") Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20240501110907.96950-8-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-01block: Unhash a zone write plug only if neededDamien Le Moal1-23/+32
Fix disk_remove_zone_wplug() to ensure that a zone write plug already removed from a disk hash table of zone write plugs is not removed again. Do this by checking the BLK_ZONE_WPLUG_UNHASHED flag of the plug and calling hlist_del_init_rcu() only if the flag is not set. Furthermore, since BIO completions can happen at any time, that is, decrementing of the zone write plug reference count can happen at any time, make sure to use disk_put_zone_wplug() instead of atomic_dec() to ensure that the zone write plug is freed when its last reference is dropped. In order to do this, disk_remove_zone_wplug() is moved after the definition of disk_put_zone_wplug(). disk_should_remove_zone_wplug() is moved as well to keep it together with disk_remove_zone_wplug(). To be consistent with this change, add a check in disk_put_zone_wplug() to ensure that a zone write plug being freed was already removed from the disk hash table. Fixes: dd291d77cc90 ("block: Introduce zone write plugging") Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20240501110907.96950-7-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-01block: Hold a reference on zone write plugs to schedule submissionDamien Le Moal1-5/+21
Since a zone write plug BIO work is a field of struct blk_zone_wplug, we must ensure that a zone write plug is never freed when its BIO submission work is queued or running. Do this by holding a reference on the zone write plug when the submission work is scheduled for execution with queue_work() and releasing the reference at the end of the execution of the work function blk_zone_wplug_bio_work(). The helper function disk_zone_wplug_schedule_bio_work() is introduced to get a reference on a zone write plug and queue its work. This helper is used in disk_zone_wplug_unplug_bio() and disk_zone_wplug_handle_error(). Fixes: dd291d77cc90 ("block: Introduce zone write plugging") Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20240501110907.96950-6-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-01block: Fix reference counting for zone write plugs in error stateDamien Le Moal1-26/+49
When zone is reset or finished, disk_zone_wplug_set_wp_offset() is called to update the zone write plug write pointer offset and to clear the zone error state (BLK_ZONE_WPLUG_ERROR flag) if it is set. However, this processing is missing dropping the reference to the zone write plug that was taken in disk_zone_wplug_set_error() when the error flag was first set. Furthermore, the error state handling must release the zone write plug lock to first execute a report zones command. When the report zone races with a reset or finish operation that clears the error, we can end up decrementing the zone write plug reference count twice: once in disk_zone_wplug_set_wp_offset() for the reset/finish operation and one more time in disk_zone_wplugs_work() once disk_zone_wplug_handle_error() completes. Fix this by introducing disk_zone_wplug_clear_error() as the symmetric function of disk_zone_wplug_set_error(). disk_zone_wplug_clear_error() decrements the zone write plug reference count obtained in disk_zone_wplug_set_error() only if the error handling has not started yet, that is, only if disk_zone_wplugs_work() has not yet taken the zone write plug off the error list. This ensure that either disk_zone_wplug_clear_error() or disk_zone_wplugs_work() drop the zone write plug reference count. Fixes: dd291d77cc90 ("block: Introduce zone write plugging") Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20240501110907.96950-5-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-01block: Fix zone write plug initialization from blk_revalidate_zone_cb()Damien Le Moal1-3/+4
When revalidating the zones of a zoned block device, blk_revalidate_zone_cb() must allocate a zone write plug for any sequential write required zone that is not empty nor full. However, the current code tests the latter case by comparing the zone write pointer offset to the zone size instead of the zone capacity. Furthermore, disk_get_and_lock_zone_wplug() is called with a sector argument equal to the zone start instead of the current zone write pointer position. This commit fixes both issues by calling disk_get_and_lock_zone_wplug() for a zone that is not empty and with a write pointer offset lower than the zone capacity and use the zone capacity sector as the sector argument for disk_get_and_lock_zone_wplug(). Fixes: dd291d77cc90 ("block: Introduce zone write plugging") Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20240501110907.96950-4-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-01block: Exclude conventional zones when faking max open limitDamien Le Moal1-10/+28
For a device that has no limits for the maximum number of open and active zones, we default to using the number of zones, limited to BLK_ZONE_WPLUG_DEFAULT_POOL_SIZE (128), for the maximum number of open zones indicated to the user. However, for a device that has conventional zones and less zones than BLK_ZONE_WPLUG_DEFAULT_POOL_SIZE, we should not account conventional zones and set the limit to the number of sequential write required zones. Furthermore, for cases where the limit is equal to the number of sequential write required zones, we can advertize a limit of 0 to indicate "no limits". Fix this by moving the zone write plug mempool resizing from disk_revalidate_zone_resources() to disk_update_zone_resources() where we can safely compute the number of conventional zones and update the limits. Fixes: 843283e96e5a ("block: Fake max open zones limit when there is no limit") Reported-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20240501110907.96950-3-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-01dm: Check that a zoned table leads to a valid mapped deviceDamien Le Moal2-1/+59
Using targets such as dm-linear, a mapped device can be created to contain only conventional zones. Such device should not be treated as zoned as it does not contain any mandatory sequential write required zone. Since such device can be randomly written, we can modify dm_set_zones_restrictions() to set the mapped device zoned queue limit to false to expose it as a regular block device. The function dm_check_zoned() does this after counting the number of conventional zones of the mapped device and comparing it to the total number of zones reported. The special dm_check_zoned_cb() report zones callback function is used to count conventional zones. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Benjamin Marzinski <bmarzins@redhat.com> Link: https://lore.kernel.org/r/20240501110907.96950-2-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-26sbitmap: use READ_ONCE to access map->wordlinke li1-4/+4
In __sbitmap_queue_get_batch(), map->word is read several times, and update atomically using atomic_long_try_cmpxchg(). But the first two read of map->word is not protected. This patch moves the statement val = READ_ONCE(map->word) forward, eliminating unprotected accesses to map->word within the function. It is aimed at reducing the number of benign races reported by KCSAN in order to focus future debugging effort on harmful races. Signed-off-by: linke li <lilinke99@qq.com> Link: https://lore.kernel.org/r/tencent_0B517C25E519D3D002194E8445E86C04AD0A@qq.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-25null_blk: Fix missing mutex_destroy() at module removalZhu Yanjun1-0/+2
When a mutex lock is not used any more, the function mutex_destroy should be called to mark the mutex lock uninitialized. Fixes: f2298c0403b0 ("null_blk: multi queue aware block test driver") Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev> Link: https://lore.kernel.org/r/20240425171635.4227-1-yanjun.zhu@linux.dev Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-25Merge tag 'md-6.10-20240425' of ↵Jens Axboe4-19/+17
https://git.kernel.org/pub/scm/linux/kernel/git/song/md into for-6.10/block Pull MD fixes from Song: "These changes contain various fixes by Yu Kuai, Li Nan, and Florian-Ewald Mueller." * tag 'md-6.10-20240425' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md: md: don't account sync_io if iostats of the disk is disabled md: Fix overflow in is_mddev_idle md: add check for sleepers in md_wakeup_thread() md/raid5: fix deadlock that raid5d() wait for itself to clear MD_SB_CHANGE_PENDING
2024-04-25block: check if zone_wplugs_hash exists in queue_zone_wplugs_showJohannes Thumshirn1-0/+3
Changhui reported a kernel crash when running this simple shell reproducer: # cd /sys/kernel/debug/block && find . -type f -exec grep -aH . {} \; The above results in a NULL pointer dereference if a device does not have a zone_wplugs_hash allocated. To fix this, return early if we don't have a zone_wplugs_hash. Reported-by: Changhui Zhong <czhong@redhat.com> Fixes: a98b05b02f0f ("block: Replace zone_wlock debugfs entry with zone_wplugs entry") Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/e5fec079dfca448cc21c425cfa5d7b291f5faa67.1714046443.git.johannes.thumshirn@wdc.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-23block: use a per disk workqueue for zone write pluggingDamien Le Moal2-8/+25
A zone write plug BIO work function blk_zone_wplug_bio_work() calls submit_bio_noacct_nocheck() to execute the next unplugged BIO. This function may block. So executing zone plugs BIO works using the block layer global kblockd workqueue can potentially lead to preformance or latency issues as the number of concurrent work for a workqueue is limited to WQ_DFL_ACTIVE (256). 1) For a system with a large number of zoned disks, issuing write requests to otherwise unused zones may be delayed wiating for a work thread to become available. 2) Requeue operations which use kblockd but are independent of zone write plugging may alsoi end up being delayed. To avoid these potential performance issues, create a workqueue per zoned device to execute zone plugs BIO work. The workqueue max active parameter is set to the maximum number of zone write plugs allocated with the zone write plug mempool. This limit is equal to the maximum number of open zones of the disk and defaults to 128 for disks that do not have a limit on the number of open zones. Fixes: dd291d77cc90 ("block: Introduce zone write plugging") Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240420075811.1276893-3-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-19block/mq-deadline: Remove some unused functionsJiapeng Chong1-28/+0
These functions are defined in the mq-deadline.c file, but not called elsewhere, so delete these unused functions. block/mq-deadline.c:134:1: warning: unused function 'deadline_earlier_request'. block/mq-deadline.c:148:1: warning: unused function 'deadline_latter_request'. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=8803 Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Link: https://lore.kernel.org/r/20240419025610.34298-1-jiapeng.chong@linux.alibaba.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-17null_blk: Simplify null_zone_write()Damien Le Moal1-20/+16
In null_zone_write, we do not need to first check if the target zone condition is FULL, READONLY or OFFLINE: for theses conditions, the check of the command sector against the zone write pointer will always result in the command failing. Remove these checks. We still however need to check that the target zone write pointer is not invalid for zone append operations. To do so, add the macro NULL_ZONE_INVALID_WP and use it in null_set_zone_cond() when changing a zone to READONLY or OFFLINE condition. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20240411085502.728558-4-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-17null_blk: Do zone resource management only if necessaryDamien Le Moal1-146/+165
For zoned null_blk devices setup without any limit on the maximum number of open and active zones, there is no need to count the number of zones that are implicitly open, explicitly open and closed. This is indicated by the boolean field need_zone_res_mgmt of sturct nullb_device. Modify the zone management functions null_reset_zone(), null_finish_zone(), null_open_zone() and null_close_zone() to manage the zone condition counters only if the device need_zone_res_mgmt field is true. With this change, the function __null_close_zone() is removed and integrated into the 2 caller sites directly, with the null_close_imp_open_zone() call site greatly simplified as this function closes zones that are known to be in the implicit open condition. null_zone_write() is modified in a similar manner to do zone condition accouting only when the device need_zone_res_mgmt field is true. With these changes, the inline helpers null_lock_zone_res() and null_unlock_zone_res() are removed and replaced with direct calls to spin_lock()/spin_unlock(). Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20240411085502.728558-3-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-17null_blk: Have all null_handle_xxx() return a blk_status_tDamien Le Moal1-10/+8
Modify the null_handle_flush() and null_handle_rq() functions to return a blk_status_t instead of an errno to simplify the call sites of these functions and to be consistant with other null_handle_xxx() functions. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20240411085502.728558-2-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-17block: Do not special-case plugging of zone write operationsDamien Le Moal5-57/+2
With the block layer zone write plugging being automatically done for any write operation to a zone of a zoned block device, a regular request plugging handled through current->plug can only ever see at most a single write request per zone. In such case, any potential reordering of the plugged requests will be harmless. We can thus remove the special casing for write operations to zones and have these requests plugged as well. This allows removing the function blk_mq_plug and instead directly using current->plug where needed. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Tested-by: Hans Holmberg <hans.holmberg@wdc.com> Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240408014128.205141-29-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-17block: Do not force select mq-deadline with CONFIG_BLK_DEV_ZONEDDamien Le Moal1-1/+0
Now that zone block device write ordering control does not depend anymore on mq-deadline and zone write locking, there is no need to force select the mq-deadline scheduler when CONFIG_BLK_DEV_ZONED is enabled. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Tested-by: Hans Holmberg <hans.holmberg@wdc.com> Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240408014128.205141-28-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-17block: Remove zone write lockingDamien Le Moal5-179/+7
Zone write locking is now unused and replaced with zone write plugging. Remove all code that was implementing zone write locking, that is, the various helper functions controlling request zone write locking and the gendisk attached zone bitmaps. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Tested-by: Hans Holmberg <hans.holmberg@wdc.com> Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240408014128.205141-27-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-17block: Replace zone_wlock debugfs entry with zone_wplugs entryDamien Le Moal3-10/+27
In preparation to completely remove zone write locking, replace the "zone_wlock" mq-debugfs entry that was listing zones that are write-locked with the zone_wplugs entry which lists the zones that currently have a write plug allocated. The write plug information provided is: the zone number, the zone write plug flags, the zone write plug write pointer offset and the number of BIOs currently waiting for execution in the zone write plug BIO list. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Tested-by: Hans Holmberg <hans.holmberg@wdc.com> Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240408014128.205141-26-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-17block: Move zone related debugfs attribute to blk-zoned.cDamien Le Moal5-28/+21
block/blk-mq-debugfs-zone.c contains a single debugfs attribute function. Defining this outside of block/blk-zoned.c does not really help in any way, so move this zone related debugfs attribute to block/blk-zoned.c and delete block/blk-mq-debugfs-zone.c. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Tested-by: Hans Holmberg <hans.holmberg@wdc.com> Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240408014128.205141-25-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-17block: Do not check zone type in blk_check_zone_append()Damien Le Moal1-2/+1
Zone append operations are only allowed to target sequential write required zones. blk_check_zone_append() uses bio_zone_is_seq() to check this. However, this check is not necessary because: 1) For NVMe ZNS namespace devices, only sequential write required zones exist, making the zone type check useless. 2) For null_blk, the driver will fail the request anyway, thus notifying the user that a conventional zone was targeted. 3) For all other zoned devices, zone append is now emulated using zone write plugging, which checks that a zone append operation does not target a conventional zone. In preparation for the removal of zone write locking and its conventional zone bitmap (used by bio_zone_is_seq()), remove the bio_zone_is_seq() call from blk_check_zone_append(). Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Tested-by: Hans Holmberg <hans.holmberg@wdc.com> Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240408014128.205141-24-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-17block: Remove elevator required featuresDamien Le Moal4-68/+5
The only elevator feature ever implemented is ELEVATOR_F_ZBD_SEQ_WRITE for signaling that a scheduler implements zone write locking to tightly control the dispatching order of write operations to zoned block devices. With the removal of zone write locking support in mq-deadline and the reliance of all block device drivers on the block layer zone write plugging to control ordering of write operations to zones, the elevator feature ELEVATOR_F_ZBD_SEQ_WRITE is completely unused. Remove it, and also remove the now unused code for filtering the possible schedulers for a block device based on required features. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Tested-by: Hans Holmberg <hans.holmberg@wdc.com> Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240408014128.205141-23-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>