summaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2021-10-30blk-mq: fix redundant check of !e expressionJean Sacren1-1/+1
In the if branch, e is checked. In the else branch, ->dispatch_busy is merely a number and has no effect on !e. We should remove the check of !e since it is always true. Signed-off-by: Jean Sacren <sakiwit@gmail.com> Link: https://lore.kernel.org/r/20211029202945.3052-1-sakiwit@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-29blk-mq-debugfs: Show active requests per queue for shared tagsJohn Garry1-1/+1
Currently we show the hctx.active value for the per-hctx "active" file. However this is not maintained for shared tags, and we instead keep a record of the number active requests per request queue - see commit f1b49fdc1c64 ("blk-mq: Record active_queues_shared_sbitmap per tag_set for when using shared sbitmap). Change for the case of shared tags to show the active requests per request queue by using __blk_mq_active_requests() helper. Signed-off-by: John Garry <john.garry@huawei.com> Link: https://lore.kernel.org/r/1635496823-33515-1-git-send-email-john.garry@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-28block: improve readability of blk_mq_end_request_batch()Jens Axboe1-6/+6
It's faster and easier to read if we tolerate cur_hctx being NULL in the "when to flush" condition. Rename last_hctx to cur_hctx while at it, as it better describes the role of that variable. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-27virtio-blk: Use blk_validate_block_size() to validate block sizeXie Yongji1-2/+10
The block layer can't support a block size larger than page size yet. And a block size that's too small or not a power of two won't work either. If a misconfigured device presents an invalid block size in configuration space, it will result in the kernel crash something like below: [ 506.154324] BUG: kernel NULL pointer dereference, address: 0000000000000008 [ 506.160416] RIP: 0010:create_empty_buffers+0x24/0x100 [ 506.174302] Call Trace: [ 506.174651] create_page_buffers+0x4d/0x60 [ 506.175207] block_read_full_page+0x50/0x380 [ 506.175798] ? __mod_lruvec_page_state+0x60/0xa0 [ 506.176412] ? __add_to_page_cache_locked+0x1b2/0x390 [ 506.177085] ? blkdev_direct_IO+0x4a0/0x4a0 [ 506.177644] ? scan_shadow_nodes+0x30/0x30 [ 506.178206] ? lru_cache_add+0x42/0x60 [ 506.178716] do_read_cache_page+0x695/0x740 [ 506.179278] ? read_part_sector+0xe0/0xe0 [ 506.179821] read_part_sector+0x36/0xe0 [ 506.180337] adfspart_check_ICS+0x32/0x320 [ 506.180890] ? snprintf+0x45/0x70 [ 506.181350] ? read_part_sector+0xe0/0xe0 [ 506.181906] bdev_disk_changed+0x229/0x5c0 [ 506.182483] blkdev_get_whole+0x6d/0x90 [ 506.183013] blkdev_get_by_dev+0x122/0x2d0 [ 506.183562] device_add_disk+0x39e/0x3c0 [ 506.184472] virtblk_probe+0x3f8/0x79b [virtio_blk] [ 506.185461] virtio_dev_probe+0x15e/0x1d0 [virtio] So let's use a block layer helper to validate the block size. Signed-off-by: Xie Yongji <xieyongji@bytedance.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Link: https://lore.kernel.org/r/20211026144015.188-5-xieyongji@bytedance.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-27loop: Use blk_validate_block_size() to validate block sizeXie Yongji1-15/+2
Remove loop_validate_block_size() and use the block layer helper to validate block size. Signed-off-by: Xie Yongji <xieyongji@bytedance.com> Link: https://lore.kernel.org/r/20211026144015.188-4-xieyongji@bytedance.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-27nbd: Use blk_validate_block_size() to validate block sizeXie Yongji1-1/+2
Use the block layer helper to validate block size instead of open coding it. Signed-off-by: Xie Yongji <xieyongji@bytedance.com> Link: https://lore.kernel.org/r/20211026144015.188-3-xieyongji@bytedance.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-27block: Add a helper to validate the block sizeXie Yongji1-0/+8
There are some duplicated codes to validate the block size in block drivers. This limitation actually comes from block layer, so this patch tries to add a new block layer helper for that. Signed-off-by: Xie Yongji <xieyongji@bytedance.com> Link: https://lore.kernel.org/r/20211026144015.188-2-xieyongji@bytedance.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-27block: re-flow blk_mq_rq_ctx_init()Jens Axboe1-12/+12
Now that we have flags passed in, we can do a final re-arrange of the flow of blk_mq_rq_ctx_init() so we're always writing request in the order in which it is laid out. Signed-off-by: Jens Axboe <axboe@kernel.dk> Link: https://lore.kernel.org/r/20211019153300.623322-5-axboe@kernel.dk Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-27block: prefetch request to be initializedJens Axboe1-0/+1
Now we have the tags available in __blk_mq_alloc_requests_batch(), we can start fetching the first request cacheline before calling into the request initialization. Signed-off-by: Jens Axboe <axboe@kernel.dk> Link: https://lore.kernel.org/r/20211019153300.623322-4-axboe@kernel.dk Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-27block: pass in blk_mq_tags to blk_mq_rq_ctx_init()Jens Axboe1-11/+14
Instead of getting this from data for every invocation of request initialization, pass it in as an argument instead. Signed-off-by: Jens Axboe <axboe@kernel.dk> Link: https://lore.kernel.org/r/20211019153300.623322-3-axboe@kernel.dk Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-27block: add rq_flags to struct blk_mq_alloc_dataJens Axboe2-14/+14
There's a hole here we can use, and it's faster to set this earlier rather than need to check q->elevator multiple times. Signed-off-by: Jens Axboe <axboe@kernel.dk> Link: https://lore.kernel.org/r/20211019153300.623322-2-axboe@kernel.dk Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-27block: add async version of bio_set_polledPavel Begunkov1-4/+3
If we know that a iocb is async we can optimise bio_set_polled() a bit, add a new helper bio_set_polled_async(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/8fa137885164a5d05fadcff4c3521da8d5a83d00.1635337135.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-27block: kill DIO_MULTI_BIOPavel Begunkov1-21/+12
Now __blkdev_direct_IO() serves only multi-bio I/O, thus remove not used anymore single bio refcounting optimisations. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/88eb488aae9ed4852a30f3a7132f296f56e43b80.1635337135.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-27block: kill unused polling bits in __blkdev_direct_IO()Pavel Begunkov1-17/+3
With addition of __blkdev_direct_IO_async(), __blkdev_direct_IO() now serves only multio-bio I/O, which we don't poll. Now we can remove anything related to I/O polling from it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/b8c597a6b7ee612df394853bfd24726aee5b898e.1635337135.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-27block: avoid extra iter advance with async iocbPavel Begunkov3-6/+17
Nobody cares about iov iterators state if we return -EIOCBQUEUED, so as the we now have __blkdev_direct_IO_async(), which gets pages only once, we can skip expensive iov_iter_advance(). It's around 1-2% of all CPU spent. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/a6158edfbfa2ae3bc24aed29a72f035df18fad2f.1635337135.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-27block: Add independent access ranges supportDamien Le Moal5-9/+410
The Concurrent Positioning Ranges VPD page (for SCSI) and data log page (for ATA) contain parameters describing the set of contiguous LBAs that can be served independently by a single LUN multi-actuator hard-disk. Similarly, a logically defined block device composed of multiple disks can in some cases execute requests directed at different sector ranges in parallel. A dm-linear device aggregating 2 block devices together is an example. This patch implements support for exposing a block device independent access ranges to the user through sysfs to allow optimizing device accesses to increase performance. To describe the set of independent sector ranges of a device (actuators of a multi-actuator HDDs or table entries of a dm-linear device), The type struct blk_independent_access_ranges is introduced. This structure describes the sector ranges using an array of struct blk_independent_access_range structures. This range structure defines the start sector and number of sectors of the access range. The ranges in the array cannot overlap and must contain all sectors within the device capacity. The function disk_set_independent_access_ranges() allows a device driver to signal to the block layer that a device has multiple independent access ranges. In this case, a struct blk_independent_access_ranges is attached to the device request queue by the function disk_set_independent_access_ranges(). The function disk_alloc_independent_access_ranges() is provided for drivers to allocate this structure. struct blk_independent_access_ranges contains kobjects (struct kobject) to expose to the user through sysfs the set of independent access ranges supported by a device. When the device is initialized, sysfs registration of the ranges information is done from blk_register_queue() using the block layer internal function disk_register_independent_access_ranges(). If a driver calls disk_set_independent_access_ranges() for a registered queue, e.g. when a device is revalidated, disk_set_independent_access_ranges() will execute disk_register_independent_access_ranges() to update the sysfs attribute files. The sysfs file structure created starts from the independent_access_ranges sub-directory and contains the start sector and number of sectors of each range, with the information for each range grouped in numbered sub-directories. E.g. for a dual actuator HDD, the user sees: $ tree /sys/block/sdk/queue/independent_access_ranges/ /sys/block/sdk/queue/independent_access_ranges/ |-- 0 | |-- nr_sectors | `-- sector `-- 1 |-- nr_sectors `-- sector For a regular device with a single access range, the independent_access_ranges sysfs directory does not exist. Device revalidation may lead to changes to this structure and to the attribute values. When manipulated, the queue sysfs_lock and sysfs_dir_lock mutexes are held for atomicity, similarly to how the blk-mq and elevator sysfs queue sub-directories are protected. The code related to the management of independent access ranges is added in the new file block/blk-ia-ranges.c. Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Link: https://lore.kernel.org/r/20211027022223.183838-2-damien.lemoal@wdc.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-26blk-mq: don't issue request directly in case that current is to be blockedMing Lei1-1/+1
When flushing plug list in case that current will be blocked, we can't issue request directly because ->queue_rq() may sleep, otherwise scheduler may complain. Fixes: dc5fc361d891 ("block: attempt direct issue of plug list") Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20211026082257.2889890-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-25sbitmap: silence data race warningJens Axboe1-1/+1
KCSAN complaints about the sbitmap hint update: ================================================================== BUG: KCSAN: data-race in sbitmap_queue_clear / sbitmap_queue_clear write to 0xffffe8ffffd145b8 of 4 bytes by interrupt on cpu 1: sbitmap_queue_clear+0xca/0xf0 lib/sbitmap.c:606 blk_mq_put_tag+0x82/0x90 __blk_mq_free_request+0x114/0x180 block/blk-mq.c:507 blk_mq_free_request+0x2c8/0x340 block/blk-mq.c:541 __blk_mq_end_request+0x214/0x230 block/blk-mq.c:565 blk_mq_end_request+0x37/0x50 block/blk-mq.c:574 lo_complete_rq+0xca/0x170 drivers/block/loop.c:541 blk_complete_reqs block/blk-mq.c:584 [inline] blk_done_softirq+0x69/0x90 block/blk-mq.c:589 __do_softirq+0x12c/0x26e kernel/softirq.c:558 run_ksoftirqd+0x13/0x20 kernel/softirq.c:920 smpboot_thread_fn+0x22f/0x330 kernel/smpboot.c:164 kthread+0x262/0x280 kernel/kthread.c:319 ret_from_fork+0x1f/0x30 write to 0xffffe8ffffd145b8 of 4 bytes by interrupt on cpu 0: sbitmap_queue_clear+0xca/0xf0 lib/sbitmap.c:606 blk_mq_put_tag+0x82/0x90 __blk_mq_free_request+0x114/0x180 block/blk-mq.c:507 blk_mq_free_request+0x2c8/0x340 block/blk-mq.c:541 __blk_mq_end_request+0x214/0x230 block/blk-mq.c:565 blk_mq_end_request+0x37/0x50 block/blk-mq.c:574 lo_complete_rq+0xca/0x170 drivers/block/loop.c:541 blk_complete_reqs block/blk-mq.c:584 [inline] blk_done_softirq+0x69/0x90 block/blk-mq.c:589 __do_softirq+0x12c/0x26e kernel/softirq.c:558 run_ksoftirqd+0x13/0x20 kernel/softirq.c:920 smpboot_thread_fn+0x22f/0x330 kernel/smpboot.c:164 kthread+0x262/0x280 kernel/kthread.c:319 ret_from_fork+0x1f/0x30 value changed: 0x00000035 -> 0x00000044 Reported by Kernel Concurrency Sanitizer on: CPU: 0 PID: 10 Comm: ksoftirqd/0 Not tainted 5.15.0-rc6-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 ================================================================== which is a data race, but not an important one. This is just updating the percpu alloc hint, and the reader of that hint doesn't ever require it to be valid. Just annotate it with data_race() to silence this one. Reported-by: syzbot+4f8bfd804b4a1f95b8f6@syzkaller.appspotmail.com Acked-by: Marco Elver <elver@google.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-25blk-cgroup: synchronize blkg creation against policy deactivationYu Kuai1-0/+10
Our test reports a null pointer dereference: [ 168.534653] ================================================================== [ 168.535614] Disabling lock debugging due to kernel taint [ 168.536346] BUG: kernel NULL pointer dereference, address: 0000000000000008 [ 168.537274] #PF: supervisor read access in kernel mode [ 168.537964] #PF: error_code(0x0000) - not-present page [ 168.538667] PGD 0 P4D 0 [ 168.539025] Oops: 0000 [#1] PREEMPT SMP KASAN [ 168.539656] CPU: 13 PID: 759 Comm: bash Tainted: G B 5.15.0-rc2-next-202100 [ 168.540954] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_0738364 [ 168.542736] RIP: 0010:bfq_pd_init+0x88/0x1e0 [ 168.543318] Code: 98 00 00 00 e8 c9 e4 5b ff 4c 8b 65 00 49 8d 7c 24 08 e8 bb e4 5b ff 4d0 [ 168.545803] RSP: 0018:ffff88817095f9c0 EFLAGS: 00010002 [ 168.546497] RAX: 0000000000000001 RBX: ffff888101a1c000 RCX: 0000000000000000 [ 168.547438] RDX: 0000000000000003 RSI: 0000000000000002 RDI: ffff888106553428 [ 168.548402] RBP: ffff888106553400 R08: ffffffff961bcaf4 R09: 0000000000000001 [ 168.549365] R10: ffffffffa2e16c27 R11: fffffbfff45c2d84 R12: 0000000000000000 [ 168.550291] R13: ffff888101a1c098 R14: ffff88810c7a08c8 R15: ffffffffa55541a0 [ 168.551221] FS: 00007fac75227700(0000) GS:ffff88839ba80000(0000) knlGS:0000000000000000 [ 168.552278] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 168.553040] CR2: 0000000000000008 CR3: 0000000165ce7000 CR4: 00000000000006e0 [ 168.554000] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 168.554929] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 168.555888] Call Trace: [ 168.556221] <TASK> [ 168.556510] blkg_create+0x1c0/0x8c0 [ 168.556989] blkg_conf_prep+0x574/0x650 [ 168.557502] ? stack_trace_save+0x99/0xd0 [ 168.558033] ? blkcg_conf_open_bdev+0x1b0/0x1b0 [ 168.558629] tg_set_conf.constprop.0+0xb9/0x280 [ 168.559231] ? kasan_set_track+0x29/0x40 [ 168.559758] ? kasan_set_free_info+0x30/0x60 [ 168.560344] ? tg_set_limit+0xae0/0xae0 [ 168.560853] ? do_sys_openat2+0x33b/0x640 [ 168.561383] ? do_sys_open+0xa2/0x100 [ 168.561877] ? __x64_sys_open+0x4e/0x60 [ 168.562383] ? __kasan_check_write+0x20/0x30 [ 168.562951] ? copyin+0x48/0x70 [ 168.563390] ? _copy_from_iter+0x234/0x9e0 [ 168.563948] tg_set_conf_u64+0x17/0x20 [ 168.564467] cgroup_file_write+0x1ad/0x380 [ 168.565014] ? cgroup_file_poll+0x80/0x80 [ 168.565568] ? __mutex_lock_slowpath+0x30/0x30 [ 168.566165] ? pgd_free+0x100/0x160 [ 168.566649] kernfs_fop_write_iter+0x21d/0x340 [ 168.567246] ? cgroup_file_poll+0x80/0x80 [ 168.567796] new_sync_write+0x29f/0x3c0 [ 168.568314] ? new_sync_read+0x410/0x410 [ 168.568840] ? __handle_mm_fault+0x1c97/0x2d80 [ 168.569425] ? copy_page_range+0x2b10/0x2b10 [ 168.570007] ? _raw_read_lock_bh+0xa0/0xa0 [ 168.570622] vfs_write+0x46e/0x630 [ 168.571091] ksys_write+0xcd/0x1e0 [ 168.571563] ? __x64_sys_read+0x60/0x60 [ 168.572081] ? __kasan_check_write+0x20/0x30 [ 168.572659] ? do_user_addr_fault+0x446/0xff0 [ 168.573264] __x64_sys_write+0x46/0x60 [ 168.573774] do_syscall_64+0x35/0x80 [ 168.574264] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 168.574960] RIP: 0033:0x7fac74915130 [ 168.575456] Code: 73 01 c3 48 8b 0d 58 ed 2c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 0f 1f 444 [ 168.577969] RSP: 002b:00007ffc3080e288 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 [ 168.578986] RAX: ffffffffffffffda RBX: 0000000000000009 RCX: 00007fac74915130 [ 168.579937] RDX: 0000000000000009 RSI: 000056007669f080 RDI: 0000000000000001 [ 168.580884] RBP: 000056007669f080 R08: 000000000000000a R09: 00007fac75227700 [ 168.581841] R10: 000056007655c8f0 R11: 0000000000000246 R12: 0000000000000009 [ 168.582796] R13: 0000000000000001 R14: 00007fac74be55e0 R15: 00007fac74be08c0 [ 168.583757] </TASK> [ 168.584063] Modules linked in: [ 168.584494] CR2: 0000000000000008 [ 168.584964] ---[ end trace 2475611ad0f77a1a ]--- This is because blkg_alloc() is called from blkg_conf_prep() without holding 'q->queue_lock', and elevator is exited before blkg_create(): thread 1 thread 2 blkg_conf_prep spin_lock_irq(&q->queue_lock); blkg_lookup_check -> return NULL spin_unlock_irq(&q->queue_lock); blkg_alloc blkcg_policy_enabled -> true pd = ->pd_alloc_fn blkg->pd[i] = pd blk_mq_exit_sched bfq_exit_queue blkcg_deactivate_policy spin_lock_irq(&q->queue_lock); __clear_bit(pol->plid, q->blkcg_pols); spin_unlock_irq(&q->queue_lock); q->elevator = NULL; spin_lock_irq(&q->queue_lock); blkg_create if (blkg->pd[i]) ->pd_init_fn -> q->elevator is NULL spin_unlock_irq(&q->queue_lock); Because blkcg_deactivate_policy() requires queue to be frozen, we can grab q_usage_counter to synchoronize blkg_conf_prep() against blkcg_deactivate_policy(). Fixes: e21b7a0b9887 ("block, bfq: add full hierarchical scheduling and cgroups support") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20211020014036.2141723-1-yukuai3@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-25block: refactor bio_iov_bvec_set()Pavel Begunkov1-23/+14
Combine bio_iov_bvec_set() and bio_iov_bvec_set_append() and let the caller to do iov_iter_advance(). Also get rid of __bio_iov_bvec_set(), which was duplicated in the final binary, and replace a weird iov_iter_truncate() of a temporal iter copy with min() better reflecting the intention. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/bcf1ac36fce769a514e19475f3623cd86a1d8b72.1635006010.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-25block: add single bio async direct IO helperPavel Begunkov1-3/+84
As with __blkdev_direct_IO_simple(), we can implement direct IO more efficiently if there is only one bio. Add __blkdev_direct_IO_async() and blkdev_bio_end_io_async(). This patch brings me from 4.45-4.5 MIOPS with nullblk to 4.7+. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/f0ae4109b7a6934adede490f84d188d53b97051b.1635006010.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-23sched: make task_struct->plug always definedJens Axboe1-2/+0
If CONFIG_BLOCK isn't set, then it's an empty struct anyway. Just make it generally available, so we don't break the compile: kernel/sched/core.c: In function ‘sched_submit_work’: kernel/sched/core.c:6346:35: error: ‘struct task_struct’ has no member named ‘plug’ 6346 | blk_flush_plug(tsk->plug, true); | ^~ kernel/sched/core.c: In function ‘io_schedule_prepare’: kernel/sched/core.c:8357:20: error: ‘struct task_struct’ has no member named ‘plug’ 8357 | if (current->plug) | ^~ kernel/sched/core.c:8358:39: error: ‘struct task_struct’ has no member named ‘plug’ 8358 | blk_flush_plug(current->plug, true); | ^~ Reported-by: Nathan Chancellor <nathan@kernel.org> Fixes: 008f75a20e70 ("block: cleanup the flush plug helpers") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-22blk-mq-sched: Don't reference queue tagset in blk_mq_sched_tags_teardown()John Garry1-1/+1
We should not reference the queue tagset in blk_mq_sched_tags_teardown() (see function comment) for the blk-mq flags, so use the passed flags instead. This solves a use-after-free, similarly fixed earlier (and since broken again) in commit f0c1c4d2864e ("blk-mq: fix use-after-free in blk_mq_exit_sched"). Reported-by: Linux Kernel Functional Testing <lkft@linaro.org> Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org> Tested-by: Anders Roxell <anders.roxell@linaro.org> Fixes: e155b0c238b2 ("blk-mq: Use shared tags for shared sbitmap support") Signed-off-by: John Garry <john.garry@huawei.com> Link: https://lore.kernel.org/r/1634890340-15432-1-git-send-email-john.garry@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-22block: fix req_bio_endio append error handlingPavel Begunkov1-1/+1
Shinichiro Kawasaki reports that there is a bug in a recent req_bio_endio() patch causing problems with zonefs. As Shinichiro suggested, inverse the condition in zone append path to resemble how it was before: fail when it's not fully completed. Fixes: 478eb72b815f3 ("block: optimise req_bio_endio()") Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/344ea4e334aace9148b41af5f2426da38c8aa65a.1634914228.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-21blk-crypto: update inline encryption documentationEric Biggers1-206/+245
Rework most of inline-encryption.rst to be easier to follow, to correct some information, to add some important details and remove some unimportant details, and to take into account the renaming from blk_keyslot_manager to blk_crypto_profile. Reviewed-by: Mike Snitzer <snitzer@redhat.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20211018180453.40441-5-ebiggers@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-21blk-crypto: rename blk_keyslot_manager to blk_crypto_profileEric Biggers18-522/+555
blk_keyslot_manager is misnamed because it doesn't necessarily manage keyslots. It actually does several different things: - Contains the crypto capabilities of the device. - Provides functions to control the inline encryption hardware. Originally these were just for programming/evicting keyslots; however, new functionality (hardware-wrapped keys) will require new functions here which are unrelated to keyslots. Moreover, device-mapper devices already (ab)use "keyslot_evict" to pass key eviction requests to their underlying devices even though device-mapper devices don't have any keyslots themselves (so it really should be "evict_key", not "keyslot_evict"). - Sometimes (but not always!) it manages keyslots. Originally it always did, but device-mapper devices don't have keyslots themselves, so they use a "passthrough keyslot manager" which doesn't actually manage keyslots. This hack works, but the terminology is unnatural. Also, some hardware doesn't have keyslots and thus also uses a "passthrough keyslot manager" (support for such hardware is yet to be upstreamed, but it will happen eventually). Let's stop having keyslot managers which don't actually manage keyslots. Instead, rename blk_keyslot_manager to blk_crypto_profile. This is a fairly big change, since for consistency it also has to update keyslot manager-related function names, variable names, and comments -- not just the actual struct name. However it's still a fairly straightforward change, as it doesn't change any actual functionality. Acked-by: Ulf Hansson <ulf.hansson@linaro.org> # For MMC Reviewed-by: Mike Snitzer <snitzer@redhat.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20211018180453.40441-4-ebiggers@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-21blk-crypto: rename keyslot-manager files to blk-crypto-profileEric Biggers10-9/+9
In preparation for renaming struct blk_keyslot_manager to struct blk_crypto_profile, rename the keyslot-manager.h and keyslot-manager.c source files. Renaming these files separately before making a lot of changes to their contents makes it easier for git to understand that they were renamed. Acked-by: Ulf Hansson <ulf.hansson@linaro.org> # For MMC Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mike Snitzer <snitzer@redhat.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20211018180453.40441-3-ebiggers@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-21blk-crypto-fallback: properly prefix function and struct namesEric Biggers1-29/+30
For clarity, avoid using just the "blk_crypto_" prefix for functions and structs that are specific to blk-crypto-fallback. Instead, use "blk_crypto_fallback_". Some places already did this, but others didn't. This is also a prerequisite for using "struct blk_crypto_keyslot" to mean a generic blk-crypto keyslot (which is what it sounds like). Rename the fallback one to "struct blk_crypto_fallback_keyslot". No change in behavior. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Mike Snitzer <snitzer@redhat.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20211018180453.40441-2-ebiggers@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-21nbd: Use invalidate_disk() helper on disconnectXie Yongji1-9/+3
When a nbd device encounters a writeback error, that error will get propagated to the bd_inode's wb_err field. Then if this nbd device's backend is disconnected and another is attached, we will get back the previous writeback error on fsync, which is unexpected. To fix it, let's use invalidate_disk() helper to invalidate the disk on disconnect instead of just setting disk's capacity to zero. Signed-off-by: Xie Yongji <xieyongji@bytedance.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20210922123711.187-5-xieyongji@bytedance.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-21loop: Remove the unnecessary bdev checks and unused bdev variableXie Yongji1-6/+3
The lo->lo_device can't be null if the lo->lo_backing_file is set. So let's remove the unnecessary bdev checks and the entire bdev variable in __loop_clr_fd() since the lo->lo_backing_file is already checked before. Signed-off-by: Xie Yongji <xieyongji@bytedance.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20210922123711.187-4-xieyongji@bytedance.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-21loop: Use invalidate_disk() helper to invalidate gendiskXie Yongji1-5/+1
Use invalidate_disk() helper to simplify the code for gendisk invalidation. Signed-off-by: Xie Yongji <xieyongji@bytedance.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20210922123711.187-3-xieyongji@bytedance.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-21block: Add invalidate_disk() helper to invalidate the gendiskXie Yongji2-0/+22
To hide internal implementation and simplify some driver code, this adds a helper to invalidate the gendisk. It will clean the gendisk's associated buffer/page caches and reset its internal states. Signed-off-by: Xie Yongji <xieyongji@bytedance.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20210922123711.187-2-xieyongji@bytedance.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-21block: kill extra rcu lock/unlock in queue enterPavel Begunkov1-1/+1
blk_try_enter_queue() already takes rcu_read_lock/unlock, so we can avoid the second pair in percpu_ref_tryget_live(), use a newly added percpu_ref_tryget_live_rcu(). As rcu_read_lock/unlock imply barrier()s, it's pretty noticeable, especially for for !CONFIG_PREEMPT_RCU (default for some distributions), where __rcu_read_lock/unlock() are not inlined. 3.20% io_uring [kernel.vmlinux] [k] __rcu_read_unlock 3.05% io_uring [kernel.vmlinux] [k] __rcu_read_lock 2.52% io_uring [kernel.vmlinux] [k] __rcu_read_unlock 2.28% io_uring [kernel.vmlinux] [k] __rcu_read_lock Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/6b11c67ea495ed9d44f067622d852de4a510ce65.1634822969.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-21percpu_ref: percpu_ref_tryget_live() version holding RCUPavel Begunkov1-10/+23
Add percpu_ref_tryget_live_rcu(), which is a version of percpu_ref_tryget_live() but the user is responsible for enclosing it in a RCU read lock section. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Acked-by: Dennis Zhou <dennis@kernel.org> Link: https://lore.kernel.org/r/3066500d7a6eb3e03f10adf98b87fdb3b1c49db8.1634822969.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-21block: convert fops.c magic constants to SHIFT_SECTORPavel Begunkov1-8/+10
Don't use shifting by a magic number 9 but replace with a more descriptive SHIFT_SECTOR. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/068782b9f7e97569fb59a99529b23bb17ea4c5e2.1634755800.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-21block: clean up blk_mq_submit_bio() mergingPavel Begunkov3-20/+9
Combine blk_mq_sched_bio_merge() and blk_attempt_plug_merge() under a common if, so we don't check it twice. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/daedc90d4029a5d1d73344771632b1faca3aaf81.1634755800.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-21block: optimise boundary blkdev_read_iter's checksPavel Begunkov1-8/+11
Combine pos and len checks and mark unlikely. Also, don't reexpand if it's not truncated. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/fff34e613aeaae1ad12977dc4592cb1a1f5d3190.1634755800.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-21fs: bdev: fix conflicting comment from lookup_bdevJackie Liu1-3/+5
We switched to directly use dev_t to get block device, lookup changed the meaning of use, now we fix this conflicting comment. Fixes: 4e7b5671c6a8 ("block: remove i_bdev") Cc: Jens Axboe <axboe@kernel.dk> Cc: Christoph Hellwig <hch@lst.de> Cc: linux-fsdevel@vger.kernel.org Signed-off-by: Jackie Liu <liuyun01@kylinos.cn> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20211021071344.1600362-1-liu.yun@linux.dev Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-21blk-mq: Fix blk_mq_tagset_busy_iter() for shared tagsJohn Garry1-2/+5
Since it is now possible for a tagset to share a single set of tags, the iter function should not re-iter the tags for the count of #hw queues in that case. Rather it should just iter once. Fixes: e155b0c238b2 ("blk-mq: Use shared tags for shared sbitmap support") Reported-by: Kashyap Desai <kashyap.desai@broadcom.com> Signed-off-by: John Garry <john.garry@huawei.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Tested-by: Kashyap Desai <kashyap.desai@broadcom.com> Link: https://lore.kernel.org/r/1634550083-202815-1-git-send-email-john.garry@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-20block: cleanup the flush plug helpersChristoph Hellwig4-36/+16
Consolidate the various helpers into a single blk_flush_plug helper that takes a plk_plug and the from_scheduler bool and switch all callsites to call it directly. Checks that the plug is non-NULL must be performed by the caller, something that most already do anyway. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20211020144119.142582-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-20block: optimise blk_flush_plug_listPavel Begunkov1-2/+2
Don't call flush_plug_callbacks if there are no plug callbacks. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> [hch: split from a larger patch] Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20211020144119.142582-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-20blk-mq: move blk_mq_flush_plug_list to block/blk-mq.hChristoph Hellwig2-2/+1
This helper is internal to the block layer. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20211020144119.142582-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-20blk-mq: only flush requests from the plug in blk_mq_submit_bioChristoph Hellwig1-1/+1
Replace the call to blk_flush_plug_list in blk_mq_submit_bio with a direct call to blk_mq_flush_plug_list. This means we do not flush plug callback from stackable devices, which doesn't really help with the accumulated requests anyway, and it also means the cached requests aren't freed here as they can still be used later on. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20211020144119.142582-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-20block: remove inaccurate requeue checkJens Axboe1-1/+0
This check is meant to catch cases where a requeue is attempted on a request that is still inserted. It's never really been useful to catch any misuse, and now it's actively wrong. Outside of that, this should not be a BUG_ON() to begin with. Remove the check as it's now causing active harm, as requeue off the plug path will trigger it even though the request state is just fine. Reported-by: Yi Zhang <yi.zhang@redhat.com> Link: https://lore.kernel.org/linux-block/CAHj4cs80zAUc2grnCZ015-2Rvd-=gXRfB_dFKy=RTm+wRo09HQ@mail.gmail.com/ Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-20block: inline a part of bio_release_pages()Pavel Begunkov2-6/+9
Inline BIO_NO_PAGE_REF check of bio_release_pages() to avoid function call. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-20block: don't bloat enter_queue with percpu_refPavel Begunkov1-1/+1
percpu_ref_put() are inlined for performance and bloat the binary, we don't care about the fail case of blk_try_enter_queue(), so we can replace it with a call to blk_queue_exit(). Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-20block: optimise req_bio_endio()Pavel Begunkov1-9/+7
First, get rid of an extra branch and chain error checks. Also reshuffle it with bio_advance(), so it goes closer to the final check, with that the compiler loads rq->rq_flags only once, and also doesn't reload bio->bi_iter.bi_size if bio_advance() didn't actually advanced the iter. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-20block: convert leftovers to bdev_get_queuePavel Begunkov2-2/+2
Convert bdev->bd_disk->queue to bdev_get_queue(), which is faster. Apparently, there are a few such spots in block that got lost during rebases. Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-20block: turn macro helpers into inline functionsPavel Begunkov1-16/+16
Replace bio_set_dev() with an identical inline helper and move it further to fix a dependency problem with bio_associate_blkg(). Do the same for bio_copy_dev(). Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-20blk-mq: support concurrent queue quiesce/unquiesceMing Lei2-3/+21
blk_mq_quiesce_queue() has been used a bit wide now, so far we don't support concurrent/nested quiesce. One biggest issue is that unquiesce can happen unexpectedly in case that quiesce/unquiesce are run concurrently from more than one context. This patch introduces q->mq_quiesce_depth to deal concurrent quiesce, and we only unquiesce queue when it is the last/outer-most one of all contexts. Several kernel panic issue has been reported[1][2][3] when running stress quiesce test. And this patch has been verified in these reports. [1] https://lore.kernel.org/linux-block/9b21c797-e505-3821-4f5b-df7bf9380328@huawei.com/T/#m1fc52431fad7f33b1ffc3f12c4450e4238540787 [2] https://lore.kernel.org/linux-block/9b21c797-e505-3821-4f5b-df7bf9380328@huawei.com/T/#m10ad90afeb9c8cc318334190a7c24c8b5c5e0722 [3] https://listman.redhat.com/archives/dm-devel/2021-September/msg00189.html Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20211014081710.1871747-7-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>