summaryrefslogtreecommitdiff
path: root/block/blk-mq.c
AgeCommit message (Collapse)AuthorFilesLines
2022-06-21block: pop cached rq before potentially blocking rq_qos_throttle()Jens Axboe1-3/+8
If rq_qos_throttle() ends up blocking, then we will have invalidated and flushed our current plug. Since blk_mq_get_cached_request() hasn't popped the cached request off the plug list just yet, we end holding a pointer to a request that is no longer valid. This insta-crashes with rq->mq_hctx being NULL in the validity checks just after. Pop the request off the cached list before doing rq_qos_throttle() to avoid using a potentially stale request. Fixes: 0a5aa8d161d1 ("block: fix blk_mq_attempt_bio_merge and rq_qos_throttle protection") Reported-by: Dylan Yudaken <dylany@fb.com> Tested-by: Dylan Yudaken <dylany@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-16blk-mq: don't clear flush_rq from tags->rqs[]Ming Lei1-2/+3
commit 364b61818f65 ("blk-mq: clearing flush request reference in tags->rqs[]") is added to clear the to-be-free flush request from tags->rqs[] for avoiding use-after-free on the flush rq. Yu Kuai reported that blk_mq_clear_flush_rq_mapping() slows down boot time by ~8s because running scsi probe which may create and remove lots of unpresent LUNs on megaraid-sas which uses BLK_MQ_F_TAG_HCTX_SHARED and each request queue has lots of hw queues. Improve the situation by not running blk_mq_clear_flush_rq_mapping if disk isn't added when there can't be any flush request issued. Reviewed-by: Christoph Hellwig <hch@lst.de> Reported-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20220616014401.817001-4-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-16blk-mq: avoid to touch q->elevator without any protectionMing Lei1-16/+2
q->elevator is referred in blk_mq_has_sqsched() without any protection, no .q_usage_counter is held, no queue srcu and rcu read lock is held, so potential use-after-free may be triggered. Fix the issue by adding one queue flag for checking if the elevator uses single queue style dispatch. Meantime the elevator feature flag of ELEVATOR_F_MQ_AWARE isn't needed any more. Cc: Jan Kara <jack@suse.cz> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220616014401.817001-3-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-16blk-mq: protect q->elevator by ->sysfs_lock in blk_mq_elv_switch_noneMing Lei1-1/+3
elevator can be tore down by sysfs switch interface or disk release, so hold ->sysfs_lock before referring to q->elevator, then potential use-after-free can be avoided. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20220616014401.817001-2-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-16block: Fix handling of offline queues in blk_mq_alloc_request_hctx()Bart Van Assche1-0/+2
This patch prevents that test nvme/004 triggers the following: UBSAN: array-index-out-of-bounds in block/blk-mq.h:135:9 index 512 is out of range for type 'long unsigned int [512]' Call Trace: show_stack+0x52/0x58 dump_stack_lvl+0x49/0x5e dump_stack+0x10/0x12 ubsan_epilogue+0x9/0x3b __ubsan_handle_out_of_bounds.cold+0x44/0x49 blk_mq_alloc_request_hctx+0x304/0x310 __nvme_submit_sync_cmd+0x70/0x200 [nvme_core] nvmf_connect_io_queue+0x23e/0x2a0 [nvme_fabrics] nvme_loop_connect_io_queues+0x8d/0xb0 [nvme_loop] nvme_loop_create_ctrl+0x58e/0x7d0 [nvme_loop] nvmf_create_ctrl+0x1d7/0x4d0 [nvme_fabrics] nvmf_dev_write+0xae/0x111 [nvme_fabrics] vfs_write+0x144/0x560 ksys_write+0xb7/0x140 __x64_sys_write+0x42/0x50 do_syscall_64+0x35/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xae Cc: Christoph Hellwig <hch@lst.de> Cc: Ming Lei <ming.lei@redhat.com> Fixes: 20e4d8139319 ("blk-mq: simplify queue mapping & schedule with each possisble CPU") Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20220615210004.1031820-1-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-03Merge tag 'for-5.19/block-exec-2022-06-02' of git://git.kernel.dk/linux-blockLinus Torvalds1-64/+49
Pull block request execute cleanups from Jens Axboe: "This change was advertised in the initial core block pull request, but didn't actually make that branch as we deferred it to a post-merge pull request to avoid a bunch of cross branch issues. This series cleans up the block execute path quite nicely" * tag 'for-5.19/block-exec-2022-06-02' of git://git.kernel.dk/linux-block: blk-mq: remove the done argument to blk_execute_rq_nowait blk-mq: avoid a mess of casts for blk_end_sync_rq blk-mq: remove __blk_execute_rq_nowait
2022-06-03Merge tag 'for-5.19/block-2022-06-02' of git://git.kernel.dk/linux-blockLinus Torvalds1-5/+5
Pull block fixes from Jens Axboe: "Just a collection of fixes that have been queued up since the initial merge window pull request, the majority of which are targeted for stable as well. One bio_set fix that fixes an issue with the dm adoption of cached bio structs that got introduced in this merge window" * tag 'for-5.19/block-2022-06-02' of git://git.kernel.dk/linux-block: block: Fix potential deadlock in blk_ia_range_sysfs_show() block: fix bio_clone_blkg_association() to associate with proper blkcg_gq block: remove useless BUG_ON() in blk_mq_put_tag() blk-mq: do not update io_ticks with passthrough requests block: make bioset_exit() fully resilient against being called twice block: use bio_queue_enter instead of blk_queue_enter in bio_poll block: document BLK_STS_AGAIN usage block: take destination bvec offsets into account in bio_copy_data_iter blk-iolatency: Fix inflight count imbalances and IO hangs on offline blk-mq: don't touch ->tagset in blk_mq_get_sq_hctx
2022-05-30blk-mq: do not update io_ticks with passthrough requestsHaisu Wang1-1/+2
Flush or passthrough requests are not accounted as normal IO in completion. To reflect iostat for slow IO, io_ticks is updated when stat show called based on inflight numbers. It may cause inconsistent io_ticks calculation result. So do not account non-passthrough request when check inflight. Fixes: 86d7331299fd ("block: update io_ticks when io hang") Signed-off-by: Haisu Wang <haisuwang@tencent.com> Reviewed-by: samuelliao <samuelliao@tencent.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220530064059.1120058-1-haisuwang@tencent.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-28blk-mq: remove the done argument to blk_execute_rq_nowaitChristoph Hellwig1-4/+1
Let the caller set it together with the end_io_data instead of passing a pointless argument. Note the the target code did in fact already set it and then just overrode it again by calling blk_execute_rq_nowait. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Kanchan Joshi <joshi.k@samsung.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20220524121530.943123-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-28blk-mq: avoid a mess of casts for blk_end_sync_rqChristoph Hellwig1-23/+20
Instead of trying to cast a __bitwise 32-bit integer to a larger integer and then a pointer, just allow a struct with the blk_status_t and the completion on stack and set the end_io_data to that. Use the opportunity to move the code to where it belongs and drop rather confusing comments. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20220524121530.943123-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-28blk-mq: remove __blk_execute_rq_nowaitChristoph Hellwig1-39/+30
We don't want to plug for synchronous execution that where we immediately wait for the request. Once that is done not a whole lot of code is shared, so just remove __blk_execute_rq_nowait. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20220524121530.943123-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-23Merge tag 'for-5.19/block-2022-05-22' of git://git.kernel.dk/linux-blockLinus Torvalds1-1/+1
Pull block updates from Jens Axboe: "Here are the core block changes for 5.19. This contains: - blk-throttle accounting fix (Laibin) - Series removing redundant assignments (Michal) - Expose bio cache via the bio_set, so that DM can use it (Mike) - Finish off the bio allocation interface cleanups by dealing with the weirdest member of the family. bio_kmalloc combines a kmalloc for the bio and bio_vecs with a hidden bio_init call and magic cleanup semantics (Christoph) - Clean up the block layer API so that APIs consumed by file systems are (almost) only struct block_device based, so that file systems don't have to poke into block layer internals like the request_queue (Christoph) - Clean up the blk_execute_rq* API (Christoph) - Clean up various lose end in the blk-cgroup code to make it easier to follow in preparation of reworking the blkcg assignment for bios (Christoph) - Fix use-after-free issues in BFQ when processes with merged queues get moved to different cgroups (Jan) - BFQ fixes (Jan) - Various fixes and cleanups (Bart, Chengming, Fanjun, Julia, Ming, Wolfgang, me)" * tag 'for-5.19/block-2022-05-22' of git://git.kernel.dk/linux-block: (83 commits) blk-mq: fix typo in comment bfq: Remove bfq_requeue_request_body() bfq: Remove superfluous conversion from RQ_BIC() bfq: Allow current waker to defend against a tentative one bfq: Relax waker detection for shared queues blk-cgroup: delete rcu_read_lock_held() WARN_ON_ONCE() blk-throttle: Set BIO_THROTTLED when bio has been throttled blk-cgroup: Remove unnecessary rcu_read_lock/unlock() blk-cgroup: always terminate io.stat lines block, bfq: make bfq_has_work() more accurate block, bfq: protect 'bfqd->queued' by 'bfqd->lock' block: cleanup the VM accounting in submit_bio block: Fix the bio.bi_opf comment block: reorder the REQ_ flags blk-iocost: combine local_stat and desc_stat to stat block: improve the error message from bio_check_eod block: allow passing a NULL bdev to bio_alloc_clone/bio_init_clone block: remove superfluous calls to blkcg_bio_issue_init kthread: unexport kthread_blkcg blk-cgroup: cleanup blkcg_maybe_throttle_current ...
2022-05-23blk-mq: don't touch ->tagset in blk_mq_get_sq_hctxMing Lei1-4/+3
blk_mq_run_hw_queues() could be run when there isn't queued request and after queue is cleaned up, at that time tagset is freed, because tagset lifetime is covered by driver, and often freed after blk_cleanup_queue() returns. So don't touch ->tagset for figuring out current default hctx by the mapping built in request queue, so use-after-free on tagset can be avoided. Meantime this way should be fast than retrieving mapping from tagset. Cc: "yukuai (C)" <yukuai3@huawei.com> Cc: Jan Kara <jack@suse.cz> Fixes: b6e68ee82585 ("blk-mq: Improve performance of non-mq IO schedulers with multiple HW queues") Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220522122350.743103-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-21blk-mq: fix typo in commentJulia Lawall1-1/+1
Spelling mistake (triple letters) in comment. Detected with the help of Coccinelle. Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr> Link: https://lore.kernel.org/r/20220521111145.81697-29-Julia.Lawall@inria.fr Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-12blk-mq: fix passthrough pluggingMing Lei1-51/+63
First we can't add request into plug list in blk_mq_request_bypass_insert which may be called when flushing plug list, so nested plug is caused. Second if polled passthrough request is inserted via blk_execute_rq(), it can't be added to plug list too since io polling needs the request to be issued to driver. Fixes the two by moving plugging into blk_execute_rq_no_wait(). Cc: Christoph Hellwig <hch@lst.de> Fixes: 1c2d2fff6dc0 ("block: wire-up support for passthrough plugging") Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20220512140010.1458645-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-11block: wire-up support for passthrough pluggingJens Axboe1-34/+39
Add support for plugging in passthrough path. When plugging is enabled, the requests are added to a plug instead of getting dispatched to the driver. And when the plug is finished, the whole batch gets dispatched via ->queue_rqs which turns out to be more efficient. Otherwise dispatching used to happen via ->queue_rq, one request at a time. Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220511054750.20432-3-joshi.k@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-27Revert "block: inherit request start time from bio for BLK_CGROUP"Tejun Heo1-8/+1
This reverts commit 0006707723233cb2a9a23ca19fc3d0864835704c. It has a couple problems: * bio_issue_time() is stored in bio->bi_issue truncated to 51 bits. This overflows in slightly over 26 days. Setting rq->io_start_time_ns with it means that io duration calculation would yield >26days after 26 days of uptime. This, for example, confuses kyber making it cause high IO latencies. * rq->io_start_time_ns should record the time that the IO is issued to the device so that on-device latency can be measured. However, bio_issue_time() is set before the bio goes through the rq-qos controllers (wbt, iolatency, iocost), so when the bio gets throttled in any of the mechanisms, the measured latencies make no sense - on-device latencies end up higher than request-alloc-to-completion latencies. We'll need a smarter way to avoid calling ktime_get_ns() repeatedly back-to-back. For now, let's revert the commit. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: stable@vger.kernel.org # v5.16+ Link: https://lore.kernel.org/r/YmmeOLfo5lzc+8yI@slm.duckdns.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-15block: don't print I/O error warning for dead disksChristoph Hellwig1-1/+2
When a disk has been marked dead, don't print warnings for I/O errors as they are very much expected. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220323163815.1526998-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-02Merge tag 'for-5.18/block-2022-04-01' of git://git.kernel.dk/linux-blockLinus Torvalds1-9/+16
Pull block fixes from Jens Axboe: "Either fixes or a few additions that got missed in the initial merge window pull. In detail: - List iterator fix to avoid leaking value post loop (Jakob) - One-off fix in minor count (Christophe) - Fix for a regression in how io priority setting works for an exiting task (Jiri) - Fix a regression in this merge window with blkg_free() being called in an inappropriate context (Ming) - Misc fixes (Ming, Tom)" * tag 'for-5.18/block-2022-04-01' of git://git.kernel.dk/linux-block: blk-wbt: remove wbt_track stub block: use dedicated list iterator variable block: Fix the maximum minor value is blk_alloc_ext_minor() block: restore the old set_task_ioprio() behaviour wrt PF_EXITING block: avoid calling blkg_free() in atomic context lib/sbitmap: allocate sb->map via kvzalloc_node
2022-03-31block: use dedicated list iterator variableJakob Koschel1-9/+16
To move the list iterator variable into the list_for_each_entry_*() macro in the future it should be avoided to use the list iterator variable after the loop body. To *never* use the list iterator variable after the loop it was concluded to use a separate iterator variable instead of a found boolean [1]. Link: https://lore.kernel.org/all/CAHk-=wgRr_D8CB-D9Kg-c=EHreAsk5SqXPwr9Y7k9sA6cWXJ6w@mail.gmail.com/ [1] Signed-off-by: Jakob Koschel <jakobkoschel@gmail.com> Link: https://lore.kernel.org/r/20220331091218.641532-1-jakobkoschel@gmail.com [axboe: move lookup to where return value is checked] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-26Merge tag 'for-5.18/write-streams-2022-03-18' of git://git.kernel.dk/linux-blockLinus Torvalds1-1/+0
Pull NVMe write streams removal from Jens Axboe: "This removes the write streams support in NVMe. No vendor ever really shipped working support for this, and they are not interested in supporting it. With the NVMe support gone, we have nothing in the tree that supports this. Remove passing around of the hints. The only discussion point in this patchset imho is the fact that the file specific write hint setting/getting fcntl helpers will now return -1/EINVAL like they did before we supported write hints. No known applications use these functions, I only know of one prototype that I help do for RocksDB, and that's not used. That said, with a change like this, it's always a bit controversial. Alternatively, we could just make them return 0 and pretend it worked. It's placement based hints after all" * tag 'for-5.18/write-streams-2022-03-18' of git://git.kernel.dk/linux-block: fs: remove fs.f_write_hint fs: remove kiocb.ki_hint block: remove the per-bio/request write hint nvme: remove support or stream based temperature hint
2022-03-22Merge tag 'for-5.18/block-2022-03-18' of git://git.kernel.dk/linux-blockLinus Torvalds1-151/+152
Pull block updates from Jens Axboe: - BFQ cleanups and fixes (Yu, Zhang, Yahu, Paolo) - blk-rq-qos completion fix (Tejun) - blk-cgroup merge fix (Tejun) - Add offline error return value to distinguish it from an IO error on the device (Song) - IO stats fixes (Zhang, Christoph) - blkcg refcount fixes (Ming, Yu) - Fix for indefinite dispatch loop softlockup (Shin'ichiro) - blk-mq hardware queue management improvements (Ming) - sbitmap dead code removal (Ming, John) - Plugging merge improvements (me) - Show blk-crypto capabilities in sysfs (Eric) - Multiple delayed queue run improvement (David) - Block throttling fixes (Ming) - Start deprecating auto module loading based on dev_t (Christoph) - bio allocation improvements (Christoph, Chaitanya) - Get rid of bio_devname (Christoph) - bio clone improvements (Christoph) - Block plugging improvements (Christoph) - Get rid of genhd.h header (Christoph) - Ensure drivers use appropriate flush helpers (Christoph) - Refcounting improvements (Christoph) - Queue initialization and teardown improvements (Ming, Christoph) - Misc fixes/improvements (Barry, Chaitanya, Colin, Dan, Jiapeng, Lukas, Nian, Yang, Eric, Chengming) * tag 'for-5.18/block-2022-03-18' of git://git.kernel.dk/linux-block: (127 commits) block: cancel all throttled bios in del_gendisk() block: let blkcg_gq grab request queue's refcnt block: avoid use-after-free on throttle data block: limit request dispatch loop duration block/bfq-iosched: Fix spelling mistake "tenative" -> "tentative" sr: simplify the local variable initialization in sr_block_open() block: don't merge across cgroup boundaries if blkcg is enabled block: fix rq-qos breakage from skipping rq_qos_done_bio() block: flush plug based on hardware and software queue order block: ensure plug merging checks the correct queue at least once block: move rq_qos_exit() into disk_release() block: do more work in elevator_exit block: move blk_exit_queue into disk_release block: move q_usage_counter release into blk_queue_release block: don't remove hctx debugfs dir from blk_mq_exit_queue block: move blkcg initialization/destroy into disk allocation/release handler sr: implement ->free_disk to simplify refcounting sd: implement ->free_disk to simplify refcounting sd: delay calling free_opal_dev sd: call sd_zbc_release_disk before releasing the scsi_device reference ...
2022-03-11block: flush plug based on hardware and software queue orderJens Axboe1-31/+28
We used to sort the plug list if we had multiple queues before dispatching requests to the IO scheduler. This usually isn't needed, but for certain workloads that interleave requests to disks, it's a less efficient to process the plug list one-by-one if everything is interleaved. Don't sort the list, but skip through it and flush out entries that have the same target at the same time. Fixes: df87eb0fce8f ("block: get rid of plug list sorting") Reported-and-tested-by: Song Liu <song@kernel.org> Reviewed-by: Song Liu <songliubraving@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-09block: don't remove hctx debugfs dir from blk_mq_exit_queueMing Lei1-1/+0
The queue's top debugfs dir is removed from blk_release_queue(), so all hctx's debugfs dirs are removed from there. Given blk_mq_exit_queue() is only called from blk_cleanup_queue(), it isn't necessary to remove hctx debugfs from blk_mq_exit_queue(). So remove it from blk_mq_exit_queue(). Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220308055200.735835-11-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-09blk-mq: handle already freed tags gracefully in blk_mq_free_rqsMing Lei1-0/+3
To simplify further changes allow for double calling blk_mq_free_rqs on a queue. Signed-off-by: Ming Lei <ming.lei@redhat.com> [hch: split out from a larger patch] Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20220308055200.735835-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-09blk-mq: do not include passthrough requests in I/O accountingChristoph Hellwig1-3/+8
I/O accounting buckets I/O into the read/write/discard categories into which passthrough I/O does not fit at all. It also accounts to the block_device, which may not even exist for passthrough I/O. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20220308055200.735835-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-09blk-mq: manage hctx map via xarrayMing Lei1-38/+24
First code becomes more clean by switching to xarray from plain array. Second use-after-free on q->queue_hw_ctx can be fixed because queue_for_each_hw_ctx() may be run when updating nr_hw_queues is in-progress. With this patch, q->hctx_table is defined as xarray, and this structure will share same lifetime with request queue, so queue_for_each_hw_ctx() can use q->hctx_table to lookup hctx reliably. Reported-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220308073219.91173-7-ming.lei@redhat.com [axboe: fix blk_mq_hw_ctx forward declaration] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-09blk-mq: prepare for implementing hctx table via xarrayMing Lei1-14/+16
It is inevitable to cause use-after-free on q->queue_hw_ctx between queue_for_each_hw_ctx() and blk_mq_update_nr_hw_queues(). And converting to xarray can fix the uaf, meantime code gets cleaner. Prepare for converting q->queue_hctx_ctx into xarray, one thing is that xa_for_each() can only accept 'unsigned long' as index, so changes type of hctx index of queue_for_each_hw_ctx() into 'unsigned long'. Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20220308073219.91173-6-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-09blk-mq: reconfigure poll after queue map is changedMing Lei1-3/+13
queue map can be changed when updating nr_hw_queues, so we need to reconfigure queue's poll capability. Add one helper for doing this job. Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20220308073219.91173-4-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-09blk-mq: simplify reallocation of hw ctxs a bitMing Lei1-19/+14
blk_mq_alloc_and_init_hctx() has already taken reuse into account, so no need to do it outside, then we can simplify blk_mq_realloc_hw_ctxs(). Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20220308073219.91173-3-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-09blk-mq: figure out correct numa node for hw queueMing Lei1-6/+30
The current code always uses default queue map and hw queue index for figuring out the numa node for hw queue, this way isn't correct because blk-mq supports three queue maps, and the correct queue map should be used for the specified hw queue. Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20220308073219.91173-2-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-09block: fix blk_mq_attempt_bio_merge and rq_qos_throttle protectionShin'ichiro Kawasaki1-12/+23
Commit 9d497e2941c3 ("block: don't protect submit_bio_checks by q_usage_counter") moved blk_mq_attempt_bio_merge and rq_qos_throttle calls out of q_usage_counter protection. However, these functions require q_usage_counter protection. The blk_mq_attempt_bio_merge call without the protection resulted in blktests block/005 failure with KASAN null- ptr-deref or use-after-free at bio merge. The rq_qos_throttle call without the protection caused kernel hang at qos throttle. To fix the failures, move the blk_mq_attempt_bio_merge and rq_qos_throttle calls back to q_usage_counter protection. Fixes: 9d497e2941c3 ("block: don't protect submit_bio_checks by q_usage_counter") Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Link: https://lore.kernel.org/r/20220308080915.3473689-1-shinichiro.kawasaki@wdc.com Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-07block: remove the per-bio/request write hintChristoph Hellwig1-1/+0
With the NVMe support for this gone, there are no consumers of these hints left, so remove them. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220304175556.407719-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-07Merge branch 'for-5.18/block' into for-5.18/write-streamsJens Axboe1-42/+22
* for-5.18/block: (96 commits) block: remove bio_devname ext4: stop using bio_devname raid5-ppl: stop using bio_devname raid1: stop using bio_devname md-multipath: stop using bio_devname dm-integrity: stop using bio_devname dm-crypt: stop using bio_devname pktcdvd: remove a pointless debug check in pkt_submit_bio block: remove handle_bad_sector block: fix and cleanup bio_check_ro bfq: fix use-after-free in bfq_dispatch_request blk-crypto: show crypto capabilities in sysfs block: don't delete queue kobject before its children block: simplify calling convention of elv_unregister_queue() block: remove redundant semicolon block: default BLOCK_LEGACY_AUTOLOAD to y block: update io_ticks when io hang block, bfq: don't move oom_bfqq block, bfq: avoid moving bfqq to it's parent bfqg block, bfq: cleanup bfq_bfqq_to_bfqg() ...
2022-02-17blk-mq: avoid extending delays of active hctx from blk_mq_delay_run_hw_queuesDavid Jeffery1-0/+8
When blk_mq_delay_run_hw_queues sets an hctx to run in the future, it can reset the delay length for an already pending delayed work run_work. This creates a scenario where multiple hctx may have their queues set to run, but if one runs first and finds nothing to do, it can reset the delay of another hctx and stall the other hctx's ability to run requests. To avoid this I/O stall when an hctx's run_work is already pending, leave it untouched to run at its current designated time rather than extending its delay. The work will still run which keeps closed the race calling blk_mq_delay_run_hw_queues is needed for while also avoiding the I/O stall. Signed-off-by: David Jeffery <djeffery@redhat.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20220131203337.GA17666@redhat Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-02-17block: move blk_crypto_bio_prep() out of blk-mq.cMing Lei1-3/+0
blk_crypto_bio_prep() is called for both bio based and blk-mq drivers, so move it out of blk-mq.c, then we can unify this kind of handling. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20220216044514.2903784-3-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-02-17blk-mq: remove the request_queue argument to blk_insert_cloned_requestChristoph Hellwig1-5/+4
The request must be submitted to the queue it was allocated for, so remove the extra request_queue argument. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mike Snitzer <snitzer@redhat.com> Link: https://lore.kernel.org/r/20220215100540.3892965-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-02-17blk-mq: fold blk_cloned_rq_check_limits into blk_insert_cloned_requestChristoph Hellwig1-33/+5
Fold blk_cloned_rq_check_limits into its only caller. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mike Snitzer <snitzer@redhat.com> Link: https://lore.kernel.org/r/20220215100540.3892965-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-02-17blk-mq: make the blk-mq stacking code optionalChristoph Hellwig1-0/+2
The code to stack blk-mq drivers is only used by dm-multipath, and will preferably stay that way. Make it optional and only selected by device mapper, so that the buildbots more easily catch abuses like the one that slipped in in the ufs driver in the last merged window. Another positive side effects is that kernel builds without device mapper shrink a little bit as well. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mike Snitzer <snitzer@redhat.com> Link: https://lore.kernel.org/r/20220215100540.3892965-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-02-11block: introduce block_rq_error tracepointYang Shi1-1/+3
Currently, rasdaemon uses the existing tracepoint block_rq_complete and filters out non-error cases in order to capture block disk errors. But there are a few problems with this approach: 1. Even kernel trace filter could do the filtering work, there is still some overhead after we enable this tracepoint. 2. The filter is merely based on errno, which does not align with kernel logic to check the errors for print_req_error(). 3. block_rq_complete only provides dev major and minor to identify the block device, it is not convenient to use in user-space. So introduce a new tracepoint block_rq_error just for the error case. With this patch, rasdaemon could switch to block_rq_error. Since the new tracepoint has the similar implementation with block_rq_complete, so move the existing code from TRACE_EVENT block_rq_complete() into new event class block_rq_completion(). Then add event for block_rq_complete and block_rq_err respectively from the newly created event class per the suggestion from Chaitanya Kulkarni. Cc: Jens Axboe <axboe@kernel.dk> Cc: Christoph Hellwig <hch@infradead.org> Reviewed-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Yang Shi <shy828301@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220210225222.260069-1-shy828301@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-02-11block: Add handling for zone append command in blk_complete_requestPankaj Raghav1-0/+4
Zone append command needs special handling to update the bi_sector field in the bio struct with the actual position of the data in the device. It is stored in __sector field of the request struct. Fixes: 5581a5ddfe8d ("block: add completion handler for fast path") Signed-off-by: Pankaj Raghav <p.raghav@samsung.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Tested-by: Adam Manzanares <a.manzanares@samsung.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20220211093425.43262-2-p.raghav@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-02-04block: pass a block_device to bio_clone_fastChristoph Hellwig1-2/+2
Pass a block_device to bio_clone_fast and __bio_clone_fast and give the functions more suitable names. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mike Snitzer <snitzer@redhat.com> Link: https://lore.kernel.org/r/20220202160109.108149-14-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-01-26blk-mq: fix missing blk_account_io_done() in error pathYu Kuai1-0/+2
If blk_mq_request_issue_directly() failed from blk_insert_cloned_request(), the request will be accounted start. Currently, blk_insert_cloned_request() is only called by dm, and such request won't be accounted done by dm. In normal path, io will be accounted start from blk_mq_bio_to_request(), when the request is allocated, and such io will be accounted done from __blk_mq_end_request_acct() whether it succeeded or failed. Thus add blk_account_io_done() to fix the problem. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220126012132.3111551-1-yukuai3@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-01-23Merge tag 'bitmap-5.17-rc1' of git://github.com/norov/linuxLinus Torvalds1-1/+1
Pull bitmap updates from Yury Norov: - introduce for_each_set_bitrange() - use find_first_*_bit() instead of find_next_*_bit() where possible - unify for_each_bit() macros * tag 'bitmap-5.17-rc1' of git://github.com/norov/linux: vsprintf: rework bitmap_list_string lib: bitmap: add performance test for bitmap_print_to_pagebuf bitmap: unify find_bit operations mm/percpu: micro-optimize pcpu_is_populated() Replace for_each_*_bit_from() with for_each_*_bit() where appropriate find: micro-optimize for_each_{set,clear}_bit() include/linux: move for_each_bit() macros from bitops.h to find.h cpumask: replace cpumask_next_* with cpumask_first_* where appropriate tools: sync tools/bitmap with mother linux all: replace find_next{,_zero}_bit with find_first{,_zero}_bit where appropriate cpumask: use find_first_and_bit() lib: add find_first_and_bit() arch: remove GENERIC_FIND_FIRST_BIT entirely include: move find.h from asm_generic to linux bitops: move find_bit_*_le functions from le.h to find.h bitops: protect find_first_{,zero}_bit properly
2022-01-18block: assign bi_bdev for cloned bios in blk_rq_prep_cloneChristoph Hellwig1-0/+1
bio_clone_fast() sets the cloned bio to have the same ->bi_bdev as the source bio. This means that when request-based dm called setup_clone(), the cloned bio had its ->bi_bdev pointing to the dm device. After Commit 0b6e522cdc4a ("blk-mq: use ->bi_bdev for I/O accounting") __blk_account_io_start() started using the request's ->bio->bi_bdev for I/O accounting, if it was set. This caused IO going to the underlying devices to use the dm device for their I/O accounting. Set up the proper ->bi_bdev in blk_rq_prep_clone based on the whole device bdev for the queue the request is cloned onto. Fixes: 0b6e522cdc4a ("blk-mq: use ->bi_bdev for I/O accounting") Reported-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> [hch: the commit message is mostly from a different patch from Benjamin] Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Benjamin Marzinski <bmarzins@redhat.com> Link: https://lore.kernel.org/r/20220118070444.1241739-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-01-15cpumask: replace cpumask_next_* with cpumask_first_* where appropriateYury Norov1-1/+1
cpumask_first() is a more effective analogue of 'next' version if n == -1 (which means start == 0). This patch replaces 'next' with 'first' where things look trivial. There's no cpumask_first_zero() function, so create it. Signed-off-by: Yury Norov <yury.norov@gmail.com> Tested-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
2022-01-10block: don't protect submit_bio_checks by q_usage_counterMing Lei1-26/+13
Commit cc9c884dd7f4 ("block: call submit_bio_checks under q_usage_counter") uses q_usage_counter to protect submit_bio_checks for avoiding IO after disk is deleted by del_gendisk(). Turns out the protection isn't necessary, because once blk_mq_freeze_queue_wait() in del_gendisk() returns: 1) all in-flight IO has been done 2) all new IO will be failed in __bio_queue_enter() because q_usage_counter is dead, and GD_DEAD is set 3) both disk and request queue instance are safe since caller of submit_bio() guarantees that the disk can't be closed. Once submit_bio_checks() needn't the protection of q_usage_counter, we can move submit_bio_checks before calling blk_mq_submit_bio() and ->submit_bio(). With this change, we needn't to throttle queue with holding one allocated request, then precise driver tag or request won't be wasted in throttling. Meantime we can unify the bio check for both bio based and request based driver. Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20220104134223.590803-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-12-21blk-mq: blk-mq: check quiesce state before queue_rqsKeith Busch1-1/+9
The low level drivers don't expect to see new requests after a successful quiesce completes. Check the queue quiesce state within the rcu protected area prior to calling the driver's queue_rqs(). Fixes: 3c67d44de787 ("block: add mq_ops->queue_rqs hook") Signed-off-by: Keith Busch <kbusch@kernel.org> Link: https://lore.kernel.org/r/20211220205919.180191-1-kbusch@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-12-16block: add mq_ops->queue_rqs hookJens Axboe1-3/+23
If we have a list of requests in our plug list, send it to the driver in one go, if possible. The driver must set mq_ops->queue_rqs() to support this, if not the usual one-by-one path is used. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-12-16block: add completion handler for fast pathJens Axboe1-1/+42
The batched completions only deal with non-partial requests anyway, and it doesn't deal with any requests that have errors. Add a completion handler that assumes it's a full request and that it's all being ended successfully. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>