summaryrefslogtreecommitdiff
path: root/block
AgeCommit message (Collapse)AuthorFilesLines
2023-02-06blk-mq: Fix potential io hung for shared sbitmap per tagsetKemeng Shi1-2/+4
Commit f906a6a0f4268 ("blk-mq: improve tag waiting setup for non-shared tags") mark restart for unshared tags for improvement. At that time, tags is only shared betweens queues and we can check if tags is shared by test BLK_MQ_F_TAG_SHARED. Afterwards, commit 32bc15afed04b ("blk-mq: Facilitate a shared sbitmap per tagset") enabled tags share betweens hctxs inside a queue. We only mark restart for shared hctxs inside a queue and may cause io hung if there is no tag currently allocated by hctxs going to be marked restart. Wait on sbitmap_queue instead of mark restart for shared hctxs case to fix this. Fixes: 32bc15afed04 ("blk-mq: Facilitate a shared sbitmap per tagset") Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-02-06blk-mq: wait on correct sbitmap_queue in blk_mq_mark_tag_waitKemeng Shi1-1/+5
For shared queues case, we will only wait on bitmap_tags if we fail to get driver tag. However, rq could be from breserved_tags, then two problems will occur: 1. io hung if no tag is currently allocated from bitmap_tags. 2. unnecessary wakeup when tag is freed to bitmap_tags while no tag is freed to breserved_tags. Wait on the bitmap which rq from to fix this. Fixes: f906a6a0f426 ("blk-mq: improve tag waiting setup for non-shared tags") Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-02-06blk-mq: remove stale comment for blk_mq_sched_mark_restart_hctxKemeng Shi1-2/+1
Commit 97889f9ac24f8 ("blk-mq: remove synchronize_rcu() from blk_mq_del_queue_tag_set()") remove handle of TAG_SHARED in restart, then shared_hctx_restart counted for how many hardware queues are marked for restart is removed too. Remove the stale comment that we still count hardware queues need restart. Fixes: 97889f9ac24f ("blk-mq: remove synchronize_rcu() from blk_mq_del_queue_tag_set()") Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-02-06blk-mq: avoid sleep in blk_mq_alloc_request_hctxKemeng Shi1-1/+2
Commit 1f5bd336b9150 ("blk-mq: add blk_mq_alloc_request_hctx") add blk_mq_alloc_request_hctx to send commands to a specific queue. If BLK_MQ_REQ_NOWAIT is not set in tag allocation, we may change to different hctx after sleep and get tag from unexpected hctx. So BLK_MQ_REQ_NOWAIT must be set in flags for blk_mq_alloc_request_hctx. After commit 600c3b0cea784 ("blk-mq: open code __blk_mq_alloc_request in blk_mq_alloc_request_hctx"), blk_mq_alloc_request_hctx return -EINVAL if both BLK_MQ_REQ_NOWAIT and BLK_MQ_REQ_RESERVED are not set instead of if BLK_MQ_REQ_NOWAIT is not set. So if BLK_MQ_REQ_NOWAIT is not set and BLK_MQ_REQ_RESERVED is set, blk_mq_alloc_request_hctx could alloc tag from unexpected hctx. I guess what we need here is that return -EINVAL if either BLK_MQ_REQ_NOWAIT or BLK_MQ_REQ_RESERVED is not set. Currently both BLK_MQ_REQ_NOWAIT and BLK_MQ_REQ_RESERVED will be set if specific hctx is needed in nvme_auth_submit, nvmf_connect_io_queue and nvmf_connect_admin_queue. Fix the potential BLK_MQ_REQ_NOWAIT missed case in future. Fixes: 600c3b0cea78 ("blk-mq: open code __blk_mq_alloc_request in blk_mq_alloc_request_hctx") Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-02-06block: stub out and deprecated the capability attribute on the gendiskChristoph Hellwig1-3/+2
The capability attribute was added in 2017 to expose the kernel internal GENHD_FL_MEDIA_CHANGE_NOTIFY to userspace without ever adding a value to an UAPI header, and without ever setting it in any driver until it was finally removed in Linux 5.7. Deprecate the file and always return 0 instead of exposing the other internal and frequently renumbered other gendisk flags. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20230203150209.3199115-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-02-06blk-cgroup: fix freeing NULL blkg in blkg_createChristoph Hellwig1-1/+2
new_blkg can be NULL if the caller didn't pass in a pre-allocated blkg. Don't try to free it in that case. Fixes: 27b642b07a4a ("blk-cgroup: simplify blkg freeing from initialization failure paths") Reported-by: Yi Zhang <yi.zhang@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20230206150201.3438972-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-02-03Merge tag 'block-6.2-2023-02-03' of git://git.kernel.dk/linuxLinus Torvalds4-4/+11
Pull block fixes from Jens Axboe: "A bit bigger than I'd like at this point, but mostly a bunch of little fixes. In detail: - NVMe pull request via Christoph: - Fix a missing queue put in nvmet_fc_ls_create_association (Amit Engel) - Clear queue pointers on tag_set initialization failure (Maurizio Lombardi) - Use workqueue dedicated to authentication (Shin'ichiro Kawasaki) - Fix for an overflow in ublk (Liu) - Fix for leaking a queue reference in block cgroups (Ming) - Fix for a use-after-free in BFQ (Yu)" * tag 'block-6.2-2023-02-03' of git://git.kernel.dk/linux: blk-cgroup: don't update io stat for root cgroup nvme-auth: use workqueue dedicated to authentication nvme: clear the request_queue pointers on failure in nvme_alloc_io_tag_set nvme: clear the request_queue pointers on failure in nvme_alloc_admin_tag_set nvme-fc: fix a missing queue put in nvmet_fc_ls_create_association block: Fix the blk_mq_destroy_queue() documentation block: ublk: extending queue_size to fix overflow block, bfq: fix uaf for bfqq in bic_set_bfqq()
2023-02-03block: factor out a bvec_set_page helperChristoph Hellwig2-16/+3
Add a helper to initialize a bvec based of a page pointer. This will help removing various open code bvec initializations. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20230203150634.3199647-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-02-03blk-cgroup: move the cgroup information to struct gendiskChristoph Hellwig5-44/+48
cgroup information only makes sense on a live gendisk that allows file system I/O (which includes the raw block device). So move over the cgroup related members. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Andreas Herrmann <aherrmann@suse.de> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230203150400.3199230-20-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-02-03blk-cgroup: pass a gendisk to blkg_lookupChristoph Hellwig2-18/+18
Pass a gendisk to blkg_lookup and use that to find the match as part of phasing out usage of the request_queue in the blk-cgroup code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Andreas Herrmann <aherrmann@suse.de> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230203150400.3199230-19-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-02-03blk-cgroup: pass a gendisk to pd_alloc_fnChristoph Hellwig7-22/+21
No need to the request_queue here, pass a gendisk and extract the node ids from that. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Andreas Herrmann <aherrmann@suse.de> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230203150400.3199230-18-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-02-03blk-cgroup: pass a gendisk to blkcg_{de,}activate_policyChristoph Hellwig8-25/+25
Prepare for storing the blkcg information in the gendisk instead of the request_queue. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Andreas Herrmann <aherrmann@suse.de> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230203150400.3199230-17-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-02-03blk-rq-qos: store a gendisk instead of request_queue in struct rq_qosChristoph Hellwig6-31/+27
This is what about half of the users already want, and it's only going to grow more. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Andreas Herrmann <aherrmann@suse.de> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230203150400.3199230-16-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-02-03blk-rq-qos: constify rq_qos_opsChristoph Hellwig5-6/+6
These op vectors are constant, so mark them const. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Andreas Herrmann <aherrmann@suse.de> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230203150400.3199230-15-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-02-03blk-rq-qos: make rq_qos_add and rq_qos_del more usefulChristoph Hellwig5-29/+21
Switch to passing a gendisk, and make rq_qos_add initialize all required fields and drop the not required q argument from rq_qos_del. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Andreas Herrmann <aherrmann@suse.de> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230203150400.3199230-14-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-02-03blk-rq-qos: move rq_qos_add and rq_qos_del out of lineChristoph Hellwig2-59/+62
These two functions are rather larger and not in a fast path, so move them out of line. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230203150400.3199230-13-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-02-03blk-wbt: open code wbt_queue_depth_changed in wbt_initChristoph Hellwig1-2/+2
wbt_queue_depth_changed just updates a field and calls another function. Open code it in wbt_init, so that the local queue variable can be used instead of the one stored in the rq_qos. This will allow delaying that rq_qos->queue assignment in a subsequent patch. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Andreas Herrmann <aherrmann@suse.de> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230203150400.3199230-12-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-02-03blk-wbt: move private information from blk-wbt.h to blk-wbt.cChristoph Hellwig4-86/+79
A large part of blk-wbt.h is only used in blk-wbt.c, so move it there. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230203150400.3199230-11-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-02-03blk-wbt: pass a gendisk to wbt_initChristoph Hellwig3-5/+6
Pass a gendisk to wbt_init to prepare for phasing out usage of the request_queue in the blk-cgroup code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Andreas Herrmann <aherrmann@suse.de> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230203150400.3199230-10-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-02-03blk-wbt: pass a gendisk to wbt_{enable,disable}_defaultChristoph Hellwig5-12/+13
Pass a gendisk to wbt_enable_default and wbt_disable_default to prepare for phasing out usage of the request_queue in the blk-cgroup code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Andreas Herrmann <aherrmann@suse.de> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230203150400.3199230-9-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-02-03blk-cgroup: store a gendisk to throttle in struct task_structChristoph Hellwig1-17/+15
Switch from a request_queue pointer and reference to a gendisk once for the throttle information in struct task_struct. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Andreas Herrmann <aherrmann@suse.de> Link: https://lore.kernel.org/r/20230203150400.3199230-8-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-02-03blk-cgroup: pin the gendisk in struct blkcg_gqChristoph Hellwig7-33/+31
Currently each blkcg_gq holds a request_queue reference, which is what is used in the policies. But a lot of these interfaces will move over to use a gendisk, so store a disk in struct blkcg_gq and hold a reference to it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Andreas Herrmann <aherrmann@suse.de> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230203150400.3199230-7-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-02-03blk-cgroup: remove the !bdi->dev check in blkg_dev_nameChristoph Hellwig1-1/+1
bdi_dev_name already performs the same check. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230203150400.3199230-6-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-02-03blk-cgroup: simplify blkg freeing from initialization failure pathsChristoph Hellwig1-20/+7
There is no need to delay freeing a blkg to a workqueue when freeing it after an initialization failure. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230203150400.3199230-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-02-03blk-cgroup: improve error unwinding in blkg_allocChristoph Hellwig1-20/+19
Unwind only the previous initialization steps that happened in blkg_alloc using goto based unwinding. This avoids the need for the !queue special case in blkg_free and thus ensures that any blkg seens outside of blkg_alloc is always fully constructed. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230203150400.3199230-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-02-03blk-cgroup: delay blk-cgroup initialization until add_diskChristoph Hellwig1-8/+9
There is no need to initialize the cgroup code before the disk is marked live. Moving the cgroup initialization earlier will help to have a fully initialized struct device in the gendisk for the cgroup code to use in the future. Similarly tear the cgroup information down in del_gendisk to be symmetric and because none of the cgroup tracking is needed once non-passthrough I/O stops. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Andreas Herrmann <aherrmann@suse.de> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230203150400.3199230-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-02-03block: don't call blk_throtl_stat_add for non-READ/WRITE commandsChristoph Hellwig1-1/+2
blk_throtl_stat_add is called from blk_stat_add explicitly, unlike the other stats that go through q->stats->callbacks. To prepare for cgroup data moving to the gendisk, ensure blk_throtl_stat_add is only called for the plain READ and WRITE commands that it actually handles internally, as blk_stat_add can also be called for passthrough commands on queues that do not have a gendisk associated with them. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Andreas Herrmann <aherrmann@suse.de> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230203150400.3199230-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-02-03block: remove ->rw_pageChristoph Hellwig1-78/+0
The ->rw_page method is a special purpose bypass of the usual bio handling path that is limited to single-page reads and writes and synchronous which causes a lot of extra code in the drivers, callers and the block layer. The only remaining user is the MM swap code. Switch that swap code to simply submit a single-vec on-stack bio an synchronously wait on it based on a newly added QUEUE_FLAG_SYNCHRONOUS flag set by the drivers that currently implement ->rw_page instead. While this touches one extra cache line and executes extra code, it simplifies the block layer and drivers and ensures that all feastures are properly supported by all drivers, e.g. right now ->rw_page bypassed cgroup writeback entirely. [akpm@linux-foundation.org: fix comment typo, per Dan] Link: https://lkml.kernel.org/r/20230125133436.447864-8-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Keith Busch <kbusch@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-02blk-cgroup: don't update io stat for root cgroupMing Lei1-0/+4
We source root cgroup stats from the system-wide stats, see blkcg_print_stat and blkcg_rstat_flush, so don't update io state for root cgroup. Fixes blkg leak issue introduced in commit 3b8cc6298724 ("blk-cgroup: Optimize blkcg_rstat_flush()") which starts to grab blkg's reference when adding iostat_cpu into percpu blkcg list, but this state won't be consumed by blkcg_rstat_flush() where the blkg reference is dropped. Tested-by: Bart van Assche <bvanassche@acm.org> Reported-by: Bart van Assche <bvanassche@acm.org> Fixes: 3b8cc6298724 ("blk-cgroup: Optimize blkcg_rstat_flush()") Cc: Tejun Heo <tj@kernel.org> Cc: Waiman Long <longman@redhat.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20230202021804.278582-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-31block: Fix the blk_mq_destroy_queue() documentationBart Van Assche1-2/+3
Commit 2b3f056f72e5 moved a blk_put_queue() call from blk_mq_destroy_queue() into its callers. Reflect this change in the documentation block above blk_mq_destroy_queue(). Cc: Christoph Hellwig <hch@lst.de> Cc: Sagi Grimberg <sagi@grimberg.me> Cc: Chaitanya Kulkarni <kch@nvidia.com> Cc: Keith Busch <kbusch@kernel.org> Fixes: 2b3f056f72e5 ("blk-mq: move the call to blk_put_queue out of blk_mq_destroy_queue") Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230130211233.831613-1-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-30block: Default to use cgroup support for BFQUlf Hansson1-0/+1
Assuming that both Kconfig options, BLK_CGROUP and IOSCHED_BFQ are set, we most likely want cgroup support for BFQ too (BFQ_GROUP_IOSCHED), so let's make it default y. Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org> Reviewed-by: Linus Walleij <linus.walleij@linaro.org> Link: https://lore.kernel.org/r/20230130121240.159456-1-ulf.hansson@linaro.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-30block, bfq: remove unused bfq_wr_max_time in struct bfq_dataKemeng Shi2-6/+0
bfqd->bfq_wr_max_time is set to 0 in bfq_init_queue and is never changed. It is only used in bfq_wr_duration when bfq_wr_max_time > 0 which never meets, so bfqd->bfq_wr_max_time is not used actually. Just remove it. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230116095153.3810101-9-shikemeng@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-30block, bfq: remove unnecessary goto tag in bfq_dispatch_rq_from_bfqqKemeng Shi1-6/+3
We jump to tag only for returning current rq. Return directly to remove this tag. Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Link: https://lore.kernel.org/r/20230116095153.3810101-8-shikemeng@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-30block, bfq: remove redundant check in bfq_put_cooperatorKemeng Shi1-2/+0
We have already avoided a circular list in bfq_setup_merge (see comments in bfq_setup_merge() for details), so bfq_queue will not appear in it's new_bfqq list. Just remove this check. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230116095153.3810101-7-shikemeng@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-30block, bfq: remove unnecessary dereference to get async_bfqqKemeng Shi1-1/+1
The async_bfqq is assigned with bfqq->bic->bfqq[0], use it directly. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230116095153.3810101-6-shikemeng@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-30block, bfq: use helper macro RQ_BFQQ to get bfqq of requestKemeng Shi1-3/+3
Use helper macro RQ_BFQQ to get bfqq of request. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230116095153.3810101-5-shikemeng@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-30block, bfq: initialize bfqq->decrease_time_jif correctlyKemeng Shi1-0/+2
Inject limit is updated or reset when time_is_before_eq_jiffies( decrease_time_jif + several msecs) or think-time state changes. decrease_time_jif is initialized to 0 and will be set to current jiffies when inject limit is updated or reset. If the jiffies is slightly greater than LONG_MAX, time_is_after_eq_jiffies(0) will keep for a long time, so as time_is_after_eq_jiffies(decrease_time_jif + several msecs). If the think-time state never chages, then the injection will not work as expected for long time. To be more specific: Function bfq_update_inject_limit maybe triggered when jiffies pasts decrease_time_jif + msecs_to_jiffies(10) in bfq_add_request by setting bfqd->wait_dispatch to true. Function bfq_reset_inject_limit are called in two conditions: 1. jiffies pasts bfqq->decrease_time_jif + msecs_to_jiffies(1000) in function bfq_add_request. 2. jiffies pasts bfqq->decrease_time_jif + msecs_to_jiffies(100) or bfq think-time state change from short to long. Fix this by initializing bfqq->decrease_time_jif to current jiffies to trigger service injection soon when service injection conditions are met. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230116095153.3810101-4-shikemeng@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-30block, bfq: remove unsed parameter reason in bfq_bfqq_is_slowKemeng Shi1-3/+2
Parameter reason is never used, just remove it. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230116095153.3810101-3-shikemeng@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-30block, bfq: correctly raise inject limit in bfq_choose_bfqq_for_injectionKemeng Shi1-6/+4
Function bfq_choose_bfqq_for_injection may temporarily raise inject limit to one request if current inject_limit is 0 before search of the source queue for injection. However the search below will reset inject limit to bfqd->in_service_queue which is zero for raised inject limit. Then the temporarily raised inject limit never works as expected. Assigment limit to bfqd->in_service_queue in search is needed as limit maybe overwriten to min_t(unsigned int, 1, limit) for condition that a large in-flight request is on non-rotational devices in found queue. So we need to reset limit to bfqd->in_service_queue for normal case. Actually, we have already make sure bfqd->rq_in_driver is < limit before search, then -Limit is >= 1 as bfqd->rq_in_driver is >= 0. Then min_t(unsigned int, 1, limit) is always 1. So we can simply check bfqd->rq_in_driver with 1 instead of result of min_t(unsigned int, 1, limit) for larget request in non-rotational device case to avoid overwritting limit and the bug is gone. -For normal case, we have already check bfqd->rq_in_driver is < limit, so we can return found bfqq unconditionally to remove unncessary check. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230116095153.3810101-2-shikemeng@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-30block, bfq: fix uaf for bfqq in bic_set_bfqq()Yu Kuai2-2/+4
After commit 64dc8c732f5c ("block, bfq: fix possible uaf for 'bfqq->bic'"), bic->bfqq will be accessed in bic_set_bfqq(), however, in some context bic->bfqq will be freed, and bic_set_bfqq() is called with the freed bic->bfqq. Fix the problem by always freeing bfqq after bic_set_bfqq(). Fixes: 64dc8c732f5c ("block, bfq: fix possible uaf for 'bfqq->bic'") Reported-and-tested-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230130014136.591038-1-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-30blk-cgroup: synchronize pd_free_fn() from blkg_free_workfn() and ↵Yu Kuai1-6/+29
blkcg_deactivate_policy() Currently parent pd can be freed before child pd: t1: remove cgroup C1 blkcg_destroy_blkgs blkg_destroy list_del_init(&blkg->q_node) // remove blkg from queue list percpu_ref_kill(&blkg->refcnt) blkg_release call_rcu t2: from t1 __blkg_release blkg_free schedule_work t4: deactivate policy blkcg_deactivate_policy pd_free_fn // parent of C1 is freed first t3: from t2 blkg_free_workfn pd_free_fn If policy(for example, ioc_timer_fn() from iocost) access parent pd from child pd after pd_offline_fn(), then UAF can be triggered. Fix the problem by delaying 'list_del_init(&blkg->q_node)' from blkg_destroy() to blkg_free_workfn(), and using a new disk level mutex to synchronize blkg_free_workfn() and blkcg_deactivate_policy(). Signed-off-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Tejun Heo <tj@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230119110350.2287325-4-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-30blk-cgroup: support to track if policy is onlineYu Kuai2-7/+18
A new field 'online' is added to blkg_policy_data to fix following 2 problem: 1) In blkcg_activate_policy(), if pd_alloc_fn() with 'GFP_NOWAIT' failed, 'queue_lock' will be dropped and pd_alloc_fn() will try again without 'GFP_NOWAIT'. In the meantime, remove cgroup can race with it, and pd_offline_fn() will be called without pd_init_fn() and pd_online_fn(). This way null-ptr-deference can be triggered. 2) In order to synchronize pd_free_fn() from blkg_free_workfn() and blkcg_deactivate_policy(), 'list_del_init(&blkg->q_node)' will be delayed to blkg_free_workfn(), hence pd_offline_fn() can be called first in blkg_destroy(), and then blkcg_deactivate_policy() will call it again, we must prevent it. The new field 'online' will be set after pd_online_fn() and will be cleared after pd_offline_fn(), in the meantime pd_offline_fn() will only be called if 'online' is set. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Tejun Heo <tj@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230119110350.2287325-3-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-30blk-cgroup: dropping parent refcount after pd_free_fn() is doneYu Kuai1-2/+2
Some cgroup policies will access parent pd through child pd even after pd_offline_fn() is done. If pd_free_fn() for parent is called before child, then UAF can be triggered. Hence it's better to guarantee the order of pd_free_fn(). Currently refcount of parent blkg is dropped in __blkg_release(), which is before pd_free_fn() is called in blkg_free_work_fn() while blkg_free_work_fn() is called asynchronously. This patch make sure pd_free_fn() called from removing cgroup is ordered by delaying dropping parent refcount after calling pd_free_fn() for child. BTW, pd_free_fn() will also be called from blkcg_deactivate_policy() from deleting device, and following patches will guarantee the order. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Tejun Heo <tj@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230119110350.2287325-2-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-30blk-mq: cleanup unused methods: blk_mq_hw_sysfs_storeZhong Jinghua1-24/+0
We found that the blk_mq_hw_sysfs_store interface has no place to use. The object default_hw_ctx_attrs using blk_mq_hw_sysfs_ops only uses the show method and does not use the store method. Since this patch: 4a46f05ebf99 ("blk-mq: move hctx and ctx counters from sysfs to debugfs") moved the store method to debugfs, the store method is not used anymore. So let me do some tiny work to clean up unused code. Signed-off-by: Zhong Jinghua <zhongjinghua@huawei.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20230128030419.2780298-1-zhongjinghua@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-30block: treat poll queue enter similarly to timeoutsJens Axboe1-1/+10
We ran into an issue where a production workload would randomly grind to a halt and not continue until the pending IO had timed out. This turned out to be a complicated interaction between queue freezing and polled IO: 1) You have an application that does polled IO. At any point in time, there may be polled IO pending. 2) You have a monitoring application that issues a passthrough command, which is marked with side effects such that it needs to freeze the queue. 3) Passthrough command is started, which calls blk_freeze_queue_start() on the device. At this point the queue is marked frozen, and any attempt to enter the queue will fail (for non-blocking) or block. 4) Now the driver calls blk_mq_freeze_queue_wait(), which will return when the queue is quiesced and pending IO has completed. 5) The pending IO is polled IO, but any attempt to poll IO through the normal iocb_bio_iopoll() -> bio_poll() will fail when it gets to bio_queue_enter() as the queue is frozen. Rather than poll and complete IO, the polling threads will sit in a tight loop attempting to poll, but failing to enter the queue to do so. The end result is that progress for either application will be stalled until all pending polled IO has timed out. This causes obvious huge latency issues for the application doing polled IO, but also long delays for passthrough command. Fix this by treating queue enter for polled IO just like we do for timeouts. This allows quick quiesce of the queue as we still poll and complete this IO, while still disallowing queueing up new IO. Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-30blk-iocost: change div64_u64 to DIV64_U64_ROUND_UP in ioc_refresh_params()Li Nan1-2/+2
vrate_min is calculated by DIV64_U64_ROUND_UP, but vrate_max is calculated by div64_u64. Vrate_min may be 1 greater than vrate_max if the input values min and max of cost.qos are equal. Signed-off-by: Li Nan <linan122@huawei.com> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230117070806.3857142-6-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-30blk-iocost: fix divide by 0 error in calc_lcoefs()Li Nan1-3/+8
echo max of u64 to cost.model can cause divide by 0 error. # echo 8:0 rbps=18446744073709551615 > /sys/fs/cgroup/io.cost.model divide error: 0000 [#1] PREEMPT SMP RIP: 0010:calc_lcoefs+0x4c/0xc0 Call Trace: <TASK> ioc_refresh_params+0x2b3/0x4f0 ioc_cost_model_write+0x3cb/0x4c0 ? _copy_from_iter+0x6d/0x6c0 ? kernfs_fop_write_iter+0xfc/0x270 cgroup_file_write+0xa0/0x200 kernfs_fop_write_iter+0x17d/0x270 vfs_write+0x414/0x620 ksys_write+0x73/0x160 __x64_sys_write+0x1e/0x30 do_syscall_64+0x35/0x80 entry_SYSCALL_64_after_hwframe+0x63/0xcd calc_lcoefs() uses the input value of cost.model in DIV_ROUND_UP_ULL, overflow would happen if bps plus IOC_PAGE_SIZE is greater than ULLONG_MAX, it can cause divide by 0 error. Fix the problem by setting basecost Signed-off-by: Li Nan <linan122@huawei.com> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230117070806.3857142-5-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-30blk-iocost: read params inside lock in sysfs apisYu Kuai1-0/+4
Otherwise, user might get abnormal values if params is updated concurrently. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230117070806.3857142-4-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-30blk-iocost: don't allow to configure bio based deviceYu Kuai1-0/+10
iocost is based on rq_qos, which can only work for request based device, thus it doesn't make sense to configure iocost for bio based device. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230117070806.3857142-3-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-30blk-iocost: check return value of match_u64()Yu Kuai1-1/+2
This patch fixs that the return value of match_u64() from ioc_qos_write() is not checked, Signed-off-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230117070806.3857142-2-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>