summaryrefslogtreecommitdiff
path: root/block
AgeCommit message (Collapse)AuthorFilesLines
2021-01-25bio: don't copy bvec for direct IOPavel Begunkov1-38/+29
The block layer spends quite a while in blkdev_direct_IO() to copy and initialise bio's bvec. However, if we've already got a bvec in the input iterator it might be reused in some cases, i.e. when new ITER_BVEC_FLAG_FIXED flag is set. Simple tests show considerable performance boost, and it also reduces memory footprint. Suggested-by: Matthew Wilcox <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-25block/psi: remove PSI annotations from direct IOPavel Begunkov1-0/+6
Direct IO does not operate on the current working set of pages managed by the kernel, so it should not be accounted as memory stall to PSI infrastructure. The block layer and iomap direct IO use bio_iov_iter_get_pages() to build bios, and they are the only users of it, so to avoid PSI tracking for them clear out BIO_WORKINGSET flag. Do same for dio_bio_submit() because fs/direct_io constructs bios by hand directly calling bio_add_page(). Reported-by: Christoph Hellwig <hch@infradead.org> Suggested-by: Christoph Hellwig <hch@infradead.org> Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-25block: remove unnecessary argument from blk_execute_rqGuoqing Jiang3-6/+5
We can remove 'q' from blk_execute_rq as well after the previous change in blk_execute_rq_nowait. And more importantly it never really was needed to start with given that we can trivial derive it from struct request. Cc: linux-scsi@vger.kernel.org Cc: virtualization@lists.linux-foundation.org Cc: linux-ide@vger.kernel.org Cc: linux-mmc@vger.kernel.org Cc: linux-nvme@lists.infradead.org Cc: linux-nfs@vger.kernel.org Acked-by: Ulf Hansson <ulf.hansson@linaro.org> # for mmc Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-25block: remove unnecessary argument from blk_execute_rq_nowaitGuoqing Jiang1-6/+4
The 'q' is not used since commit a1ce35fa4985 ("block: remove dead elevator code"), also update the comment of the function. And more importantly it never really was needed to start with given that we can trivial derive it from struct request. Cc: target-devel@vger.kernel.org Cc: linux-scsi@vger.kernel.org Cc: virtualization@lists.linux-foundation.org Cc: linux-ide@vger.kernel.org Cc: linux-mmc@vger.kernel.org Cc: linux-nvme@lists.infradead.org Cc: linux-nfs@vger.kernel.org Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-25bsg: free the request before return error codePan Bian1-1/+3
Free the request rq before returning error code. Fixes: 972248e9111e ("scsi: bsg-lib: handle bidi requests without block layer help") Signed-off-by: Pan Bian <bianpan2016@163.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-25block: Fix an error handling in add_partitionDinghao Liu1-1/+1
Once we have called device_initialize(), we should use put_device() to give up the reference on error, just like what we have done on failure of device_add(). Signed-off-by: Dinghao Liu <dinghao.liu@zju.edu.cn> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-25blk-mq: test QUEUE_FLAG_HCTX_ACTIVE for sbitmap_shared in hctx_may_queueMing Lei1-1/+1
In case of blk_mq_is_sbitmap_shared(), we should test QUEUE_FLAG_HCTX_ACTIVE against q->queue_flags instead of BLK_MQ_S_TAG_ACTIVE. So fix it. Cc: John Garry <john.garry@huawei.com> Cc: Kashyap Desai <kashyap.desai@broadcom.com> Fixes: f1b49fdc1c64 ("blk-mq: Record active_queues_shared_sbitmap per tag_set for when using shared sbitmap") Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: John Garry <john.garry@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-25block: move three bvec helpers declaration into private helperMing Lei1-0/+4
bvec_alloc(), bvec_free() and bvec_nr_vecs() are only used inside block layer core functions, no need to declare them in public header. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-25block: set .bi_max_vecs as actual allocated vector numberMing Lei1-1/+2
bvec_alloc() may allocate more bio vectors than requested, so set .bi_max_vecs as actual allocated vector number, instead of the requested number. This way can help fs build bigger bio because new bio often won't be allocated until the current one becomes full. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-25block: don't allocate inline bvecs if this bioset needn't bvecsMing Lei1-2/+5
The inline bvecs won't be used if user needn't bvecs by not passing BIOSET_NEED_BVECS, so don't allocate bvecs in this situation. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Tested-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-25block: don't pass BIOSET_NEED_BVECS for q->bio_splitMing Lei1-1/+1
q->bio_split is only used by bio_split() for fast cloning bio, and no need to allocate bvecs, so remove this flag. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Tested-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-25block: manage bio slab cache by xarrayMing Lei1-67/+49
Managing bio slab cache via xarray by using slab cache size as xarray index, and storing 'struct bio_slab' instance into xarray. So code is simplified a lot, meantime it becomes more readable than before. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Tested-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-25bfq: don't duplicate code for different pathshuhai1-3/+0
As we can see, returns parent_sched_may_change whether sd->next_in_service changes or not, so remove this judgment. Signed-off-by: huhai <huhai@tj.kylinos.cn> Acked-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-25blk-mq: Improve performance of non-mq IO schedulers with multiple HW queuesJan Kara2-6/+61
Currently when non-mq aware IO scheduler (BFQ, mq-deadline) is used for a queue with multiple HW queues, the performance it rather bad. The problem is that these IO schedulers use queue-wide locking and their dispatch function does not respect the hctx it is passed in and returns any request it finds appropriate. Thus locality of request access is broken and dispatch from multiple CPUs just contends on IO scheduler locks. For these IO schedulers there's little point in dispatching from multiple CPUs. Instead dispatch always only from a single CPU to limit contention. Below is a comparison of dbench runs on XFS filesystem where the storage is a raid card with 64 HW queues and to it attached a single rotating disk. BFQ is used as IO scheduler: clients MQ SQ MQ-Patched Amean 1 39.12 (0.00%) 43.29 * -10.67%* 36.09 * 7.74%* Amean 2 128.58 (0.00%) 101.30 * 21.22%* 96.14 * 25.23%* Amean 4 577.42 (0.00%) 494.47 * 14.37%* 508.49 * 11.94%* Amean 8 610.95 (0.00%) 363.86 * 40.44%* 362.12 * 40.73%* Amean 16 391.78 (0.00%) 261.49 * 33.25%* 282.94 * 27.78%* Amean 32 324.64 (0.00%) 267.71 * 17.54%* 233.00 * 28.23%* Amean 64 295.04 (0.00%) 253.02 * 14.24%* 242.37 * 17.85%* Amean 512 10281.61 (0.00%) 10211.16 * 0.69%* 10447.53 * -1.61%* Numbers are times so lower is better. MQ is stock 5.10-rc6 kernel. SQ is the same kernel with megaraid_sas.host_tagset_enable=0 so that the card advertises just a single HW queue. MQ-Patched is a kernel with this patch applied. You can see multiple hardware queues heavily hurt performance in combination with BFQ. The patch restores the performance. Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-25Revert "blk-mq, elevator: Count requests per hctx to improve performance"Jan Kara3-12/+0
This reverts commit b445547ec1bbd3e7bf4b1c142550942f70527d95. Since both mq-deadline and BFQ completely ignore hctx they are passed to their dispatch function and dispatch whatever request they deem fit checking whether any request for a particular hctx is queued is just pointless since we'll very likely get a request from a different hctx anyway. In the following commit we'll deal with lock contention in these IO schedulers in presence of multiple HW queues in a different way. Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-25block, bfq: do not expire a queue when it is the only busy onePaolo Valente1-2/+20
This commits preserves I/O-dispatch plugging for a special symmetric case that may suddenly turn into asymmetric: the case where only one bfq_queue, say bfqq, is busy. In this case, not expiring bfqq does not cause any harm to any other queues in terms of service guarantees. In contrast, it avoids the following unlucky sequence of events: (1) bfqq is expired, (2) a new queue with a lower weight than bfqq becomes busy (or more queues), (3) the new queue is served until a new request arrives for bfqq, (4) when bfqq is finally served, there are so many requests of the new queue in the drive that the pending requests for bfqq take a lot of time to be served. In particular, event (2) may case even already dispatched requests of bfqq to be delayed, inside the drive. So, to avoid this series of events, the scenario is preventively declared as asymmetric also if bfqq is the only busy queues. By doing so, I/O-dispatch plugging is performed for bfqq. Tested-by: Jan Kara <jack@suse.cz> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-25block, bfq: avoid spurious switches to soft_rt of interactive queuesPaolo Valente1-20/+37
BFQ tags some bfq_queues as interactive or soft_rt if it deems that these bfq_queues contain the I/O of, respectively, interactive or soft real-time applications. BFQ privileges both these special types of bfq_queues over normal bfq_queues. To privilege a bfq_queue, BFQ mainly raises the weight of the bfq_queue. In particular, soft_rt bfq_queues get a higher weight than interactive bfq_queues. A bfq_queue may turn from interactive to soft_rt. And this leads to a tricky issue. Soft real-time applications usually start with an I/O-bound, interactive phase, in which they load themselves into main memory. BFQ correctly detects this phase, and keeps the bfq_queues associated with the application in interactive mode for a while. Problems arise when the I/O pattern of the application finally switches to soft real-time. One of the conditions for a bfq_queue to be deemed as soft_rt is that the bfq_queue does not consume too much bandwidth. But the bfq_queues associated with a soft real-time application consume as much bandwidth as they can in the loading phase of the application. So, after the application becomes truly soft real-time, a lot of time should pass before the average bandwidth consumed by its bfq_queues finally drops to a value acceptable for soft_rt bfq_queues. As a consequence, there might be a time gap during which the application is not privileged at all, because its bfq_queues are not interactive any longer, but cannot be deemed as soft_rt yet. To avoid this problem, BFQ pretends that an interactive bfq_queue consumes zero bandwidth, and allows an interactive bfq_queue to switch to soft_rt. Yet, this fake zero-bandwidth consumption easily causes the bfq_queue to often switch to soft_rt deceptively, during its loading phase. As in soft_rt mode, the bfq_queue gets its bandwidth correctly computed, and therefore soon switches back to interactive. Then it switches again to soft_rt, and so on. These spurious fluctuations usually cause losses of throughput, because they deceive BFQ's mechanisms for boosting throughput (injection, I/O-plugging avoidance, ...). This commit addresses this issue as follows: 1) It does compute actual bandwidth consumption also for interactive bfq_queues. This avoids the above false positives. 2) When a bfq_queue switches from interactive to normal mode, the consumed bandwidth is reset (forgotten). This allows the bfq_queue to enjoy soft_rt very quickly. In particular, two alternatives are possible in this switch: - the bfq_queue still has backlog, and therefore there is a budget already scheduled to serve the bfq_queue; in this case, the scheduling of the current budget of the bfq_queue is not hindered, because only the scheduling of the next budget will be affected by the weight drop. After that, if the bfq_queue is actually in a soft_rt phase, and becomes empty during the service of its current budget, which is the natural behavior of a soft_rt bfq_queue, then the bfq_queue will be considered as soft_rt when its next I/O arrives. If, in contrast, the bfq_queue remains constantly non-empty, then its next budget will be scheduled with a low weight, which is the natural treatment for an I/O-bound (non soft_rt) bfq_queue. - the bfq_queue is empty; in this case, the bfq_queue may be considered unjustly soft_rt when its new I/O arrives. Yet the problem is now much smaller than before, because it is unlikely that more than one spurious fluctuation occurs. Tested-by: Jan Kara <jack@suse.cz> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-25block, bfq: do not raise non-default weightsPaolo Valente1-3/+7
BFQ heuristics try to detect interactive I/O, and raise the weight of the queues containing such an I/O. Yet, if also the user changes the weight of a queue (i.e., the user changes the ioprio of the process associated with that queue), then it is most likely better to prevent BFQ heuristics from silently changing the same weight. Tested-by: Jan Kara <jack@suse.cz> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-25block, bfq: increase time window for waker detectionPaolo Valente1-1/+1
Tests on slower machines showed current window to be way too small. This commit increases it. Tested-by: Jan Kara <jack@suse.cz> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-25block, bfq: set next_rq to waker_bfqq->next_rq in waker injectionJia Cheng Hu1-1/+1
Since commit c5089591c3ba ("block, bfq: detect wakers and unconditionally inject their I/O"), when the in-service bfq_queue, say Q, is temporarily empty, BFQ checks whether there are I/O requests to inject (also) from the waker bfq_queue for Q. To this goal, the value pointed by bfqq->waker_bfqq->next_rq must be controlled. However, the current implementation mistakenly looks at bfqq->next_rq, which instead points to the next request of the currently served queue. This mistake evidently causes losses of throughput in scenarios with waker bfq_queues. This commit corrects this mistake. Fixes: c5089591c3ba ("block, bfq: detect wakers and unconditionally inject their I/O") Signed-off-by: Jia Cheng Hu <jia.jiachenghu@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-25block, bfq: use half slice_idle as a threshold to check short ttimePaolo Valente1-3/+4
The value of the I/O plugging (idling) timeout is used also as the think-time threshold to decide whether a process has a short think time. In this respect, a good value of this timeout for rotational drives is un the order of several ms. Yet, this is often too long a time interval to be effective as a think-time threshold. This commit mitigates this problem (by a lot, according to tests), by halving the threshold. Tested-by: Jan Kara <jack@suse.cz> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-25block: use an xarray for disk->part_tblChristoph Hellwig4-178/+19
Now that no fast path lookups in the partition table are left, there is no point in micro-optimizing the data structure for it. Just use a bog standard xarray. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-25block: remove DISK_PITER_REVERSEChristoph Hellwig1-30/+7
There is good reason to iterate backwards when deleting all partitions in del_gendisk, just like we don't in blk_drop_partitions. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-25block: add a disk_uevent helperChristoph Hellwig1-13/+14
Add a helper to call kobject_uevent for the disk and all partitions, and unexport the disk_part_iter_* helpers that are now only used in the core block code. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-25blk-mq: use ->bi_bdev for I/O accountingChristoph Hellwig3-51/+5
Remove the reverse map from a sector to a partition for I/O accounting by simply using ->bi_bdev. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-25block: use ->bi_bdev for bio based I/O accountingChristoph Hellwig1-10/+13
Rework the I/O accounting for bio based drivers to use ->bi_bdev. This means all drivers can now simply use bio_start_io_acct to start accounting, and it will take partitions into account automatically. To end I/O account either bio_end_io_acct can be used if the driver never remaps I/O to a different device, or bio_end_io_acct_remapped if the driver did remap the I/O. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-25block: do not reassig ->bi_bdev when partition remappingChristoph Hellwig1-2/+3
There is no good reason to reassign ->bi_bdev when remapping the partition-relative block number to the device wide one, as all the information required by the drivers comes from the gendisk anyway. Keeping the original ->bi_bdev alive will allow to greatly simplify the partition-away I/O accounting. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-25block: simplify submit_bio_checks a bitChristoph Hellwig1-25/+14
Merge a few checks for whole devices vs partitions to streamline the sanity checks. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-25block: store a block_device pointer in struct bioChristoph Hellwig12-69/+55
Replace the gendisk pointer in struct bio with a pointer to the newly improved struct block device. From that the gendisk can be trivially accessed with an extra indirection, but it also allows to directly look up all information related to partition remapping. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-25block: propagate BLKROSET on the whole device to all partitionsChristoph Hellwig1-2/+1
Change the policy so that a BLKROSET on the whole device also affects partitions. To quote Martin K. Petersen: It's very common for database folks to twiddle the read-only state of block devices and partitions. I know that our users will find it very counter-intuitive that setting /dev/sda read-only won't prevent writes to /dev/sda1. The existing behavior is inconsistent in the sense that doing: # blockdev --setro /dev/sda # echo foo > /dev/sda1 permits writes. But: # blockdev --setro /dev/sda <something triggers revalidate> # echo foo > /dev/sda1 doesn't. And a subsequent: # blockdev --setrw /dev/sda # echo foo > /dev/sda1 doesn't work either since sda1's read-only policy has been inherited from the whole-disk device. You need to do: # blockdev --rereadpt after setting the whole-disk device rw to effectuate the same change on the partitions, otherwise they are stuck being read-only indefinitely. However, setting the read-only policy on a partition does *not* require the revalidate step. As a matter of fact, doing the revalidate will blow away the policy setting you just made. So the user needs to take different actions depending on whether they are trying to read-protect a whole-disk device or a partition. Despite using the same ioctl. That is really confusing. I have lost count how many times our customers have had data clobbered because of ambiguity of the existing whole-disk device policy. The current behavior violates the principle of least surprise by letting the user think they write protected the whole disk when they actually didn't. Suggested-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-25block: add a hard-readonly flag to struct gendiskChristoph Hellwig3-19/+21
Commit 20bd1d026aac ("scsi: sd: Keep disk read-only when re-reading partition") addressed a long-standing problem with user read-only policy being overridden as a result of a device-initiated revalidate. The commit has since been reverted due to a regression that left some USB devices read-only indefinitely. To fix the underlying problems with revalidate we need to keep track of hardware state and user policy separately. The gendisk has been updated to reflect the current hardware state set by the device driver. This is done to allow returning the device to the hardware state once the user clears the BLKROSET flag. The resulting semantics are as follows: - If BLKROSET sets a given partition read-only, that partition will remain read-only even if the underlying storage stack initiates a revalidate. However, the BLKRRPART ioctl will cause the partition table to be dropped and any user policy on partitions will be lost. - If BLKROSET has not been set, both the whole disk device and any partitions will reflect the current write-protect state of the underlying device. Based on a patch from Martin K. Petersen <martin.petersen@oracle.com>. Reported-by: Oleksii Kurochko <olkuroch@cisco.com> Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=201221 Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-25block: remove the NULL bdev check in bdev_read_onlyChristoph Hellwig1-3/+0
Only a single caller can end up in bdev_read_only, so move the check there. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-10Merge tag 'block-5.11-2021-01-10' of git://git.kernel.dk/linux-blockLinus Torvalds4-13/+23
Pull block fixes from Jens Axboe: - Missing CRC32 selections (Arnd) - Fix for a merge window regression with bdev inode init (Christoph) - bcache fixes - rnbd fixes - NVMe pull request from Christoph: - fix a race in the nvme-tcp send code (Sagi Grimberg) - fix a list corruption in an nvme-rdma error path (Israel Rukshin) - avoid a possible double fetch in nvme-pci (Lalithambika Krishnakumar) - add the susystem NQN quirk for a Samsung driver (Gopal Tiwari) - fix two compiler warnings in nvme-fcloop (James Smart) - don't call sleeping functions from irq context in nvme-fc (James Smart) - remove an unused argument (Max Gurtovoy) - remove unused exports (Minwoo Im) - Use-after-free fix for partition iteration (Ming) - Missing blk-mq debugfs flag annotation (John) - Bdev freeze regression fix (Satya) - blk-iocost NULL pointer deref fix (Tejun) * tag 'block-5.11-2021-01-10' of git://git.kernel.dk/linux-block: (26 commits) bcache: set bcache device into read-only mode for BCH_FEATURE_INCOMPAT_OBSO_LARGE_BUCKET bcache: introduce BCH_FEATURE_INCOMPAT_LOG_LARGE_BUCKET_SIZE for large bucket bcache: check unsupported feature sets for bcache register bcache: fix typo from SUUP to SUPP in features.h bcache: set pdev_set_uuid before scond loop iteration blk-mq-debugfs: Add decode for BLK_MQ_F_TAG_HCTX_SHARED block/rnbd-clt: avoid module unload race with close confirmation block/rnbd: Adding name to the Contributors List block/rnbd-clt: Fix sg table use after free block/rnbd-srv: Fix use after free in rnbd_srv_sess_dev_force_close block/rnbd: Select SG_POOL for RNBD_CLIENT block: pre-initialize struct block_device in bdev_alloc_inode fs: Fix freeze_bdev()/thaw_bdev() accounting of bd_fsfreeze_sb nvme: remove the unused status argument from nvme_trace_bio_complete nvmet-rdma: Fix list_del corruption on queue establishment failure nvme: unexport functions with no external caller nvme: avoid possible double fetch in handling CQE nvme-tcp: Fix possible race of io_work and direct send nvme-pci: mark Samsung PM1725a as IGNORE_DEV_SUBNQN nvme-fcloop: Fix sscanf type and list_first_entry_or_null warnings ...
2021-01-08blk-mq-debugfs: Add decode for BLK_MQ_F_TAG_HCTX_SHAREDJohn Garry1-0/+1
Showing the hctx flags for when BLK_MQ_F_TAG_HCTX_SHARED is set gives something like: root@debian:/home/john# more /sys/kernel/debug/block/sda/hctx0/flags alloc_policy=FIFO SHOULD_MERGE|TAG_QUEUE_SHARED|3 Add the decoding for that flag. Fixes: 32bc15afed04b ("blk-mq: Facilitate a shared sbitmap per tagset") Signed-off-by: John Garry <john.garry@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-05block: fix use-after-free in disk_part_iter_nextMing Lei1-4/+7
Make sure that bdgrab() is done on the 'block_device' instance before referring to it for avoiding use-after-free. Cc: <stable@vger.kernel.org> Reported-by: syzbot+825f0f9657d4e528046e@syzkaller.appspotmail.com Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-05bfq: Fix computation of shallow depthJan Kara1-4/+4
BFQ computes number of tags it allows to be allocated for each request type based on tag bitmap. However it uses 1 << bitmap.shift as number of available tags which is wrong. 'shift' is just an internal bitmap value containing logarithm of how many bits bitmap uses in each bitmap word. Thus number of tags allowed for some request types can be far to low. Use proper bitmap.depth which has the number of tags instead. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-05blk-iocost: fix NULL iocg deref from racing against initializationTejun Heo1-5/+11
When initializing iocost for a queue, its rqos should be registered before the blkcg policy is activated to allow policy data initiailization to lookup the associated ioc. This unfortunately means that the rqos methods can be called on bios before iocgs are attached to all existing blkgs. While the race is theoretically possible on ioc_rqos_throttle(), it mostly happened in ioc_rqos_merge() due to the difference in how they lookup ioc. The former determines it from the passed in @rqos and then bails before dereferencing iocg if the looked up ioc is disabled, which most likely is the case if initialization is still in progress. The latter looked up ioc by dereferencing the possibly NULL iocg making it a lot more prone to actually triggering the bug. * Make ioc_rqos_merge() use the same method as ioc_rqos_throttle() to look up ioc for consistency. * Make ioc_rqos_throttle() and ioc_rqos_merge() test for NULL iocg before dereferencing it. * Explain the danger of NULL iocgs in blk_iocost_init(). Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Jonathan Lemon <bsd@fb.com> Cc: stable@vger.kernel.org # v5.4+ Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-01Merge tag 'scsi-fixes' of ↵Linus Torvalds5-20/+27
git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi Pull SCSI fixes from James Bottomley: "This is a load of driver fixes (12 ufs, 1 mpt3sas, 1 cxgbi). The big core two fixes are for power management ("block: Do not accept any requests while suspended" and "block: Fix a race in the runtime power management code") which finally sorts out the resume problems we've occasionally been having. To make the resume fix, there are seven necessary precursors which effectively renames REQ_PREEMPT to REQ_PM, so every "special" request in block is automatically a power management exempt one. All of the non-PM preempt cases are removed except for the one in the SCSI Parallel Interface (spi) domain validation which is a genuine case where we have to run requests at high priority to validate the bus so this becomes an autopm get/put protected request" * tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (22 commits) scsi: cxgb4i: Fix TLS dependency scsi: ufs: Un-inline ufshcd_vops_device_reset function scsi: ufs: Re-enable WriteBooster after device reset scsi: ufs-mediatek: Use correct path to fix compile error scsi: mpt3sas: Signedness bug in _base_get_diag_triggers() scsi: block: Do not accept any requests while suspended scsi: block: Remove RQF_PREEMPT and BLK_MQ_REQ_PREEMPT scsi: core: Only process PM requests if rpm_status != RPM_ACTIVE scsi: scsi_transport_spi: Set RQF_PM for domain validation commands scsi: ide: Mark power management requests with RQF_PM instead of RQF_PREEMPT scsi: ide: Do not set the RQF_PREEMPT flag for sense requests scsi: block: Introduce BLK_MQ_REQ_PM scsi: block: Fix a race in the runtime power management code scsi: ufs-pci: Enable UFSHCD_CAP_RPM_AUTOSUSPEND for Intel controllers scsi: ufs-pci: Fix recovery from hibernate exit errors for Intel controllers scsi: ufs-pci: Ensure UFS device is in PowerDown mode for suspend-to-disk ->poweroff() scsi: ufs-pci: Fix restore from S4 for Intel controllers scsi: ufs-mediatek: Keep VCC always-on for specific devices scsi: ufs: Allow regulators being always-on scsi: ufs: Clear UAC for RPMB after ufshcd resets ...
2020-12-30block: add debugfs stanza for QUEUE_FLAG_NOWAITAndres Freund1-0/+1
This was missed in 021a24460dc2. Leads to the numeric value of QUEUE_FLAG_NOWAIT (i.e. 29) showing up in /sys/kernel/debug/block/*/state. Fixes: 021a24460dc28e7412aecfae89f60e1847e685c0 Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Cc: Mike Snitzer <snitzer@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andres Freund <andres@anarazel.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-22block: update some copyrightsChristoph Hellwig2-0/+3
Update copyrights for files that have gotten some major rewrites lately. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-17blk-mq: Don't complete on a remote CPU in force threaded modeSebastian Andrzej Siewior1-0/+8
With force threaded interrupts enabled, raising softirq from an SMP function call will always result in waking the ksoftirqd thread. This is not optimal given that the thread runs at SCHED_OTHER priority. Completing the request in hard IRQ-context on PREEMPT_RT (which enforces the force threaded mode) is bad because the completion handler may acquire sleeping locks which violate the locking context. Disable request completing on a remote CPU in force threaded mode. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-17blk-iocost: Add iocg idle state tracepointBaolin Wang1-0/+3
It will be helpful to trace the iocg's whole state, including active and idle state. And we can easily expand the original iocost_iocg_activate trace event to support a state trace class, including active and idle state tracing. Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-17blk-mq: Remove 'running from the wrong CPU' warningDaniel Wagner1-25/+0
It's guaranteed that no request is in flight when a hctx is going offline. This warning is only triggered when the wq's CPU is hot plugged and the blk-mq is not synced up yet. As this state is temporary and the request is still processed correctly, better remove the warning as this is the fast path. Suggested-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Daniel Wagner <dwagner@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-17Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsiLinus Torvalds1-1/+1
Pull SCSI updates from James Bottomley: "This consists of the usual driver updates (ufs, qla2xxx, smartpqi, target, zfcp, fnic, mpt3sas, ibmvfc) plus a load of cleanups, a major power management rework and a load of assorted minor updates. There are a few core updates (formatting fixes being the big one) but nothing major this cycle" * tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (279 commits) scsi: mpt3sas: Update driver version to 36.100.00.00 scsi: mpt3sas: Handle trigger page after firmware update scsi: mpt3sas: Add persistent MPI trigger page scsi: mpt3sas: Add persistent SCSI sense trigger page scsi: mpt3sas: Add persistent Event trigger page scsi: mpt3sas: Add persistent Master trigger page scsi: mpt3sas: Add persistent trigger pages support scsi: mpt3sas: Sync time periodically between driver and firmware scsi: qla2xxx: Update version to 10.02.00.104-k scsi: qla2xxx: Fix device loss on 4G and older HBAs scsi: qla2xxx: If fcport is undergoing deletion complete I/O with retry scsi: qla2xxx: Fix the call trace for flush workqueue scsi: qla2xxx: Fix flash update in 28XX adapters on big endian machines scsi: qla2xxx: Handle aborts correctly for port undergoing deletion scsi: qla2xxx: Fix N2N and NVMe connect retry failure scsi: qla2xxx: Fix FW initialization error on big endian machines scsi: qla2xxx: Fix crash during driver load on big endian machines scsi: qla2xxx: Fix compilation issue in PPC systems scsi: qla2xxx: Don't check for fw_started while posting NVMe command scsi: qla2xxx: Tear down session if FW say it is down ...
2020-12-17Merge tag 'for-5.11/drivers-2020-12-14' of git://git.kernel.dk/linux-blockLinus Torvalds2-17/+18
Pull block driver updates from Jens Axboe: "Nothing major in here: - NVMe pull request from Christoph: - nvmet passthrough improvements (Chaitanya Kulkarni) - fcloop error injection support (James Smart) - read-only support for zoned namespaces without Zone Append (Javier González) - improve some error message (Minwoo Im) - reject I/O to offline fabrics namespaces (Victor Gladkov) - PCI queue allocation cleanups (Niklas Schnelle) - remove an unused allocation in nvmet (Amit Engel) - a Kconfig spelling fix (Colin Ian King) - nvme_req_qid simplication (Baolin Wang) - MD pull request from Song: - Fix race condition in md_ioctl() (Dae R. Jeong) - Initialize read_slot properly for raid10 (Kevin Vigor) - Code cleanup (Pankaj Gupta) - md-cluster resync/reshape fix (Zhao Heming) - Move null_blk into its own directory (Damien Le Moal) - null_blk zone and discard improvements (Damien Le Moal) - bcache race fix (Dongsheng Yang) - Set of rnbd fixes/improvements (Gioh Kim, Guoqing Jiang, Jack Wang, Lutz Pogrell, Md Haris Iqbal) - lightnvm NULL pointer deref fix (tangzhenhao) - sr in_interrupt() removal (Sebastian Andrzej Siewior) - FC endpoint security support for s390/dasd (Jan Höppner, Sebastian Ott, Vineeth Vijayan). From the s390 arch guys, arch bits included as it made it easier for them to funnel the feature through the block driver tree. - Follow up fixes (Colin Ian King)" * tag 'for-5.11/drivers-2020-12-14' of git://git.kernel.dk/linux-block: (64 commits) block: drop dead assignments in loop_init() sr: Remove in_interrupt() usage in sr_init_command(). sr: Switch the sector size back to 2048 if sr_read_sector() changed it. cdrom: Reset sector_size back it is not 2048. drivers/lightnvm: fix a null-ptr-deref bug in pblk-core.c null_blk: Move driver into its own directory null_blk: Allow controlling max_hw_sectors limit null_blk: discard zones on reset null_blk: cleanup discard handling null_blk: Improve implicit zone close null_blk: improve zone locking block: Align max_hw_sectors to logical blocksize null_blk: Fail zone append to conventional zones null_blk: Fix zone size initialization bcache: fix race between setting bdev state to none and new write request direct to backing block/rnbd: fix a null pointer dereference on dev->blk_symlink_name block/rnbd-clt: Dynamically alloc buffer for pathname & blk_symlink_name block/rnbd: call kobject_put in the failure path Documentation/ABI/rnbd-srv: add document for force_close block/rnbd-srv: close a mapped device from server side. ...
2020-12-16Merge tag 'for-5.11/block-2020-12-14' of git://git.kernel.dk/linux-blockLinus Torvalds18-943/+583
Pull block updates from Jens Axboe: "Another series of killing more code than what is being added, again thanks to Christoph's relentless cleanups and tech debt tackling. This contains: - blk-iocost improvements (Baolin Wang) - part0 iostat fix (Jeffle Xu) - Disable iopoll for split bios (Jeffle Xu) - block tracepoint cleanups (Christoph Hellwig) - Merging of struct block_device and hd_struct (Christoph Hellwig) - Rework/cleanup of how block device sizes are updated (Christoph Hellwig) - Simplification of gendisk lookup and removal of block device aliasing (Christoph Hellwig) - Block device ioctl cleanups (Christoph Hellwig) - Removal of bdget()/blkdev_get() as exported API (Christoph Hellwig) - Disk change rework, avoid ->revalidate_disk() (Christoph Hellwig) - sbitmap improvements (Pavel Begunkov) - Hybrid polling fix (Pavel Begunkov) - bvec iteration improvements (Pavel Begunkov) - Zone revalidation fixes (Damien Le Moal) - blk-throttle limit fix (Yu Kuai) - Various little fixes" * tag 'for-5.11/block-2020-12-14' of git://git.kernel.dk/linux-block: (126 commits) blk-mq: fix msec comment from micro to milli seconds blk-mq: update arg in comment of blk_mq_map_queue blk-mq: add helper allocating tagset->tags Revert "block: Fix a lockdep complaint triggered by request queue flushing" nvme-loop: use blk_mq_hctx_set_fq_lock_class to set loop's lock class blk-mq: add new API of blk_mq_hctx_set_fq_lock_class block: disable iopoll for split bio block: Improve blk_revalidate_disk_zones() checks sbitmap: simplify wrap check sbitmap: replace CAS with atomic and sbitmap: remove swap_lock sbitmap: optimise sbitmap_deferred_clear() blk-mq: skip hybrid polling if iopoll doesn't spin blk-iocost: Factor out the base vrate change into a separate function blk-iocost: Factor out the active iocgs' state check into a separate function blk-iocost: Move the usage ratio calculation to the correct place blk-iocost: Remove unnecessary advance declaration blk-iocost: Fix some typos in comments blktrace: fix up a kerneldoc comment block: remove the request_queue to argument request based tracepoints ...
2020-12-15Merge tag 'sched-core-2020-12-14' of ↵Linus Torvalds1-3/+1
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Thomas Gleixner: - migrate_disable/enable() support which originates from the RT tree and is now a prerequisite for the new preemptible kmap_local() API which aims to replace kmap_atomic(). - A fair amount of topology and NUMA related improvements - Improvements for the frequency invariant calculations - Enhanced robustness for the global CPU priority tracking and decision making - The usual small fixes and enhancements all over the place * tag 'sched-core-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (61 commits) sched/fair: Trivial correction of the newidle_balance() comment sched/fair: Clear SMT siblings after determining the core is not idle sched: Fix kernel-doc markup x86: Print ratio freq_max/freq_base used in frequency invariance calculations x86, sched: Use midpoint of max_boost and max_P for frequency invariance on AMD EPYC x86, sched: Calculate frequency invariance for AMD systems irq_work: Optimize irq_work_single() smp: Cleanup smp_call_function*() irq_work: Cleanup sched: Limit the amount of NUMA imbalance that can exist at fork time sched/numa: Allow a floating imbalance between NUMA nodes sched: Avoid unnecessary calculation of load imbalance at clone time sched/numa: Rename nr_running and break out the magic number sched: Make migrate_disable/enable() independent of RT sched/topology: Condition EAS enablement on FIE support arm64: Rebuild sched domains on invariance status changes sched/topology,schedutil: Wrap sched domains rebuild sched/uclamp: Allow to reset a task uclamp constraint value sched/core: Fix typos in comments Documentation: scheduler: fix information on arch SD flags, sched_domain and sched_debug ...
2020-12-12blk-mq: fix msec comment from micro to milli secondsMinwoo Im1-3/+3
Delay to wait for queue running is milli second unit which is passed to delayed work via msecs_to_jiffies() which is to convert milliseconds to jiffies. Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com> Reviewed-by: John Garry <john.garry@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-12blk-mq: update arg in comment of blk_mq_map_queueMinwoo Im1-1/+1
Update mis-named argument description of blk_mq_map_queue(). This patch also updates description that argument to software queue percpu context. Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com> Reviewed-by: John Garry <john.garry@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-12blk-mq: add helper allocating tagset->tagsMinwoo Im1-1/+7
tagset->set is allocated from blk_mq_alloc_tag_set() rather than being reallocated. This patch added a helper to make its meaning explicitly which is to allocate rather than to reallocate. Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>