summaryrefslogtreecommitdiff
path: root/block
AgeCommit message (Collapse)AuthorFilesLines
2021-03-25block, bfq: merge bursts of newly-created queuesPaolo Valente3-10/+266
Many throughput-sensitive workloads are made of several parallel I/O flows, with all flows generated by the same application, or more generically by the same task (e.g., system boot). The most counterproductive action with these workloads is plugging I/O dispatch when one of the bfq_queues associated with these flows remains temporarily empty. To avoid this plugging, BFQ has been using a burst-handling mechanism for years now. This mechanism has proven effective for throughput, and not detrimental for service guarantees. This commit pushes this mechanism a little bit further, basing on the following two facts. First, all the I/O flows of a the same application or task contribute to the execution/completion of that common application or task. So the performance figures that matter are total throughput of the flows and task-wide I/O latency. In particular, these flows do not need to be protected from each other, in terms of individual bandwidth or latency. Second, the above fact holds regardless of the number of flows. Putting these two facts together, this commits merges stably the bfq_queues associated with these I/O flows, i.e., with the processes that generate these IO/ flows, regardless of how many the involved processes are. To decide whether a set of bfq_queues is actually associated with the I/O flows of a common application or task, and to merge these queues stably, this commit operates as follows: given a bfq_queue, say Q2, currently being created, and the last bfq_queue, say Q1, created before Q2, Q2 is merged stably with Q1 if - very little time has elapsed since when Q1 was created - Q2 has the same ioprio as Q1 - Q2 belongs to the same group as Q1 Merging bfq_queues also reduces scheduling overhead. A fio test with ten random readers on /dev/nullb shows a throughput boost of 40%, with a quadcore. Since BFQ's execution time amounts to ~50% of the total per-request processing time, the above throughput boost implies that BFQ's overhead is reduced by more than 50%. Tested-by: Jan Kara <jack@suse.cz> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Link: https://lore.kernel.org/r/20210304174627.161-7-paolo.valente@linaro.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-25block, bfq: keep shared queues out of the waker mechanismPaolo Valente1-1/+11
Shared queues are likely to receive I/O at a high rate. This may deceptively let them be considered as wakers of other queues. But a false waker will unjustly steal bandwidth to its supposedly woken queue. So considering also shared queues in the waking mechanism may cause more control troubles than throughput benefits. This commit keeps shared queues out of the waker-detection mechanism. Tested-by: Jan Kara <jack@suse.cz> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Link: https://lore.kernel.org/r/20210304174627.161-6-paolo.valente@linaro.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-25block, bfq: fix weight-raising resume with !low_latencyPaolo Valente1-2/+8
When the io_latency heuristic is off, bfq_queues must not start to be weight-raised. Unfortunately, by mistake, this may happen when the state of a previously weight-raised bfq_queue is resumed after a queue split. This commit fixes this error. Tested-by: Jan Kara <jack@suse.cz> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Link: https://lore.kernel.org/r/20210304174627.161-5-paolo.valente@linaro.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-25block, bfq: make shared queues inherit wakersPaolo Valente1-3/+39
Consider a bfq_queue bfqq that is about to be merged with another bfq_queue new_bfqq. The processes associated with bfqq are cooperators of the processes associated with new_bfqq. So, if bfqq has a waker, then it is reasonable (and beneficial for throughput) to assume that all these processes will be happy to let bfqq's waker freely inject I/O when they have no I/O. So this commit makes new_bfqq inherit bfqq's waker. Tested-by: Jan Kara <jack@suse.cz> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Link: https://lore.kernel.org/r/20210304174627.161-4-paolo.valente@linaro.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-25block, bfq: put reqs of waker and woken in dispatch listPaolo Valente1-1/+43
Consider a new I/O request that arrives for a bfq_queue bfqq. If, when this happens, the only active bfq_queues are bfqq and either its waker bfq_queue or one of its woken bfq_queues, then there is no point in queueing this new I/O request in bfqq for service. In fact, the in-service queue and bfqq agree on serving this new I/O request as soon as possible. So this commit puts this new I/O request directly into the dispatch list. Tested-by: Jan Kara <jack@suse.cz> Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Link: https://lore.kernel.org/r/20210304174627.161-3-paolo.valente@linaro.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-25block, bfq: always inject I/O of queues blocked by wakersPaolo Valente2-5/+35
Suppose that I/O dispatch is plugged, to wait for new I/O for the in-service bfq-queue, say bfqq. Suppose then that there is a further bfq_queue woken by bfqq, and that this woken queue has pending I/O. A woken queue does not steal bandwidth from bfqq, because it remains soon without I/O if bfqq is not served. So there is virtually no risk of loss of bandwidth for bfqq if this woken queue has I/O dispatched while bfqq is waiting for new I/O. In contrast, this extra I/O injection boosts throughput. This commit performs this extra injection. Tested-by: Jan Kara <jack@suse.cz> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Link: https://lore.kernel.org/r/20210304174627.161-2-paolo.valente@linaro.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-25blk-mq: Sentence reconstruct for better readabilityBhaskar Chowdhury1-2/+2
Sentence reconstruction for better readability. Signed-off-by: Bhaskar Chowdhury <unixbhaskar@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-24block: support zone append bvecsJohannes Thumshirn1-3/+18
Christoph reported that we'll likely trigger the WARN_ON_ONCE() checking that we're not submitting a bvec with REQ_OP_ZONE_APPEND in bio_iov_iter_get_pages() some time ago using zoned btrfs, but I couldn't reproduce it back then. Now Naohiro was able to trigger the bug as well with xfstests generic/095 on a zoned btrfs. There is nothing that prevents bvec submissions via REQ_OP_ZONE_APPEND if the hardware's zone append limit is met. Reported-by: Naohiro Aota <naohiro.aota@wdc.com> Reported-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/10bd414d9326c90cd69029077db63b363854eee5.1616600835.git.johannes.thumshirn@wdc.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-23block: recalculate segment count for multi-segment discards correctlyDavid Jeffery1-0/+8
When a stacked block device inserts a request into another block device using blk_insert_cloned_request, the request's nr_phys_segments field gets recalculated by a call to blk_recalc_rq_segments in blk_cloned_rq_check_limits. But blk_recalc_rq_segments does not know how to handle multi-segment discards. For disk types which can handle multi-segment discards like nvme, this results in discard requests which claim a single segment when it should report several, triggering a warning in nvme and causing nvme to fail the discard from the invalid state. WARNING: CPU: 5 PID: 191 at drivers/nvme/host/core.c:700 nvme_setup_discard+0x170/0x1e0 [nvme_core] ... nvme_setup_cmd+0x217/0x270 [nvme_core] nvme_loop_queue_rq+0x51/0x1b0 [nvme_loop] __blk_mq_try_issue_directly+0xe7/0x1b0 blk_mq_request_issue_directly+0x41/0x70 ? blk_account_io_start+0x40/0x50 dm_mq_queue_rq+0x200/0x3e0 blk_mq_dispatch_rq_list+0x10a/0x7d0 ? __sbitmap_queue_get+0x25/0x90 ? elv_rb_del+0x1f/0x30 ? deadline_remove_request+0x55/0xb0 ? dd_dispatch_request+0x181/0x210 __blk_mq_do_dispatch_sched+0x144/0x290 ? bio_attempt_discard_merge+0x134/0x1f0 __blk_mq_sched_dispatch_requests+0x129/0x180 blk_mq_sched_dispatch_requests+0x30/0x60 __blk_mq_run_hw_queue+0x47/0xe0 __blk_mq_delay_run_hw_queue+0x15b/0x170 blk_mq_sched_insert_requests+0x68/0xe0 blk_mq_flush_plug_list+0xf0/0x170 blk_finish_plug+0x36/0x50 xlog_cil_committed+0x19f/0x290 [xfs] xlog_cil_process_committed+0x57/0x80 [xfs] xlog_state_do_callback+0x1e0/0x2a0 [xfs] xlog_ioend_work+0x2f/0x80 [xfs] process_one_work+0x1b6/0x350 worker_thread+0x53/0x3e0 ? process_one_work+0x350/0x350 kthread+0x11b/0x140 ? __kthread_bind_mask+0x60/0x60 ret_from_fork+0x22/0x30 This patch fixes blk_recalc_rq_segments to be aware of devices which can have multi-segment discards. It calculates the correct discard segment count by counting the number of bio as each discard bio is considered its own segment. Fixes: 1e739730c5b9 ("block: optionally merge discontiguous discard bios into a single request") Signed-off-by: David Jeffery <djeffery@redhat.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Laurence Oberman <loberman@redhat.com> Link: https://lore.kernel.org/r/20210211143807.GA115624@redhat Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-11block: Discard page cache of zone reset target rangeShin'ichiro Kawasaki1-2/+36
When zone reset ioctl and data read race for a same zone on zoned block devices, the data read leaves stale page cache even though the zone reset ioctl zero clears all the zone data on the device. To avoid non-zero data read from the stale page cache after zone reset, discard page cache of reset target zones in blkdev_zone_mgmt_ioctl(). Introduce the helper function blkdev_truncate_zone_range() to discard the page cache. Ensure the page cache discarded by calling the helper function before and after zone reset in same manner as fallocate does. This patch can be applied back to the stable kernel version v5.10.y. Rework is needed for older stable kernels. Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Fixes: 3ed05a987e0f ("blk-zoned: implement ioctls") Cc: <stable@vger.kernel.org> # 5.10+ Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20210311072546.678999-1-shinichiro.kawasaki@wdc.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-11block: Suppress uevent for hidden device when removedDaniel Wagner1-3/+1
register_disk() suppress uevents for devices with the GENHD_FL_HIDDEN but enables uevents at the end again in order to announce disk after possible partitions are created. When the device is removed the uevents are still on and user land sees 'remove' messages for devices which were never 'add'ed to the system. KERNEL[95481.571887] remove /devices/virtual/nvme-fabrics/ctl/nvme5/nvme0c5n1 (block) Let's suppress the uevents for GENHD_FL_HIDDEN by not enabling the uevents at all. Signed-off-by: Daniel Wagner <dwagner@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin Wilck <mwilck@suse.com> Link: https://lore.kernel.org/r/20210311151917.136091-1-dwagner@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-11block: rename BIO_MAX_PAGES to BIO_MAX_VECSChristoph Hellwig5-13/+13
Ever since the addition of multipage bio_vecs BIO_MAX_PAGES has been horribly confusingly misnamed. Rename it to BIO_MAX_VECS to stop confusing users of the bio API. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20210311110137.1132391-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-10block: Fix REQ_OP_ZONE_RESET_ALL handlingDamien Le Moal1-1/+1
Similarly to a single zone reset operation (REQ_OP_ZONE_RESET), execute REQ_OP_ZONE_RESET_ALL operations with REQ_SYNC set. Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-05blk-cgroup: Fix the recursive blkg rwstatXunlei Pang1-1/+2
The current blkio.throttle.io_service_bytes_recursive doesn't work correctly. As an example, for the following blkcg hierarchy: (Made 1GB READ in test1, 512MB READ in test2) test / \ test1 test2 $ head -n 1 test/test1/blkio.throttle.io_service_bytes_recursive 8:0 Read 1073684480 $ head -n 1 test/test2/blkio.throttle.io_service_bytes_recursive 8:0 Read 537448448 $ head -n 1 test/blkio.throttle.io_service_bytes_recursive 8:0 Read 537448448 Clearly, above data of "test" reflects "test2" not "test1"+"test2". Do the correct summary in blkg_rwstat_recursive_sum(). Signed-off-by: Xunlei Pang <xlpang@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-05scsi: blk-mq: Return budget token from .get_budget callbackMing Lei3-19/+59
SCSI uses a global atomic variable to track queue depth for each LUN/request queue. This doesn't scale well when there are lots of CPU cores and the disk is very fast. It has been observed that IOPS is affected a lot by tracking queue depth via sdev->device_busy in the I/O path. Return budget token from .get_budget callback. The budget token can be passed to driver so that we can replace the atomic variable with sbitmap_queue and alleviate the scaling problems that way. Link: https://lore.kernel.org/r/20210122023317.687987-9-ming.lei@redhat.com Cc: Omar Sandoval <osandov@fb.com> Cc: Kashyap Desai <kashyap.desai@broadcom.com> Cc: Sumanesh Samanta <sumanesh.samanta@broadcom.com> Cc: Ewan D. Milne <emilne@redhat.com> Tested-by: Sumanesh Samanta <sumanesh.samanta@broadcom.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2021-03-05scsi: sbitmap: Move allocation hint into sbitmapMing Lei2-2/+2
Allocation hint should have belonged to sbitmap. Also, when sbitmap's depth is high and there is no need to use mulitple wakeup queues, user can benefit from percpu allocation hint too. Move allocation hint into sbitmap, then SCSI device queue can benefit from allocation hint when converting to plain sbitmap. Convert vhost/scsi.c to use sbitmap allocation with percpu alloc hint. This is more efficient than the previous approach. Link: https://lore.kernel.org/r/20210122023317.687987-5-ming.lei@redhat.com Cc: Omar Sandoval <osandov@fb.com> Cc: Kashyap Desai <kashyap.desai@broadcom.com> Cc: Sumanesh Samanta <sumanesh.samanta@broadcom.com> Cc: Ewan D. Milne <emilne@redhat.com> Cc: Mike Christie <michael.christie@oracle.com> Cc: virtualization@lists.linux-foundation.org Tested-by: Sumanesh Samanta <sumanesh.samanta@broadcom.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2021-03-05scsi: sbitmap: Maintain allocation round_robin in sbitmapMing Lei2-2/+3
Currently the allocation round_robin info is maintained by sbitmap_queue. However, bit allocation really belongs to sbitmap. Move it there. Link: https://lore.kernel.org/r/20210122023317.687987-3-ming.lei@redhat.com Cc: Omar Sandoval <osandov@fb.com> Cc: Kashyap Desai <kashyap.desai@broadcom.com> Cc: Sumanesh Samanta <sumanesh.samanta@broadcom.com> Cc: Ewan D. Milne <emilne@redhat.com> Cc: Hannes Reinecke <hare@suse.de> Cc: virtualization@lists.linux-foundation.org Tested-by: Sumanesh Samanta <sumanesh.samanta@broadcom.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2021-03-02block/bfq: update comments and default value in docs for fifo_expireJoseph Qi1-1/+1
Correct the comments since bfq_fifo_expire[0] is for async request, while bfq_fifo_expire[1] is for sync request. Also update docs, according the source code, the default fifo_expire_async is 250ms, and fifo_expire_sync is 125ms. Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com> Acked-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-01block: Drop leftover references to RQF_SORTEDJean Delvare2-6/+1
Commit a1ce35fa49852db60fc6e268038530be533c5b15 ("block: remove dead elevator code") removed all users of RQF_SORTED. However it is still defined, and there is one reference left to it (which in effect is dead code). Clear it all up. Signed-off-by: Jean Delvare <jdelvare@suse.de> Cc: Jens Axboe <axboe@kernel.dk> Cc: Ming Lei <ming.lei@redhat.com> Cc: Omar Sandoval <osandov@fb.com> Cc: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-01block: revert "block: fix bd_size_lock use"Damien Le Moal2-7/+4
With the removal of the skd driver, using IRQ safe locking of a bdev bd_size_lock spinlock to protect the bdev inode size is not necessary anymore as there is no other known driver using this lock under an IRQ disabled context (e.g. calling set_capacity() with IRQ disabled). Revert commit 0fe37724f8e7 ("block: fix bd_size_lock use") which introduced the IRQ safe change. Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-28Merge tag 'block-5.12-2021-02-27' of git://git.kernel.dk/linux-blockLinus Torvalds14-92/+79
Pull more block updates from Jens Axboe: "A few stragglers (and one due to me missing it originally), and fixes for changes in this merge window mostly. In particular: - blktrace cleanups (Chaitanya, Greg) - Kill dead blk_pm_* functions (Bart) - Fixes for the bio alloc changes (Christoph) - Fix for the partition changes (Christoph, Ming) - Fix for turning off iopoll with polled IO inflight (Jeffle) - nbd disconnect fix (Josef) - loop fsync error fix (Mauricio) - kyber update depth fix (Yang) - max_sectors alignment fix (Mikulas) - Add bio_max_segs helper (Matthew)" * tag 'block-5.12-2021-02-27' of git://git.kernel.dk/linux-block: (21 commits) block: Add bio_max_segs blktrace: fix documentation for blk_fill_rw() block: memory allocations in bounce_clone_bio must not fail block: remove the gfp_mask argument to bounce_clone_bio block: fix bounce_clone_bio for passthrough bios block-crypto-fallback: use a bio_set for splitting bios block: fix logging on capacity change blk-settings: align max_sectors on "logical_block_size" boundary block: reopen the device in blkdev_reread_part block: don't skip empty device in in disk_uevent blktrace: remove debugfs file dentries from struct blk_trace nbd: handle device refs for DESTROY_ON_DISCONNECT properly kyber: introduce kyber_depth_updated() loop: fix I/O error on fsync() in detached loop devices block: fix potential IO hang when turning off io_poll block: get rid of the trace rq insert wrapper blktrace: fix blk_rq_merge documentation blktrace: fix blk_rq_issue documentation blktrace: add blk_fill_rwbs documentation comment block: remove superfluous param in blk_fill_rwbs() ...
2021-02-27block: Add bio_max_segsMatthew Wilcox (Oracle)1-3/+1
It's often inconvenient to use BIO_MAX_PAGES due to min() requiring the sign to be the same. Introduce bio_max_segs() and change BIO_MAX_PAGES to be unsigned to make it easier for the users. Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-24block: memory allocations in bounce_clone_bio must not failChristoph Hellwig1-3/+2
The caller can't cope with a failure from bounce_clone_bio, so use __GFP_NOFAIL for the passthrough case. bio_alloc_bioset already won't fail due to the use of mempools. And yes, we need to get rid of this bock layer bouncing code entirely sooner or later.. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-24block: remove the gfp_mask argument to bounce_clone_bioChristoph Hellwig1-6/+6
The only caller always passes GFP_NOIO. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-24block: fix bounce_clone_bio for passthrough biosChristoph Hellwig1-8/+9
Now that bio_alloc_bioset does not fall back to kmalloc for a NULL bio_set, handle that case explicitly and simplify the calling conventions. Based on an earlier patch from Chaitanya Kulkarni. Fixes: 3175199ab0ac ("block: split bio_kmalloc from bio_alloc_bioset") Reported-by: Chaitanya Kulkarni <Chaitanya.Kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-24block-crypto-fallback: use a bio_set for splitting biosChristoph Hellwig1-2/+10
bio_split with a NULL bs argumen used to fall back to kmalloc the bio, which does not guarantee forward progress and could to deadlocks. Now that the overloading of the NULL bs argument to bio_alloc_bioset has been removed it crashes instead. Fix all that by using a special crafted bioset. Fixes: 3175199ab0ac ("block: split bio_kmalloc from bio_alloc_bioset") Reported-by: John Stultz <john.stultz@linaro.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: John Stultz <john.stultz@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-24block: fix logging on capacity changeMing Lei1-1/+1
Local variable of 'capacity' stores the previous disk capacity, and 'size' variable records the latest disk capacity, so swap them for fixing logging on capacity change. Cc: Christoph Hellwig <hch@lst.de> Fixes: a782483cc1f8 ("block: remove the nr_sects field in struct hd_struct") Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-24blk-settings: align max_sectors on "logical_block_size" boundaryMikulas Patocka1-0/+12
We get I/O errors when we run md-raid1 on the top of dm-integrity on the top of ramdisk. device-mapper: integrity: Bio not aligned on 8 sectors: 0xff00, 0xff device-mapper: integrity: Bio not aligned on 8 sectors: 0xff00, 0xff device-mapper: integrity: Bio not aligned on 8 sectors: 0xffff, 0x1 device-mapper: integrity: Bio not aligned on 8 sectors: 0xffff, 0x1 device-mapper: integrity: Bio not aligned on 8 sectors: 0x8048, 0xff device-mapper: integrity: Bio not aligned on 8 sectors: 0x8147, 0xff device-mapper: integrity: Bio not aligned on 8 sectors: 0x8246, 0xff device-mapper: integrity: Bio not aligned on 8 sectors: 0x8345, 0xbb The ramdisk device has logical_block_size 512 and max_sectors 255. The dm-integrity device uses logical_block_size 4096 and it doesn't affect the "max_sectors" value - thus, it inherits 255 from the ramdisk. So, we have a device with max_sectors not aligned on logical_block_size. The md-raid device sees that the underlying leg has max_sectors 255 and it will split the bios on 255-sector boundary, making the bios unaligned on logical_block_size. In order to fix the bug, we round down max_sectors to logical_block_size. Cc: stable@vger.kernel.org Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-24block: reopen the device in blkdev_reread_partChristoph Hellwig1-7/+14
Historically the BLKRRPART ioctls called into the now defunct ->revalidate method, which caused the sd driver to check if any media is present. When the ->revalidate method was removed this revalidation was lost, leading to lots of I/O errors when using the eject command. Fix this by reopening the device to rescan the partitions, and thus calling the revalidation logic in the sd driver. Fixes: 471bd0af544b ("sd: use bdev_check_media_change") Reported--by: Tom Seewald <tseewald@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: Tom Seewald <tseewald@gmail.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-24block: don't skip empty device in in disk_ueventChristoph Hellwig1-1/+1
Restore the previous behavior by using the correct flag for the whole device ("part0"). Fixes: 99dfc43ecbf6 ("block: use ->bi_bdev for bio based I/O accounting") Reported-by: John Stultz <john.stultz@linaro.org> Tested-by: John Stultz <john.stultz@linaro.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-22kyber: introduce kyber_depth_updated()Yang Yang1-16/+13
Hang occurs when user changes the scheduler queue depth, by writing to the 'nr_requests' sysfs file of that device. The details of the environment that we found the problem are as follows: an eMMC block device total driver tags: 16 default queue_depth: 32 kqd->async_depth initialized in kyber_init_sched() with queue_depth=32 Then we change queue_depth to 256, by writing to the 'nr_requests' sysfs file. But kqd->async_depth don't be updated after queue_depth changes. Now the value of async depth is too small for queue_depth=256, this may cause hang. This patch introduces kyber_depth_updated(), so that kyber can update async depth when queue depth changes. Signed-off-by: Yang Yang <yang.yang@vivo.com> Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-22Merge tag 'for-5.12/block-ipi-2021-02-21' of git://git.kernel.dk/linux-blockLinus Torvalds1-68/+41
Pull block IPI updates from Jens Axboe: "Avoid IRQ locking for the block IPI handling (Sebastian Andrzej Siewior)" * tag 'for-5.12/block-ipi-2021-02-21' of git://git.kernel.dk/linux-block: blk-mq: Use llist_head for blk_cpu_done blk-mq: Always complete remote completions requests in softirq smp: Process pending softirqs in flush_smp_call_function_from_idle()
2021-02-22Merge tag 'for-5.12/dm-changes' of ↵Linus Torvalds2-0/+147
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper updates from Mike Snitzer: - Fix DM integrity's HMAC support to provide enhanced security of internal_hash and journal_mac capabilities. - Various DM writecache fixes to address performance, fix table output to match what was provided at table creation, fix writing beyond end of device when shrinking underlying data device, and a couple other small cleanups. - Add DM crypt support for using trusted keys. - Fix deadlock when swapping to DM crypt device by throttling number of in-flight REQ_SWAP bios. Implemented in DM core so that other bio-based targets can opt-in by setting ti->limit_swap_bios. - Fix various inverted logic bugs in the .iterate_devices callout functions that are used to assess if specific feature or capability is supported across all devices being combined/stacked by DM. - Fix DM era target bugs that exposed users to lost writes or memory leaks. - Add DM core support for passing through inline crypto support of underlying devices. Includes block/keyslot-manager changes that enable extending this support to DM. - Various small fixes and cleanups (spelling fixes, front padding calculation cleanup, cleanup conditional zoned support in targets, etc). * tag 'for-5.12/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (31 commits) dm: fix deadlock when swapping to encrypted device dm: simplify target code conditional on CONFIG_BLK_DEV_ZONED dm: set DM_TARGET_PASSES_CRYPTO feature for some targets dm: support key eviction from keyslot managers of underlying devices dm: add support for passing through inline crypto support block/keyslot-manager: Introduce functions for device mapper support block/keyslot-manager: Introduce passthrough keyslot manager dm era: only resize metadata in preresume dm era: Use correct value size in equality function of writeset tree dm era: Fix bitset memory leaks dm era: Verify the data block size hasn't changed dm era: Reinitialize bitset cache before digesting a new writeset dm era: Update in-core bitset after committing the metadata dm era: Recover committed writeset after crash dm writecache: use bdev_nr_sectors() instead of open-coded equivalent dm writecache: fix writing beyond end of underlying device when shrinking dm table: remove needless request_queue NULL pointer checks dm table: fix zoned iterate_devices based device capability checks dm table: fix DAX iterate_devices based device capability checks dm table: fix iterate_devices based device capability checks ...
2021-02-22Merge tag 'mmc-v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmcLinus Torvalds1-0/+29
Pull MMC updates from Ulf Hansson: "MMC core: - Add support for eMMC inline encryption - Add a helper function to parse DT properties for clock phases - Some improvements and cleanups for the mmc_test module MMC host: - android-goldfish: Remove driver - cqhci: Add support for eMMC inline encryption - dw_mmc-zx: Remove driver - meson-gx: Extend support for scatter-gather to allow SD_IO_RW_EXTENDED - mmci: Add support for probing bus voltage level translator - mtk-sd: Address race condition for request timeouts - sdhci_am654: Add Support for the variant on TI's AM64 SoC - sdhci-esdhc-imx: Prevent kernel panic at ->remove() - sdhci-iproc: Add ACPI bindings for the RPi to enable SD and WiFi on RPi4 - sdhci-msm: Add Inline Crypto Engine support - sdhci-msm: Use actual_clock to improve timeout calculations - sdhci-of-aspeed: Add Andrew Jeffery as maintainer - sdhci-of-aspeed: Extend clock support for the AST2600 variant - sdhci-pci-gli: Increase idle period for low power state for GL9763E - sdhci-pci-o2micro: Make tuning for SDR104 HW more robust - sdhci-sirf: Remove driver - sdhci-xenon: Add support for the AP807 variant - sunxi-mmc: Add support for the A100 variant - sunxi-mmc: Ensure host is suspended during system sleep - tmio: Add detection of data timeout errors - tmio/renesas_sdhi: Extend support for retuning - renesas_sdhi_internal_dmac: Add support for the ->pre|post_req() ops" * tag 'mmc-v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc: (86 commits) mmc: sdhci-esdhc-imx: fix kernel panic when remove module mmc: host: Retire MMC_GOLDFISH mmc: cb710: Use new tasklet API mmc: sdhci-pci-o2micro: Bug fix for SDR104 HW tuning failure mmc: mmc_test: use erase_arg for mmc_erase command mmc: wbsd: Use new tasklet API mmc: via-sdmmc: Use new tasklet API mmc: uniphier-sd: Use new tasklet API mmc: tifm_sd: Use new tasklet API mmc: s3cmci: Use new tasklet API mmc: omap: Use new tasklet API mmc: dw_mmc: Use new tasklet API mmc: au1xmmc: Use new tasklet API mmc: atmel-mci: Use new tasklet API mmc: cavium: Replace spin_lock_irqsave with spin_lock in hard IRQ mmc: queue: Remove unused define mmc: core: Drop redundant bouncesz from struct mmc_card mmc: core: Drop redundant member in struct mmc host mmc: core: Use host instead of card argument to mmc_spi_send_csd() mmc: core: Exclude unnecessary header file ...
2021-02-22block: fix potential IO hang when turning off io_pollJeffle Xu1-2/+5
QUEUE_FLAG_POLL flag will be cleared when turning off 'io_poll', while at that moment there may be IOs stuck in hw queue uncompleted. The following polling routine won't help reap these IOs, since blk_poll() will return immediately because of cleared QUEUE_FLAG_POLL flag. Thus these IOs will hang until they finnaly time out. The hang out can be observed by 'fio --engine=io_uring iodepth=1', while turning off 'io_poll' at the same time. To fix this, freeze and flush the request queue first when turning off 'io_poll'. Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-22block: get rid of the trace rq insert wrapperChaitanya Kulkarni6-10/+10
Get rid of the wrapper for trace_block_rq_insert() and call the function directly. Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-22block: Remove unused blk_pm_*() function definitionsBart Van Assche1-38/+0
Commit a1ce35fa4985 ("block: remove dead elevator code") removed the last callers of blk_pm_requeue_request(), blk_pm_add_request() and blk_pm_put_request(). Hence remove the definitions of these functions. Removing these functions removes all users of the struct request nr_pending member. Hence also remove 'nr_pending'. Note: 'nr_pending' is no longer used since commit 7cedffec8e75 ("block: Make blk_get_request() block for non-PM requests while suspended"). Cc: Alan Stern <stern@rowland.harvard.edu> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-21Merge tag 'for-5.12/block-2021-02-17' of git://git.kernel.dk/linux-blockLinus Torvalds26-930/+848
Pull core block updates from Jens Axboe: "Another nice round of removing more code than what is added, mostly due to Christoph's relentless pursuit of tech debt removal/cleanups. This pull request contains: - Two series of BFQ improvements (Paolo, Jan, Jia) - Block iov_iter improvements (Pavel) - bsg error path fix (Pan) - blk-mq scheduler improvements (Jan) - -EBUSY discard fix (Jan) - bvec allocation improvements (Ming, Christoph) - bio allocation and init improvements (Christoph) - Store bdev pointer in bio instead of gendisk + partno (Christoph) - Block trace point cleanups (Christoph) - hard read-only vs read-only split (Christoph) - Block based swap cleanups (Christoph) - Zoned write granularity support (Damien) - Various fixes/tweaks (Chunguang, Guoqing, Lei, Lukas, Huhai)" * tag 'for-5.12/block-2021-02-17' of git://git.kernel.dk/linux-block: (104 commits) mm: simplify swapdev_block sd_zbc: clear zone resources for non-zoned case block: introduce blk_queue_clear_zone_settings() zonefs: use zone write granularity as block size block: introduce zone_write_granularity limit block: use blk_queue_set_zoned in add_partition() nullb: use blk_queue_set_zoned() to setup zoned devices nvme: cleanup zone information initialization block: document zone_append_max_bytes attribute block: use bi_max_vecs to find the bvec pool md/raid10: remove dead code in reshape_request block: mark the bio as cloned in bio_iov_bvec_set block: set BIO_NO_PAGE_REF in bio_iov_bvec_set block: remove a layer of indentation in bio_iov_iter_get_pages block: turn the nr_iovecs argument to bio_alloc* into an unsigned short block: remove the 1 and 4 vec bvec_slabs entries block: streamline bvec_alloc block: factor out a bvec_alloc_gfp helper block: move struct biovec_slab to bio.c block: reuse BIO_INLINE_VECS for integrity bvecs ...
2021-02-12blk-mq: Use llist_head for blk_cpu_doneSebastian Andrzej Siewior1-56/+41
With llist_head it is possible to avoid the locking (the irq-off region) when items are added. This makes it possible to add items on a remote CPU without additional locking. llist_add() returns true if the list was previously empty. This can be used to invoke the SMP function call / raise sofirq only if the first item was added (otherwise it is already pending). This simplifies the code a little and reduces the IRQ-off regions. blk_mq_raise_softirq() needs a preempt-disable section to ensure the request is enqueued on the same CPU as the softirq is raised. Some callers (USB-storage) invoke this path in preemptible context. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-12blk-mq: Always complete remote completions requests in softirqSebastian Andrzej Siewior1-13/+1
Controllers with multiple queues have their IRQ-handelers pinned to a CPU. The core shouldn't need to complete the request on a remote CPU. Remove this case and always raise the softirq to complete the request. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-11dm: support key eviction from keyslot managers of underlying devicesSatya Tangirala1-0/+1
Now that device mapper supports inline encryption, add the ability to evict keys from all underlying devices. When an upper layer requests a key eviction, we simply iterate through all underlying devices and evict that key from each device. Co-developed-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Satya Tangirala <satyat@google.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-02-11block/keyslot-manager: Introduce functions for device mapper supportSatya Tangirala1-0/+107
Introduce blk_ksm_update_capabilities() to update the capabilities of a keyslot manager (ksm) in-place. The pointer to a ksm in a device's request queue may not be easily replaced, because upper layers like the filesystem might access it (e.g. for programming keys/checking capabilities) at the same time the device wants to replace that request queue's ksm (and free the old ksm's memory). This function allows the device to update the capabilities of the ksm in its request queue directly. Devices can safely update the ksm this way without any synchronization with upper layers *only* if the updated (new) ksm continues to support all the crypto capabilities that the old ksm did (see description below for blk_ksm_is_superset() for why this is so). Also introduce blk_ksm_is_superset() which checks whether one ksm's capabilities are a (not necessarily strict) superset of another ksm's. The blk-crypto framework requires that crypto capabilities that were advertised when a bio was created continue to be supported by the device until that bio is ended - in practice this probably means that a device's advertised crypto capabilities can *never* "shrink" (since there's no synchronization between bio creation and when a device may want to change its advertised capabilities) - so a previously advertised crypto capability must always continue to be supported. This function can be used to check that a new ksm is a valid replacement for an old ksm. Signed-off-by: Satya Tangirala <satyat@google.com> Reviewed-by: Eric Biggers <ebiggers@google.com> Acked-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-02-11block/keyslot-manager: Introduce passthrough keyslot managerSatya Tangirala1-0/+39
The device mapper may map over devices that have inline encryption capabilities, and to make use of those capabilities, the DM device must itself advertise those inline encryption capabilities. One way to do this would be to have the DM device set up a keyslot manager with a "sufficiently large" number of keyslots, but that would use a lot of memory. Also, the DM device itself has no "keyslots", and it doesn't make much sense to talk about "programming a key into a DM device's keyslot manager", so all that extra memory used to represent those keyslots is just wasted. All a DM device really needs to be able to do is advertise the crypto capabilities of the underlying devices in a coherent manner and expose a way to evict keys from the underlying devices. There are also devices with inline encryption hardware that do not have a limited number of keyslots. One can send a raw encryption key along with a bio to these devices (as opposed to typical inline encryption hardware that require users to first program a raw encryption key into a keyslot, and send the index of that keyslot along with the bio). These devices also only need the same things from the keyslot manager that DM devices need - a way to advertise crypto capabilities and potentially a way to expose a function to evict keys from hardware. So we introduce a "passthrough" keyslot manager that provides a way to represent a keyslot manager that doesn't have just a limited number of keyslots, and for which do not require keys to be programmed into keyslots. DM devices can set up a passthrough keyslot manager in their request queues, and advertise appropriate crypto capabilities based on those of the underlying devices. Blk-crypto does not attempt to program keys into any keyslots in the passthrough keyslot manager. Instead, if/when the bio is resubmitted to the underlying device, blk-crypto will try to program the key into the underlying device's keyslot manager. Signed-off-by: Satya Tangirala <satyat@google.com> Reviewed-by: Eric Biggers <ebiggers@google.com> Acked-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-02-10block: introduce blk_queue_clear_zone_settings()Damien Le Moal3-0/+21
Introduce the internal function blk_queue_clear_zone_settings() to cleanup all limits and resources related to zoned block devices. This new function is called from blk_queue_set_zoned() when a disk zoned model is set to BLK_ZONED_NONE. This particular case can happens when a partition is created on a host-aware scsi disk. Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@edc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-10block: introduce zone_write_granularity limitDamien Le Moal2-1/+44
Per ZBC and ZAC specifications, host-managed SMR hard-disks mandate that all writes into sequential write required zones be aligned to the device physical block size. However, NVMe ZNS does not have this constraint and allows write operations into sequential zones to be aligned to the device logical block size. This inconsistency does not help with software portability across device types. To solve this, introduce the zone_write_granularity queue limit to indicate the alignment constraint, in bytes, of write operations into zones of a zoned block device. This new limit is exported as a read-only sysfs queue attribute and the helper blk_queue_zone_write_granularity() introduced for drivers to set this limit. The function blk_queue_set_zoned() is modified to set this new limit to the device logical block size by default. NVMe ZNS devices as well as zoned nullb devices use this default value as is. The scsi disk driver is modified to execute the blk_queue_zone_write_granularity() helper to set the zone write granularity of host-managed SMR disks to the disk physical block size. The accessor functions queue_zone_write_granularity() and bdev_zone_write_granularity() are also introduced. Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@edc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-10block: use blk_queue_set_zoned in add_partition()Damien Le Moal1-1/+1
When changing the zoned model of host-aware zoned block devices, use blk_queue_set_zoned() instead of directly assigning the gendisk queue zoned limit. Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@edc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-09block: add bio_add_zone_append_pageJohannes Thumshirn1-0/+33
Add bio_add_zone_append_page(), a wrapper around bio_add_hw_page() which is intended to be used by file systems that directly add pages to a bio instead of using bio_iov_iter_get_pages(). Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Acked-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08block: use bi_max_vecs to find the bvec poolChristoph Hellwig3-71/+50
Instead of encoding of the bvec pool using magic bio flags, just use a helper to find the pool based on the max_vecs value. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-08block: mark the bio as cloned in bio_iov_bvec_setChristoph Hellwig1-1/+1
bio_iov_bvec_set clones the bio_vecs from the iter, and thus should be treated like a cloned bio in every respect. That also includes not touching bi_max_vecs as that is a property of the bio allocation and not its current payload. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-08block: set BIO_NO_PAGE_REF in bio_iov_bvec_setChristoph Hellwig1-3/+2
bio_iov_bvec_set assigns the foreign bvec, so setting the NO_PAGE_REF directly there seems like the best fit. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>