summaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2020-05-14block: Inline encryption support for blk-mqSatya Tangirala14-7/+684
We must have some way of letting a storage device driver know what encryption context it should use for en/decrypting a request. However, it's the upper layers (like the filesystem/fscrypt) that know about and manages encryption contexts. As such, when the upper layer submits a bio to the block layer, and this bio eventually reaches a device driver with support for inline encryption, the device driver will need to have been told the encryption context for that bio. We want to communicate the encryption context from the upper layer to the storage device along with the bio, when the bio is submitted to the block layer. To do this, we add a struct bio_crypt_ctx to struct bio, which can represent an encryption context (note that we can't use the bi_private field in struct bio to do this because that field does not function to pass information across layers in the storage stack). We also introduce various functions to manipulate the bio_crypt_ctx and make the bio/request merging logic aware of the bio_crypt_ctx. We also make changes to blk-mq to make it handle bios with encryption contexts. blk-mq can merge many bios into the same request. These bios need to have contiguous data unit numbers (the necessary changes to blk-merge are also made to ensure this) - as such, it suffices to keep the data unit number of just the first bio, since that's all a storage driver needs to infer the data unit number to use for each data block in each bio in a request. blk-mq keeps track of the encryption context to be used for all the bios in a request with the request's rq_crypt_ctx. When the first bio is added to an empty request, blk-mq will program the encryption context of that bio into the request_queue's keyslot manager, and store the returned keyslot in the request's rq_crypt_ctx. All the functions to operate on encryption contexts are in blk-crypto.c. Upper layers only need to call bio_crypt_set_ctx with the encryption key, algorithm and data_unit_num; they don't have to worry about getting a keyslot for each encryption context, as blk-mq/blk-crypto handles that. Blk-crypto also makes it possible for request-based layered devices like dm-rq to make use of inline encryption hardware by cloning the rq_crypt_ctx and programming a keyslot in the new request_queue when necessary. Note that any user of the block layer can submit bios with an encryption context, such as filesystems, device-mapper targets, etc. Signed-off-by: Satya Tangirala <satyat@google.com> Reviewed-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-14block: Keyslot Manager for Inline EncryptionSatya Tangirala6-0/+550
Inline Encryption hardware allows software to specify an encryption context (an encryption key, crypto algorithm, data unit num, data unit size) along with a data transfer request to a storage device, and the inline encryption hardware will use that context to en/decrypt the data. The inline encryption hardware is part of the storage device, and it conceptually sits on the data path between system memory and the storage device. Inline Encryption hardware implementations often function around the concept of "keyslots". These implementations often have a limited number of "keyslots", each of which can hold a key (we say that a key can be "programmed" into a keyslot). Requests made to the storage device may have a keyslot and a data unit number associated with them, and the inline encryption hardware will en/decrypt the data in the requests using the key programmed into that associated keyslot and the data unit number specified with the request. As keyslots are limited, and programming keys may be expensive in many implementations, and multiple requests may use exactly the same encryption contexts, we introduce a Keyslot Manager to efficiently manage keyslots. We also introduce a blk_crypto_key, which will represent the key that's programmed into keyslots managed by keyslot managers. The keyslot manager also functions as the interface that upper layers will use to program keys into inline encryption hardware. For more information on the Keyslot Manager, refer to documentation found in block/keyslot-manager.c and linux/keyslot-manager.h. Co-developed-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Satya Tangirala <satyat@google.com> Reviewed-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-14Documentation: Document the blk-crypto frameworkSatya Tangirala2-0/+264
The blk-crypto framework adds support for inline encryption. There are numerous changes throughout the storage stack. This patch documents the main design choices in the block layer, the API presented to users of the block layer (like fscrypt or layered devices) and the API presented to drivers for adding support for inline encryption. Signed-off-by: Satya Tangirala <satyat@google.com> Reviewed-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-14iocost: don't let vrate run wild while there's no saturation signalTejun Heo1-4/+24
When the QoS targets are met and nothing is being throttled, there's no way to tell how saturated the underlying device is - it could be almost entirely idle, at the cusp of saturation or anywhere inbetween. Given that there's no information, it's best to keep vrate as-is in this state. Before 7cd806a9a953 ("iocost: improve nr_lagging handling"), this was the case - if the device isn't missing QoS targets and nothing is being throttled, busy_level was reset to zero. While fixing nr_lagging handling, 7cd806a9a953 ("iocost: improve nr_lagging handling") broke this. Now, while the device is hitting QoS targets and nothing is being throttled, vrate keeps getting adjusted according to the existing busy_level. This led to vrate keeping climing till it hits max when there's an IO issuer with limited request concurrency if the vrate started low. vrate starts getting adjusted upwards until the issuer can issue IOs w/o being throttled. From then on, QoS targets keeps getting met and nothing on the system needs throttling and vrate keeps getting increased due to the existing busy_level. This patch makes the following changes to the busy_level logic. * Reset busy_level if nr_shortages is zero to avoid the above scenario. * Make non-zero nr_lagging block lowering nr_level but still clear positive busy_level if there's clear non-saturation signal - QoS targets are met and nr_shortages is non-zero. nr_lagging's role is preventing adjusting vrate upwards while there are long-running commands and it shouldn't keep busy_level positive while there's clear non-saturation signal. * Restructure code for clarity and add comments. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Andy Newell <newella@fb.com> Fixes: 7cd806a9a953 ("iocost: improve nr_lagging handling") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-14block: move blk_io_schedule() out of header fileMing Lei2-12/+15
blk_io_schedule() isn't called from performance sensitive code path, and it is easier to maintain by exporting it as symbol. Also blk_io_schedule() is only called by CONFIG_BLOCK code, so it is safe to do this way. Meantime fixes build failure when CONFIG_BLOCK is off. Cc: Christoph Hellwig <hch@infradead.org> Fixes: e6249cdd46e4 ("block: add blk_io_schedule() for avoiding task hung in sync dio") Reported-by: Satya Tangirala <satyat@google.com> Tested-by: Satya Tangirala <satyat@google.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-13zonefs: use REQ_OP_ZONE_APPEND for sync DIOJohannes Thumshirn1-8/+72
Synchronous direct I/O to a sequential write only zone can be issued using the new REQ_OP_ZONE_APPEND request operation. As dispatching multiple BIOs can potentially result in reordering, we cannot support asynchronous IO via this interface. We also can only dispatch up to queue_max_zone_append_sectors() via the new zone-append method and have to return a short write back to user-space in case an IO larger than queue_max_zone_append_sectors() has been issued. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Acked-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-13block: export bio_release_pages and bio_iov_iter_get_pagesJohannes Thumshirn1-0/+2
Export bio_release_pages and bio_iov_iter_get_pages, so they can be used from modular code. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-13null_blk: Support REQ_OP_ZONE_APPENDDamien Le Moal1-8/+29
Support REQ_OP_ZONE_APPEND requests for null_blk devices with zoned mode enabled. Use the internally tracked zone write pointer position as the actual write position and return it using the command request __sector field in the case of an mq device and using the command BIO sector in the case of a BIO device. Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-13scsi: sd_zbc: emulate ZONE_APPEND commandsJohannes Thumshirn3-30/+392
Emulate ZONE_APPEND for SCSI disks using a regular WRITE(16) command with a start LBA set to the target zone write pointer position. In order to always know the write pointer position of a sequential write zone, the write pointer of all zones is tracked using an array of 32bits zone write pointer offset attached to the scsi disk structure. Each entry of the array indicate a zone write pointer position relative to the zone start sector. The write pointer offsets are maintained in sync with the device as follows: 1) the write pointer offset of a zone is reset to 0 when a REQ_OP_ZONE_RESET command completes. 2) the write pointer offset of a zone is set to the zone size when a REQ_OP_ZONE_FINISH command completes. 3) the write pointer offset of a zone is incremented by the number of 512B sectors written when a write, write same or a zone append command completes. 4) the write pointer offset of all zones is reset to 0 when a REQ_OP_ZONE_RESET_ALL command completes. Since the block layer does not write lock zones for zone append commands, to ensure a sequential ordering of the regular write commands used for the emulation, the target zone of a zone append command is locked when the function sd_zbc_prepare_zone_append() is called from sd_setup_read_write_cmnd(). If the zone write lock cannot be obtained (e.g. a zone append is in-flight or a regular write has already locked the zone), the zone append command dispatching is delayed by returning BLK_STS_ZONE_RESOURCE. To avoid the need for write locking all zones for REQ_OP_ZONE_RESET_ALL requests, use a spinlock to protect accesses and modifications of the zone write pointer offsets. This spinlock is initialized from sd_probe() using the new function sd_zbc_init(). Co-developed-by: Damien Le Moal <Damien.LeMoal@wdc.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-13scsi: sd_zbc: factor out sanity checks for zoned commandsJohannes Thumshirn1-11/+25
Factor sanity checks for zoned commands from sd_zbc_setup_zone_mgmt_cmnd(). This will help with the introduction of an emulated ZONE_APPEND command. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-13block: Modify revalidate zonesDamien Le Moal3-3/+11
Modify the interface of blk_revalidate_disk_zones() to add an optional driver callback function that a driver can use to extend processing done during zone revalidation. The callback, if defined, is executed with the device request queue frozen, after all zones have been inspected. Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-13block: introduce blk_req_zone_write_trylockJohannes Thumshirn2-0/+15
Introduce blk_req_zone_write_trylock(), which either grabs the write-lock for a sequential zone or returns false, if the zone is already locked. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-13block: Introduce REQ_OP_ZONE_APPENDKeith Busch8-4/+207
Define REQ_OP_ZONE_APPEND to append-write sectors to a zone of a zoned block device. This is a no-merge write operation. A zone append write BIO must: * Target a zoned block device * Have a sector position indicating the start sector of the target zone * The target zone must be a sequential write zone * The BIO must not cross a zone boundary * The BIO size must not be split to ensure that a single range of LBAs is written with a single command. Implement these checks in generic_make_request_checks() using the helper function blk_check_zone_append(). To avoid write append BIO splitting, introduce the new max_zone_append_sectors queue limit attribute and ensure that a BIO size is always lower than this limit. Export this new limit through sysfs and check these limits in bio_full(). Also when a LLDD can't dispatch a request to a specific zone, it will return BLK_STS_ZONE_RESOURCE indicating this request needs to be delayed, e.g. because the zone it will be dispatched to is still write-locked. If this happens set the request aside in a local list to continue trying dispatching requests such as READ requests or a WRITE/ZONE_APPEND requests targetting other zones. This way we can still keep a high queue depth without starving other requests even if one request can't be served due to zone write-locking. Finally, make sure that the bio sector position indicates the actual write position as indicated by the device on completion. Signed-off-by: Keith Busch <kbusch@kernel.org> [ jth: added zone-append specific add_page and merge_page helpers ] Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-13block: rename __bio_add_pc_page to bio_add_hw_pageChristoph Hellwig3-29/+45
Rename __bio_add_pc_page() to bio_add_hw_page() and explicitly pass in a max_sectors argument. This max_sectors argument can be used to specify constraints from the hardware. Signed-off-by: Christoph Hellwig <hch@lst.de> [ jth: rebased and made public for blk-map.c ] Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Daniel Wagner <dwagner@suse.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-13block: provide fallbacks for blk_queue_zone_is_seq and blk_queue_zone_noJohannes Thumshirn1-0/+10
blk_queue_zone_is_seq() and blk_queue_zone_no() have not been called with CONFIG_BLK_DEV_ZONED disabled until now. The introduction of REQ_OP_ZONE_APPEND will change this, so we need to provide noop fallbacks for the !CONFIG_BLK_DEV_ZONED case. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-13block: add blk_io_schedule() for avoiding task hung in sync dioMing Lei4-4/+16
Sync dio could be big, or may take long time in discard or in case of IO failure. We have prevented task hung in submit_bio_wait() and blk_execute_rq(), so apply the same trick for prevent task hung from happening in sync dio. Add helper of blk_io_schedule() and use io_schedule_timeout() to prevent task hung warning. Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Cc: Salman Qazi <sqazi@google.com> Cc: Jesse Barnes <jsbarnes@google.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Bart Van Assche <bvanassche@acm.org> Cc: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-13block: don't hold part0's refcount in IO pathMing Lei2-9/+8
gendisk can't be gone when there is IO activity, so not hold part0's refcount in IO path. Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@infradead.org> Cc: Yufen Yu <yuyufen@huawei.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Hou Tao <houtao1@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-13block: re-organize fields of 'struct hd_part'Ming Lei1-7/+8
Put all fields accessed in IO path together at the beginning of the struct, so that all can be fetched in single cacheline. Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@infradead.org> Cc: Yufen Yu <yuyufen@huawei.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Hou Tao <houtao1@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-13block: only define 'nr_sects_seq' in hd_part for 32bit SMPMing Lei3-2/+11
The seqcount of 'nr_sects_seq' is only needed in case of 32bit SMP, so define it just for 32bit SMP. Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@infradead.org> Cc: Yufen Yu <yuyufen@huawei.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Hou Tao <houtao1@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-13block: fix use-after-free on cached last_lookup partitionMing Lei3-16/+23
delete_partition() clears the cached last_lookup partition. However the .last_lookup cache may be overwritten by one IO path after it is cleared from delete_partition(). Then another IO path may use the cached deleting partition after hd_struct_free() is called, then use-after-free is triggered on the cached partition. Fixes the issue by the following approach: 1) always get the partition's refcount via hd_struct_try_get() before setting .last_lookup 2) move clearing .last_lookup from delete_partition() to hd_struct_free() which is the release handle of the partition's percpu-refcount, so that no IO path can cache deleteing partition via .last_lookup. It is one candidate approach of Yufen's patch[1] which adds overhead in fast path by indirect lookup which may introduce one extra cacheline in IO path. Also this patch relies on percpu-refcount's protection, and it is easier to understand and verify. [1] https://lore.kernel.org/linux-block/20200109013551.GB9655@ming.t460p/T/#t Reported-by: Yufen Yu <yuyufen@huawei.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@infradead.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Hou Tao <houtao1@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-13block: reset mapping if failed to update hardware queue countWeiping Zhang1-1/+1
When we increase hardware queue count, blk_mq_update_queue_map will reset the mapping between cpu and hardware queue base on the hardware queue count(set->nr_hw_queues). The mapping cannot be reset if it encounters error in blk_mq_realloc_hw_ctxs, but the fallback flow will continue using it, then blk_mq_map_swqueue will touch a invalid memory, because the mapping points to a wrong hctx. blktest block/030: null_blk: module loaded Increasing nr_hw_queues to 8 fails, fallback to 1 ================================================================== BUG: KASAN: null-ptr-deref in blk_mq_map_swqueue+0x2f2/0x830 Read of size 8 at addr 0000000000000128 by task nproc/8541 CPU: 5 PID: 8541 Comm: nproc Not tainted 5.7.0-rc4-dbg+ #3 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4-rebuilt.opensuse.org 04/01/2014 Call Trace: dump_stack+0xa5/0xe6 __kasan_report.cold+0x65/0xbb kasan_report+0x45/0x60 check_memory_region+0x15e/0x1c0 __kasan_check_read+0x15/0x20 blk_mq_map_swqueue+0x2f2/0x830 __blk_mq_update_nr_hw_queues+0x3df/0x690 blk_mq_update_nr_hw_queues+0x32/0x50 nullb_device_submit_queues_store+0xde/0x160 [null_blk] configfs_write_file+0x1c4/0x250 [configfs] __vfs_write+0x4c/0x90 vfs_write+0x14b/0x2d0 ksys_write+0xdd/0x180 __x64_sys_write+0x47/0x50 do_syscall_64+0x6f/0x310 entry_SYSCALL_64_after_hwframe+0x49/0xb3 Signed-off-by: Weiping Zhang <zhangweiping@didiglobal.com> Tested-by: Bart van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-11bdi: fix up for "remove the name field in struct backing_dev_info"Stephen Rothwell1-1/+0
Fixes: 1cd925d58385 ("bdi: remove the name field in struct backing_dev_info") Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-10hfs: stop using ioctl_by_bdevChristoph Hellwig1-13/+19
Instead just call the CDROM layer functionality directly. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-10bdi: remove the name field in struct backing_dev_infoChristoph Hellwig6-8/+1
The name is only printed for a not registered bdi in writeback. Use the device name there as is more useful anyway for the unlike case that the warning triggers. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-10bdi: simplify bdi_allocChristoph Hellwig5-12/+7
Merge the _node vs normal version and drop the superflous gfp_t argument. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-10bdi: remove bdi_register_ownerChristoph Hellwig3-14/+8
Split out a new bdi_set_owner helper to set the owner, and move the policy for creating the bdi name back into genhd.c, where it belongs. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-10bdi: unexport bdi_register_vaChristoph Hellwig1-1/+0
bdi_register_va is only used by super.c, which can't be modular. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-10driver core: remove device_create_vargsChristoph Hellwig2-39/+2
All external users of device_create_vargs are gone, so remove it and open code it in the only caller. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-10block: rename blk_mq_alloc_rq_mapsWeiping Zhang1-2/+2
rename blk_mq_alloc_rq_maps to blk_mq_alloc_map_and_requests, this function allocs both map and request, make function name align with funtion. Signed-off-by: Weiping Zhang <zhangweiping@didiglobal.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-10block: rename __blk_mq_alloc_rq_mapWeiping Zhang1-3/+4
rename __blk_mq_alloc_rq_map to __blk_mq_alloc_map_and_request, actually it alloc both map and request, make function name align with function. Signed-off-by: Weiping Zhang <zhangweiping@didiglobal.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-10block: alloc map and request for new hardware queueMing Lei1-12/+12
Alloc new map and request for new hardware queue when increse hardware queue count. Before this patch, it will show a warning for each new hardware queue, but it's not enough, these hctx have no maps and reqeust, when a bio was mapped to these hardware queue, it will trigger kernel panic when get request from these hctx. Test environment: * A NVMe disk supports 128 io queues * 96 cpus in system A corner case can always trigger this panic, there are 96 io queues allocated for HCTX_TYPE_DEFAULT type, the corresponding kernel log: nvme nvme0: 96/0/0 default/read/poll queues. Now we set nvme write queues to 96, then nvme will alloc others(32) queues for read, but blk_mq_update_nr_hw_queues does not alloc map and request for these new added io queues. So when process read nvme disk, it will trigger kernel panic when get request from these hardware context. Reproduce script: nr=$(expr `cat /sys/block/nvme0n1/device/queue_count` - 1) echo $nr > /sys/module/nvme/parameters/write_queues echo 1 > /sys/block/nvme0n1/device/reset_controller dd if=/dev/nvme0n1 of=/dev/null bs=4K count=1 [ 8040.805626] ------------[ cut here ]------------ [ 8040.805627] WARNING: CPU: 82 PID: 12921 at block/blk-mq.c:2578 blk_mq_map_swqueue+0x2b6/0x2c0 [ 8040.805627] Modules linked in: nvme nvme_core nf_conntrack_netlink xt_addrtype br_netfilter overlay xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nft_counter nf_nat_tftp nf_conntrack_tftp nft_masq nf_tables_set nft_fib_inet nft_f ib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack tun bridge nf_defrag_ipv6 nf_defrag_ipv4 stp llc ip6_tables ip_tables nft_compat rfkill ip_set nf_tables nfne tlink sunrpc intel_rapl_msr intel_rapl_common skx_edac nfit libnvdimm x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass ipmi_ssif crct10dif_pclmul crc32_pclmul iTCO_wdt iTCO_vendor_support ghash_clmulni_intel intel_ cstate intel_uncore raid0 joydev intel_rapl_perf ipmi_si pcspkr mei_me ioatdma sg ipmi_devintf mei i2c_i801 dca lpc_ich ipmi_msghandler acpi_power_meter acpi_pad xfs libcrc32c sd_mod ast i2c_algo_bit drm_vram_helper drm_ttm_helper ttm d rm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops [ 8040.805637] ahci drm i40e libahci crc32c_intel libata t10_pi wmi dm_mirror dm_region_hash dm_log dm_mod [last unloaded: nvme_core] [ 8040.805640] CPU: 82 PID: 12921 Comm: kworker/u194:2 Kdump: loaded Tainted: G W 5.6.0-rc5.78317c+ #2 [ 8040.805640] Hardware name: Inspur SA5212M5/YZMB-00882-104, BIOS 4.0.9 08/27/2019 [ 8040.805641] Workqueue: nvme-reset-wq nvme_reset_work [nvme] [ 8040.805642] RIP: 0010:blk_mq_map_swqueue+0x2b6/0x2c0 [ 8040.805643] Code: 00 00 00 00 00 41 83 c5 01 44 39 6d 50 77 b8 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 8b bb 98 00 00 00 89 d6 e8 8c 81 03 00 eb 83 <0f> 0b e9 52 ff ff ff 0f 1f 00 0f 1f 44 00 00 41 57 48 89 f1 41 56 [ 8040.805643] RSP: 0018:ffffba590d2e7d48 EFLAGS: 00010246 [ 8040.805643] RAX: 0000000000000000 RBX: ffff9f013e1ba800 RCX: 000000000000003d [ 8040.805644] RDX: ffff9f00ffff6000 RSI: 0000000000000003 RDI: ffff9ed200246d90 [ 8040.805644] RBP: ffff9f00f6a79860 R08: 0000000000000000 R09: 000000000000003d [ 8040.805645] R10: 0000000000000001 R11: ffff9f0138c3d000 R12: ffff9f00fb3a9008 [ 8040.805645] R13: 000000000000007f R14: ffffffff96822660 R15: 000000000000005f [ 8040.805645] FS: 0000000000000000(0000) GS:ffff9f013fa80000(0000) knlGS:0000000000000000 [ 8040.805646] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 8040.805646] CR2: 00007f7f397fa6f8 CR3: 0000003d8240a002 CR4: 00000000007606e0 [ 8040.805647] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 8040.805647] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 8040.805647] PKRU: 55555554 [ 8040.805647] Call Trace: [ 8040.805649] blk_mq_update_nr_hw_queues+0x31b/0x390 [ 8040.805650] nvme_reset_work+0xb4b/0xeab [nvme] [ 8040.805651] process_one_work+0x1a7/0x370 [ 8040.805652] worker_thread+0x1c9/0x380 [ 8040.805653] ? max_active_store+0x80/0x80 [ 8040.805655] kthread+0x112/0x130 [ 8040.805656] ? __kthread_parkme+0x70/0x70 [ 8040.805657] ret_from_fork+0x35/0x40 [ 8040.805658] ---[ end trace b5f13b1e73ccb5d3 ]--- [ 8229.365135] BUG: kernel NULL pointer dereference, address: 0000000000000004 [ 8229.365165] #PF: supervisor read access in kernel mode [ 8229.365178] #PF: error_code(0x0000) - not-present page [ 8229.365191] PGD 0 P4D 0 [ 8229.365201] Oops: 0000 [#1] SMP PTI [ 8229.365212] CPU: 77 PID: 13024 Comm: dd Kdump: loaded Tainted: G W 5.6.0-rc5.78317c+ #2 [ 8229.365232] Hardware name: Inspur SA5212M5/YZMB-00882-104, BIOS 4.0.9 08/27/2019 [ 8229.365253] RIP: 0010:blk_mq_get_tag+0x227/0x250 [ 8229.365265] Code: 44 24 04 44 01 e0 48 8b 74 24 38 65 48 33 34 25 28 00 00 00 75 33 48 83 c4 40 5b 5d 41 5c 41 5d 41 5e c3 48 8d 68 10 4c 89 ef <44> 8b 60 04 48 89 ee e8 dd f9 ff ff 83 f8 ff 75 c8 e9 67 fe ff ff [ 8229.365304] RSP: 0018:ffffba590e977970 EFLAGS: 00010246 [ 8229.365317] RAX: 0000000000000000 RBX: ffff9f00f6a79860 RCX: ffffba590e977998 [ 8229.365333] RDX: 0000000000000000 RSI: ffff9f012039b140 RDI: ffffba590e977a38 [ 8229.365349] RBP: 0000000000000010 R08: ffffda58ff94e190 R09: ffffda58ff94e198 [ 8229.365365] R10: 0000000000000011 R11: ffff9f00f6a79860 R12: 0000000000000000 [ 8229.365381] R13: ffffba590e977a38 R14: ffff9f012039b140 R15: 0000000000000001 [ 8229.365397] FS: 00007f481c230580(0000) GS:ffff9f013f940000(0000) knlGS:0000000000000000 [ 8229.365415] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 8229.365428] CR2: 0000000000000004 CR3: 0000005f35e26004 CR4: 00000000007606e0 [ 8229.365444] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 8229.365460] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 8229.365476] PKRU: 55555554 [ 8229.365484] Call Trace: [ 8229.365498] ? finish_wait+0x80/0x80 [ 8229.365512] blk_mq_get_request+0xcb/0x3f0 [ 8229.365525] blk_mq_make_request+0x143/0x5d0 [ 8229.365538] generic_make_request+0xcf/0x310 [ 8229.365553] ? scan_shadow_nodes+0x30/0x30 [ 8229.365564] submit_bio+0x3c/0x150 [ 8229.365576] mpage_readpages+0x163/0x1a0 [ 8229.365588] ? blkdev_direct_IO+0x490/0x490 [ 8229.365601] read_pages+0x6b/0x190 [ 8229.365612] __do_page_cache_readahead+0x1c1/0x1e0 [ 8229.365626] ondemand_readahead+0x182/0x2f0 [ 8229.365639] generic_file_buffered_read+0x590/0xab0 [ 8229.365655] new_sync_read+0x12a/0x1c0 [ 8229.365666] vfs_read+0x8a/0x140 [ 8229.365676] ksys_read+0x59/0xd0 [ 8229.365688] do_syscall_64+0x55/0x1d0 [ 8229.365700] entry_SYSCALL_64_after_hwframe+0x44/0xa9 Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Weiping Zhang <zhangweiping@didiglobal.com> Tested-by: Weiping Zhang <zhangweiping@didiglobal.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-10block: save previous hardware queue count before udpateWeiping Zhang1-1/+1
blk_mq_realloc_tag_set_tags will update set->nr_hw_queues, so save old set->nr_hw_queues before call this function. Signed-off-by: Weiping Zhang <zhangweiping@didiglobal.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-10block: free both rq_map and requestWeiping Zhang1-1/+1
Allocation: __blk_mq_alloc_rq_map blk_mq_alloc_rq_map blk_mq_alloc_rq_map tags = blk_mq_init_tags : kzalloc_node: tags->rqs = kcalloc_node tags->static_rqs = kcalloc_node blk_mq_alloc_rqs p = alloc_pages_node tags->static_rqs[i] = p + offset; Free: blk_mq_free_rq_map kfree(tags->rqs); kfree(tags->static_rqs); blk_mq_free_tags kfree(tags); The page allocated in blk_mq_alloc_rqs cannot be released, so we should use blk_mq_free_map_and_requests here. blk_mq_free_map_and_requests blk_mq_free_rqs __free_pages : cleanup for blk_mq_alloc_rqs blk_mq_free_rq_map : cleanup for blk_mq_alloc_rq_map Signed-off-by: Weiping Zhang <zhangweiping@didiglobal.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-10Merge branch 'block-5.7' into for-5.8/blockJens Axboe20-131/+209
Pull in block-5.7 fixes for 5.8. Mostly to resolve a conflict with the blk-iocost changes, but we also need the base of the bdi use-after-free as well as we build on top of it. * block-5.7: nvme: fix possible hang when ns scanning fails during error recovery nvme-pci: fix "slimmer CQ head update" bdi: add a ->dev_name field to struct backing_dev_info bdi: use bdi_dev_name() to get device name bdi: move bdi_dev_name out of line vboxsf: don't use the source name in the bdi name iocost: protect iocg->abs_vdebt with iocg->waitq.lock block: remove the bd_openers checks in blk_drop_partitions nvme: prevent double free in nvme_alloc_ns() error handling null_blk: Cleanup zoned device initialization null_blk: Fix zoned command handling block: remove unused header blk-iocost: Fix error on iocost_ioc_vrate_adj bdev: Reduce time holding bd_mutex in sync in blkdev_close() buffer: remove useless comment and WB_REASON_FREE_MORE_MEM, reason. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-10nvme: fix possible hang when ns scanning fails during error recoverySagi Grimberg1-1/+1
When the controller is reconnecting, the host fails I/O and admin commands as the host cannot reach the controller. ns scanning may revalidate namespaces during that period and it is wrong to remove namespaces due to these failures as we may hang (see 205da2434301). One command that may fail is nvme_identify_ns_descs. Since we return success due to having ns identify descriptor list optional, we continue to compare ns identifiers in nvme_revalidate_disk, obviously fail and return -ENODEV to nvme_validate_ns, which will remove the namespace. Exactly what we don't want to happen. Fixes: 22802bf742c2 ("nvme: Namepace identification descriptor list is optional") Tested-by: Anton Eidelman <anton@lightbitslabs.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-10nvme-pci: fix "slimmer CQ head update"Alexey Dobriyan1-1/+5
Pre-incrementing ->cq_head can't be done in memory because OOB value can be observed by another context. This devalues space savings compared to original code :-\ $ ./scripts/bloat-o-meter ../vmlinux-000 ../obj/vmlinux add/remove: 0/0 grow/shrink: 0/4 up/down: 0/-32 (-32) Function old new delta nvme_poll_irqdisable 464 456 -8 nvme_poll 455 447 -8 nvme_irq 388 380 -8 nvme_dev_disable 955 947 -8 But the code is minimal now: one read for head, one read for q_depth, one increment, one comparison, single instruction phase bit update and one write for new head. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Reported-by: John Garry <john.garry@huawei.com> Tested-by: John Garry <john.garry@huawei.com> Fixes: e2a366a4b0feaeb ("nvme-pci: slimmer CQ head update") Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-10bdi: add a ->dev_name field to struct backing_dev_infoChristoph Hellwig2-2/+4
Cache a copy of the name for the life time of the backing_dev_info structure so that we can reference it even after unregistering. Fixes: 68f23b89067f ("memcg: fix a crash in wb_workfn when a device disappears") Reported-by: Yufen Yu <yuyufen@huawei.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-10bdi: use bdi_dev_name() to get device nameYufen Yu4-8/+10
Use the common interface bdi_dev_name() to get device name. Signed-off-by: Yufen Yu <yuyufen@huawei.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Add missing <linux/backing-dev.h> include BFQ Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-07bdi: move bdi_dev_name out of lineChristoph Hellwig2-9/+10
bdi_dev_name is not a fast path function, move it out of line. This prepares for using it from modular callers without having to export an implementation detail like bdi_unknown_name. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-07vboxsf: don't use the source name in the bdi nameChristoph Hellwig1-1/+1
Simplify the bdi name to mirror what we are doing elsewhere, and drop them name in favor of just using a number. This avoids a potentially very long bdi name. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hans de Goede <hdegoede@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-05iocost: protect iocg->abs_vdebt with iocg->waitq.lockTejun Heo2-47/+77
abs_vdebt is an atomic_64 which tracks how much over budget a given cgroup is and controls the activation of use_delay mechanism. Once a cgroup goes over budget from forced IOs, it has to pay it back with its future budget. The progress guarantee on debt paying comes from the iocg being active - active iocgs are processed by the periodic timer, which ensures that as time passes the debts dissipate and the iocg returns to normal operation. However, both iocg activation and vdebt handling are asynchronous and a sequence like the following may happen. 1. The iocg is in the process of being deactivated by the periodic timer. 2. A bio enters ioc_rqos_throttle(), calls iocg_activate() which returns without anything because it still sees that the iocg is already active. 3. The iocg is deactivated. 4. The bio from #2 is over budget but needs to be forced. It increases abs_vdebt and goes over the threshold and enables use_delay. 5. IO control is enabled for the iocg's subtree and now IOs are attributed to the descendant cgroups and the iocg itself no longer issues IOs. This leaves the iocg with stuck abs_vdebt - it has debt but inactive and no further IOs which can activate it. This can end up unduly punishing all the descendants cgroups. The usual throttling path has the same issue - the iocg must be active while throttled to ensure that future event will wake it up - and solves the problem by synchronizing the throttling path with a spinlock. abs_vdebt handling is another form of overage handling and shares a lot of characteristics including the fact that it isn't in the hottest path. This patch fixes the above and other possible races by strictly synchronizing abs_vdebt and use_delay handling with iocg->waitq.lock. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Vlad Dmitriev <vvd@fb.com> Cc: stable@vger.kernel.org # v5.4+ Fixes: e1518f63f246 ("blk-iocost: Don't let merges push vtime into the future") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-04udf: stop using ioctl_by_bdevChristoph Hellwig1-16/+13
Instead just call the CDROM layer functionality directly. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-04isofs: stop using ioctl_by_bdevChristoph Hellwig1-28/+26
Instead just call the CDROM layer functionality directly, and turn the hot mess in isofs_get_last_session into remotely readable code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-04hfsplus: stop using ioctl_by_bdevChristoph Hellwig1-15/+18
Instead just call the CDROM layer functionality directly. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-04cdrom: factor out a cdrom_multisession helperChristoph Hellwig2-16/+27
Factor out a version of the CDROMMULTISESSION ioctl handler that can be called directly from kernel space. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-04cdrom: factor out a cdrom_read_tocentry helperChristoph Hellwig2-17/+25
Factor out a version of the CDROMREADTOCENTRY ioctl handler that can be called directly from kernel space. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-04ide-cd: rename cdrom_read_tocentryChristoph Hellwig1-7/+7
Give the cdrom_read_tocentry function and ide_ prefix to not conflict with the soon to be added generic function. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-04block: add a cdrom_device_info pointer to struct gendiskChristoph Hellwig7-8/+18
Add a pointer to the CDROM information structure to struct gendisk. This will allow various removable media file systems to call directly into the CDROM layer instead of abusing ioctls with kernel pointers. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-01iocost_monitor: drop string wrap around numbers when outputting jsonTejun Heo1-21/+21
Wrapping numbers in strings is used by some to work around bit-width issues in some enviroments. The problem isn't innate to json and the workaround seems to cause more integration problems than help. Let's drop the string wrapping. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-01iocost_monitor: exit successfully if interval is zeroTejun Heo1-1/+5
This is to help external tools to decide whether iocost_monitor has all its requirements met or not based on the exit status of an -i0 run. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>