summaryrefslogtreecommitdiff
path: root/include/linux/blk_types.h
AgeCommit message (Collapse)AuthorFilesLines
2023-06-30Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsiLinus Torvalds1-2/+8
Pull SCSI updates from James Bottomley: "Updates to the usual drivers (ufs, pm80xx, libata-scsi, smartpqi, lpfc, qla2xxx). We have a couple of major core changes impacting other systems: - Command Duration Limits, which spills into block and ATA - block level Persistent Reservation Operations, which touches block, nvme, target and dm Both of these are added with merge commits containing a cover letter explaining what's going on" * tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (187 commits) scsi: core: Improve warning message in scsi_device_block() scsi: core: Replace scsi_target_block() with scsi_block_targets() scsi: core: Don't wait for quiesce in scsi_device_block() scsi: core: Don't wait for quiesce in scsi_stop_queue() scsi: core: Merge scsi_internal_device_block() and device_block() scsi: sg: Increase number of devices scsi: bsg: Increase number of devices scsi: qla2xxx: Remove unused nvme_ls_waitq wait queue scsi: ufs: ufs-pci: Add support for Intel Arrow Lake scsi: sd: sd_zbc: Use PAGE_SECTORS_SHIFT scsi: ufs: wb: Add explicit flush_threshold sysfs attribute scsi: ufs: ufs-qcom: Switch to the new ICE API scsi: ufs: dt-bindings: qcom: Add ICE phandle scsi: ufs: ufs-mediatek: Set UFSHCD_QUIRK_MCQ_BROKEN_RTC quirk scsi: ufs: ufs-mediatek: Set UFSHCD_QUIRK_MCQ_BROKEN_INTR quirk scsi: ufs: core: Add host quirk UFSHCD_QUIRK_MCQ_BROKEN_RTC scsi: ufs: core: Add host quirk UFSHCD_QUIRK_MCQ_BROKEN_INTR scsi: ufs: core: Remove dedicated hwq for dev command scsi: ufs: core: mcq: Fix the incorrect OCS value for the device command scsi: ufs: dt-bindings: samsung,exynos: Drop unneeded quotes ...
2023-06-16block: remove BIO_PAGE_REFFEDChristoph Hellwig1-1/+0
Now that all block direct I/O helpers use page pinning, this flag is unused. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christian Brauner <brauner@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/r/20230614140341.521331-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-05block: introduce holder opsChristoph Hellwig1-0/+2
Add a new blk_holder_ops structure, which is passed to blkdev_get_by_* and installed in the block_device for exclusive claims. It will be used to allow the block layer to call back into the user of the block device for thing like notification of a removed device or a device resize. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Link: https://lore.kernel.org/r/20230601094459.1350643-10-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-05-24block: Add BIO_PAGE_PINNED and associated infrastructureDavid Howells1-0/+1
Add BIO_PAGE_PINNED to indicate that the pages in a bio are pinned (FOLL_PIN) and that the pin will need removing. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: John Hubbard <jhubbard@nvidia.com> cc: Al Viro <viro@zeniv.linux.org.uk> cc: Jens Axboe <axboe@kernel.dk> cc: Jan Kara <jack@suse.cz> cc: Matthew Wilcox <willy@infradead.org> cc: Logan Gunthorpe <logang@deltatee.com> cc: linux-block@vger.kernel.org Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230522205744.2825689-5-dhowells@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-05-24block: Replace BIO_NO_PAGE_REF with BIO_PAGE_REFFED with inverted logicChristoph Hellwig1-1/+1
Replace BIO_NO_PAGE_REF with a BIO_PAGE_REFFED flag that has the inverted meaning is only set when a page reference has been acquired that needs to be released by bio_release_pages(). Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> cc: Al Viro <viro@zeniv.linux.org.uk> cc: Jens Axboe <axboe@kernel.dk> cc: Jan Kara <jack@suse.cz> cc: Matthew Wilcox <willy@infradead.org> cc: Logan Gunthorpe <logang@deltatee.com> cc: linux-block@vger.kernel.org Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230522205744.2825689-4-dhowells@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-05-23Merge patch series "Add Command Duration Limits support"Martin K. Petersen1-0/+6
Niklas Cassel <nks@flawful.org> says: This series adds support for Command Duration Limits. The series is based on linux tag: v6.4-rc1 The series can also be found in git: https://github.com/floatious/linux/commits/cdl-v7 ================= CDL in ATA / SCSI ================= Command Duration Limits is defined in: T13 ATA Command Set - 5 (ACS-5) and T10 SCSI Primary Commands - 6 (SPC-6) respectively (a simpler version of CDL is defined in T10 SPC-5). CDL defines Duration Limits Descriptors (DLD). 7 DLDs for read commands and 7 DLDs for write commands. Simply put, a DLD contains a limit and a policy. A command can specify that a certain limit should be applied by setting the DLD index field (3 bits, so 0-7) in the command itself. The DLD index points to one of the 7 DLDs. DLD index 0 means no descriptor, so no limit. DLD index 1-7 means DLD 1-7. A DLD can have a few different policies, but the two major ones are: -Policy 0xF (abort), command will be completed with command aborted error (ATA) or status CHECK CONDITION (SCSI), with sense data indicating that the command timed out. -Policy 0xD (complete-unavailable), command will be completed without error (ATA) or status GOOD (SCSI), with sense data indicating that the command timed out. Note that the command will not have transferred any data to/from the device when the command timed out, even though the command returned success. Regardless of the CDL policy, in case of a CDL timeout, the I/O will result in a -ETIME error to user-space. The DLDs are defined in the CDL log page(s) and are readable and writable. Reading and writing the CDL DLDs are outside the scope of the kernel. If a user wants to read or write the descriptors, they can do so using a user-space application that sends passthrough commands, such as cdl-tools: https://github.com/westerndigitalcorporation/cdl-tools ================================ The introduction of ioprio hints ================================ What the kernel does provide, is a method to let I/O use one of the CDL DLDs defined in the device. Note that the kernel will simply forward the DLD index to the device, so the kernel currently does not know, nor does it need to know, how the DLDs are defined inside the device. The way that the CDL DLD index is supplied to the kernel is by introducing a new 10 bit "ioprio hint" field within the existing 16 bit ioprio definition. Currently, only 6 out of the 16 ioprio bits are in use, the remaining 10 bits are unused, and are currently explicitly disallowed to be set by the kernel. For now, we only add ioprio hints representing CDL DLD index 1-7. Additional ioprio hints for other QoS features could be defined in the future. A theoretical future work could be to make an I/O scheduler aware of these hints. E.g. for CDL, an I/O scheduler could make use of the duration limit in each descriptor, and take that information into account while scheduling commands. Right now, the ioprio hints will be ignored by the I/O schedulers. ============================== How to use CDL from user-space ============================== Since CDL is mutually exclusive with NCQ priority (see ncq_prio_enable and sas_ncq_prio_enable in Documentation/ABI/testing/sysfs-block-device), CDL has to be explicitly enabled using: echo 1 > /sys/block/$bdev/device/cdl_enable Since the ioprio hints are supplied through the existing I/O priority API, it should be simple for an application to make use of the ioprio hints. It simply has to reuse one of the new macros defined in include/uapi/linux/ioprio.h: IOPRIO_PRIO_HINT() or IOPRIO_PRIO_VALUE_HINT(), and supply one of the new hints defined in include/uapi/linux/ioprio.h: IOPRIO_HINT_DEV_DURATION_LIMIT_[1-7], which indicates that the I/O should use the corresponding CDL DLD index 1-7. By reusing the I/O priority API, the user can both define a DLD to use per AIO (io_uring sqe->ioprio or libaio iocb->aio_reqprio) or per-thread (ioprio_set()). ======= Testing ======= With the following fio patches: https://github.com/floatious/fio/commits/cdl fio adds support for ioprio hints, such that CDL can be tested using e.g.: fio --ioengine=io_uring --cmdprio_percentage=10 --cmdprio_hint=DLD_index A simple way to test is to use a DLD with a very short duration limit, and send large reads. Regardless of the CDL policy, in case of a CDL timeout, the I/O will result in a -ETIME error to user-space. We also provide a CDL test suite located in the cdl-tools repo, see: https://github.com/westerndigitalcorporation/cdl-tools#testing-a-system-command-duration-limits-support We have tested this patch series using: -real hardware -the following QEMU implementation: https://github.com/floatious/qemu/tree/cdl (NOTE: the QEMU implementation requires you to define the CDL policy at compile time, so you currently need to recompile QEMU when switching between policies.) =================== Further information =================== For further information about CDL, see Damien's slides: Presented at SDC 2021: https://www.snia.org/sites/default/files/SDC/2021/pdfs/SNIA-SDC21-LeMoal-Be-On-Time-command-duration-limits-Feature-Support-in%20Linux.pdf Presented at Lund Linux Con 2022: https://drive.google.com/file/d/1I6ChFc0h4JY9qZdO1bY5oCAdYCSZVqWw/view?usp=sharing ================ Changes since V6 ================ -Rebased series on v6.4-rc1. -Picked up Reviewed-by tags from Hannes (Thank you Hannes!) -Picked up Reviewed-by tag from Christoph (Thank you Christoph!) -Changed KernelVersion from 6.4 to 6.5 for new sysfs attributes. For older change logs, see previous patch series versions: https://lore.kernel.org/linux-scsi/20230406113252.41211-1-nks@flawful.org/ https://lore.kernel.org/linux-scsi/20230404182428.715140-1-nks@flawful.org/ https://lore.kernel.org/linux-scsi/20230309215516.3800571-1-niklas.cassel@wdc.com/ https://lore.kernel.org/linux-scsi/20230124190308.127318-1-niklas.cassel@wdc.com/ https://lore.kernel.org/linux-scsi/20230112140412.667308-1-niklas.cassel@wdc.com/ https://lore.kernel.org/linux-scsi/20221208105947.2399894-1-niklas.cassel@wdc.com/ Link: https://lore.kernel.org/r/20230511011356.227789-1-nks@flawful.org Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2023-05-23scsi: block: Introduce BLK_STS_DURATION_LIMITDamien Le Moal1-0/+6
Introduce the new block I/O status BLK_STS_DURATION_LIMIT for LLDDs to report command that failed due to a command duration limit being exceeded. This new status is mapped to the ETIME error code to allow users to differentiate "soft" duration limit failures from other more serious hardware related errors. If we compare BLK_STS_DURATION_LIMIT with BLK_STS_TIMEOUT: -BLK_STS_DURATION_LIMIT means that the drive gave a reply indicating that the command duration limit was exceeded before the command could be completed. This I/O status is mapped to ETIME for user space. -BLK_STS_TIMEOUT means that the drive never gave a reply at all. This I/O status is mapped to ETIMEDOUT for user space. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Co-developed-by: Niklas Cassel <niklas.cassel@wdc.com> Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com> Link: https://lore.kernel.org/r/20230511011356.227789-4-nks@flawful.org Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2023-05-22Merge patch series "Use block pr_ops in LIO"Martin K. Petersen1-2/+2
Mike Christie <michael.christie@oracle.com> says: The patches in this thread allow us to use the block pr_ops with LIO's target_core_iblock module to support cluster applications in VMs. They were built over Linus's tree. They also apply over linux-next and Martin's tree and Jens's trees. Currently, to use windows clustering or linux clustering (pacemaker + cluster labs scsi fence agents) in VMs with LIO and vhost-scsi, you have to use tcmu or pscsi or use a cluster aware FS/framework for the LIO pr file. Setting up a cluster FS/framework is pain and waste when your real backend device is already a distributed device, and pscsi and tcmu are nice for specific use cases, but iblock gives you the best performance and allows you to use stacked devices like dm-multipath. So these patches allow iblock to work like pscsi/tcmu where they can pass a PR command to the backend module. And then iblock will use the pr_ops to pass the PR command to the real devices similar to what we do for unmap today. The patches are separated in the following groups: Patch 1 - 2: - Add block layer callouts for reading reservations and rename reservation error code. Patch 3 - 5: - SCSI support for new callouts. Patch 6: - DM support for new callouts. Patch 7 - 13: - NVMe support for new callouts. Patch 14 - 18: - LIO support for new callouts. This patchset has been tested with the libiscsi PGR ops and with window's failover cluster verification test. Note that for scsi backend devices we need this patchset: https://lore.kernel.org/linux-scsi/20230123221046.125483-1-michael.christie@oracle.com/T/#m4834a643ffb5bac2529d65d40906d3cfbdd9b1b7 to handle UAs. To reduce the size of this patchset that's being done separately to make reviewing easier. And to make merging easier this patchset and the one above do not have any conflicts so can be merged in different trees. Link: https://lore.kernel.org/r/20230407200551.12660-1-michael.christie@oracle.com Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2023-04-26Merge tag 'for-6.4/block-2023-04-21' of git://git.kernel.dk/linuxLinus Torvalds1-8/+13
Pull block updates from Jens Axboe: - drbd patches, bringing us closer to unifying the out-of-tree version and the in tree one (Andreas, Christoph) - support for auto-quiesce for the s390 dasd driver (Stefan) - MD pull request via Song: - md/bitmap: Optimal last page size (Jon Derrick) - Various raid10 fixes (Yu Kuai, Li Nan) - md: add error_handlers for raid0 and linear (Mariusz Tkaczyk) - NVMe pull request via Christoph: - Drop redundant pci_enable_pcie_error_reporting (Bjorn Helgaas) - Validate nvmet module parameters (Chaitanya Kulkarni) - Fence TCP socket on receive error (Chris Leech) - Fix async event trace event (Keith Busch) - Minor cleanups (Chaitanya Kulkarni, zhenwei pi) - Fix and cleanup nvmet Identify handling (Damien Le Moal, Christoph Hellwig) - Fix double blk_mq_complete_request race in the timeout handler (Lei Yin) - Fix irq locking in nvme-fcloop (Ming Lei) - Remove queue mapping helper for rdma devices (Sagi Grimberg) - use structured request attribute checks for nbd (Jakub) - fix blk-crypto race conditions between keyslot management (Eric) - add sed-opal support for reading read locking range attributes (Ondrej) - make fault injection configurable for null_blk (Akinobu) - clean up the request insertion API (Christoph) - clean up the queue running API (Christoph) - blkg config helper cleanups (Tejun) - lazy init support for blk-iolatency (Tejun) - various fixes and tweaks to ublk (Ming) - remove hybrid polling. It hasn't really been useful since we got async polled IO support, and these days we don't support sync polled IO at all (Keith) - misc fixes, cleanups, improvements (Zhong, Ondrej, Colin, Chengming, Chaitanya, me) * tag 'for-6.4/block-2023-04-21' of git://git.kernel.dk/linux: (118 commits) nbd: fix incomplete validation of ioctl arg ublk: don't return 0 in case of any failure sed-opal: geometry feature reporting command null_blk: Always check queue mode setting from configfs block: ublk: switch to ioctl command encoding blk-mq: fix the blk_mq_add_to_requeue_list call in blk_kick_flush block, bfq: Fix division by zero error on zero wsum fault-inject: fix build error when FAULT_INJECTION_CONFIGFS=y and CONFIGFS_FS=m block: store bdev->bd_disk->fops->submit_bio state in bdev block: re-arrange the struct block_device fields for better layout md/raid5: remove unused working_disks variable md/raid10: don't call bio_start_io_acct twice for bio which experienced read error md/raid10: fix memleak of md thread md/raid10: fix memleak for 'conf->bio_split' md/raid10: fix leak of 'r10bio->remaining' for recovery md/raid10: don't BUG_ON() in raise_barrier() md: fix soft lockup in status_resync md: add error_handlers for raid0 and linear md: Use optimal I/O size for last bitmap page md: Fix types in sb writer ...
2023-04-17btrfs, block: move REQ_CGROUP_PUNT to btrfsChristoph Hellwig1-13/+5
REQ_CGROUP_PUNT is a bit annoying as it is hard to follow and adds a branch to the bio submission hot path. To fix this, export blkcg_punt_bio_submit and let btrfs call it directly. Add a new REQ_FS_PRIVATE flag for btrfs to indicate to it's own low-level bio submission code that a punt to the cgroup submission helper is required. Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-16block: store bdev->bd_disk->fops->submit_bio state in bdevJens Axboe1-0/+1
We have a long chain of memory dereferencing just to whether or not this disk has a special submit_bio helper. As that's not necessarily the common case, add a bd_has_submit_bio state in the bdev to avoid traversing this memory dependency chain if we don't need to. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-16block: re-arrange the struct block_device fields for better layoutJens Axboe1-8/+12
This moves struct device out-of-line as it's just used at open/close time, so we can keep some of the commonly used fields closer together. On a standard setup, it also reduces the size from 864 bytes to 848 bytes. Yes, struct device is a pig... Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-12block: Rename BLK_STS_NEXUS to BLK_STS_RESV_CONFLICTMike Christie1-2/+2
BLK_STS_NEXUS is used for NVMe/SCSI reservation conflicts and DASD's locking feature which works similar to NVMe/SCSI reservations where a host can get a lock on a device and when the lock is taken it will get failures. This patch renames BLK_STS_NEXUS so it better reflects this type of use. Signed-off-by: Mike Christie <michael.christie@oracle.com> Link: https://lore.kernel.org/r/20230407200551.12660-3-michael.christie@oracle.com Acked-by: Stefan Haberland <sth@linux.ibm.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2022-12-07block: remove bio_set_op_attrsChristoph Hellwig1-7/+0
This macro is obsolete, so replace the last few uses with open coded bi_opf assignments. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Coly Li <colyli@suse.de <mailto:colyli@suse.de>> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20221206144057.720846-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-20block: remove PSI accounting from the bio layerChristoph Hellwig1-1/+0
PSI accounting is now done by the VM code, where it should have been since the beginning. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Link: https://lore.kernel.org/r/20220915094200.139713-6-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-12blk-throttle: fix that io throttle can only work for single bioYu Kuai1-1/+1
Test scripts: cd /sys/fs/cgroup/blkio/ echo "8:0 1024" > blkio.throttle.write_bps_device echo $$ > cgroup.procs dd if=/dev/zero of=/dev/sda bs=10k count=1 oflag=direct & dd if=/dev/zero of=/dev/sda bs=10k count=1 oflag=direct & Test result: 10240 bytes (10 kB, 10 KiB) copied, 10.0134 s, 1.0 kB/s 10240 bytes (10 kB, 10 KiB) copied, 10.0135 s, 1.0 kB/s The problem is that the second bio is finished after 10s instead of 20s. Root cause: 1) second bio will be flagged: __blk_throtl_bio while (true) { ... if (sq->nr_queued[rw]) -> some bio is throttled already break }; bio_set_flag(bio, BIO_THROTTLED); -> flag the bio 2) flagged bio will be dispatched without waiting: throtl_dispatch_tg tg_may_dispatch tg_with_in_bps_limit if (bps_limit == U64_MAX || bio_flagged(bio, BIO_THROTTLED)) *wait = 0; -> wait time is zero return true; commit 9f5ede3c01f9 ("block: throttle split bio in case of iops limit") support to count split bios for iops limit, thus it adds flagged bio checking in tg_with_in_bps_limit() so that split bios will only count once for bps limit, however, it introduce a new problem that io throttle won't work if multiple bios are throttled. In order to fix the problem, handle iops/bps limit in different ways: 1) for iops limit, there is no flag to record if the bio is throttled, and iops is always applied. 2) for bps limit, original bio will be flagged with BIO_BPS_THROTTLED, and io throttle will ignore bio with the flag. Noted this patch also remove the code to set flag in __bio_clone(), it's introduced in commit 111be8839817 ("block-throttle: avoid double charge"), and author thinks split bio can be resubmited and throttled again, which is wrong because split bio will continue to dispatch from caller. Fixes: 9f5ede3c01f9 ("block: throttle split bio in case of iops limit") Cc: <stable@vger.kernel.org> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20220829022240.3348319-2-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-14block: Introduce the type blk_opf_tBart Van Assche1-46/+51
Introduce the type blk_opf_t for the request operation and flags (REQ_OP_* and REQ_*). This type will be used to improve documentation of the block layer code and also to allow sparse to verify whether request flags are used correctly. Cc: Christoph Hellwig <hch@lst.de> Cc: Ming Lei <ming.lei@redhat.com> Cc: Hannes Reinecke <hare@suse.de> Cc: Damien Le Moal <damien.lemoal@wdc.com> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20220714180729.1065367-6-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-14block: Change the type of req_op() and bio_op() into enum req_opBart Van Assche1-2/+4
Improve static type checking by changing the type of the value returned by req_op() and bio_op() from unsigned int into enum req_op. Insert 'default: break;' in switch statements on the enum req_op type to prevent that the compiler warns about these switch statements. Cc: Christoph Hellwig <hch@lst.de> Cc: Ming Lei <ming.lei@redhat.com> Cc: Hannes Reinecke <hare@suse.de> Cc: Damien Le Moal <damien.lemoal@wdc.com> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com> Cc: Tim Waugh <tim@cyberelk.net> Cc: Alasdair Kergon <agk@redhat.com> Cc: Mike Snitzer <snitzer@kernel.org> Cc: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20220714180729.1065367-5-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-14block: Use enum req_op where appropriateBart Van Assche1-1/+1
Change the type of the arguments that are used to pass a REQ_OP_* value from int or unsigned int into enum req_op to improve static type checking. Cc: Christoph Hellwig <hch@lst.de> Cc: Ming Lei <ming.lei@redhat.com> Cc: Hannes Reinecke <hare@suse.de> Cc: Damien Le Moal <damien.lemoal@wdc.com> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20220714180729.1065367-3-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-14treewide: Rename enum req_opf into enum req_opBart Van Assche1-8/+8
The type name enum req_opf is misleading since it suggests that values of this type include both an operation type and flags. Since values of this type represent an operation only, change the type name into enum req_op. Convert the enum req_op documentation into kernel-doc format. Move a few definitions such that the enum req_op documentation occurs just above the enum req_op definition. The name "req_opf" was introduced by commit ef295ecf090d ("block: better op and flags encoding"). Cc: Christoph Hellwig <hch@lst.de> Cc: Ming Lei <ming.lei@redhat.com> Cc: Hannes Reinecke <hare@suse.de> Cc: Damien Le Moal <damien.lemoal@wdc.com> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20220714180729.1065367-2-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-03Merge tag 'for-5.19/block-2022-06-02' of git://git.kernel.dk/linux-blockLinus Torvalds1-0/+4
Pull block fixes from Jens Axboe: "Just a collection of fixes that have been queued up since the initial merge window pull request, the majority of which are targeted for stable as well. One bio_set fix that fixes an issue with the dm adoption of cached bio structs that got introduced in this merge window" * tag 'for-5.19/block-2022-06-02' of git://git.kernel.dk/linux-block: block: Fix potential deadlock in blk_ia_range_sysfs_show() block: fix bio_clone_blkg_association() to associate with proper blkcg_gq block: remove useless BUG_ON() in blk_mq_put_tag() blk-mq: do not update io_ticks with passthrough requests block: make bioset_exit() fully resilient against being called twice block: use bio_queue_enter instead of blk_queue_enter in bio_poll block: document BLK_STS_AGAIN usage block: take destination bvec offsets into account in bio_copy_data_iter blk-iolatency: Fix inflight count imbalances and IO hangs on offline blk-mq: don't touch ->tagset in blk_mq_get_sq_hctx
2022-05-28block: document BLK_STS_AGAIN usageHannes Reinecke1-0/+4
BLK_STS_AGAIN should only be used if RQF_NOWAIT is set and the bio would block. So we'd better document that to avoid accidental misuse. Signed-off-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220524055631.85480-2-hare@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-24Merge tag 'for-5.19/drivers-2022-05-22' of git://git.kernel.dk/linux-blockLinus Torvalds1-1/+1
Pull block driver updates from Jens Axboe: "Here are the driver updates queued up for 5.19. This contains: - NVMe pull requests via Christoph: - tighten the PCI presence check (Stefan Roese) - fix a potential NULL pointer dereference in an error path (Kyle Miller Smith) - fix interpretation of the DMRSL field (Tom Yan) - relax the data transfer alignment (Keith Busch) - verbose error logging improvements (Max Gurtovoy, Chaitanya Kulkarni) - misc cleanups (Chaitanya Kulkarni, Christoph) - set non-mdts limits in nvme_scan_work (Chaitanya Kulkarni) - add support for TP4084 - Time-to-Ready Enhancements (Christoph) - MD pull request via Song: - Improve annotation in raid5 code, by Logan Gunthorpe - Support MD_BROKEN flag in raid-1/5/10, by Mariusz Tkaczyk - Other small fixes/cleanups - null_blk series making the configfs side much saner (Damien) - Various minor drbd cleanups and fixes (Haowen, Uladzislau, Jiapeng, Arnd, Cai) - Avoid using the system workqueue (and hence flushing it) in rnbd (Jack) - Avoid using the system workqueue (and hence flushing it) in aoe (Tetsuo) - Series fixing discard_alignment issues in drivers (Christoph) - Small series fixing drivers poking at disk->part0 for openers information (Christoph) - Series fixing deadlocks in loop (Christoph, Tetsuo) - Remove loop.h and add SPDX headers (Christoph) - Various fixes and cleanups (Julia, Xie, Yu)" * tag 'for-5.19/drivers-2022-05-22' of git://git.kernel.dk/linux-block: (72 commits) mtip32xx: fix typo in comment nvme: set non-mdts limits in nvme_scan_work nvme: add support for TP4084 - Time-to-Ready Enhancements nvme: split the enum used for various register constants nbd: Fix hung on disconnect request if socket is closed before nvme-fabrics: add a request timeout helper nvme-pci: harden drive presence detect in nvme_dev_disable() nvme-pci: fix a NULL pointer dereference in nvme_alloc_admin_tags nvme: mark internal passthru request RQF_QUIET nvme: remove unneeded include from constants file nvme: add missing status values to verbose logging nvme: set dma alignment to dword nvme: fix interpretation of DMRSL loop: remove most the top-of-file boilerplate comment from the UAPI header loop: remove most the top-of-file boilerplate comment loop: add a SPDX header loop: remove loop.h block: null_blk: Improve device creation with configfs block: null_blk: Cleanup messages block: null_blk: Cleanup device creation and deletion ...
2022-05-12block: Fix the bio.bi_opf commentBart Van Assche1-3/+2
Commit ef295ecf090d modified the Linux kernel such that the bottom bits of the bi_opf member contain the operation instead of the topmost bits. That commit did not update the comment next to bi_opf. Hence this patch. From commit ef295ecf090d: -#define bio_op(bio) ((bio)->bi_opf >> BIO_OP_SHIFT) +#define bio_op(bio) ((bio)->bi_opf & REQ_OP_MASK) Cc: Christoph Hellwig <hch@lst.de> Cc: Ming Lei <ming.lei@redhat.com> Fixes: ef295ecf090d ("block: better op and flags encoding") Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220511235152.1082246-1-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-12block: reorder the REQ_ flagsChristoph Hellwig1-7/+8
Keep the op-specific flag last so that they are clearly separate from the generic flags. Various recent commits just kept adding new flags at the end. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220512061408.1826595-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-18block: turn bdev->bd_openers into an atomic_tChristoph Hellwig1-1/+1
All manipulation of bd_openers is under disk->open_mutex and will remain so for the foreseeable future. But at least one place reads it without the lock (blkdev_get) and there are more to be added. So make sure the compiler does not do turn the increments and decrements into non-atomic sequences by using an atomic_t. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220330052917.2566582-6-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-18block: allow using the per-cpu bio cache from bio_alloc_biosetMike Snitzer1-1/+2
Replace the BIO_PERCPU_CACHE bio-internal flag with a REQ_ALLOC_CACHE one that can be passed to bio_alloc / bio_alloc_bioset, and implement the percpu cache allocation logic in a helper called from bio_alloc_bioset. This allows any bio_alloc_bioset user to use the percpu caches instead of having the functionality tied to struct kiocb. Signed-off-by: Mike Snitzer <snitzer@kernel.org> [hch: refactored a bit] Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220324203526.62306-2-snitzer@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-01dm: fix dm_io and dm_target_io flags race condition on AlphaMikulas Patocka1-0/+2
Early alpha processors cannot write a single byte or short; they read 8 bytes, modify the value in registers and write back 8 bytes. This could cause race condition in the structure dm_io - if the fields flags and io_count are modified simultaneously. Fix this bug by using 32-bit flags if we are on Alpha and if we are compiling for a processor that doesn't have the byte-word-extension. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Fixes: bd4a6dd241ae ("dm: reduce size of dm_io and dm_target_io structs") [snitzer: Jens allowed this change since Mikulas owns a relevant Alpha!] Acked-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-03-26Merge tag 'for-5.18/write-streams-2022-03-18' of git://git.kernel.dk/linux-blockLinus Torvalds1-1/+0
Pull NVMe write streams removal from Jens Axboe: "This removes the write streams support in NVMe. No vendor ever really shipped working support for this, and they are not interested in supporting it. With the NVMe support gone, we have nothing in the tree that supports this. Remove passing around of the hints. The only discussion point in this patchset imho is the fact that the file specific write hint setting/getting fcntl helpers will now return -1/EINVAL like they did before we supported write hints. No known applications use these functions, I only know of one prototype that I help do for RocksDB, and that's not used. That said, with a change like this, it's always a bit controversial. Alternatively, we could just make them return 0 and pretend it worked. It's placement based hints after all" * tag 'for-5.18/write-streams-2022-03-18' of git://git.kernel.dk/linux-block: fs: remove fs.f_write_hint fs: remove kiocb.ki_hint block: remove the per-bio/request write hint nvme: remove support or stream based temperature hint
2022-03-25Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsiLinus Torvalds1-2/+0
Pull SCSI updates from James Bottomley: "This series consists of the usual driver updates (qla2xxx, pm8001, libsas, smartpqi, scsi_debug, lpfc, iscsi, mpi3mr) plus minor updates and bug fixes. The high blast radius core update is the removal of write same, which affects block and several non-SCSI devices. The other big change, which is more local, is the removal of the SCSI pointer" * tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (281 commits) scsi: scsi_ioctl: Drop needless assignment in sg_io() scsi: bsg: Drop needless assignment in scsi_bsg_sg_io_fn() scsi: lpfc: Copyright updates for 14.2.0.0 patches scsi: lpfc: Update lpfc version to 14.2.0.0 scsi: lpfc: SLI path split: Refactor BSG paths scsi: lpfc: SLI path split: Refactor Abort paths scsi: lpfc: SLI path split: Refactor SCSI paths scsi: lpfc: SLI path split: Refactor CT paths scsi: lpfc: SLI path split: Refactor misc ELS paths scsi: lpfc: SLI path split: Refactor VMID paths scsi: lpfc: SLI path split: Refactor FDISC paths scsi: lpfc: SLI path split: Refactor LS_RJT paths scsi: lpfc: SLI path split: Refactor LS_ACC paths scsi: lpfc: SLI path split: Refactor the RSCN/SCR/RDF/EDC/FARPR paths scsi: lpfc: SLI path split: Refactor PLOGI/PRLI/ADISC/LOGO paths scsi: lpfc: SLI path split: Refactor base ELS paths and the FLOGI path scsi: lpfc: SLI path split: Introduce lpfc_prep_wqe scsi: lpfc: SLI path split: Refactor fast and slow paths to native SLI4 scsi: lpfc: SLI path split: Refactor lpfc_iocbq scsi: lpfc: Use kcalloc() ...
2022-03-14block: fix rq-qos breakage from skipping rq_qos_done_bio()Tejun Heo1-1/+2
a647a524a467 ("block: don't call rq_qos_ops->done_bio if the bio isn't tracked") made bio_endio() skip rq_qos_done_bio() if BIO_TRACKED is not set. While this fixed a potential oops, it also broke blk-iocost by skipping the done_bio callback for merged bios. Before, whether a bio goes through rq_qos_throttle() or rq_qos_merge(), rq_qos_done_bio() would be called on the bio on completion with BIO_TRACKED distinguishing the former from the latter. rq_qos_done_bio() is not called for bios which wenth through rq_qos_merge(). This royally confuses blk-iocost as the merged bios never finish and are considered perpetually in-flight. One reliably reproducible failure mode is an intermediate cgroup geting stuck active preventing its children from being activated due to the leaf-only rule, leading to loss of control. The following is from resctl-bench protection scenario which emulates isolating a web server like workload from a memory bomb run on an iocost configuration which should yield a reasonable level of protection. # cat /sys/block/nvme2n1/device/model Samsung SSD 970 PRO 512GB # cat /sys/fs/cgroup/io.cost.model 259:0 ctrl=user model=linear rbps=834913556 rseqiops=93622 rrandiops=102913 wbps=618985353 wseqiops=72325 wrandiops=71025 # cat /sys/fs/cgroup/io.cost.qos 259:0 enable=1 ctrl=user rpct=95.00 rlat=18776 wpct=95.00 wlat=8897 min=60.00 max=100.00 # resctl-bench -m 29.6G -r out.json run protection::scenario=mem-hog,loops=1 ... Memory Hog Summary ================== IO Latency: R p50=242u:336u/2.5m p90=794u:1.4m/7.5m p99=2.7m:8.0m/62.5m max=8.0m:36.4m/350m W p50=221u:323u/1.5m p90=709u:1.2m/5.5m p99=1.5m:2.5m/9.5m max=6.9m:35.9m/350m Isolation and Request Latency Impact Distributions: min p01 p05 p10 p25 p50 p75 p90 p95 p99 max mean stdev isol% 15.90 15.90 15.90 40.05 57.24 59.07 60.01 74.63 74.63 90.35 90.35 58.12 15.82 lat-imp% 0 0 0 0 0 4.55 14.68 15.54 233.5 548.1 548.1 53.88 143.6 Result: isol=58.12:15.82% lat_imp=53.88%:143.6 work_csv=100.0% missing=3.96% The isolation result of 58.12% is close to what this device would show without any IO control. Fix it by introducing a new flag BIO_QOS_MERGED to mark merged bios and calling rq_qos_done_bio() on them too. For consistency and clarity, rename BIO_TRACKED to BIO_QOS_THROTTLED. The flag checks are moved into rq_qos_done_bio() so that it's next to the code paths that set the flags. With the patch applied, the above same benchmark shows: # resctl-bench -m 29.6G -r out.json run protection::scenario=mem-hog,loops=1 ... Memory Hog Summary ================== IO Latency: R p50=123u:84.4u/985u p90=322u:256u/2.5m p99=1.6m:1.4m/9.5m max=11.1m:36.0m/350m W p50=429u:274u/995u p90=1.7m:1.3m/4.5m p99=3.4m:2.7m/11.5m max=7.9m:5.9m/26.5m Isolation and Request Latency Impact Distributions: min p01 p05 p10 p25 p50 p75 p90 p95 p99 max mean stdev isol% 84.91 84.91 89.51 90.73 92.31 94.49 96.36 98.04 98.71 100.0 100.0 94.42 2.81 lat-imp% 0 0 0 0 0 2.81 5.73 11.11 13.92 17.53 22.61 4.10 4.68 Result: isol=94.42:2.81% lat_imp=4.10%:4.68 work_csv=58.34% missing=0% Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: a647a524a467 ("block: don't call rq_qos_ops->done_bio if the bio isn't tracked") Cc: stable@vger.kernel.org # v5.15+ Cc: Ming Lei <ming.lei@redhat.com> Cc: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/Yi7rdrzQEHjJLGKB@slm.duckdns.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-07block: remove the per-bio/request write hintChristoph Hellwig1-1/+0
With the NVMe support for this gone, there are no consumers of these hints left, so remove them. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220304175556.407719-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-02-23scsi: block: Remove REQ_OP_WRITE_SAME supportChristoph Hellwig1-2/+0
No more users of REQ_OP_WRITE_SAME or drivers implementing it are left, so remove the infrastructure. [mkp: fold in and tweak sysfs reporting fix] Link: https://lore.kernel.org/r/20220209082828.2629273-8-hch@lst.de Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2022-02-04block: introduce BLK_STS_OFFLINESong Liu1-0/+7
Currently, drivers reports BLK_STS_IOERR for devices that are not full online or being removed. This behavior could cause confusion for users, as they are not really I/O errors from the device. Solve this issue with a new state BLK_STS_OFFLINE, which reports "device offline error" in dmesg instead of "I/O error". EIO is intentionally kept to not change user visible return value. Signed-off-by: Song Liu <song@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20220203192827.1370270-2-song@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18block: cache inode size in bdevJens Axboe1-0/+1
Reading the inode size brings in a new cacheline for IO submit, and it's in the hot path being checked for every single IO. When doing millions of IOs per core per second, this is noticeable overhead. Cache the nr_sectors in the bdev itself. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18block: move the SECTOR_SIZE related definitions to blk_types.hChristoph Hellwig1-0/+17
Ensure these are always available for inlines in the various block layer headers. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20211018101130.1838532-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18block: cache request queue in bdevPavel Begunkov1-0/+1
There are tons of places where we need to get a request_queue only having bdev, which turns into bdev->bd_disk->queue. There are probably a hundred of such places considering inline helpers, and enough of them are in hot paths. Cache queue pointer in struct block_device and make use of it in bdev_get_queue(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/a3bfaecdd28956f03629d0ca5c63ebc096e1c809.1634219547.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18block: switch polling to be bio basedChristoph Hellwig1-7/+5
Replace the blk_poll interface that requires the caller to keep a queue and cookie from the submissions with polling based on the bio. Polling for the bio itself leads to a few advantages: - the cookie construction can made entirely private in blk-mq.c - the caller does not need to remember the request_queue and cookie separately and thus sidesteps their lifetime issues - keeping the device and the cookie inside the bio allows to trivially support polling BIOs remapping by stacking drivers - a lot of code to propagate the cookie back up the submission path can be removed entirely. Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: Mark Wunderlich <mark.wunderlich@intel.com> Link: https://lore.kernel.org/r/20211012111226.760968-15-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18block: rename REQ_HIPRI to REQ_POLLEDChristoph Hellwig1-2/+2
Unlike the RWF_HIPRI userspace ABI which is intentionally kept vague, the bio flag is specific to the polling implementation, so rename and document it properly. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Tested-by: Mark Wunderlich <mark.wunderlich@intel.com> Link: https://lore.kernel.org/r/20211012111226.760968-12-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18blk-mq: remove blk_qc_t_validChristoph Hellwig1-5/+0
Move the trivial check into the only caller. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Tested-by: Mark Wunderlich <mark.wunderlich@intel.com> Link: https://lore.kernel.org/r/20211012111226.760968-9-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18blk-mq: remove blk_qc_t_to_tag and blk_qc_t_is_internalChristoph Hellwig1-10/+0
Merge both functions into their only caller to keep the blk-mq tag to blk_qc_t mapping as private as possible in blk-mq.c. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Tested-by: Mark Wunderlich <mark.wunderlich@intel.com> Link: https://lore.kernel.org/r/20211012111226.760968-8-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18blk-mq: factor out a blk_qc_to_hctx helperChristoph Hellwig1-5/+0
Add a helper to get the hctx from a request_queue and cookie, and fold the blk_qc_t_to_queue_num helper into it as no other callers are left. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Tested-by: Mark Wunderlich <mark.wunderlich@intel.com> Link: https://lore.kernel.org/r/20211012111226.760968-6-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18block: move struct request to blk-mq.hChristoph Hellwig1-2/+0
struct request is only used by blk-mq drivers, so move it and all related declarations to blk-mq.h. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20210920123328.1399408-18-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-31Merge tag 'for-5.15-tag' of ↵Linus Torvalds1-0/+1
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs updates from David Sterba: "The highlights of this round are integrations with fs-verity and idmapped mounts, the rest is usual mix of minor improvements, speedups and cleanups. There are some patches outside of btrfs, namely updating some VFS interfaces, all straightforward and acked. Features: - fs-verity support, using standard ioctls, backward compatible with read-only limitation on inodes with previously enabled fs-verity - idmapped mount support - make mount with rescue=ibadroots more tolerant to partially damaged trees - allow raid0 on a single device and raid10 on two devices, degenerate cases but might be useful as an intermediate step during conversion to other profiles - zoned mode block group auto reclaim can be disabled via sysfs knob Performance improvements: - continue readahead of node siblings even if target node is in memory, could speed up full send (on sample test +11%) - batching of delayed items can speed up creating many files - fsync/tree-log speedups - avoid unnecessary work (gains +2% throughput, -2% run time on sample load) - reduced lock contention on renames (on dbench +4% throughput, up to -30% latency) Fixes: - various zoned mode fixes - preemptive flushing threshold tuning, avoid excessive work on almost full filesystems Core: - continued subpage support, preparation for implementing remaining features like compression and defragmentation; with some limitations, write is now enabled on 64K page systems with 4K sectors, still considered experimental - no readahead on compressed reads - inline extents disabled - disabled raid56 profile conversion and mount - improved flushing logic, fixing early ENOSPC on some workloads - inode flags have been internally split to read-only and read-write incompat bit parts, used by fs-verity - new tree items for fs-verity - descriptor item - Merkle tree item - inode operations extended to be namespace-aware - cleanups and refactoring Generic code changes: - fs: new export filemap_fdatawrite_wbc - fs: removed sync_inode - block: bio_trim argument type fixups - vfs: add namespace-aware lookup" * tag 'for-5.15-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (114 commits) btrfs: reset replace target device to allocation state on close btrfs: zoned: fix ordered extent boundary calculation btrfs: do not do preemptive flushing if the majority is global rsv btrfs: reduce the preemptive flushing threshold to 90% btrfs: tree-log: check btrfs_lookup_data_extent return value btrfs: avoid unnecessarily logging directories that had no changes btrfs: allow idmapped mount btrfs: handle ACLs on idmapped mounts btrfs: allow idmapped INO_LOOKUP_USER ioctl btrfs: allow idmapped SUBVOL_SETFLAGS ioctl btrfs: allow idmapped SET_RECEIVED_SUBVOL ioctls btrfs: relax restrictions for SNAP_DESTROY_V2 with subvolids btrfs: allow idmapped SNAP_DESTROY ioctls btrfs: allow idmapped SNAP_CREATE/SUBVOL_CREATE ioctls btrfs: check whether fsgid/fsuid are mapped during subvolume creation btrfs: allow idmapped permission inode op btrfs: allow idmapped setattr inode op btrfs: allow idmapped tmpfile inode op btrfs: allow idmapped symlink inode op btrfs: allow idmapped mkdir inode op ...
2021-08-31Merge tag 'io_uring-bio-cache.5-2021-08-30' of git://git.kernel.dk/linux-blockLinus Torvalds1-0/+1
Pull support for struct bio recycling from Jens Axboe: "This adds bio recycling support for polled IO, allowing quick reuse of a bio for high IOPS scenarios via a percpu bio_set list. It's good for almost a 10% improvement in performance, bumping our per-core IO limit from ~3.2M IOPS to ~3.5M IOPS" * tag 'io_uring-bio-cache.5-2021-08-30' of git://git.kernel.dk/linux-block: bio: improve kerneldoc documentation for bio_alloc_kiocb() block: provide bio_clear_hipri() helper block: use the percpu bio cache in __blkdev_direct_IO io_uring: enable use of bio alloc cache block: clear BIO_PERCPU_CACHE flag if polling isn't supported bio: add allocation cache abstraction fs: add kiocb alloc cache flag bio: optimize initialization of a bio
2021-08-23bio: add allocation cache abstractionJens Axboe1-0/+1
Add a per-cpu bio_set cache for bio allocations, enabling us to quickly recycle them instead of going through the slab allocator. This cache isn't IRQ safe, and hence is only really suitable for polled IO. Very simple - keeps a count of bio's in the cache, and maintains a max of 512 with a slack of 64. If we get above max + slack, we drop slack number of bio's. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-23block: fix argument type of bio_trim()Chaitanya Kulkarni1-0/+1
The function bio_trim has offset and size arguments that are declared as int. The callers of this function use sector_t type when passing the offset and size, e.g. drivers/md/raid1.c:narrow_write_error() and drivers/md/raid1.c:narrow_write_error(). Change offset and size arguments to sector_t type for bio_trim(). Also, add WARN_ON_ONCE() to catch their overflow. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-09block: remove the bd_bdi in struct block_deviceChristoph Hellwig1-1/+0
Just retrieve the bdi from the disk. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20210809141744.1203023-6-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-09block: look up holders by bdevChristoph Hellwig1-3/+0
Invert they way the holder relations are tracked. This very slightly reduces the memory overhead for partitioned devices. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20210804094147.459763-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-09block: make the block holder code optionalChristoph Hellwig1-1/+1
Move the block holder code into a separate file as it is not in any way related to the other block_dev.c code, and add a new selectable config option for it so that we don't have to build it without any remapped drivers selected. The Kconfig symbol contains a _DEPRECATED suffix to match the comments added in commit 49731baa41df ("block: restore multiple bd_link_disk_holder() support"). Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mike Snitzer <snitzer@redhat.com> Link: https://lore.kernel.org/r/20210804094147.459763-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>