summaryrefslogtreecommitdiff
path: root/drivers/md
AgeCommit message (Collapse)AuthorFilesLines
2023-11-03Merge tag 'mm-stable-2023-11-01-14-33' of ↵Linus Torvalds8-49/+69
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: "Many singleton patches against the MM code. The patch series which are included in this merge do the following: - Kemeng Shi has contributed some compation maintenance work in the series 'Fixes and cleanups to compaction' - Joel Fernandes has a patchset ('Optimize mremap during mutual alignment within PMD') which fixes an obscure issue with mremap()'s pagetable handling during a subsequent exec(), based upon an implementation which Linus suggested - More DAMON/DAMOS maintenance and feature work from SeongJae Park i the following patch series: mm/damon: misc fixups for documents, comments and its tracepoint mm/damon: add a tracepoint for damos apply target regions mm/damon: provide pseudo-moving sum based access rate mm/damon: implement DAMOS apply intervals mm/damon/core-test: Fix memory leaks in core-test mm/damon/sysfs-schemes: Do DAMOS tried regions update for only one apply interval - In the series 'Do not try to access unaccepted memory' Adrian Hunter provides some fixups for the recently-added 'unaccepted memory' feature. To increase the feature's checking coverage. 'Plug a few gaps where RAM is exposed without checking if it is unaccepted memory' - In the series 'cleanups for lockless slab shrink' Qi Zheng has done some maintenance work which is preparation for the lockless slab shrinking code - Qi Zheng has redone the earlier (and reverted) attempt to make slab shrinking lockless in the series 'use refcount+RCU method to implement lockless slab shrink' - David Hildenbrand contributes some maintenance work for the rmap code in the series 'Anon rmap cleanups' - Kefeng Wang does more folio conversions and some maintenance work in the migration code. Series 'mm: migrate: more folio conversion and unification' - Matthew Wilcox has fixed an issue in the buffer_head code which was causing long stalls under some heavy memory/IO loads. Some cleanups were added on the way. Series 'Add and use bdev_getblk()' - In the series 'Use nth_page() in place of direct struct page manipulation' Zi Yan has fixed a potential issue with the direct manipulation of hugetlb page frames - In the series 'mm: hugetlb: Skip initialization of gigantic tail struct pages if freed by HVO' has improved our handling of gigantic pages in the hugetlb vmmemmep optimizaton code. This provides significant boot time improvements when significant amounts of gigantic pages are in use - Matthew Wilcox has sent the series 'Small hugetlb cleanups' - code rationalization and folio conversions in the hugetlb code - Yin Fengwei has improved mlock()'s handling of large folios in the series 'support large folio for mlock' - In the series 'Expose swapcache stat for memcg v1' Liu Shixin has added statistics for memcg v1 users which are available (and useful) under memcg v2 - Florent Revest has enhanced the MDWE (Memory-Deny-Write-Executable) prctl so that userspace may direct the kernel to not automatically propagate the denial to child processes. The series is named 'MDWE without inheritance' - Kefeng Wang has provided the series 'mm: convert numa balancing functions to use a folio' which does what it says - In the series 'mm/ksm: add fork-exec support for prctl' Stefan Roesch makes is possible for a process to propagate KSM treatment across exec() - Huang Ying has enhanced memory tiering's calculation of memory distances. This is used to permit the dax/kmem driver to use 'high bandwidth memory' in addition to Optane Data Center Persistent Memory Modules (DCPMM). The series is named 'memory tiering: calculate abstract distance based on ACPI HMAT' - In the series 'Smart scanning mode for KSM' Stefan Roesch has optimized KSM by teaching it to retain and use some historical information from previous scans - Yosry Ahmed has fixed some inconsistencies in memcg statistics in the series 'mm: memcg: fix tracking of pending stats updates values' - In the series 'Implement IOCTL to get and optionally clear info about PTEs' Peter Xu has added an ioctl to /proc/<pid>/pagemap which permits us to atomically read-then-clear page softdirty state. This is mainly used by CRIU - Hugh Dickins contributed the series 'shmem,tmpfs: general maintenance', a bunch of relatively minor maintenance tweaks to this code - Matthew Wilcox has increased the use of the VMA lock over file-backed page faults in the series 'Handle more faults under the VMA lock'. Some rationalizations of the fault path became possible as a result - In the series 'mm/rmap: convert page_move_anon_rmap() to folio_move_anon_rmap()' David Hildenbrand has implemented some cleanups and folio conversions - In the series 'various improvements to the GUP interface' Lorenzo Stoakes has simplified and improved the GUP interface with an eye to providing groundwork for future improvements - Andrey Konovalov has sent along the series 'kasan: assorted fixes and improvements' which does those things - Some page allocator maintenance work from Kemeng Shi in the series 'Two minor cleanups to break_down_buddy_pages' - In thes series 'New selftest for mm' Breno Leitao has developed another MM self test which tickles a race we had between madvise() and page faults - In the series 'Add folio_end_read' Matthew Wilcox provides cleanups and an optimization to the core pagecache code - Nhat Pham has added memcg accounting for hugetlb memory in the series 'hugetlb memcg accounting' - Cleanups and rationalizations to the pagemap code from Lorenzo Stoakes, in the series 'Abstract vma_merge() and split_vma()' - Audra Mitchell has fixed issues in the procfs page_owner code's new timestamping feature which was causing some misbehaviours. In the series 'Fix page_owner's use of free timestamps' - Lorenzo Stoakes has fixed the handling of new mappings of sealed files in the series 'permit write-sealed memfd read-only shared mappings' - Mike Kravetz has optimized the hugetlb vmemmap optimization in the series 'Batch hugetlb vmemmap modification operations' - Some buffer_head folio conversions and cleanups from Matthew Wilcox in the series 'Finish the create_empty_buffers() transition' - As a page allocator performance optimization Huang Ying has added automatic tuning to the allocator's per-cpu-pages feature, in the series 'mm: PCP high auto-tuning' - Roman Gushchin has contributed the patchset 'mm: improve performance of accounted kernel memory allocations' which improves their performance by ~30% as measured by a micro-benchmark - folio conversions from Kefeng Wang in the series 'mm: convert page cpupid functions to folios' - Some kmemleak fixups in Liu Shixin's series 'Some bugfix about kmemleak' - Qi Zheng has improved our handling of memoryless nodes by keeping them off the allocation fallback list. This is done in the series 'handle memoryless nodes more appropriately' - khugepaged conversions from Vishal Moola in the series 'Some khugepaged folio conversions'" [ bcachefs conflicts with the dynamically allocated shrinkers have been resolved as per Stephen Rothwell in https://lore.kernel.org/all/20230913093553.4290421e@canb.auug.org.au/ with help from Qi Zheng. The clone3 test filtering conflict was half-arsed by yours truly ] * tag 'mm-stable-2023-11-01-14-33' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (406 commits) mm/damon/sysfs: update monitoring target regions for online input commit mm/damon/sysfs: remove requested targets when online-commit inputs selftests: add a sanity check for zswap Documentation: maple_tree: fix word spelling error mm/vmalloc: fix the unchecked dereference warning in vread_iter() zswap: export compression failure stats Documentation: ubsan: drop "the" from article title mempolicy: migration attempt to match interleave nodes mempolicy: mmap_lock is not needed while migrating folios mempolicy: alloc_pages_mpol() for NUMA policy without vma mm: add page_rmappable_folio() wrapper mempolicy: remove confusing MPOL_MF_LAZY dead code mempolicy: mpol_shared_policy_init() without pseudo-vma mempolicy trivia: use pgoff_t in shared mempolicy tree mempolicy trivia: slightly more consistent naming mempolicy trivia: delete those ancient pr_debug()s mempolicy: fix migrate_pages(2) syscall return nr_failed kernfs: drop shared NUMA mempolicy hooks hugetlbfs: drop shared NUMA mempolicy pretence mm/damon/sysfs-test: add a unit test for damon_sysfs_set_targets() ...
2023-11-02Merge tag 'sysctl-6.7-rc1' of ↵Linus Torvalds1-1/+0
git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux Pull sysctl updates from Luis Chamberlain: "To help make the move of sysctls out of kernel/sysctl.c not incur a size penalty sysctl has been changed to allow us to not require the sentinel, the final empty element on the sysctl array. Joel Granados has been doing all this work. On the v6.6 kernel we got the major infrastructure changes required to support this. For v6.7-rc1 we have all arch/ and drivers/ modified to remove the sentinel. Both arch and driver changes have been on linux-next for a bit less than a month. It is worth re-iterating the value: - this helps reduce the overall build time size of the kernel and run time memory consumed by the kernel by about ~64 bytes per array - the extra 64-byte penalty is no longer inncurred now when we move sysctls out from kernel/sysctl.c to their own files For v6.8-rc1 expect removal of all the sentinels and also then the unneeded check for procname == NULL. The last two patches are fixes recently merged by Krister Johansen which allow us again to use softlockup_panic early on boot. This used to work but the alias work broke it. This is useful for folks who want to detect softlockups super early rather than wait and spend money on cloud solutions with nothing but an eventual hung kernel. Although this hadn't gone through linux-next it's also a stable fix, so we might as well roll through the fixes now" * tag 'sysctl-6.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux: (23 commits) watchdog: move softlockup_panic back to early_param proc: sysctl: prevent aliased sysctls from getting passed to init intel drm: Remove now superfluous sentinel element from ctl_table array Drivers: hv: Remove now superfluous sentinel element from ctl_table array raid: Remove now superfluous sentinel element from ctl_table array fw loader: Remove the now superfluous sentinel element from ctl_table array sgi-xp: Remove the now superfluous sentinel element from ctl_table array vrf: Remove the now superfluous sentinel element from ctl_table array char-misc: Remove the now superfluous sentinel element from ctl_table array infiniband: Remove the now superfluous sentinel element from ctl_table array macintosh: Remove the now superfluous sentinel element from ctl_table array parport: Remove the now superfluous sentinel element from ctl_table array scsi: Remove now superfluous sentinel element from ctl_table array tty: Remove now superfluous sentinel element from ctl_table array xen: Remove now superfluous sentinel element from ctl_table array hpet: Remove now superfluous sentinel element from ctl_table array c-sky: Remove now superfluous sentinel element from ctl_talbe array powerpc: Remove now superfluous sentinel element from ctl_table arrays riscv: Remove now superfluous sentinel element from ctl_table array x86/vdso: Remove now superfluous sentinel element from ctl_table array ...
2023-11-02Merge tag 'for-6.7/dm-changes' of ↵Linus Torvalds12-107/+320
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper updates from Mike Snitzer: - Update DM core to directly call the map function for both the linear and stripe targets; which are provided by DM core - Various updates to use new safer string functions - Update DM core to respect REQ_NOWAIT flag in normal bios so that memory allocations are always attempted with GFP_NOWAIT - Add Mikulas Patocka to MAINTAINERS as a DM maintainer! - Improve DM delay target's handling of short delays (< 50ms) by using a kthread to check expiration of IOs rather than timers and a wq - Update the DM error target so that it works with zoned storage. This helps xfstests to provide proper IO error handling coverage when testing a filesystem with native zoned storage support - Update both DM crypt and integrity targets to improve performance by using crypto_shash_digest() rather than init+update+final sequence - Fix DM crypt target by backfilling missing memory allocation accounting for compound pages * tag 'for-6.7/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm crypt: account large pages in cc->n_allocated_pages dm integrity: use crypto_shash_digest() in sb_mac() dm crypt: use crypto_shash_digest() in crypt_iv_tcw_whitening() dm error: Add support for zoned block devices dm delay: for short delays, use kthread instead of timers and wq MAINTAINERS: add Mikulas Patocka as a DM maintainer dm: respect REQ_NOWAIT flag in normal bios issued to DM dm: enhance alloc_multiple_bios() to be more versatile dm: make __send_duplicate_bios return unsigned int dm log userspace: replace deprecated strncpy with strscpy dm ioctl: replace deprecated strncpy with strscpy_pad dm crypt: replace open-coded kmemdup_nul dm cache metadata: replace deprecated strncpy with strscpy dm: shortcut the calls to linear_map and stripe_map
2023-11-02Merge tag 'for-6.7/block-2023-10-30' of git://git.kernel.dk/linuxLinus Torvalds12-573/+622
Pull block updates from Jens Axboe: - Improvements to the queue_rqs() support, and adding null_blk support for that as well (Chengming) - Series improving badblocks support (Coly) - Key store support for sed-opal (Greg) - IBM partition string handling improvements (Jan) - Make number of ublk devices supported configurable (Mike) - Cancelation improvements for ublk (Ming) - MD pull requests via Song: - Handle timeout in md-cluster, by Denis Plotnikov - Cleanup pers->prepare_suspend, by Yu Kuai - Rewrite mddev_suspend(), by Yu Kuai - Simplify md_seq_ops, by Yu Kuai - Reduce unnecessary locking array_state_store(), by Mariusz Tkaczyk - Make rdev add/remove independent from daemon thread, by Yu Kuai - Refactor code around quiesce() and mddev_suspend(), by Yu Kuai - NVMe pull request via Keith: - nvme-auth updates (Mark) - nvme-tcp tls (Hannes) - nvme-fc annotaions (Kees) - Misc cleanups and improvements (Jiapeng, Joel) * tag 'for-6.7/block-2023-10-30' of git://git.kernel.dk/linux: (95 commits) block: ublk_drv: Remove unused function md: cleanup pers->prepare_suspend() nvme-auth: allow mixing of secret and hash lengths nvme-auth: use transformed key size to create resp nvme-auth: alloc nvme_dhchap_key as single buffer nvmet-tcp: use 'spin_lock_bh' for state_lock() powerpc/pseries: PLPKS SED Opal keystore support block: sed-opal: keystore access for SED Opal keys block:sed-opal: SED Opal keystore ublk: simplify aborting request ublk: replace monitor with cancelable uring_cmd ublk: quiesce request queue when aborting queue ublk: rename mm_lock as lock ublk: move ublk_cancel_dev() out of ub->mutex ublk: make sure io cmd handled in submitter task context ublk: don't get ublk device reference in ublk_abort_queue() ublk: Make ublks_max configurable ublk: Limit dev_id/ub_number values md-cluster: check for timeout while a new disk adding nvme: rework NVME_AUTH Kconfig selection ...
2023-10-31dm crypt: account large pages in cc->n_allocated_pagesMikulas Patocka1-3/+12
The commit 5054e778fcd9c ("dm crypt: allocate compound pages if possible") changed dm-crypt to use compound pages to improve performance. Unfortunately, there was an oversight: the allocation of compound pages was not accounted at all. Normal pages are accounted in a percpu counter cc->n_allocated_pages and dm-crypt is limited to allocate at most 2% of memory. Because compound pages were not accounted at all, dm-crypt could allocate memory over the 2% limit. Fix this by adding the accounting of compound pages, so that memory consumption of dm-crypt is properly limited. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Fixes: 5054e778fcd9c ("dm crypt: allocate compound pages if possible") Cc: stable@vger.kernel.org # v6.5+ Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-10-31dm integrity: use crypto_shash_digest() in sb_mac()Eric Biggers1-20/+10
Simplify sb_mac() by using crypto_shash_digest() instead of an init+update+final sequence. This should also improve performance. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-10-31dm crypt: use crypto_shash_digest() in crypt_iv_tcw_whitening()Eric Biggers1-7/+1
Simplify crypt_iv_tcw_whitening() by using crypto_shash_digest() instead of an init+update+final sequence. This should also improve performance. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-10-31dm error: Add support for zoned block devicesDamien Le Moal2-4/+125
dm-error is used in several test cases in the xfstests test suite to check the handling of IO errors in file systems. However, with several file systems getting native support for zoned block devices (e.g. btrfs and f2fs), dm-error's lack of zoned block device support creates problems as the file system attempts executing zone commands (e.g. a zone append operation) against a dm-error non-zoned block device, which causes various issues in the block layer (e.g. WARN_ON triggers). This commit adds supports for zoned block devices to dm-error, allowing a DM device table containing an error target to be exposed as a zoned block device (if all targets have a compatible zoned model support and mapping). This is done as follows: 1) Allow passing 2 arguments to an error target, similar to dm-linear: a backing device and a start sector. These arguments are optional and dm-error retains its characteristics if the arguments are not specified. 2) Implement the iterate_devices method so that dm-core can normally check the zone support and restrictions (e.g. zone alignment of the targets). When the backing device arguments are not specified, the iterate_devices method never calls the fn() argument. When no backing device is specified, as before, we assume that the DM device is not zoned. When the backing device arguments are specified, the zoned model of the DM device will depend on the backing device type: - If the backing device is zoned and its model and mapping is compatible with other targets of the device, the resulting device will be zoned, with the dm-error mapped portion always returning errors (similar to the default non-zoned case). - If the backing device is not zoned, then the DM device will not be either. This zone support for dm-error requires the definition of a functional report_zones operation so that dm_revalidate_zones() can operate correctly and resources for emulating zone append operations initialized. This is necessary for cases where dm-error is used to partially map a device and have an overall correct handling of zone append. This means that dm-error does not fail report zones operations. Two changes that are not obvious are included to avoid issues: 1) dm_table_supports_zoned_model() is changed to directly check if the backing device of a wildcard target (= dm-error target) is zoned. Otherwise, we wouldn't be able to catch the invalid setup of dm-error without a backing device (non zoned case) being combined with zoned targets. 2) dm_table_supports_dax() is modified to return false if the wildcard target is found. Otherwise, when dm-error is set without a backing device, we end up with a NULL pointer dereference in set_dax_synchronous (dax_dev is NULL). This is consistent with the current behavior because dm_table_supports_dax() always returned false for targets that do not define the iterate_devices method. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Tested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-10-31dm delay: for short delays, use kthread instead of timers and wqChristian Loehle1-15/+88
DM delay's current design of using timers and wq to realize the delays is insufficient for delays below ~50ms. This commit enhances the design to use a kthread to flush the expired delays, trading some CPU time (in some cases) for better delay accuracy and delays closer to what the user requested for smaller delays. The new design is chosen as long as all the delays are below 50ms. Since bios can't be completed in interrupt context using a kthread is probably the most reasonable way to approach this. Testing with echo "0 2097152 zero" | dmsetup create dm-zeros for i in $(seq 0 20); do echo "0 2097152 delay /dev/mapper/dm-zeros 0 $i" | dmsetup create dm-delay-${i}ms; done Some performance numbers for comparison, on beaglebone black (single core) CONFIG_HZ_1000=y: fio --name=1msread --rw=randread --bs=4k --runtime=60 --time_based \ --filename=/dev/mapper/dm-delay-1ms Theoretical maximum: 1000 IOPS Previous: 250 IOPS Kthread: 500 IOPS fio --name=10msread --rw=randread --bs=4k --runtime=60 --time_based \ --filename=/dev/mapper/dm-delay-10ms Theoretical maximum: 100 IOPS Previous: 45 IOPS Kthread: 50 IOPS fio --name=1mswrite --rw=randwrite --direct=1 --bs=4k --runtime=60 \ --time_based --filename=/dev/mapper/dm-delay-1ms Theoretical maximum: 1000 IOPS Previous: 498 IOPS Kthread: 1000 IOPS fio --name=10mswrite --rw=randwrite --direct=1 --bs=4k --runtime=60 \ --time_based --filename=/dev/mapper/dm-delay-10ms Theoretical maximum: 100 IOPS Previous: 90 IOPS Kthread: 100 IOPS (This one is just to prove the new design isn't impacting throughput, not really about delays): fio --name=10mswriteasync --rw=randwrite --direct=1 --bs=4k \ --runtime=60 --time_based --filename=/dev/mapper/dm-delay-10ms \ --numjobs=32 --iodepth=64 --ioengine=libaio --group_reporting Previous: 13.3k IOPS Kthread: 13.3k IOPS Signed-off-by: Christian Loehle <christian.loehle@arm.com> [Harshit: kthread_create error handling fix in delay_ctr] Signed-off-by: Harshit Mogalapalli <harshit.m.mogalapalli@oracle.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-10-31Merge tag 'hardening-v6.7-rc1' of ↵Linus Torvalds5-5/+5
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull hardening updates from Kees Cook: "One of the more voluminous set of changes is for adding the new __counted_by annotation[1] to gain run-time bounds checking of dynamically sized arrays with UBSan. - Add LKDTM test for stuck CPUs (Mark Rutland) - Improve LKDTM selftest behavior under UBSan (Ricardo Cañuelo) - Refactor more 1-element arrays into flexible arrays (Gustavo A. R. Silva) - Analyze and replace strlcpy and strncpy uses (Justin Stitt, Azeem Shaikh) - Convert group_info.usage to refcount_t (Elena Reshetova) - Add __counted_by annotations (Kees Cook, Gustavo A. R. Silva) - Add Kconfig fragment for basic hardening options (Kees Cook, Lukas Bulwahn) - Fix randstruct GCC plugin performance mode to stay in groups (Kees Cook) - Fix strtomem() compile-time check for small sources (Kees Cook)" * tag 'hardening-v6.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (56 commits) hwmon: (acpi_power_meter) replace open-coded kmemdup_nul reset: Annotate struct reset_control_array with __counted_by kexec: Annotate struct crash_mem with __counted_by virtio_console: Annotate struct port_buffer with __counted_by ima: Add __counted_by for struct modsig and use struct_size() MAINTAINERS: Include stackleak paths in hardening entry string: Adjust strtomem() logic to allow for smaller sources hardening: x86: drop reference to removed config AMD_IOMMU_V2 randstruct: Fix gcc-plugin performance mode to stay in group mailbox: zynqmp: Annotate struct zynqmp_ipi_pdata with __counted_by drivers: thermal: tsens: Annotate struct tsens_priv with __counted_by irqchip/imx-intmux: Annotate struct intmux_data with __counted_by KVM: Annotate struct kvm_irq_routing_table with __counted_by virt: acrn: Annotate struct vm_memory_region_batch with __counted_by hwmon: Annotate struct gsc_hwmon_platform_data with __counted_by sparc: Annotate struct cpuinfo_tree with __counted_by isdn: kcapi: replace deprecated strncpy with strscpy_pad isdn: replace deprecated strncpy with strscpy NFS/flexfiles: Annotate struct nfs4_ff_layout_segment with __counted_by nfs41: Annotate struct nfs4_file_layout_dsaddr with __counted_by ...
2023-10-31Merge tag 'bcachefs-2023-10-30' of https://evilpiepirate.org/git/bcachefsLinus Torvalds7-600/+5
Pull initial bcachefs updates from Kent Overstreet: "Here's the bcachefs filesystem pull request. One new patch since last week: the exportfs constants ended up conflicting with other filesystems that are also getting added to the global enum, so switched to new constants picked by Amir. The only new non fs/bcachefs/ patch is the objtool patch that adds bcachefs functions to the list of noreturns. The patch that exports osq_lock() has been dropped for now, per Ingo" * tag 'bcachefs-2023-10-30' of https://evilpiepirate.org/git/bcachefs: (2781 commits) exportfs: Change bcachefs fid_type enum to avoid conflicts bcachefs: Refactor memcpy into direct assignment bcachefs: Fix drop_alloc_keys() bcachefs: snapshot_create_lock bcachefs: Fix snapshot skiplists during snapshot deletion bcachefs: bch2_sb_field_get() refactoring bcachefs: KEY_TYPE_error now counts towards i_sectors bcachefs: Fix handling of unknown bkey types bcachefs: Switch to unsafe_memcpy() in a few places bcachefs: Use struct_size() bcachefs: Correctly initialize new buckets on device resize bcachefs: Fix another smatch complaint bcachefs: Use strsep() in split_devs() bcachefs: Add iops fields to bch_member bcachefs: Rename bch_sb_field_members -> bch_sb_field_members_v1 bcachefs: New superblock section members_v2 bcachefs: Add new helper to retrieve bch_member from sb bcachefs: bucket_lock() is now a sleepable lock bcachefs: fix crc32c checksum merge byte order problem bcachefs: Fix bch2_inode_delete_keys() ...
2023-10-28bcache: Fixup error handling in register_cache()Jan Kara1-13/+10
Coverity has noticed that the printing of error message in register_cache() uses already freed bdev_handle to get to bdev. In fact the problem has been there even before commit "bcache: Convert to bdev_open_by_path()" just a bit more subtle one - cache object itself could have been freed by the time we looked at ca->bdev and we don't hold any reference to bdev either so even that could in principle go away (due to device unplug or similar). Fix all these problems by printing the error message before closing the bdev. Fixes: dc893f51d24a ("bcache: Convert to bdev_open_by_path()") Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20231004093757.11560-1-jack@suse.cz Asked-by: Coly Li <colyli@suse.de> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-10-28md: Convert to bdev_open_by_dev()Jan Kara2-18/+9
Convert md to use bdev_open_by_dev() and pass the handle around. We also don't need the 'Holder' flag anymore so remove it. CC: linux-raid@vger.kernel.org CC: Song Liu <song@kernel.org> Acked-by: Song Liu <song@kernel.org> Acked-by: Christoph Hellwig <hch@lst.de> Acked-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230927093442.25915-11-jack@suse.cz Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-10-28dm: Convert to bdev_open_by_dev()Jan Kara1-9/+11
Convert device mapper to use bdev_open_by_dev() and pass the handle around. CC: Alasdair Kergon <agk@redhat.com> CC: Mike Snitzer <snitzer@kernel.org> CC: dm-devel@redhat.com Acked-by: Christoph Hellwig <hch@lst.de> Acked-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230927093442.25915-10-jack@suse.cz Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-10-28bcache: Convert to bdev_open_by_path()Jan Kara2-37/+43
Convert bcache to use bdev_open_by_path() and pass the handle around. CC: linux-bcache@vger.kernel.org CC: Coly Li <colyli@suse.de> CC: Kent Overstreet <kent.overstreet@gmail.com> Acked-by: Christoph Hellwig <hch@lst.de> Acked-by: Christian Brauner <brauner@kernel.org> Acked-by: Coly Li <colyli@suse.de> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230927093442.25915-9-jack@suse.cz Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-10-27dm: respect REQ_NOWAIT flag in normal bios issued to DMMike Snitzer1-11/+29
Update DM core's normal IO submission to allocate required memory using GFP_NOWAIT if REQ_NOWAIT is set. Tested with simple test provided in commit a9ce385344f916 ("dm: don't attempt to queue IO under RCU protection") that was enhanced to check error codes. Also tested using fio's pvsync2 with nowait=1. But testing with induced GFP_NOWAIT allocation failures wasn't performed (yet). Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-10-27dm: enhance alloc_multiple_bios() to be more versatileMike Snitzer1-34/+34
alloc_multiple_bios() has the useful ability to try allocating bios with GFP_NOWAIT but will fallback to using GFP_NOIO. The callers service both empty flush bios and abnormal bios (e.g. discard). alloc_multiple_bios() enhancements offered in this commit: - don't require table_devices_lock if num_bios = 1 - allow caller to pass GFP_NOWAIT to do usual GFP_NOWAIT with GFP_NOIO fallback - allow caller to pass GFP_NOIO to _only_ allocate using GFP_NOIO Flush bios with data may be issued to DM with REQ_NOWAIT, as such it makes sense to attempt servicing them with GFP_NOWAIT allocations. But abnormal IO should never be issued using REQ_NOWAIT (if that changes in the future that's fine, but no sense supporting it now). While at it, rename __send_changing_extent_only() to __send_abnormal_io(). [Thanks to both Ming and Mikulas for help with translating known possible IO scenarios to requirements.] Suggested-by: Ming Lei <ming.lei@redhat.com> Suggested-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-10-23dm: make __send_duplicate_bios return unsigned intMikulas Patocka1-2/+2
All the callers cast the value returned by __send_duplicate_bios to unsigned int type, so we can return unsigned int as well. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-10-23dm log userspace: replace deprecated strncpy with strscpyJustin Stitt1-1/+1
`strncpy` is deprecated for use on NUL-terminated destination strings [1] and as such we should prefer more robust and less ambiguous string interfaces. `lc` is already zero-allocated: | lc = kzalloc(sizeof(*lc), GFP_KERNEL); ... as such, any future NUL-padding is superfluous. A suitable replacement is `strscpy` [2] due to the fact that it guarantees NUL-termination on the destination buffer without unnecessarily NUL-padding. Let's also go with the more idiomatic `dest, src, sizeof(dest)` pattern for destination buffers that the compiler can calculate the size for. Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings [1] Link: https://manpages.debian.org/testing/linux-manual-4.8/strscpy.9.en.html [2] Link: https://github.com/KSPP/linux/issues/90 Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Justin Stitt <justinstitt@google.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-10-23dm ioctl: replace deprecated strncpy with strscpy_padJustin Stitt1-2/+2
`strncpy` is deprecated for use on NUL-terminated destination strings [1] and as such we should prefer more robust and less ambiguous string interfaces. We expect `spec->target_type` to be NUL-terminated based on its use with a format string after `dm_table_add_target()` is called | r = dm_table_add_target(table, spec->target_type, | (sector_t) spec->sector_start, | (sector_t) spec->length, | target_params); ... wherein `spec->target_type` is passed as parameter `type` and later printed with DMERR: | DMERR("%s: %s: unknown target type", dm_device_name(t->md), type); It appears that `spec` is not zero-allocated and thus NUL-padding may be required in this ioctl context. Considering the above, a suitable replacement is `strscpy_pad` due to the fact that it guarantees NUL-termination whilst maintaining the NUL-padding behavior that strncpy provides. Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings [1] Link: https://github.com/KSPP/linux/issues/90 Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Justin Stitt <justinstitt@google.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-10-23dm crypt: replace open-coded kmemdup_nulJustin Stitt1-2/+1
kzalloc() followed by strncpy() on an expected NUL-terminated string is just kmemdup_nul(). Let's simplify this code (while also dropping a deprecated strncpy() call [1]). Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings [1] Link: https://github.com/KSPP/linux/issues/90 Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Justin Stitt <justinstitt@google.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-10-23dm cache metadata: replace deprecated strncpy with strscpyJustin Stitt1-3/+3
`strncpy` is deprecated for use on NUL-terminated destination strings [1] and as such we should prefer more robust and less ambiguous string interfaces. It seems `cmd->policy_name` is intended to be NUL-terminated based on a now changed line of code from Commit (c6b4fcbad044e6ff "dm: add cache target"): | if (strcmp(cmd->policy_name, policy_name)) { // ... However, now a length-bounded strncmp is used: | if (strncmp(cmd->policy_name, policy_name, sizeof(cmd->policy_name))) ... which means NUL-terminated may not strictly be required. However, I believe the intent of the code is clear and we should maintain NUL-termination of policy_names. Moreover, __begin_transaction_flags() zero-allocates `cmd` before calling read_superblock_fields(): | cmd = kzalloc(sizeof(*cmd), GFP_KERNEL); Also, `disk_super->policy_name` is zero-initialized | memset(disk_super->policy_name, 0, sizeof(disk_super->policy_name)); ... therefore any NUL-padding is redundant. Considering the above, a suitable replacement is `strscpy` [2] due to the fact that it guarantees NUL-termination on the destination buffer without unnecessarily NUL-padding. Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings [1] Link: https://manpages.debian.org/testing/linux-manual-4.8/strscpy.9.en.html [2] Link: https://github.com/KSPP/linux/issues/90 Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Justin Stitt <justinstitt@google.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-10-19bcache: move closures to lib/Kent Overstreet7-600/+5
Prep work for bcachefs - being a fork of bcache it also uses closures Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Acked-by: Coly Li <colyli@suse.de> Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
2023-10-18md: cleanup pers->prepare_suspend()Yu Kuai3-62/+17
pers->prepare_suspend() is not used anymore and can be removed. Reverts following three commit: - commit 431e61257d63 ("md: export md_is_rdwr() and is_md_suspended()") - commit 3e00777d5157 ("md: add a new api prepare_suspend() in md_personality") - commit 868bba54a3bc ("md/raid5: fix a deadlock in the case that reshape is interrupted") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20231016100240.540474-1-yukuai1@huaweicloud.com
2023-10-12md-cluster: check for timeout while a new disk addingDenis Plotnikov1-4/+11
A new disk adding may end up with timeout and a new disk won't be added. Add returning the error in that case. Found by Linux Verification Center (linuxtesting.org) with SVACE Signed-off-by: Denis Plotnikov <den-plotnikov@yandex-team.ru> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230925125940.1542506-1-den-plotnikov@yandex-team.ru
2023-10-11raid: Remove now superfluous sentinel element from ctl_table arrayJoel Granados1-1/+0
This commit comes at the tail end of a greater effort to remove the empty elements at the end of the ctl_table arrays (sentinels) which will reduce the overall build time size of the kernel and run time memory bloat by ~64 bytes per sentinel (further information Link : https://lore.kernel.org/all/ZO5Yx5JFogGi%2FcBo@bombadil.infradead.org/) Remove sentinel from raid_table Signed-off-by: Joel Granados <j.granados@samsung.com> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2023-10-11md: rename __mddev_suspend/resume() back to mddev_suspend/resume()Yu Kuai4-19/+19
Now that the old apis are removed, __mddev_suspend/resume() can be renamed to their original names. This is done by: sed -i "s/__mddev_suspend/mddev_suspend/g" *.[ch] sed -i "s/__mddev_resume/mddev_resume/g" *.[ch] Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20231010151958.145896-20-yukuai1@huaweicloud.com
2023-10-11md: remove old apis to suspend the arrayYu Kuai2-87/+3
Now that mddev_suspend() and mddev_resume() is not used anywhere, remove them, and remove 'MD_ALLOW_SB_UPDATE' and 'MD_UPDATING_SB' as well. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20231010151958.145896-19-yukuai1@huaweicloud.com
2023-10-11md: suspend array in md_start_sync() if array need reconfigurationYu Kuai1-3/+8
So that io won't concurrent with array reconfiguration, and it's safe to suspend the array directly because normal io won't rely on md_start_sync(). Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20231010151958.145896-18-yukuai1@huaweicloud.com
2023-10-11md/raid5: replace suspend with quiesce() callbackYu Kuai1-9/+9
raid5 is the only personality to suspend array in check_reshape() and start_reshape() callback, suspend and quiesce() callback can both wait for all normal io to be done, and prevent new io to be dispatched, the difference is that suspend is implemented in common layer, and quiesce() callback is implemented in raid5. In order to cleanup all the usage of mddev_suspend(), the new apis __mddev_suspend() need to be called before 'reconfig_mutex' is held, and it's not good to affect all the personalities in common layer just for raid5. Hence replace suspend with quiesce() callaback, prepare to reomove all the users of mddev_suspend(). Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20231010151958.145896-17-yukuai1@huaweicloud.com
2023-10-11md/md-linear: cleanup linear_add()Yu Kuai1-2/+0
Now that caller already suspend the array, there is no need to suspend array in liner_add(). Note that mddev_suspend/resume() is not used anymore. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20231010151958.145896-16-yukuai1@huaweicloud.com
2023-10-11md: cleanup mddev_create/destroy_serial_pool()Yu Kuai3-31/+17
Now that except for stopping the array, all the callers already suspend the array, there is no need to suspend anymore, hence remove the second parameter. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20231010151958.145896-15-yukuai1@huaweicloud.com
2023-10-11md: use new apis to suspend array before mddev_create/destroy_serial_poolYu Kuai3-16/+18
mddev_create/destroy_serial_pool() will be called from several places where mddev_suspend() will be called later. Prepare to remove the mddev_suspend() from mddev_create/destroy_serial_pool(). Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20231010151958.145896-14-yukuai1@huaweicloud.com
2023-10-11md: use new apis to suspend array for ioctls involed array reconfigurationYu Kuai1-10/+20
'reconfig_mutex' will be grabbed before these ioctls, suspend array before holding the lock, so that io won't concurrent with array reconfiguration through ioctls. This is not hot path, so performance is not concerned. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20231010151958.145896-13-yukuai1@huaweicloud.com
2023-10-11md: use new apis to suspend array for adding/removing rdev from state_store()Yu Kuai1-8/+11
User can write 'remove' and 're-add' to trigger array reconfiguration through sysfs, suspend array in this case so that io won't concurrent with array reconfiguration. And now that all the caller of add_bound_rdev() alread suspend the array, remove mddev_suspend/resume() from add_bound_rdev() as well. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20231010151958.145896-12-yukuai1@huaweicloud.com
2023-10-11md: use new apis to suspend array for sysfs apisYu Kuai1-16/+8
Convert to use new apis in following sysfs apis: - level_store - suspend_lo_store - suspend_hi_store - serialize_policy_store These are not hot path, so performance is not concerned. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20231010151958.145896-11-yukuai1@huaweicloud.com
2023-10-11md/raid5: use new apis to suspend arrayYu Kuai1-26/+12
Convert to use new apis, the old apis will be removed eventually. These are not hot path, so performance is not concerned. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20231010151958.145896-10-yukuai1@huaweicloud.com
2023-10-11md/raid5-cache: use new apis to suspend arrayYu Kuai1-11/+8
Convert to use new apis, the old apis will be removed eventually. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20231010151958.145896-9-yukuai1@huaweicloud.com
2023-10-11md/md-bitmap: use new apis to suspend array for location_store()Yu Kuai1-4/+2
Convert to use new apis, the old apis will be removed eventually. This is not hot path, so performance is not concerned. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20231010151958.145896-8-yukuai1@huaweicloud.com
2023-10-11md/dm-raid: use new apis to suspend arrayYu Kuai1-8/+4
Convert to use new apis, the old apis will be removed eventually. These are not hot path, so performance is not concerned. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20231010151958.145896-7-yukuai1@huaweicloud.com
2023-10-11md: add new helpers to suspend/resume and lock/unlock arrayYu Kuai1-0/+27
The new helpers suspend the array first and then lock the array, Prepare to refactor from: mddev_lock/lock_nointr mddev_suspend ... mddev_resuem mddev_lock With: mddev_suspend_and_lock/lock_nointr ... mddev_unlock_and_resume After all the use cases is refactored, mddev_suspend/resume() will be removed. And mddev_suspend_and_lock() will also replace mddev_lock() for the case that the array will be reconfigured, in order to synchronize with io to prevent problems in many corner cases. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20231010151958.145896-6-yukuai1@huaweicloud.com
2023-10-11md: add new helpers to suspend/resume arrayYu Kuai2-2/+103
Advantages for new apis: - reconfig_mutex is not required; - the weird logical that suspend array hold 'reconfig_mutex' for mddev_check_recovery() to update superblock is not needed; - the specail handling, 'pers->prepare_suspend', for raid456 is not needed; - It's safe to be called at any time once mddev is allocated, and it's designed to be used from slow path where array configuration is changed; - the new helpers is designed to be called before mddev_lock(), hence it support to be interrupted by user as well. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20231010151958.145896-5-yukuai1@huaweicloud.com
2023-10-11md: replace is_md_suspended() with 'mddev->suspended' in md_check_recovery()Yu Kuai1-1/+1
Prepare to cleanup pers->prepare_suspend(), which is used to fix a deadlock in raid456 by returning error for io that is waiting for reshape to make progress in mddev_suspend(). This change will allow reshape to make progress while waiting for io to be done in mddev_suspend() in following patches. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20231010151958.145896-4-yukuai1@huaweicloud.com
2023-10-11md/raid5-cache: use READ_ONCE/WRITE_ONCE for 'conf->log'Yu Kuai1-22/+25
'conf->log' is set with 'reconfig_mutex' grabbed, however, readers are not procted, hence protect it with READ_ONCE/WRITE_ONCE to prevent reading abnormal values. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20231010151958.145896-3-yukuai1@huaweicloud.com
2023-10-11md: use READ_ONCE/WRITE_ONCE for 'suspend_lo' and 'suspend_hi'Yu Kuai1-7/+9
Protect 'suspend_lo' and 'suspend_hi' with READ_ONCE/WRITE_ONCE to prevent reading abnormal values. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20231010151958.145896-2-yukuai1@huaweicloud.com
2023-10-10Merge tag 'v6.6-p4' of ↵Linus Torvalds1-1/+2
git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 Pull crypto fix from Herbert Xu: "Fix a regression in dm-crypt" * tag 'v6.6-p4' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: dm crypt: Fix reqsize in crypt_iv_eboiv_gen
2023-10-10md/raid1: don't split discard io for write behindYu Kuai1-1/+2
Currently, discad io is treated the same as normal write io, and for write behind case, io size is limited to: BIO_MAX_VECS * (PAGE_SIZE >> 9) For 0.5KB sector size and 4KB PAGE_SIZE, this is just 1MB. For consequence, if 'WriteMostly' is set to one of the underlying disks, then diskcard io will be splited into 1MB and it will take a long time for the diskcard to finish. Fix this problem by disable write behind for discard io. Reported-by: Roman Mamedov <rm@romanrm.net> Closes: https://lore.kernel.org/all/6a1165f7-c792-c054-b8f0-1ad4f7b8ae01@ultracoder.org/ Reported-and-tested-by: Kirill Kirilenko <kirill@ultracoder.org> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20231007112105.407449-1-yukuai1@huaweicloud.com
2023-10-07Merge tag 'for-6.6/dm-fixes-2' of ↵Linus Torvalds1-8/+7
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper fixes from Mike Snitzer: - Fix memory leak when freeing dm zoned target device - Update dm-devel mailing list address in MAINTAINERS * tag 'for-6.6/dm-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: MAINTAINERS: update the dm-devel mailing list dm zoned: free dmz->ddev array in dmz_put_zoned_devices
2023-10-07dm: shortcut the calls to linear_map and stripe_mapMikulas Patocka4-4/+13
Shortcut the calls to linear_map and stripe_map, so that they don't suffer the overhead of retpolines used for indirect calls. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-10-07Merge tag 'block-6.6-2023-10-06' of git://git.kernel.dk/linuxLinus Torvalds1-0/+7
Pull block fixes from Jens Axboe: "Just two minor fixes, for nbd and md" * tag 'block-6.6-2023-10-06' of git://git.kernel.dk/linux: nbd: don't call blk_mark_disk_dead nbd_clear_sock_ioctl md/raid5: release batch_last before waiting for another stripe_head