summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)AuthorFilesLines
2024-03-13mm, slab: remove last vestiges of SLAB_MEM_SPREADLinus Torvalds35-57/+46
Yes, yes, I know the slab people were planning on going slow and letting every subsystem fight this thing on their own. But let's just rip off the band-aid and get it over and done with. I don't want to see a number of unnecessary pull requests just to get rid of a flag that no longer has any meaning. This was mainly done with a couple of 'sed' scripts and then some manual cleanup of the end result. Link: https://lore.kernel.org/all/CAHk-=wji0u+OOtmAOD-5JV3SXcRJF___k_+8XNKmak0yd5vW1Q@mail.gmail.com/ Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-03-13Merge tag 'lsm-pr-20240312' of ↵Linus Torvalds7-22/+16
git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm Pull lsm updates from Paul Moore: - Promote IMA/EVM to a proper LSM This is the bulk of the diffstat, and the source of all the changes in the VFS code. Prior to the start of the LSM stacking work it was important that IMA/EVM were separate from the rest of the LSMs, complete with their own hooks, infrastructure, etc. as it was the only way to enable IMA/EVM at the same time as a LSM. However, now that the bulk of the LSM infrastructure supports multiple simultaneous LSMs, we can simplify things greatly by bringing IMA/EVM into the LSM infrastructure as proper LSMs. This is something I've wanted to see happen for quite some time and Roberto was kind enough to put in the work to make it happen. - Use the LSM hook default values to simplify the call_int_hook() macro Previously the call_int_hook() macro required callers to supply a default return value, despite a default value being specified when the LSM hook was defined. This simplifies the macro by using the defined default return value which makes life easier for callers and should also reduce the number of return value bugs in the future (we've had a few pop up recently, hence this work). - Use the KMEM_CACHE() macro instead of kmem_cache_create() The guidance appears to be to use the KMEM_CACHE() macro when possible and there is no reason why we can't use the macro, so let's use it. - Fix a number of comment typos in the LSM hook comment blocks Not much to say here, we fixed some questionable grammar decisions in the LSM hook comment blocks. * tag 'lsm-pr-20240312' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm: (28 commits) cred: Use KMEM_CACHE() instead of kmem_cache_create() lsm: use default hook return value in call_int_hook() lsm: fix typos in security/security.c comment headers integrity: Remove LSM ima: Make it independent from 'integrity' LSM evm: Make it independent from 'integrity' LSM evm: Move to LSM infrastructure ima: Move IMA-Appraisal to LSM infrastructure ima: Move to LSM infrastructure integrity: Move integrity_kernel_module_request() to IMA security: Introduce key_post_create_or_update hook security: Introduce inode_post_remove_acl hook security: Introduce inode_post_set_acl hook security: Introduce inode_post_create_tmpfile hook security: Introduce path_post_mknod hook security: Introduce file_release hook security: Introduce file_post_open hook security: Introduce inode_post_removexattr hook security: Introduce inode_post_setattr hook security: Align inode_setattr hook definition with EVM ...
2024-03-13Merge tag 'net-next-6.9' of ↵Linus Torvalds2-9/+126
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Jakub Kicinski: "Core & protocols: - Large effort by Eric to lower rtnl_lock pressure and remove locks: - Make commonly used parts of rtnetlink (address, route dumps etc) lockless, protected by RCU instead of rtnl_lock. - Add a netns exit callback which already holds rtnl_lock, allowing netns exit to take rtnl_lock once in the core instead of once for each driver / callback. - Remove locks / serialization in the socket diag interface. - Remove 6 calls to synchronize_rcu() while holding rtnl_lock. - Remove the dev_base_lock, depend on RCU where necessary. - Support busy polling on a per-epoll context basis. Poll length and budget parameters can be set independently of system defaults. - Introduce struct net_hotdata, to make sure read-mostly global config variables fit in as few cache lines as possible. - Add optional per-nexthop statistics to ease monitoring / debug of ECMP imbalance problems. - Support TCP_NOTSENT_LOWAT in MPTCP. - Ensure that IPv6 temporary addresses' preferred lifetimes are long enough, compared to other configured lifetimes, and at least 2 sec. - Support forwarding of ICMP Error messages in IPSec, per RFC 4301. - Add support for the independent control state machine for bonding per IEEE 802.1AX-2008 5.4.15 in addition to the existing coupled control state machine. - Add "network ID" to MCTP socket APIs to support hosts with multiple disjoint MCTP networks. - Re-use the mono_delivery_time skbuff bit for packets which user space wants to be sent at a specified time. Maintain the timing information while traversing veth links, bridge etc. - Take advantage of MSG_SPLICE_PAGES for RxRPC DATA and ACK packets. - Simplify many places iterating over netdevs by using an xarray instead of a hash table walk (hash table remains in place, for use on fastpaths). - Speed up scanning for expired routes by keeping a dedicated list. - Speed up "generic" XDP by trying harder to avoid large allocations. - Support attaching arbitrary metadata to netconsole messages. Things we sprinkled into general kernel code: - Enforce VM_IOREMAP flag and range in ioremap_page_range and introduce VM_SPARSE kind and vm_area_[un]map_pages (used by bpf_arena). - Rework selftest harness to enable the use of the full range of ksft exit code (pass, fail, skip, xfail, xpass). Netfilter: - Allow userspace to define a table that is exclusively owned by a daemon (via netlink socket aliveness) without auto-removing this table when the userspace program exits. Such table gets marked as orphaned and a restarting management daemon can re-attach/regain ownership. - Speed up element insertions to nftables' concatenated-ranges set type. Compact a few related data structures. BPF: - Add BPF token support for delegating a subset of BPF subsystem functionality from privileged system-wide daemons such as systemd through special mount options for userns-bound BPF fs to a trusted & unprivileged application. - Introduce bpf_arena which is sparse shared memory region between BPF program and user space where structures inside the arena can have pointers to other areas of the arena, and pointers work seamlessly for both user-space programs and BPF programs. - Introduce may_goto instruction that is a contract between the verifier and the program. The verifier allows the program to loop assuming it's behaving well, but reserves the right to terminate it. - Extend the BPF verifier to enable static subprog calls in spin lock critical sections. - Support registration of struct_ops types from modules which helps projects like fuse-bpf that seeks to implement a new struct_ops type. - Add support for retrieval of cookies for perf/kprobe multi links. - Support arbitrary TCP SYN cookie generation / validation in the TC layer with BPF to allow creating SYN flood handling in BPF firewalls. - Add code generation to inline the bpf_kptr_xchg() helper which improves performance when stashing/popping the allocated BPF objects. Wireless: - Add SPP (signaling and payload protected) AMSDU support. - Support wider bandwidth OFDMA, as required for EHT operation. Driver API: - Major overhaul of the Energy Efficient Ethernet internals to support new link modes (2.5GE, 5GE), share more code between drivers (especially those using phylib), and encourage more uniform behavior. Convert and clean up drivers. - Define an API for querying per netdev queue statistics from drivers. - IPSec: account in global stats for fully offloaded sessions. - Create a concept of Ethernet PHY Packages at the Device Tree level, to allow parameterizing the existing PHY package code. - Enable Rx hashing (RSS) on GTP protocol fields. Misc: - Improvements and refactoring all over networking selftests. - Create uniform module aliases for TC classifiers, actions, and packet schedulers to simplify creating modprobe policies. - Address all missing MODULE_DESCRIPTION() warnings in networking. - Extend the Netlink descriptions in YAML to cover message encapsulation or "Netlink polymorphism", where interpretation of nested attributes depends on link type, classifier type or some other "class type". Drivers: - Ethernet high-speed NICs: - Add a new driver for Marvell's Octeon PCI Endpoint NIC VF. - Intel (100G, ice, idpf): - support E825-C devices - nVidia/Mellanox: - support devices with one port and multiple PCIe links - Broadcom (bnxt): - support n-tuple filters - support configuring the RSS key - Wangxun (ngbe/txgbe): - implement irq_domain for TXGBE's sub-interrupts - Pensando/AMD: - support XDP - optimize queue submission and wakeup handling (+17% bps) - optimize struct layout, saving 28% of memory on queues - Ethernet NICs embedded and virtual: - Google cloud vNIC: - refactor driver to perform memory allocations for new queue config before stopping and freeing the old queue memory - Synopsys (stmmac): - obey queueMaxSDU and implement counters required by 802.1Qbv - Renesas (ravb): - support packet checksum offload - suspend to RAM and runtime PM support - Ethernet switches: - nVidia/Mellanox: - support for nexthop group statistics - Microchip: - ksz8: implement PHY loopback - add support for KSZ8567, a 7-port 10/100Mbps switch - PTP: - New driver for RENESAS FemtoClock3 Wireless clock generator. - Support OCP PTP cards designed and built by Adva. - CAN: - Support recvmsg() flags for own, local and remote traffic on CAN BCM sockets. - Support for esd GmbH PCIe/402 CAN device family. - m_can: - Rx/Tx submission coalescing - wake on frame Rx - WiFi: - Intel (iwlwifi): - enable signaling and payload protected A-MSDUs - support wider-bandwidth OFDMA - support for new devices - bump FW API to 89 for AX devices; 90 for BZ/SC devices - MediaTek (mt76): - mt7915: newer ADIE version support - mt7925: radio temperature sensor support - Qualcomm (ath11k): - support 6 GHz station power modes: Low Power Indoor (LPI), Standard Power) SP and Very Low Power (VLP) - QCA6390 & WCN6855: support 2 concurrent station interfaces - QCA2066 support - Qualcomm (ath12k): - refactoring in preparation for Multi-Link Operation (MLO) support - 1024 Block Ack window size support - firmware-2.bin support - support having multiple identical PCI devices (firmware needs to have ATH12K_FW_FEATURE_MULTI_QRTR_ID) - QCN9274: support split-PHY devices - WCN7850: enable Power Save Mode in station mode - WCN7850: P2P support - RealTek: - rtw88: support for more rtw8811cu and rtw8821cu devices - rtw89: support SCAN_RANDOM_SN and SET_SCAN_DWELL - rtlwifi: speed up USB firmware initialization - rtwl8xxxu: - RTL8188F: concurrent interface support - Channel Switch Announcement (CSA) support in AP mode - Broadcom (brcmfmac): - per-vendor feature support - per-vendor SAE password setup - DMI nvram filename quirk for ACEPC W5 Pro" * tag 'net-next-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2255 commits) nexthop: Fix splat with CONFIG_DEBUG_PREEMPT=y nexthop: Fix out-of-bounds access during attribute validation nexthop: Only parse NHA_OP_FLAGS for dump messages that require it nexthop: Only parse NHA_OP_FLAGS for get messages that require it bpf: move sleepable flag from bpf_prog_aux to bpf_prog bpf: hardcode BPF_PROG_PACK_SIZE to 2MB * num_possible_nodes() selftests/bpf: Add kprobe multi triggering benchmarks ptp: Move from simple ida to xarray vxlan: Remove generic .ndo_get_stats64 vxlan: Do not alloc tstats manually devlink: Add comments to use netlink gen tool nfp: flower: handle acti_netdevs allocation failure net/packet: Add getsockopt support for PACKET_COPY_THRESH net/netlink: Add getsockopt support for NETLINK_LISTEN_ALL_NSID selftests/bpf: Add bpf_arena_htab test. selftests/bpf: Add bpf_arena_list test. selftests/bpf: Add unit tests for bpf_arena_alloc/free_pages bpf: Add helper macro bpf_addr_space_cast() libbpf: Recognize __arena global variables. bpftool: Recognize arena map type ...
2024-03-13Merge tag 'Smack-for-6.9' of https://github.com/cschaufler/smack-nextLinus Torvalds1-1/+31
Pull smack updates from Casey Schaufler: - Improvements to the initialization of in-memory inodes - A fix in ramfs to propery ensure the initialization of in-memory inodes - Removal of duplicated code in smack_cred_transfer() * tag 'Smack-for-6.9' of https://github.com/cschaufler/smack-next: Smack: use init_task_smack() in smack_cred_transfer() ramfs: Initialize security of in-memory inodes smack: Initialize the in-memory inode in smack_inode_init_security() smack: Always determine inode labels in smack_inode_init_security() smack: Handle SMACK64TRANSMUTE in smack_inode_setsecurity() smack: Set SMACK64TRANSMUTE only for dirs in smack_inode_setxattr()
2024-03-13Merge tag 'hardening-v6.9-rc1' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull hardening updates from Kees Cook: "As is pretty normal for this tree, there are changes all over the place, especially for small fixes, selftest improvements, and improved macro usability. Some header changes ended up landing via this tree as they depended on the string header cleanups. Also, a notable set of changes is the work for the reintroduction of the UBSAN signed integer overflow sanitizer so that we can continue to make improvements on the compiler side to make this sanitizer a more viable future security hardening option. Summary: - string.h and related header cleanups (Tanzir Hasan, Andy Shevchenko) - VMCI memcpy() usage and struct_size() cleanups (Vasiliy Kovalev, Harshit Mogalapalli) - selftests/powerpc: Fix load_unaligned_zeropad build failure (Michael Ellerman) - hardened Kconfig fragment updates (Marco Elver, Lukas Bulwahn) - Handle tail call optimization better in LKDTM (Douglas Anderson) - Use long form types in overflow.h (Andy Shevchenko) - Add flags param to string_get_size() (Andy Shevchenko) - Add Coccinelle script for potential struct_size() use (Jacob Keller) - Fix objtool corner case under KCFI (Josh Poimboeuf) - Drop 13 year old backward compat CAP_SYS_ADMIN check (Jingzi Meng) - Add str_plural() helper (Michal Wajdeczko, Kees Cook) - Ignore relocations in .notes section - Add comments to explain how __is_constexpr() works - Fix m68k stack alignment expectations in stackinit Kunit test - Convert string selftests to KUnit - Add KUnit tests for fortified string functions - Improve reporting during fortified string warnings - Allow non-type arg to type_max() and type_min() - Allow strscpy() to be called with only 2 arguments - Add binary mode to leaking_addresses scanner - Various small cleanups to leaking_addresses scanner - Adding wrapping_*() arithmetic helper - Annotate initial signed integer wrap-around in refcount_t - Add explicit UBSAN section to MAINTAINERS - Fix UBSAN self-test warnings - Simplify UBSAN build via removal of CONFIG_UBSAN_SANITIZE_ALL - Reintroduce UBSAN's signed overflow sanitizer" * tag 'hardening-v6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (51 commits) selftests/powerpc: Fix load_unaligned_zeropad build failure string: Convert helpers selftest to KUnit string: Convert selftest to KUnit sh: Fix build with CONFIG_UBSAN=y compiler.h: Explain how __is_constexpr() works overflow: Allow non-type arg to type_max() and type_min() VMCI: Fix possible memcpy() run-time warning in vmci_datagram_invoke_guest_handler() lib/string_helpers: Add flags param to string_get_size() x86, relocs: Ignore relocations in .notes section objtool: Fix UNWIND_HINT_{SAVE,RESTORE} across basic blocks overflow: Use POD in check_shl_overflow() lib: stackinit: Adjust target string to 8 bytes for m68k sparc: vdso: Disable UBSAN instrumentation kernel.h: Move lib/cmdline.c prototypes to string.h leaking_addresses: Provide mechanism to scan binary files leaking_addresses: Ignore input device status lines leaking_addresses: Use File::Temp for /tmp files MAINTAINERS: Update LEAKING_ADDRESSES details fortify: Improve buffer overflow reporting fortify: Add KUnit tests for runtime overflows ...
2024-03-13Merge tag 'execve-v6.9-rc1' of ↵Linus Torvalds2-9/+4
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull execve updates from Kees Cook: - Drop needless error path code in remove_arg_zero() (Li kunyu, Kees Cook) - binfmt_elf_efpic: Don't use missing interpreter's properties (Max Filippov) - Use /bin/bash for execveat selftests * tag 'execve-v6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: exec: Simplify remove_arg_zero() error path selftests/exec: Perform script checks with /bin/bash exec: Delete unnecessary statements in remove_arg_zero() fs: binfmt_elf_efpic: don't use missing interpreter's properties
2024-03-13Merge tag 'pstore-v6.9-rc1' of ↵Linus Torvalds3-8/+6
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull pstore updates from Kees Cook: - Make PSTORE_RAM available by default on arm64 (Nícolas F R A Prado) - Allow for dynamic initialization in modular build (Guilherme G Piccoli) - Add missing allocation failure check (Kunwu Chan) - Avoid duplicate memory zeroing (Christophe JAILLET) - Avoid potential double-free during pstorefs umount * tag 'pstore-v6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: pstore/zone: Don't clear memory twice pstore/zone: Add a null pointer check to the psz_kmsg_read efi: pstore: Allow dynamic initialization based on module parameter arm64: defconfig: Enable PSTORE_RAM pstore/ram: Register to module device table pstore: inode: Only d_invalidate() is needed
2024-03-13Merge tag 'nfsd-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linuxLinus Torvalds29-557/+1344
Pull nfsd updates from Chuck Lever: "The bulk of the patches for this release are optimizations, code clean-ups, and minor bug fixes. One new feature to mention is that NFSD administrators now have the ability to revoke NFSv4 open and lock state. NFSD's NFSv3 support has had this capability for some time. As always I am grateful to NFSD contributors, reviewers, and testers" * tag 'nfsd-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: (75 commits) NFSD: Clean up nfsd4_encode_replay() NFSD: send OP_CB_RECALL_ANY to clients when number of delegations reaches its limit NFSD: Document nfsd_setattr() fill-attributes behavior nfsd: Fix NFSv3 atomicity bugs in nfsd_setattr() nfsd: Fix a regression in nfsd_setattr() NFSD: OP_CB_RECALL_ANY should recall both read and write delegations NFSD: handle GETATTR conflict with write delegation NFSD: add support for CB_GETATTR callback NFSD: Document the phases of CREATE_SESSION NFSD: Fix the NFSv4.1 CREATE_SESSION operation nfsd: clean up comments over nfs4_client definition svcrdma: Add Write chunk WRs to the RPC's Send WR chain svcrdma: Post WRs for Write chunks in svc_rdma_sendto() svcrdma: Post the Reply chunk and Send WR together svcrdma: Move write_info for Reply chunks into struct svc_rdma_send_ctxt svcrdma: Post Send WR chain svcrdma: Fix retry loop in svc_rdma_send() svcrdma: Prevent a UAF in svc_rdma_send() svcrdma: Fix SQ wake-ups svcrdma: Increase the per-transport rw_ctx count ...
2024-03-12Merge tag 'erofs-for-6.9-rc1' of ↵Linus Torvalds9-317/+335
git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs Pull erofs updates from Gao Xiang: "In this cycle, we introduce compressed inode support over fscache since a lot of native EROFS images are explicitly compressed so that EROFS over fscache can be more widely used even without Dragonfly Nydus [1]. Apart from that, there are some folio conversions for compressed inodes available as well as a lockdep false positive fix. Summary: - Some folio conversions for compressed inodes; - Add compressed inode support over fscache; - Fix lockdep false positives of erofs_pseudo_mnt" Link: https://nydus.dev [1] * tag 'erofs-for-6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs: erofs: support compressed inodes over fscache erofs: make iov_iter describe target buffers over fscache erofs: fix lockdep false positives on initializing erofs_pseudo_mnt erofs: refine managed cache operations to folios erofs: convert z_erofs_submissionqueue_endio() to folios erofs: convert z_erofs_fill_bio_vec() to folios erofs: get rid of `justfound` debugging tag erofs: convert z_erofs_do_read_page() to folios erofs: convert z_erofs_onlinepage_.* to folios
2024-03-12Merge tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fsverity/linuxLinus Torvalds3-26/+24
Pull fsverity update from Eric Biggers: "Slightly improve data verification performance by eliminating an unnecessary lock" * tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fsverity/linux: fsverity: remove hash page spin lock
2024-03-12Merge tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/linuxLinus Torvalds3-12/+21
Pull fscrypt updates from Eric Biggers: "Fix flakiness in a test by releasing the quota synchronously when a key is removed, and other minor cleanups" * tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/linux: fscrypt: shrink the size of struct fscrypt_inode_info slightly fscrypt: write CBC-CTS instead of CTS-CBC fscrypt: clear keyring before calling key_put() fscrypt: explicitly require that inode->i_blkbits be set
2024-03-12Merge tag 'affs-for-6.9' of ↵Linus Torvalds1-2/+1
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull affs update from David Sterba: "One change to AFFS that removes use of SLAB_MEM_SPREAD, which is going to be removed from MM code" * tag 'affs-for-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: affs: remove SLAB_MEM_SPREAD flag usage
2024-03-12Merge tag 'for-6.9-tag' of ↵Linus Torvalds118-1117/+2131
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs updates from David Sterba: "Mostly stabilization, refactoring and cleanup changes. There rest are minor performance optimizations due to caching or lock contention reduction and a few notable fixes. Performance improvements: - minor speedup in logging when repeatedly allocated structure is preallocated only once, improves latency and decreases lock contention - minor throughput increase (+6%), reduced lock contention after clearing delayed allocation bits, applies to several common workload types - skip full quota rescan if a new relation is added in the same transaction Fixes: - zstd fix for inline compressed file in subpage mode, updated version from the 6.8 time - proper qgroup inheritance ioctl parameter validation - more fiemap followup fixes after reduced locking done in 6.8: - fix race when detecting delalloc ranges Core changes: - more debugging code: - added assertions for a very rare crash in raid56 calculation - tree-checker dumps page state to give more insights into possible reference counting issues - add checksum calculation offloading sysfs knob, for now enabled under DEBUG only to determine a good heuristic for deciding the offload or synchronous, depends on various factors (block group profile, device speed) and is not as clear as initially thought (checksum type) - error handling improvements, added assertions - more page to folio conversion (defrag, truncate), cached size and shift - preparation for more fine grained locking of sectors in subpage mode - cleanups and refactoring: - include cleanups, forward declarations - pointer-to-structure helpers - redundant argument removals - removed unused code - slab cache updates, last use of SLAB_MEM_SPREAD removed" * tag 'for-6.9-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (114 commits) btrfs: reuse cloned extent buffer during fiemap to avoid re-allocations btrfs: fix race when detecting delalloc ranges during fiemap btrfs: fix off-by-one chunk length calculation at contains_pending_extent() btrfs: qgroup: allow quick inherit if snapshot is created and added to the same parent btrfs: qgroup: validate btrfs_qgroup_inherit parameter btrfs: include device major and minor numbers in the device scan notice btrfs: mark btrfs_put_caching_control() static btrfs: remove SLAB_MEM_SPREAD flag use btrfs: qgroup: always free reserved space for extent records btrfs: tree-checker: dump the page status if hit something wrong btrfs: compression: remove dead comments in btrfs_compress_heuristic() btrfs: subpage: make writer lock utilize bitmap btrfs: subpage: make reader lock utilize bitmap btrfs: unexport btrfs_subpage_start_writer() and btrfs_subpage_end_and_test_writer() btrfs: pass a valid extent map cache pointer to __get_extent_map() btrfs: merge btrfs_del_delalloc_inode() helpers btrfs: pass btrfs_device to btrfs_scratch_superblocks() btrfs: handle transaction commit errors in flush_reservations() btrfs: use KMEM_CACHE() to create btrfs_free_space cache btrfs: use KMEM_CACHE() to create delayed ref caches ...
2024-03-12Merge tag 'zonefs-6.9-rc1' of ↵Linus Torvalds1-72/+95
git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs Pull zonefs update from Damien Le Moal: - A single change for this cycle to convert zonefs to use the new mount API * tag 'zonefs-6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs: zonefs: convert zonefs to use the new mount api
2024-03-11Merge tag 'for-6.9/block-20240310' of git://git.kernel.dk/linuxLinus Torvalds3-14/+38
Pull block updates from Jens Axboe: - MD pull requests via Song: - Cleanup redundant checks (Yu Kuai) - Remove deprecated headers (Marc Zyngier, Song Liu) - Concurrency fixes (Li Lingfeng) - Memory leak fix (Li Nan) - Refactor raid1 read_balance (Yu Kuai, Paul Luse) - Clean up and fix for md_ioctl (Li Nan) - Other small fixes (Gui-Dong Han, Heming Zhao) - MD atomic limits (Christoph) - NVMe pull request via Keith: - RDMA target enhancements (Max) - Fabrics fixes (Max, Guixin, Hannes) - Atomic queue_limits usage (Christoph) - Const use for class_register (Ricardo) - Identification error handling fixes (Shin'ichiro, Keith) - Improvement and cleanup for cached request handling (Christoph) - Moving towards atomic queue limits. Core changes and driver bits so far (Christoph) - Fix UAF issues in aoeblk (Chun-Yi) - Zoned fix and cleanups (Damien) - s390 dasd cleanups and fixes (Jan, Miroslav) - Block issue timestamp caching (me) - noio scope guarding for zoned IO (Johannes) - block/nvme PI improvements (Kanchan) - Ability to terminate long running discard loop (Keith) - bdev revalidation fix (Li) - Get rid of old nr_queues hack for kdump kernels (Ming) - Support for async deletion of ublk (Ming) - Improve IRQ bio recycling (Pavel) - Factor in CPU capacity for remote vs local completion (Qais) - Add shared_tags configfs entry for null_blk (Shin'ichiro - Fix for a regression in page refcounts introduced by the folio unification (Tony) - Misc fixes and cleanups (Arnd, Colin, John, Kunwu, Li, Navid, Ricardo, Roman, Tang, Uwe) * tag 'for-6.9/block-20240310' of git://git.kernel.dk/linux: (221 commits) block: partitions: only define function mac_fix_string for CONFIG_PPC_PMAC block/swim: Convert to platform remove callback returning void cdrom: gdrom: Convert to platform remove callback returning void block: remove disk_stack_limits md: remove mddev->queue md: don't initialize queue limits md/raid10: use the atomic queue limit update APIs md/raid5: use the atomic queue limit update APIs md/raid1: use the atomic queue limit update APIs md/raid0: use the atomic queue limit update APIs md: add queue limit helpers md: add a mddev_is_dm helper md: add a mddev_add_trace_msg helper md: add a mddev_trace_remap helper bcache: move calculation of stripe_size and io_opt into bcache_device_init virtio_blk: Do not use disk_set_max_open/active_zones() aoe: fix the potential use-after-free problem in aoecmd_cfg_pkts block: move capacity validation to blkpg_do_ioctl() block: prevent division by zero in blk_rq_stat_sum() drbd: atomically update queue limits in drbd_reconsider_queue_parameters ...
2024-03-11Merge tag 'for-6.9/io_uring-20240310' of git://git.kernel.dk/linuxLinus Torvalds2-25/+29
Pull io_uring updates from Jens Axboe: - Make running of task_work internal loops more fair, and unify how the different methods deal with them (me) - Support for per-ring NAPI. The two minor networking patches are in a shared branch with netdev (Stefan) - Add support for truncate (Tony) - Export SQPOLL utilization stats (Xiaobing) - Multishot fixes (Pavel) - Fix for a race in manipulating the request flags via poll (Pavel) - Cleanup the multishot checking by making it generic, moving it out of opcode handlers (Pavel) - Various tweaks and cleanups (me, Kunwu, Alexander) * tag 'for-6.9/io_uring-20240310' of git://git.kernel.dk/linux: (53 commits) io_uring: Fix sqpoll utilization check racing with dying sqpoll io_uring/net: dedup io_recv_finish req completion io_uring: refactor DEFER_TASKRUN multishot checks io_uring: fix mshot io-wq checks io_uring/net: add io_req_msg_cleanup() helper io_uring/net: simplify msghd->msg_inq checking io_uring/kbuf: rename REQ_F_PARTIAL_IO to REQ_F_BL_NO_RECYCLE io_uring/net: remove dependency on REQ_F_PARTIAL_IO for sr->done_io io_uring/net: correctly handle multishot recvmsg retry setup io_uring/net: clear REQ_F_BL_EMPTY in the multishot retry handler io_uring: fix io_queue_proc modifying req->flags io_uring: fix mshot read defer taskrun cqe posting io_uring/net: fix overflow check in io_recvmsg_mshot_prep() io_uring/net: correct the type of variable io_uring/sqpoll: statistics of the true utilization of sq threads io_uring/net: move recv/recvmsg flags out of retry loop io_uring/kbuf: flag request if buffer pool is empty after buffer pick io_uring/net: improve the usercopy for sendmsg/recvmsg io_uring/net: move receive multishot out of the generic msghdr path io_uring/net: unify how recvmsg and sendmsg copy in the msghdr ...
2024-03-11Merge tag 'vfs-6.9.uuid' of ↵Linus Torvalds10-15/+59
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs uuid updates from Christian Brauner: "This adds two new ioctl()s for getting the filesystem uuid and retrieving the sysfs path based on the path of a mounted filesystem. Getting the filesystem uuid has been implemented in filesystem specific code for a while it's now lifted as a generic ioctl" * tag 'vfs-6.9.uuid' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: xfs: add support for FS_IOC_GETFSSYSFSPATH fs: add FS_IOC_GETFSSYSFSPATH fat: Hook up sb->s_uuid fs: FS_IOC_GETUUID ovl: convert to super_set_uuid() fs: super_set_uuid()
2024-03-11Merge tag 'vfs-6.9.super' of ↵Linus Torvalds30-272/+319
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull block handle updates from Christian Brauner: "Last cycle we changed opening of block devices, and opening a block device would return a bdev_handle. This allowed us to implement support for restricting and forbidding writes to mounted block devices. It was accompanied by converting and adding helpers to operate on bdev_handles instead of plain block devices. That was already a good step forward but ultimately it isn't necessary to have special purpose helpers for opening block devices internally that return a bdev_handle. Fundamentally, opening a block device internally should just be equivalent to opening files. So now all internal opens of block devices return files just as a userspace open would. Instead of introducing a separate indirection into bdev_open_by_*() via struct bdev_handle bdev_file_open_by_*() is made to just return a struct file. Opening and closing a block device just becomes equivalent to opening and closing a file. This all works well because internally we already have a pseudo fs for block devices and so opening block devices is simple. There's a few places where we needed to be careful such as during boot when the kernel is supposed to mount the rootfs directly without init doing it. Here we need to take care to ensure that we flush out any asynchronous file close. That's what we already do for opening, unpacking, and closing the initramfs. So nothing new here. The equivalence of opening and closing block devices to regular files is a win in and of itself. But it also has various other advantages. We can remove struct bdev_handle completely. Various low-level helpers are now private to the block layer. Other helpers were simply removable completely. A follow-up series that is already reviewed build on this and makes it possible to remove bdev->bd_inode and allows various clean ups of the buffer head code as well. All places where we stashed a bdev_handle now just stash a file and use simple accessors to get to the actual block device which was already the case for bdev_handle" * tag 'vfs-6.9.super' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (35 commits) block: remove bdev_handle completely block: don't rely on BLK_OPEN_RESTRICT_WRITES when yielding write access bdev: remove bdev pointer from struct bdev_handle bdev: make struct bdev_handle private to the block layer bdev: make bdev_{release, open_by_dev}() private to block layer bdev: remove bdev_open_by_path() reiserfs: port block device access to file ocfs2: port block device access to file nfs: port block device access to files jfs: port block device access to file f2fs: port block device access to files ext4: port block device access to file erofs: port device access to file btrfs: port device access to file bcachefs: port block device access to file target: port block device access to file s390: port block device access to file nvme: port block device access to file block2mtd: port device access to files bcache: port block device access to files ...
2024-03-11Merge tag 'vfs-6.9.file' of ↵Linus Torvalds42-868/+956
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull file locking updates from Christian Brauner: "A few years ago struct file_lock_context was added to allow for separate lists to track different types of file locks instead of using a singly-linked list for all of them. Now leases no longer need to be tracked using struct file_lock. However, a lot of the infrastructure is identical for leases and locks so separating them isn't trivial. This splits a group of fields used by both file locks and leases into a new struct file_lock_core. The new core struct is embedded in struct file_lock. Coccinelle was used to convert a lot of the callers to deal with the move, with the remaining 25% or so converted by hand. Afterwards several internal functions in fs/locks.c are made to work with struct file_lock_core. Ultimately this allows to split struct file_lock into struct file_lock and struct file_lease. The file lease APIs are then converted to take struct file_lease" * tag 'vfs-6.9.file' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (51 commits) filelock: fix deadlock detection in POSIX locking filelock: always define for_each_file_lock() smb: remove redundant check filelock: don't do security checks on nfsd setlease calls filelock: split leases out of struct file_lock filelock: remove temporary compatibility macros smb/server: adapt to breakup of struct file_lock smb/client: adapt to breakup of struct file_lock ocfs2: adapt to breakup of struct file_lock nfsd: adapt to breakup of struct file_lock nfs: adapt to breakup of struct file_lock lockd: adapt to breakup of struct file_lock fuse: adapt to breakup of struct file_lock gfs2: adapt to breakup of struct file_lock dlm: adapt to breakup of struct file_lock ceph: adapt to breakup of struct file_lock afs: adapt to breakup of struct file_lock 9p: adapt to breakup of struct file_lock filelock: convert seqfile handling to use file_lock_core filelock: convert locks_translate_pid to take file_lock_core ...
2024-03-11Merge tag 'vfs-6.9.pidfd' of ↵Linus Torvalds7-84/+486
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull pdfd updates from Christian Brauner: - Until now pidfds could only be created for thread-group leaders but not for threads. There was no technical reason for this. We simply had no users that needed support for this. Now we do have users that need support for this. This introduces a new PIDFD_THREAD flag for pidfd_open(). If that flag is set pidfd_open() creates a pidfd that refers to a specific thread. In addition, we now allow clone() and clone3() to be called with CLONE_PIDFD | CLONE_THREAD which wasn't possible before. A pidfd that refers to an individual thread differs from a pidfd that refers to a thread-group leader: (1) Pidfds are pollable. A task may poll a pidfd and get notified when the task has exited. For thread-group leader pidfds the polling task is woken if the thread-group is empty. In other words, if the thread-group leader task exits when there are still threads alive in its thread-group the polling task will not be woken when the thread-group leader exits but rather when the last thread in the thread-group exits. For thread-specific pidfds the polling task is woken if the thread exits. (2) Passing a thread-group leader pidfd to pidfd_send_signal() will generate thread-group directed signals like kill(2) does. Passing a thread-specific pidfd to pidfd_send_signal() will generate thread-specific signals like tgkill(2) does. The default scope of the signal is thus determined by the type of the pidfd. Since use-cases exist where the default scope of the provided pidfd needs to be overriden the following flags are added to pidfd_send_signal(): - PIDFD_SIGNAL_THREAD Send a thread-specific signal. - PIDFD_SIGNAL_THREAD_GROUP Send a thread-group directed signal. - PIDFD_SIGNAL_PROCESS_GROUP Send a process-group directed signal. The scope change will only work if the struct pid is actually used for this scope. For example, in order to send a thread-group directed signal the provided pidfd must be used as a thread-group leader and similarly for PIDFD_SIGNAL_PROCESS_GROUP the struct pid must be used as a process group leader. - Move pidfds from the anonymous inode infrastructure to a tiny pseudo filesystem. This will unblock further work that we weren't able to do simply because of the very justified limitations of anonymous inodes. Moving pidfds to a tiny pseudo filesystem allows for statx on pidfds to become useful for the first time. They can now be compared by inode number which are unique for the system lifetime. Instead of stashing struct pid in file->private_data we can now stash it in inode->i_private. This makes it possible to introduce concepts that operate on a process once all file descriptors have been closed. A concrete example is kill-on-last-close. Another side-effect is that file->private_data is now freed up for per-file options for pidfds. Now, each struct pid will refer to a different inode but the same struct pid will refer to the same inode if it's opened multiple times. In contrast to now where each struct pid refers to the same inode. The tiny pseudo filesystem is not visible anywhere in userspace exactly like e.g., pipefs and sockfs. There's no lookup, there's no complex inode operations, nothing. Dentries and inodes are always deleted when the last pidfd is closed. We allocate a new inode and dentry for each struct pid and we reuse that inode and dentry for all pidfds that refer to the same struct pid. The code is entirely optional and fairly small. If it's not selected we fallback to anonymous inodes. Heavily inspired by nsfs. The dentry and inode allocation mechanism is moved into generic infrastructure that is now shared between nsfs and pidfs. The path_from_stashed() helper must be provided with a stashing location, an inode number, a mount, and the private data that is supposed to be used and it will provide a path that can be passed to dentry_open(). The helper will try retrieve an existing dentry from the provided stashing location. If a valid dentry is found it is reused. If not a new one is allocated and we try to stash it in the provided location. If this fails we retry until we either find an existing dentry or the newly allocated dentry could be stashed. Subsequent openers of the same namespace or task are then able to reuse it. - Currently it is only possible to get notified when a task has exited, i.e., become a zombie and userspace gets notified with EPOLLIN. We now also support waiting until the task has been reaped, notifying userspace with EPOLLHUP. - Ensure that ESRCH is reported for getfd if a task is exiting instead of the confusing EBADF. - Various smaller cleanups to pidfd functions. * tag 'vfs-6.9.pidfd' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (23 commits) libfs: improve path_from_stashed() libfs: add stashed_dentry_prune() libfs: improve path_from_stashed() helper pidfs: convert to path_from_stashed() helper nsfs: convert to path_from_stashed() helper libfs: add path_from_stashed() pidfd: add pidfs pidfd: move struct pidfd_fops pidfd: allow to override signal scope in pidfd_send_signal() pidfd: change pidfd_send_signal() to respect PIDFD_THREAD signal: fill in si_code in prepare_kill_siginfo() selftests: add ESRCH tests for pidfd_getfd() pidfd: getfd should always report ESRCH if a task is exiting pidfd: clone: allow CLONE_THREAD | CLONE_PIDFD together pidfd: exit: kill the no longer used thread_group_exited() pidfd: change do_notify_pidfd() to use __wake_up(poll_to_key(EPOLLIN)) pid: kill the obsolete PIDTYPE_PID code in transfer_pid() pidfd: kill the no longer needed do_notify_pidfd() in de_thread() pidfd_poll: report POLLHUP when pid_task() == NULL pidfd: implement PIDFD_THREAD flag for pidfd_open() ...
2024-03-11Merge tag 'vfs-6.9.iomap' of ↵Linus Torvalds12-331/+392
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull iomap updates from Christian Brauner: - Restore read-write hints in struct bio through the bi_write_hint member for the sake of UFS devices in mobile applications. This can result in up to 40% lower write amplification in UFS devices. The patch series that builds on this will be coming in via the SCSI maintainers (Bart) - Overhaul the iomap writeback code. Afterwards ->map_blocks() is able to map multiple blocks at once as long as they're in the same folio. This reduces CPU usage for buffered write workloads on e.g., xfs on systems with lots of cores (Christoph) - Record processed bytes in iomap_iter() trace event (Kassey) - Extend iomap_writepage_map() trace event after Christoph's ->map_block() changes to map mutliple blocks at once (Zhang) * tag 'vfs-6.9.iomap' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (22 commits) iomap: Add processed for iomap_iter iomap: add pos and dirty_len into trace_iomap_writepage_map block, fs: Restore the per-bio/request data lifetime fields fs: Propagate write hints to the struct block_device inode fs: Move enum rw_hint into a new header file fs: Split fcntl_rw_hint() fs: Verify write lifetime constants at compile time fs: Fix rw_hint validation iomap: pass the length of the dirty region to ->map_blocks iomap: map multiple blocks at a time iomap: submit ioends immediately iomap: factor out a iomap_writepage_map_block helper iomap: only call mapping_set_error once for each failed bio iomap: don't chain bios iomap: move the iomap_sector sector calculation out of iomap_add_to_ioend iomap: clean up the iomap_alloc_ioend calling convention iomap: move all remaining per-folio logic into iomap_writepage_map iomap: factor out a iomap_writepage_handle_eof helper iomap: move the PF_MEMALLOC check to iomap_writepages iomap: move the io_folios field out of struct iomap_ioend ...
2024-03-11Merge tag 'vfs-6.9.ntfs' of ↵Linus Torvalds48-28826/+0
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull ntfs update from Christian Brauner: "This removes the old ntfs driver. The new ntfs3 driver is a full replacement that was merged over two years ago. We've went through various userspace and either they use ntfs3 or they use the fuse version of ntfs and thus build neither ntfs nor ntfs3. I think that's a clear sign that we should risk removing the legacy ntfs driver. Quoting from Arch Linux and Debian: - Debian does neither build the legacy ntfs nor the new ntfs3: "Not currently built with Debian's kernel packages, 'ntfs' has been symlinked to 'ntfs-3g' as it relates to fstab and mount commands. Debian kernels are built without support of the ntfs3 driver developed by Paragon Software." (cf. [2]) - Archlinux provides ntfs3 as their default since 5.15: "All officially supported kernels with versions 5.15 or newer are built with CONFIG_NTFS3_FS=m and thus support it. Before 5.15, NTFS read and write support is provided by the NTFS-3G FUSE file system." (cf. [1]). It's unmaintained apart from various odd fixes as well. Worst case we have to reintroduce it if someone really has a valid dependency on it. But it's worth trying to see whether we can remove it" Link: https://wiki.archlinux.org/title/NTFS [1] Link: https://wiki.debian.org/NTFS [2] * tag 'vfs-6.9.ntfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: fs: remove NTFS classic from docum. index fs: Remove NTFS classic
2024-03-11Merge tag 'vfs-6.9.misc' of ↵Linus Torvalds45-390/+494
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull misc vfs updates from Christian Brauner: "Misc features, cleanups, and fixes for vfs and individual filesystems. Features: - Support idmapped mounts for hugetlbfs. - Add RWF_NOAPPEND flag for pwritev2(). This allows us to fix a bug where the passed offset is ignored if the file is O_APPEND. The new flag allows a caller to enforce that the offset is honored to conform to posix even if the file was opened in append mode. - Move i_mmap_rwsem in struct address_space to avoid false sharing between i_mmap and i_mmap_rwsem. - Convert efs, qnx4, and coda to use the new mount api. - Add a generic is_dot_dotdot() helper that's used by various filesystems and the VFS code instead of open-coding it multiple times. - Recently we've added stable offsets which allows stable ordering when iterating directories exported through NFS on e.g., tmpfs filesystems. Originally an xarray was used for the offset map but that caused slab fragmentation issues over time. This switches the offset map to the maple tree which has a dense mode that handles this scenario a lot better. Includes tests. - Finally merge the case-insensitive improvement series Gabriel has been working on for a long time. This cleanly propagates case insensitive operations through ->s_d_op which in turn allows us to remove the quite ugly generic_set_encrypted_ci_d_ops() operations. It also improves performance by trying a case-sensitive comparison first and then fallback to case-insensitive lookup if that fails. This also fixes a bug where overlayfs would be able to be mounted over a case insensitive directory which would lead to all sort of odd behaviors. Cleanups: - Make file_dentry() a simple accessor now that ->d_real() is simplified because of the backing file work we did the last two cycles. - Use the dedicated file_mnt_idmap helper in ntfs3. - Use smp_load_acquire/store_release() in the i_size_read/write helpers and thus remove the hack to handle i_size reads in the filemap code. - The SLAB_MEM_SPREAD is a nop now. Remove it from various places in fs/ - It's no longer necessary to perform a second built-in initramfs unpack call because we retain the contents of the previous extraction. Remove it. - Now that we have removed various allocators kfree_rcu() always works with kmem caches and kmalloc(). So simplify various places that only use an rcu callback in order to handle the kmem cache case. - Convert the pipe code to use a lockdep comparison function instead of open-coding the nesting making lockdep validation easier. - Move code into fs-writeback.c that was located in a header but can be made static as it's only used in that one file. - Rewrite the alignment checking iterators for iovec and bvec to be easier to read, and also significantly more compact in terms of generated code. This saves 270 bytes of text on x86-64 (with clang-18) and 224 bytes on arm64 (with gcc-13). In profiles it also saves a bit of time for the same workload. - Switch various places to use KMEM_CACHE instead of kmem_cache_create(). - Use inode_set_ctime_to_ts() in inode_set_ctime_current() - Use kzalloc() in name_to_handle_at() to avoid kernel infoleak. - Various smaller cleanups for eventfds. Fixes: - Fix various comments and typos, and unneeded initializations. - Fix stack allocation hack for clang in the select code. - Improve dump_mapping() debug code on a best-effort basis. - Fix build errors in various selftests. - Avoid wrap-around instrumentation in various places. - Don't allow user namespaces without an idmapping to be used for idmapped mounts. - Fix sysv sb_read() call. - Fix fallback implementation of the get_name() export operation" * tag 'vfs-6.9.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (70 commits) hugetlbfs: support idmapped mounts qnx4: convert qnx4 to use the new mount api fs: use inode_set_ctime_to_ts to set inode ctime to current time libfs: Drop generic_set_encrypted_ci_d_ops ubifs: Configure dentry operations at dentry-creation time f2fs: Configure dentry operations at dentry-creation time ext4: Configure dentry operations at dentry-creation time libfs: Add helper to choose dentry operations at mount-time libfs: Merge encrypted_ci_dentry_ops and ci_dentry_ops fscrypt: Drop d_revalidate once the key is added fscrypt: Drop d_revalidate for valid dentries during lookup fscrypt: Factor out a helper to configure the lookup dentry ovl: Always reject mounting over case-insensitive directories libfs: Attempt exact-match comparison first during casefolded lookup efs: remove SLAB_MEM_SPREAD flag usage jfs: remove SLAB_MEM_SPREAD flag usage minix: remove SLAB_MEM_SPREAD flag usage openpromfs: remove SLAB_MEM_SPREAD flag usage proc: remove SLAB_MEM_SPREAD flag usage qnx6: remove SLAB_MEM_SPREAD flag usage ...
2024-03-10erofs: support compressed inodes over fscacheJingbo Xu4-20/+77
Since fscache can utilize iov_iter to write dest buffers, bio_vec can be used in this way too. To simplify this, pseudo bios are prepared and bio_vec will be filled with bio_add_page(). And a common .bi_end_io will be called directly to handle I/O completions. Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20240308094159.40547-2-jefflexu@linux.alibaba.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2024-03-10erofs: make iov_iter describe target buffers over fscacheJingbo Xu1-112/+123
So far the fscache mode supports uncompressed data only, and the data read from fscache is put directly into the target page cache. As the support for compressed data in fscache mode is going to be introduced, rework the fscache internals so that the following compressed part could make the raw data read from fscache be directed to the target buffer it wants, decompress the raw data, and finally fill the page cache with the decompressed data. As the first step, a new structure, i.e. erofs_fscache_io (io), is introduced to describe a generic read request from the fscache, while the caller can specify the target buffer it wants in the iov_iter structure (io->iter). Besides, the caller can also specify its completion callback and private data through erofs_fscache_io, which will be called to make further handling, e.g. unlocking the page cache for uncompressed data or decompressing the read raw data, when the read request from the fscache completes. Now erofs_fscache_read_io_async() serves as a generic interface for reading raw data from fscache for both compressed and uncompressed data. The erofs_fscache_rq structure is kept to describe a request to fill the page cache in the specified range. Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20240308094159.40547-1-jefflexu@linux.alibaba.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2024-03-10erofs: fix lockdep false positives on initializing erofs_pseudo_mntBaokun Li3-31/+15
Lockdep reported the following issue when mounting erofs with a domain_id: ============================================ WARNING: possible recursive locking detected 6.8.0-rc7-xfstests #521 Not tainted -------------------------------------------- mount/396 is trying to acquire lock: ffff907a8aaaa0e0 (&type->s_umount_key#50/1){+.+.}-{3:3}, at: alloc_super+0xe3/0x3d0 but task is already holding lock: ffff907a8aaa90e0 (&type->s_umount_key#50/1){+.+.}-{3:3}, at: alloc_super+0xe3/0x3d0 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&type->s_umount_key#50/1); lock(&type->s_umount_key#50/1); *** DEADLOCK *** May be due to missing lock nesting notation 2 locks held by mount/396: #0: ffff907a8aaa90e0 (&type->s_umount_key#50/1){+.+.}-{3:3}, at: alloc_super+0xe3/0x3d0 #1: ffffffffc00e6f28 (erofs_domain_list_lock){+.+.}-{3:3}, at: erofs_fscache_register_fs+0x3d/0x270 [erofs] stack backtrace: CPU: 1 PID: 396 Comm: mount Not tainted 6.8.0-rc7-xfstests #521 Call Trace: <TASK> dump_stack_lvl+0x64/0xb0 validate_chain+0x5c4/0xa00 __lock_acquire+0x6a9/0xd50 lock_acquire+0xcd/0x2b0 down_write_nested+0x45/0xd0 alloc_super+0xe3/0x3d0 sget_fc+0x62/0x2f0 vfs_get_super+0x21/0x90 vfs_get_tree+0x2c/0xf0 fc_mount+0x12/0x40 vfs_kern_mount.part.0+0x75/0x90 kern_mount+0x24/0x40 erofs_fscache_register_fs+0x1ef/0x270 [erofs] erofs_fc_fill_super+0x213/0x380 [erofs] This is because the file_system_type of both erofs and the pseudo-mount point of domain_id is erofs_fs_type, so two successive calls to alloc_super() are considered to be using the same lock and trigger the warning above. Therefore add a nodev file_system_type called erofs_anon_fs_type in fscache.c to silence this complaint. Because kern_mount() takes a pointer to struct file_system_type, not its (string) name. So we don't need to call register_filesystem(). In addition, call init_pseudo() in erofs_anon_init_fs_context() as suggested by Al Viro, so that we can remove erofs_fc_fill_pseudo_super(), erofs_fc_anon_get_tree(), and erofs_anon_context_ops. Suggested-by: Al Viro <viro@zeniv.linux.org.uk> Fixes: a9849560c55e ("erofs: introduce a pseudo mnt to manage shared cookies") Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-and-tested-by: Jingbo Xu <jefflexu@linux.alibaba.com> Reviewed-by: Yang Erkun <yangerkun@huawei.com> Link: https://lore.kernel.org/r/20240307101018.2021925-1-libaokun1@huawei.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2024-03-10erofs: refine managed cache operations to foliosGao Xiang6-48/+34
Convert erofs_try_to_free_all_cached_pages() and z_erofs_cache_release_folio(). Besides, erofs_page_is_managed() is moved to zdata.c and renamed as erofs_folio_is_managed(). Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20240305091448.1384242-6-hsiangkao@linux.alibaba.com
2024-03-10erofs: convert z_erofs_submissionqueue_endio() to foliosGao Xiang1-11/+11
Use bio_for_each_folio() to iterate over each folio in the bio and there is no large folios for now. Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20240305091448.1384242-5-hsiangkao@linux.alibaba.com
2024-03-10erofs: convert z_erofs_fill_bio_vec() to foliosGao Xiang1-35/+36
Introduce a folio member to `struct z_erofs_bvec` and convert most of z_erofs_fill_bio_vec() to folios, which is still straight-forward. Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20240305091448.1384242-4-hsiangkao@linux.alibaba.com
2024-03-10erofs: get rid of `justfound` debugging tagGao Xiang1-17/+3
`justfound` is introduced to identify cached folios that are just added to compressed bvecs so that more checks can be applied in the I/O submission path. EROFS is quite now stable compared to the codebase at that stage. `justfound` becomes a burden for upcoming features. Drop it. Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20240305091448.1384242-3-hsiangkao@linux.alibaba.com
2024-03-10erofs: convert z_erofs_do_read_page() to foliosGao Xiang1-16/+15
It is a straight-forward conversion. Besides, it's renamed as z_erofs_scan_folio(). Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20240305091448.1384242-2-hsiangkao@linux.alibaba.com
2024-03-10erofs: convert z_erofs_onlinepage_.* to foliosGao Xiang1-28/+22
Online folios are locked file-backed folios which will eventually keep decoded (e.g. decompressed) data of each inode for end users to utilize. It may belong to a few pclusters and contain other data (e.g. compressed data for inplace I/Os) temporarily in a time-sharing manner to reduce memory footprints for low-ended storage devices with high latencies under heary I/O pressure. Apart from folio_end_read() usage, it's a straight-forward conversion. Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20240305091448.1384242-1-hsiangkao@linux.alibaba.com
2024-03-10exec: Simplify remove_arg_zero() error pathKees Cook1-7/+3
We don't need the "out" label any more, so remove "ret" and return directly on error. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Kees Cook <keescook@chromium.org> --- Cc: Eric Biederman <ebiederm@xmission.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: linux-mm@kvack.org Cc: linux-fsdevel@vger.kernel.org
2024-03-09pstore/zone: Don't clear memory twiceChristophe JAILLET1-1/+0
There is no need to call memset(..., 0, ...) on memory allocated by kcalloc(). It is already zeroed. Remove the redundant call. Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Link: https://lore.kernel.org/r/fa2597400051c18c6ca11187b0e4b906729991b2.1709972649.git.christophe.jaillet@wanadoo.fr Signed-off-by: Kees Cook <keescook@chromium.org>
2024-03-09NFSD: Clean up nfsd4_encode_replay()Chuck Lever2-16/+31
Replace open-coded encoding logic with the use of conventional XDR utility functions. Add a tracepoint to make replays observable in field troubleshooting situations. The WARN_ON is removed. A stack trace is of little use, as there is only one call site for nfsd4_encode_replay(), and a buffer length shortage here is unlikely. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-07Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski16-78/+236
Cross-merge networking fixes after downstream PR. No conflicts. Adjacent changes: net/core/page_pool_user.c 0b11b1c5c320 ("netdev: let netlink core handle -EMSGSIZE errors") 429679dcf7d9 ("page_pool: fix netlink dump stop/resume") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-07Merge tag 'erofs-for-6.8-fixes' of ↵Linus Torvalds2-1/+3
git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs Pull erofs fixes from Gao Xiang: "The main one is a KMSAN fix which addresses an issue introduced in this cycle so it'd be much better to fix before releasing, and the remaining one fixes VMA alignment for THP. Summary: - Fix a KMSAN uninit-value issue triggered by a crafted image - Fix VMA alignment for memory mapped files on THP" * tag 'erofs-for-6.8-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs: erofs: apply proper VMA alignment for memory mapped files on THP erofs: fix uninitialized page cache reported by KMSAN
2024-03-07Merge tag 'for-next-6.9' of ↵Christian Brauner9-78/+59
ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/krisman/unicode into vfs.misc Merge case-insensitive updates from Gabriel Krisman Bertazi: - Patch case-insensitive lookup by trying the case-exact comparison first, before falling back to costly utf8 casefolded comparison. - Fix to forbid using a case-insensitive directory as part of an overlayfs mount. - Patchset to ensure d_op are set at d_alloc time for fscrypt and casefold volumes, ensuring filesystem dentries will all have the correct ops, whether they come from a lookup or not. * tag 'for-next-6.9' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/krisman/unicode: libfs: Drop generic_set_encrypted_ci_d_ops ubifs: Configure dentry operations at dentry-creation time f2fs: Configure dentry operations at dentry-creation time ext4: Configure dentry operations at dentry-creation time libfs: Add helper to choose dentry operations at mount-time libfs: Merge encrypted_ci_dentry_ops and ci_dentry_ops fscrypt: Drop d_revalidate once the key is added fscrypt: Drop d_revalidate for valid dentries during lookup fscrypt: Factor out a helper to configure the lookup dentry ovl: Always reject mounting over case-insensitive directories libfs: Attempt exact-match comparison first during casefolded lookup Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-03-07erofs: apply proper VMA alignment for memory mapped files on THPGao Xiang1-0/+1
There are mainly two reasons that thp_get_unmapped_area() should be used for EROFS as other filesystems: - It's needed to enable PMD mappings as a FSDAX filesystem, see commit 74d2fad1334d ("thp, dax: add thp_get_unmapped_area for pmd mappings"); - It's useful together with large folios and CONFIG_READ_ONLY_THP_FOR_FS which enable THPs for mmapped files (e.g. shared libraries) even without FSDAX. See commit 1854bc6e2420 ("mm/readahead: Align file mappings for non-DAX"). Fixes: 06252e9ce05b ("erofs: dax support for non-tailpacking regular file") Fixes: ce529cc25b18 ("erofs: enable large folios for iomap mode") Fixes: e6687b89225e ("erofs: enable large folios for fscache mode") Reviewed-by: Jingbo Xu <jefflexu@linux.alibaba.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20240306053138.2240206-1-hsiangkao@linux.alibaba.com
2024-03-07erofs: fix uninitialized page cache reported by KMSANGao Xiang1-1/+2
syzbot reports a KMSAN reproducer [1] which generates a crafted filesystem image and causes IMA to read uninitialized page cache. Later, (rq->outputsize > rq->inputsize) will be formally supported after either large uncompressed pclusters (> block size) or big lclusters are landed. However, currently there is no way to generate such filesystems by using mkfs.erofs. Thus, let's mark this condition as unsupported for now. [1] https://lore.kernel.org/r/0000000000002be12a0611ca7ff8@google.com Reported-and-tested-by: syzbot+7bc44a489f0ef0670bd5@syzkaller.appspotmail.com Fixes: 1ca01520148a ("erofs: refine z_erofs_transform_plain() for sub-page block support") Reviewed-by: Sandeep Dhavale <dhavale@google.com> Reviewed-by: Yue Hu <huyue2@coolpad.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20240304035339.425857-1-hsiangkao@linux.alibaba.com
2024-03-06iov_iter: get rid of 'copy_mc' flagLinus Torvalds1-3/+42
This flag is only set by one single user: the magical core dumping code that looks up user pages one by one, and then writes them out using their kernel addresses (by using a BVEC_ITER). That actually ends up being a huge problem, because while we do use copy_mc_to_kernel() for this case and it is able to handle the possible machine checks involved, nothing else is really ready to handle the failures caused by the machine check. In particular, as reported by Tong Tiangen, we don't actually support fault_in_iov_iter_readable() on a machine check area. As a result, the usual logic for writing things to a file under a filesystem lock, which involves doing a copy with page faults disabled and then if that fails trying to fault pages in without holding the locks with fault_in_iov_iter_readable() does not work at all. We could decide to always just make the MC copy "succeed" (and filling the destination with zeroes), and that would then create a core dump file that just ignores any machine checks. But honestly, this single special case has been problematic before, and means that all the normal iov_iter code ends up slightly more complex and slower. See for example commit c9eec08bac96 ("iov_iter: Don't deal with iter->copy_mc in memcpy_from_iter_mc()") where David Howells re-organized the code just to avoid having to check the 'copy_mc' flags inside the inner iov_iter loops. So considering that we have exactly one user, and that one user is a non-critical special case that doesn't actually ever trigger in real life (Tong found this with manual error injection), the sane solution is to just decide that the onus on handling the machine check lines on that user instead. Ergo, do the copy_mc_to_kernel() in the core dump logic itself, copying the user data to a stable kernel page before writing it out. Fixes: f1982740f5e7 ("iov_iter: Convert iterate*() to inline funcs") Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Tong Tiangen <tongtiangen@huawei.com> Link: https://lore.kernel.org/r/20240305133336.3804360-1-tongtiangen@huawei.com Link: https://lore.kernel.org/all/4e80924d-9c85-f13a-722a-6a5d2b1c225a@huawei.com/ Tested-by: David Howells <dhowells@redhat.com> Reviewed-by: David Howells <dhowells@redhat.com> Reviewed-by: Jens Axboe <axboe@kernel.dk> Reported-by: Tong Tiangen <tongtiangen@huawei.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-03-05btrfs: reuse cloned extent buffer during fiemap to avoid re-allocationsFilipe Manana1-8/+24
During fiemap we may have to visit multiple leaves of the subvolume's inode tree, and each time we are freeing and allocating an extent buffer to use as a clone of each visited leaf. Optimize this by reusing cloned extent buffers, to avoid the freeing and re-allocation both of the extent buffer structure itself and more importantly of the pages attached to the extent buffer. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-05btrfs: fix race when detecting delalloc ranges during fiemapFilipe Manana1-61/+160
For fiemap we recently stopped locking the target extent range for the whole duration of the fiemap call, in order to avoid a deadlock in a scenario where the fiemap buffer happens to be a memory mapped range of the same file. This use case is very unlikely to be useful in practice but it may be triggered by fuzz testing (syzbot, etc). This however introduced a race that makes us miss delalloc ranges for file regions that are currently holes, so the caller of fiemap will not be aware that there's data for some file regions. This can be quite serious for some use cases - for example in coreutils versions before 9.0, the cp program used fiemap to detect holes and data in the source file, copying only regions with data (extents or delalloc) from the source file to the destination file in order to preserve holes (see the documentation for its --sparse command line option). This means that if cp was used with a source file that had delalloc in a hole, the destination file could end up without that data, which is effectively a data loss issue, if it happened to hit the race described below. The race happens like this: 1) Fiemap is called, without the FIEMAP_FLAG_SYNC flag, for a file that has delalloc in the file range [64M, 65M[, which is currently a hole; 2) Fiemap locks the inode in shared mode, then starts iterating the inode's subvolume tree searching for file extent items, without having the whole fiemap target range locked in the inode's io tree - the change introduced recently by commit b0ad381fa769 ("btrfs: fix deadlock with fiemap and extent locking"). It only locks ranges in the io tree when it finds a hole or prealloc extent since that commit; 3) Note that fiemap clones each leaf before using it, and this is to avoid deadlocks when locking a file range in the inode's io tree and the fiemap buffer is memory mapped to some file, because writing to the page with btrfs_page_mkwrite() will wait on any ordered extent for the page's range and the ordered extent needs to lock the range and may need to modify the same leaf, therefore leading to a deadlock on the leaf; 4) While iterating the file extent items in the cloned leaf before finding the hole in the range [64M, 65M[, the delalloc in that range is flushed and its ordered extent completes - meaning the corresponding file extent item is in the inode's subvolume tree, but not present in the cloned leaf that fiemap is iterating over; 5) When fiemap finds the hole in the [64M, 65M[ range by seeing the gap in the cloned leaf (or a file extent item with disk_bytenr == 0 in case the NO_HOLES feature is not enabled), it will lock that file range in the inode's io tree and then search for delalloc by checking for the EXTENT_DELALLOC bit in the io tree for that range and ordered extents (with btrfs_find_delalloc_in_range()). But it finds nothing since the delalloc in that range was already flushed and the ordered extent completed and is gone - as a result fiemap will not report that there's delalloc or an extent for the range [64M, 65M[, so user space will be mislead into thinking that there's a hole in that range. This could actually be sporadically triggered with test case generic/094 from fstests, which reports a missing extent/delalloc range like this: generic/094 2s ... - output mismatch (see /home/fdmanana/git/hub/xfstests/results//generic/094.out.bad) --- tests/generic/094.out 2020-06-10 19:29:03.830519425 +0100 +++ /home/fdmanana/git/hub/xfstests/results//generic/094.out.bad 2024-02-28 11:00:00.381071525 +0000 @@ -1,3 +1,9 @@ QA output created by 094 fiemap run with sync fiemap run without sync +ERROR: couldn't find extent at 7 +map is 'HHDDHPPDPHPH' +logical: [ 5.. 6] phys: 301517.. 301518 flags: 0x800 tot: 2 +logical: [ 8.. 8] phys: 301520.. 301520 flags: 0x800 tot: 1 ... (Run 'diff -u /home/fdmanana/git/hub/xfstests/tests/generic/094.out /home/fdmanana/git/hub/xfstests/results//generic/094.out.bad' to see the entire diff) So in order to fix this, while still avoiding deadlocks in the case where the fiemap buffer is memory mapped to the same file, change fiemap to work like the following: 1) Always lock the whole range in the inode's io tree before starting to iterate the inode's subvolume tree searching for file extent items, just like we did before commit b0ad381fa769 ("btrfs: fix deadlock with fiemap and extent locking"); 2) Now instead of writing to the fiemap buffer every time we have an extent to report, write instead to a temporary buffer (1 page), and when that buffer becomes full, stop iterating the file extent items, unlock the range in the io tree, release the search path, submit all the entries kept in that buffer to the fiemap buffer, and then resume the search for file extent items after locking again the remainder of the range in the io tree. The buffer having a size of a page, allows for 146 entries in a system with 4K pages. This is a large enough value to have a good performance by avoiding too many restarts of the search for file extent items. In other words this preserves the huge performance gains made in the last two years to fiemap, while avoiding the deadlocks in case the fiemap buffer is memory mapped to the same file (useless in practice, but possible and exercised by fuzz testing and syzbot). Fixes: b0ad381fa769 ("btrfs: fix deadlock with fiemap and extent locking") Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-05btrfs: fix off-by-one chunk length calculation at contains_pending_extent()Filipe Manana1-1/+1
At contains_pending_extent() the value of the end offset of a chunk we found in the device's allocation state io tree is inclusive, so when we calculate the length we pass to the in_range() macro, we must sum 1 to the expression "physical_end - physical_offset". In practice the wrong calculation should be harmless as chunks sizes are never 1 byte and we should never have 1 byte ranges of unallocated space. Nevertheless fix the wrong calculation. Reported-by: Alex Lyakas <alex.lyakas@zadara.com> Link: https://lore.kernel.org/linux-btrfs/CAOcd+r30e-f4R-5x-S7sV22RJPe7+pgwherA6xqN2_qe7o4XTg@mail.gmail.com/ Fixes: 1c11b63eff2a ("btrfs: replace pending/pinned chunks lists with io tree") CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-05btrfs: qgroup: allow quick inherit if snapshot is created and added to the ↵Qu Wenruo1-7/+72
same parent Currently "btrfs subvolume snapshot -i <qgroupid>" would always mark the qgroup inconsistent. This can be annoying if the fs has a lot of snapshots, and needs qgroup to get the accounting for the amount of bytes it can free for each snapshot. Although we have the new simple quote as a solution, there is also a case where we can skip the full scan, if all the following conditions are met: - The source subvolume belongs to a higher level parent qgroup - The parent qgroup already owns all its bytes exclusively - The new snapshot is also added to the same parent qgroup In that case, we only need to add nodesize to the parent qgroup and avoid a full rescan. This patch would add the extra quick accounting update for such inherit. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-05btrfs: qgroup: validate btrfs_qgroup_inherit parameterQu Wenruo3-13/+57
[BUG] Currently btrfs can create subvolume with an invalid qgroup inherit without triggering any error: # mkfs.btrfs -O quota -f $dev # mount $dev $mnt # btrfs subvolume create -i 2/0 $mnt/subv1 # btrfs qgroup show -prce --sync $mnt Qgroupid Referenced Exclusive Path -------- ---------- --------- ---- 0/5 16.00KiB 16.00KiB <toplevel> 0/256 16.00KiB 16.00KiB subv1 [CAUSE] We only do a very basic size check for btrfs_qgroup_inherit structure, but never really verify if the values are correct. Thus in btrfs_qgroup_inherit() function, we have to skip non-existing qgroups, and never return any error. [FIX] Fix the behavior and introduce extra checks: - Introduce early check for btrfs_qgroup_inherit structure Not only the size, but also all the qgroup ids would be verified. And the timing is very early, so we can return error early. This early check is very important for snapshot creation, as snapshot is delayed to transaction commit. - Drop support for btrfs_qgroup_inherit::num_ref_copies and num_excl_copies Those two members are used to specify to copy refr/excl numbers from other qgroups. This would definitely mark qgroup inconsistent, and btrfs-progs has dropped the support for them for a long time. It's time to drop the support for kernel. - Verify the supported btrfs_qgroup_inherit::flags Just in case we want to add extra flags for btrfs_qgroup_inherit. Now above subvolume creation would fail with -ENOENT other than silently ignore the non-existing qgroup. CC: stable@vger.kernel.org # 6.7+ Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-05btrfs: include device major and minor numbers in the device scan noticeAnand Jain1-7/+12
To better debug issues surrounding device scans, include the device's major and minor numbers in the device scan notice for btrfs. Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-05btrfs: mark btrfs_put_caching_control() staticLijuan Li2-2/+1
btrfs_put_caching_control() is only used in block-group.c, so mark it static. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Lijuan Li <lilijuan@iscas.ac.cn> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-05btrfs: remove SLAB_MEM_SPREAD flag useChengming Zhou12-29/+18
The SLAB_MEM_SPREAD flag used to be implemented in SLAB, which was removed as of v6.8-rc1, so it became a dead flag since the commit 16a1d968358a ("mm/slab: remove mm/slab.c and slab_def.h"). And the series[1] went on to mark it obsolete to avoid confusion for users. Here we can just remove all its users, which has no functional change. [1] https://lore.kernel.org/all/20240223-slab-cleanup-flags-v2-1-02f1753e8303@suse.cz/ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-05btrfs: qgroup: always free reserved space for extent recordsQu Wenruo1-5/+5
[BUG] If qgroup is marked inconsistent (e.g. caused by operations needing full subtree rescan, like creating a snapshot and assign to a higher level qgroup), btrfs would immediately start leaking its data reserved space. The following script can easily reproduce it: mkfs.btrfs -O quota -f $dev mount $dev $mnt btrfs subvolume create $mnt/subv1 btrfs qgroup create 1/0 $mnt # This snapshot creation would mark qgroup inconsistent, # as the ownership involves different higher level qgroup, thus # we have to rescan both source and snapshot, which can be very # time consuming, thus here btrfs just choose to mark qgroup # inconsistent, and let users to determine when to do the rescan. btrfs subv snapshot -i 1/0 $mnt/subv1 $mnt/snap1 # Now this write would lead to qgroup rsv leak. xfs_io -f -c "pwrite 0 64k" $mnt/file1 # And at unmount time, btrfs would report 64K DATA rsv space leaked. umount $mnt And we would have the following dmesg output for the unmount: BTRFS info (device dm-1): last unmount of filesystem 14a3d84e-f47b-4f72-b053-a8a36eef74d3 BTRFS warning (device dm-1): qgroup 0/5 has unreleased space, type 0 rsv 65536 [CAUSE] Since commit e15e9f43c7ca ("btrfs: introduce BTRFS_QGROUP_RUNTIME_FLAG_NO_ACCOUNTING to skip qgroup accounting"), we introduce a mode for btrfs qgroup to skip the timing consuming backref walk, if the qgroup is already inconsistent. But this skip also covered the data reserved freeing, thus the qgroup reserved space for each newly created data extent would not be freed, thus cause the leakage. [FIX] Make the data extent reserved space freeing mandatory. The qgroup reserved space handling is way cheaper compared to the backref walking part, and we always have the super sensitive leak detector, thus it's definitely worth to always free the qgroup reserved data space. Reported-by: Fabian Vogt <fvogt@suse.com> Fixes: e15e9f43c7ca ("btrfs: introduce BTRFS_QGROUP_RUNTIME_FLAG_NO_ACCOUNTING to skip qgroup accounting") CC: stable@vger.kernel.org # 6.1+ Link: https://bugzilla.suse.com/show_bug.cgi?id=1216196 Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>