summaryrefslogtreecommitdiff
path: root/fs/xfs
AgeCommit message (Collapse)AuthorFilesLines
2023-08-30Merge tag 'mm-stable-2023-08-28-18-26' of ↵Linus Torvalds2-26/+18
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - Some swap cleanups from Ma Wupeng ("fix WARN_ON in add_to_avail_list") - Peter Xu has a series (mm/gup: Unify hugetlb, speed up thp") which reduces the special-case code for handling hugetlb pages in GUP. It also speeds up GUP handling of transparent hugepages. - Peng Zhang provides some maple tree speedups ("Optimize the fast path of mas_store()"). - Sergey Senozhatsky has improved te performance of zsmalloc during compaction (zsmalloc: small compaction improvements"). - Domenico Cerasuolo has developed additional selftest code for zswap ("selftests: cgroup: add zswap test program"). - xu xin has doe some work on KSM's handling of zero pages. These changes are mainly to enable the user to better understand the effectiveness of KSM's treatment of zero pages ("ksm: support tracking KSM-placed zero-pages"). - Jeff Xu has fixes the behaviour of memfd's MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED sysctl ("mm/memfd: fix sysctl MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED"). - David Howells has fixed an fscache optimization ("mm, netfs, fscache: Stop read optimisation when folio removed from pagecache"). - Axel Rasmussen has given userfaultfd the ability to simulate memory poisoning ("add UFFDIO_POISON to simulate memory poisoning with UFFD"). - Miaohe Lin has contributed some routine maintenance work on the memory-failure code ("mm: memory-failure: remove unneeded PageHuge() check"). - Peng Zhang has contributed some maintenance work on the maple tree code ("Improve the validation for maple tree and some cleanup"). - Hugh Dickins has optimized the collapsing of shmem or file pages into THPs ("mm: free retracted page table by RCU"). - Jiaqi Yan has a patch series which permits us to use the healthy subpages within a hardware poisoned huge page for general purposes ("Improve hugetlbfs read on HWPOISON hugepages"). - Kemeng Shi has done some maintenance work on the pagetable-check code ("Remove unused parameters in page_table_check"). - More folioification work from Matthew Wilcox ("More filesystem folio conversions for 6.6"), ("Followup folio conversions for zswap"). And from ZhangPeng ("Convert several functions in page_io.c to use a folio"). - page_ext cleanups from Kemeng Shi ("minor cleanups for page_ext"). - Baoquan He has converted some architectures to use the GENERIC_IOREMAP ioremap()/iounmap() code ("mm: ioremap: Convert architectures to take GENERIC_IOREMAP way"). - Anshuman Khandual has optimized arm64 tlb shootdown ("arm64: support batched/deferred tlb shootdown during page reclamation/migration"). - Better maple tree lockdep checking from Liam Howlett ("More strict maple tree lockdep"). Liam also developed some efficiency improvements ("Reduce preallocations for maple tree"). - Cleanup and optimization to the secondary IOMMU TLB invalidation, from Alistair Popple ("Invalidate secondary IOMMU TLB on permission upgrade"). - Ryan Roberts fixes some arm64 MM selftest issues ("selftests/mm fixes for arm64"). - Kemeng Shi provides some maintenance work on the compaction code ("Two minor cleanups for compaction"). - Some reduction in mmap_lock pressure from Matthew Wilcox ("Handle most file-backed faults under the VMA lock"). - Aneesh Kumar contributes code to use the vmemmap optimization for DAX on ppc64, under some circumstances ("Add support for DAX vmemmap optimization for ppc64"). - page-ext cleanups from Kemeng Shi ("add page_ext_data to get client data in page_ext"), ("minor cleanups to page_ext header"). - Some zswap cleanups from Johannes Weiner ("mm: zswap: three cleanups"). - kmsan cleanups from ZhangPeng ("minor cleanups for kmsan"). - VMA handling cleanups from Kefeng Wang ("mm: convert to vma_is_initial_heap/stack()"). - DAMON feature work from SeongJae Park ("mm/damon/sysfs-schemes: implement DAMOS tried total bytes file"), ("Extend DAMOS filters for address ranges and DAMON monitoring targets"). - Compaction work from Kemeng Shi ("Fixes and cleanups to compaction"). - Liam Howlett has improved the maple tree node replacement code ("maple_tree: Change replacement strategy"). - ZhangPeng has a general code cleanup - use the K() macro more widely ("cleanup with helper macro K()"). - Aneesh Kumar brings memmap-on-memory to ppc64 ("Add support for memmap on memory feature on ppc64"). - pagealloc cleanups from Kemeng Shi ("Two minor cleanups for pcp list in page_alloc"), ("Two minor cleanups for get pageblock migratetype"). - Vishal Moola introduces a memory descriptor for page table tracking, "struct ptdesc" ("Split ptdesc from struct page"). - memfd selftest maintenance work from Aleksa Sarai ("memfd: cleanups for vm.memfd_noexec"). - MM include file rationalization from Hugh Dickins ("arch: include asm/cacheflush.h in asm/hugetlb.h"). - THP debug output fixes from Hugh Dickins ("mm,thp: fix sloppy text output"). - kmemleak improvements from Xiaolei Wang ("mm/kmemleak: use object_cache instead of kmemleak_initialized"). - More folio-related cleanups from Matthew Wilcox ("Remove _folio_dtor and _folio_order"). - A VMA locking scalability improvement from Suren Baghdasaryan ("Per-VMA lock support for swap and userfaults"). - pagetable handling cleanups from Matthew Wilcox ("New page table range API"). - A batch of swap/thp cleanups from David Hildenbrand ("mm/swap: stop using page->private on tail pages for THP_SWAP + cleanups"). - Cleanups and speedups to the hugetlb fault handling from Matthew Wilcox ("Change calling convention for ->huge_fault"). - Matthew Wilcox has also done some maintenance work on the MM subsystem documentation ("Improve mm documentation"). * tag 'mm-stable-2023-08-28-18-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (489 commits) maple_tree: shrink struct maple_tree maple_tree: clean up mas_wr_append() secretmem: convert page_is_secretmem() to folio_is_secretmem() nios2: fix flush_dcache_page() for usage from irq context hugetlb: add documentation for vma_kernel_pagesize() mm: add orphaned kernel-doc to the rst files. mm: fix clean_record_shared_mapping_range kernel-doc mm: fix get_mctgt_type() kernel-doc mm: fix kernel-doc warning from tlb_flush_rmaps() mm: remove enum page_entry_size mm: allow ->huge_fault() to be called without the mmap_lock held mm: move PMD_ORDER to pgtable.h mm: remove checks for pte_index memcg: remove duplication detection for mem_cgroup_uncharge_swap mm/huge_memory: work on folio->swap instead of page->private when splitting folio mm/swap: inline folio_set_swap_entry() and folio_swap_entry() mm/swap: use dedicated entry for swap in folio mm/swap: stop using page->private on tail pages for THP_SWAP selftests/mm: fix WARNING comparing pointer to 0 selftests: cgroup: fix test_kmem_memcg_deletion kernel mem check ...
2023-08-28Merge tag 'iomap-6.6-merge-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds1-1/+1
Pull iomap updates from Darrick Wong: "We've got some big changes for this release -- I'm very happy to be landing willy's work to enable large folios for the page cache for general read and write IOs when the fs can make contiguous space allocations, and Ritesh's work to track sub-folio dirty state to eliminate the write amplification problems inherent in using large folios. As a bonus, io_uring can now process write completions in the caller's context instead of bouncing through a workqueue, which should reduce io latency dramatically. IOWs, XFS should see a nice performance bump for both IO paths. Summary: - Make large writes to the page cache fill sparse parts of the cache with large folios, then use large memcpy calls for the large folio. - Track the per-block dirty state of each large folio so that a buffered write to a single byte on a large folio does not result in a (potentially) multi-megabyte writeback IO. - Allow some directio completions to be performed in the initiating task's context instead of punting through a workqueue. This will reduce latency for some io_uring requests" * tag 'iomap-6.6-merge-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (26 commits) iomap: support IOCB_DIO_CALLER_COMP io_uring/rw: add write support for IOCB_DIO_CALLER_COMP fs: add IOCB flags related to passing back dio completions iomap: add IOMAP_DIO_INLINE_COMP iomap: only set iocb->private for polled bio iomap: treat a write through cache the same as FUA iomap: use an unsigned type for IOMAP_DIO_* defines iomap: cleanup up iomap_dio_bio_end_io() iomap: Add per-block dirty state tracking to improve performance iomap: Allocate ifs in ->write_begin() early iomap: Refactor iomap_write_delalloc_punch() function out iomap: Use iomap_punch_t typedef iomap: Fix possible overflow condition in iomap_write_delalloc_scan iomap: Add some uptodate state handling helpers for ifs state bitmap iomap: Drop ifs argument from iomap_set_range_uptodate() iomap: Rename iomap_page to iomap_folio_state and others iomap: Copy larger chunks from userspace iomap: Create large folios in the buffered write path filemap: Allow __filemap_get_folio to allocate large folios filemap: Add fgf_t typedef ...
2023-08-28Merge tag 'v6.6-vfs.super' of ↵Linus Torvalds6-96/+268
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull superblock updates from Christian Brauner: "This contains the super rework that was ready for this cycle. The first part changes the order of how we open block devices and allocate superblocks, contains various cleanups, simplifications, and a new mechanism to wait on superblock state changes. This unblocks work to ultimately limit the number of writers to a block device. Jan has already scheduled follow-up work that will be ready for v6.7 and allows us to restrict the number of writers to a given block device. That series builds on this work right here. The second part contains filesystem freezing updates. Overview: The generic superblock changes are rougly organized as follows (ignoring additional minor cleanups): (1) Removal of the bd_super member from struct block_device. This was a very odd back pointer to struct super_block with unclear rules. For all relevant places we have other means to get the same information so just get rid of this. (2) Simplify rules for superblock cleanup. Roughly, everything that is allocated during fs_context initialization and that's stored in fs_context->s_fs_info needs to be cleaned up by the fs_context->free() implementation before the superblock allocation function has been called successfully. After sget_fc() returned fs_context->s_fs_info has been transferred to sb->s_fs_info at which point sb->kill_sb() if fully responsible for cleanup. Adhering to these rules means that cleanup of sb->s_fs_info in fill_super() is to be avoided as it's brittle and inconsistent. Cleanup shouldn't be duplicated between sb->put_super() as sb->put_super() is only called if sb->s_root has been set aka when the filesystem has been successfully born (SB_BORN). That complexity should be avoided. This also means that block devices are to be closed in sb->kill_sb() instead of sb->put_super(). More details in the lower section. (3) Make it possible to lookup or create a superblock before opening block devices There's a subtle dependency on (2) as some filesystems did rely on fill_super() to be called in order to correctly clean up sb->s_fs_info. All these filesystems have been fixed. (4) Switch most filesystem to follow the same logic as the generic mount code now does as outlined in (3). (5) Use the superblock as the holder of the block device. We can now easily go back from block device to owning superblock. (6) Export and extend the generic fs_holder_ops and use them as holder ops everywhere and remove the filesystem specific holder ops. (7) Call from the block layer up into the filesystem layer when the block device is removed, allowing to shut down the filesystem without risk of deadlocks. (8) Get rid of get_super(). We can now easily go back from the block device to owning superblock and can call up from the block layer into the filesystem layer when the device is removed. So no need to wade through all registered superblock to find the owning superblock anymore" Link: https://lore.kernel.org/lkml/20230824-prall-intakt-95dbffdee4a0@brauner/ * tag 'v6.6-vfs.super' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (47 commits) super: use higher-level helper for {freeze,thaw} super: wait until we passed kill super super: wait for nascent superblocks super: make locking naming consistent super: use locking helpers fs: simplify invalidate_inodes fs: remove get_super block: call into the file system for ioctl BLKFLSBUF block: call into the file system for bdev_mark_dead block: consolidate __invalidate_device and fsync_bdev block: drop the "busy inodes on changed media" log message dasd: also call __invalidate_device when setting the device offline amiflop: don't call fsync_bdev in FDFMTBEG floppy: call disk_force_media_change when changing the format block: simplify the disk_force_media_change interface nbd: call blk_mark_disk_dead in nbd_clear_sock_ioctl xfs use fs_holder_ops for the log and RT devices xfs: drop s_umount over opening the log and RT devices ext4: use fs_holder_ops for the log device ext4: drop s_umount over opening the log device ...
2023-08-28Merge tag 'v6.6-vfs.ctime' of ↵Linus Torvalds9-24/+31
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs timestamp updates from Christian Brauner: "This adds VFS support for multi-grain timestamps and converts tmpfs, xfs, ext4, and btrfs to use them. This carries acks from all relevant filesystems. The VFS always uses coarse-grained timestamps when updating the ctime and mtime after a change. This has the benefit of allowing filesystems to optimize away a lot of metadata updates, down to around 1 per jiffy, even when a file is under heavy writes. Unfortunately, this has always been an issue when we're exporting via NFSv3, which relies on timestamps to validate caches. A lot of changes can happen in a jiffy, so timestamps aren't sufficient to help the client decide to invalidate the cache. Even with NFSv4, a lot of exported filesystems don't properly support a change attribute and are subject to the same problems with timestamp granularity. Other applications have similar issues with timestamps (e.g., backup applications). If we were to always use fine-grained timestamps, that would improve the situation, but that becomes rather expensive, as the underlying filesystem would have to log a lot more metadata updates. This introduces fine-grained timestamps that are used when they are actively queried. This uses the 31st bit of the ctime tv_nsec field to indicate that something has queried the inode for the mtime or ctime. When this flag is set, on the next mtime or ctime update, the kernel will fetch a fine-grained timestamp instead of the usual coarse-grained one. As POSIX generally mandates that when the mtime changes, the ctime must also change the kernel always stores normalized ctime values, so only the first 30 bits of the tv_nsec field are ever used. Filesytems can opt into this behavior by setting the FS_MGTIME flag in the fstype. Filesystems that don't set this flag will continue to use coarse-grained timestamps. Various preparatory changes, fixes and cleanups are included: - Fixup all relevant places where POSIX requires updating ctime together with mtime. This is a wide-range of places and all maintainers provided necessary Acks. - Add new accessors for inode->i_ctime directly and change all callers to rely on them. Plain accesses to inode->i_ctime are now gone and it is accordingly rename to inode->__i_ctime and commented as requiring accessors. - Extend generic_fillattr() to pass in a request mask mirroring in a sense the statx() uapi. This allows callers to pass in a request mask to only get a subset of attributes filled in. - Rework timestamp updates so it's possible to drop the @now parameter the update_time() inode operation and associated helpers. - Add inode_update_timestamps() and convert all filesystems to it removing a bunch of open-coding" * tag 'v6.6-vfs.ctime' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (107 commits) btrfs: convert to multigrain timestamps ext4: switch to multigrain timestamps xfs: switch to multigrain timestamps tmpfs: add support for multigrain timestamps fs: add infrastructure for multigrain timestamps fs: drop the timespec64 argument from update_time xfs: have xfs_vn_update_time gets its own timestamp fat: make fat_update_time get its own timestamp fat: remove i_version handling from fat_update_time ubifs: have ubifs_update_time use inode_update_timestamps btrfs: have it use inode_update_timestamps fs: drop the timespec64 arg from generic_update_time fs: pass the request_mask to generic_fillattr fs: remove silly warning from current_time gfs2: fix timestamp handling on quota inodes fs: rename i_ctime field to __i_ctime selinux: convert to ctime accessor functions security: convert to ctime accessor functions apparmor: convert to ctime accessor functions sunrpc: convert to ctime accessor functions ...
2023-08-25mm: remove enum page_entry_sizeMatthew Wilcox (Oracle)2-26/+18
Remove the unnecessary encoding of page order into an enum and pass the page order directly. That lets us get rid of pe_order(). The switch constructs have to be changed to if/else constructs to prevent GCC from warning on builds with 3-level page tables where PMD_ORDER and PUD_ORDER have the same value. If you are looking at this commit because your driver stopped compiling, look at the previous commit as well and audit your driver to be sure it doesn't depend on mmap_lock being held in its ->huge_fault method. [willy@infradead.org: use "order %u" to match the (non dev_t) style] Link: https://lkml.kernel.org/r/ZOUYekbtTv+n8hYf@casper.infradead.org Link: https://lkml.kernel.org/r/20230818202335.2739663-4-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-23Merge tag 'vfs-6.6-merge-3' of ↵Christian Brauner4-38/+183
ssh://gitolite.kernel.org/pub/scm/fs/xfs/xfs-linux Pull xfs online fsck update from Darrick Wong: New code for 6.6: * Allow the kernel to initiate a freeze of a filesystem. The kernel and userspace can both hold a freeze on a filesystem at the same time; the freeze is not lifted until /both/ holders lift it. This will enable us to fix a longstanding bug in XFS online fsck. * Use kernel-initated fsfreeze to fix some longstanding false negatives in online fsck of the free space and inode counters. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Message-Id: <20230822182604.GB11286@frogsfrogsfrogs> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-11xfs use fs_holder_ops for the log and RT devicesChristoph Hellwig1-15/+4
Use the generic fs_holder_ops to shut down the file system when the log or RT device goes away instead of duplicating the logic. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Message-Id: <20230802154131.2221419-13-hch@lst.de> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-11xfs: drop s_umount over opening the log and RT devicesChristoph Hellwig1-4/+14
Just like get_tree_bdev needs to drop s_umount when opening the main device, we need to do the same for the xfs log and RT devices to avoid a potential lock order reversal with s_unmount for the mark_dead path. It might be preferable to just drop s_umount over ->fill_super entirely, but that will require a fairly massive audit first, so we'll do the easy version here first. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Message-Id: <20230802154131.2221419-12-hch@lst.de> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-11fs: use the super_block as holder when mounting file systemsChristoph Hellwig1-1/+1
The file system type is not a very useful holder as it doesn't allow us to go back to the actual file system instance. Pass the super_block instead which is useful when passed back to the file system driver. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Christian Brauner <brauner@kernel.org> Message-Id: <20230802154131.2221419-7-hch@lst.de> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-11xfs: switch to multigrain timestampsJeff Layton3-7/+7
Enable multigrain timestamps, which should ensure that there is an apparent change to the timestamp whenever it has been written after being actively observed via getattr. Also, anytime the mtime changes, the ctime must also change, and those are now the only two options for xfs_trans_ichgtime. Have that function unconditionally bump the ctime, and ASSERT that XFS_ICHGTIME_CHG is always set. Acked-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Jeff Layton <jlayton@kernel.org> Message-Id: <20230807-mgctime-v7-11-d1dec143a704@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-11fs: drop the timespec64 argument from update_timeJeff Layton1-1/+0
Now that all of the update_time operations are prepared for it, we can drop the timespec64 argument from the update_time operation. Do that and remove it from some associated functions like inode_update_time and inode_needs_update_time. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Message-Id: <20230807-mgctime-v7-8-d1dec143a704@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-11xfs: have xfs_vn_update_time gets its own timestampJeff Layton1-5/+9
In later patches we're going to drop the "now" parameter from the update_time operation. Prepare XFS for this by reworking how it fetches timestamps and sets them in the inode. Ensure that we update the ctime even if only S_MTIME is set. Signed-off-by: Jeff Layton <jlayton@kernel.org> Acked-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Message-Id: <20230807-mgctime-v7-7-d1dec143a704@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-10xfs: document the invalidate_bdev call in invalidate_bdevChristoph Hellwig1-0/+26
Copy and paste the commit message from Darrick into a comment to explain the seemingly odd invalidate_bdev in xfs_shutdown_devices. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Message-Id: <20230809220545.1308228-8-hch@lst.de> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-10xfs: close the external block devices in xfs_mount_freeChristoph Hellwig2-16/+22
blkdev_put must not be called under sb->s_umount to avoid a lock order reversal with disk->open_mutex. Move closing the buftargs into ->kill_sb to archive that. Note that the flushing of the disk caches and block device mapping invalidated needs to stay in ->put_super as the main block device is closed in kill_block_super already. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Message-Id: <20230809220545.1308228-7-hch@lst.de> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-10xfs: close the RT and log block devices in xfs_free_buftargChristoph Hellwig1-0/+5
Closing the block devices logically belongs into xfs_free_buftarg, So instead of open coding it in the caller move it there and add a check for the s_bdev so that the main device isn't close as that's done by the VFS helper. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Message-Id: <20230809220545.1308228-6-hch@lst.de> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-10xfs: remove xfs_blkdev_putChristoph Hellwig1-13/+5
There isn't much use for this trivial wrapper, especially as the NULL check is only needed in a single call site. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christian Brauner <brauner@kernel.org> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Message-Id: <20230809220545.1308228-5-hch@lst.de> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-10xfs: free the xfs_mount in ->kill_sbChristoph Hellwig1-9/+11
As a rule of thumb everything allocated to the fs_context and moved into the super_block should be freed by ->kill_sb so that the teardown handling doesn't need to be duplicated between the fill_super error path and put_super. Implement a XFS-specific kill_sb method to do that. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christian Brauner <brauner@kernel.org> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Message-Id: <20230809220545.1308228-4-hch@lst.de> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-10xfs: remove a superfluous s_fs_info NULL check in xfs_fs_put_superChristoph Hellwig1-4/+0
->put_super is only called when sb->s_root is set, and thus when fill_super succeeds. Thus drop the NULL check that can't happen in xfs_fs_put_super. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christian Brauner <brauner@kernel.org> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Message-Id: <20230809220545.1308228-3-hch@lst.de> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-10xfs: reformat the xfs_fs_free prototypeChristoph Hellwig1-1/+2
The xfs_fs_free prototype formatting is a weird mix of the classic XFS style and the Linux style. Fix it up to be consistent. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Message-Id: <20230809220545.1308228-2-hch@lst.de> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-09fs: drop the timespec64 arg from generic_update_timeJeff Layton1-2/+4
In future patches we're going to change how the ctime is updated to keep track of when it has been queried. The way that the update_time operation works (and a lot of its callers) make this difficult, since they grab a timestamp early and then pass it down to eventually be copied into the inode. All of the existing update_time callers pass in the result of current_time() in some fashion. Drop the "time" parameter from generic_update_time, and rework it to fetch its own timestamp. This change means that an update_time could fetch a different timestamp than was seen in inode_needs_update_time. update_time is only ever called with one of two flag combinations: Either S_ATIME is set, or S_MTIME|S_CTIME|S_VERSION are set. With this change we now treat the flags argument as an indicator that some value needed to be updated when last checked, rather than an indication to update specific timestamps. Rework the logic for updating the timestamps and put it in a new inode_update_timestamps helper that other update_time routines can use. S_ATIME is as treated as we always have, but if any of the other three are set, then we attempt to update all three. Also, some callers of generic_update_time need to know what timestamps were actually updated. Change it to return an S_* flag mask to indicate that and rework the callers to expect it. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Message-Id: <20230807-mgctime-v7-3-d1dec143a704@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-04xfs: stabilize fs summary counters for online fsckDarrick J. Wong4-38/+183
If the fscounters scrubber notices incorrect summary counters, it's entirely possible that scrub is simply racing with other threads that are updating the incore counters. There isn't a good way to stabilize percpu counters or set ourselves up to observe live updates with hooks like we do for the quotacheck or nlinks scanners, so we instead choose to freeze the filesystem long enough to walk the incore per-AG structures. Past me thought that it was going to be commonplace to have to freeze the filesystem to perform some kind of repair and set up a whole separate infrastructure to freeze the filesystem in such a way that userspace could not unfreeze while we were running. This involved adding a mutex and freeze_super/thaw_super functions and dealing with the fact that the VFS freeze/thaw functions can free the VFS superblock references on return. This was all very overwrought, since fscounters turned out to be the only user of scrub freezes, and it doesn't require the log to quiesce, only the incore superblock counters. We prevent other threads from changing the freeze level by calling freeze_super_excl with a custom freeze cookie to keep everyone else out of the filesystem. The end result is that fscounters should be much more efficient. When we're checking a busy system and we can't stabilize the counters, the custom freeze will do less work, which should result in less downtime. Repair should be similarly speedy, but that's in a later patch. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-07-25iomap: Add per-block dirty state tracking to improve performanceRitesh Harjani (IBM)1-1/+1
When filesystem blocksize is less than folio size (either with mapping_large_folio_support() or with blocksize < pagesize) and when the folio is uptodate in pagecache, then even a byte write can cause an entire folio to be written to disk during writeback. This happens because we currently don't have a mechanism to track per-block dirty state within struct iomap_folio_state. We currently only track uptodate state. This patch implements support for tracking per-block dirty state in iomap_folio_state->state bitmap. This should help improve the filesystem write performance and help reduce write amplification. Performance testing of below fio workload reveals ~16x performance improvement using nvme with XFS (4k blocksize) on Power (64K pagesize) FIO reported write bw scores improved from around ~28 MBps to ~452 MBps. 1. <test_randwrite.fio> [global] ioengine=psync rw=randwrite overwrite=1 pre_read=1 direct=0 bs=4k size=1G dir=./ numjobs=8 fdatasync=1 runtime=60 iodepth=64 group_reporting=1 [fio-run] 2. Also our internal performance team reported that this patch improves their database workload performance by around ~83% (with XFS on Power) Reported-by: Aravinda Herle <araherle@in.ibm.com> Reported-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-07-24xfs: convert to ctime accessor functionsJeff Layton8-13/+15
In later patches, we're going to change how the inode's ctime field is used. Switch to using accessor functions instead of raw accesses of inode->i_ctime. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Message-Id: <20230705190309.579783-80-jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-07-17xfs: convert flex-array declarations in xfs attr shortform objectsDarrick J. Wong2-1/+2
As of 6.5-rc1, UBSAN trips over the ondisk extended attribute shortform definitions using an array length of 1 to pretend to be a flex array. Kernel compilers have to support unbounded array declarations, so let's correct this. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Kees Cook <keescook@chromium.org>
2023-07-17xfs: convert flex-array declarations in xfs attr leaf blocksDarrick J. Wong2-10/+67
As of 6.5-rc1, UBSAN trips over the ondisk extended attribute leaf block definitions using an array length of 1 to pretend to be a flex array. Kernel compilers have to support unbounded array declarations, so let's correct this. ================================================================================ UBSAN: array-index-out-of-bounds in fs/xfs/libxfs/xfs_attr_leaf.c:2535:24 index 2 is out of range for type '__u8 [1]' Call Trace: <TASK> dump_stack_lvl+0x33/0x50 __ubsan_handle_out_of_bounds+0x9c/0xd0 xfs_attr3_leaf_getvalue+0x2ce/0x2e0 [xfs 4a986a89a77bb77402ab8a87a37da369ef6a3f09] xfs_attr_leaf_get+0x148/0x1c0 [xfs 4a986a89a77bb77402ab8a87a37da369ef6a3f09] xfs_attr_get_ilocked+0xae/0x110 [xfs 4a986a89a77bb77402ab8a87a37da369ef6a3f09] xfs_attr_get+0xee/0x150 [xfs 4a986a89a77bb77402ab8a87a37da369ef6a3f09] xfs_xattr_get+0x7d/0xc0 [xfs 4a986a89a77bb77402ab8a87a37da369ef6a3f09] __vfs_getxattr+0xa3/0x100 vfs_getxattr+0x87/0x1d0 do_getxattr+0x17a/0x220 getxattr+0x89/0xf0 Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Kees Cook <keescook@chromium.org>
2023-07-17xfs: convert flex-array declarations in struct xfs_attrlist*Darrick J. Wong1-2/+2
As of 6.5-rc1, UBSAN trips over the attrlist ioctl definitions using an array length of 1 to pretend to be a flex array. Kernel compilers have to support unbounded array declarations, so let's correct this. This may cause friction with userspace header declarations, but suck is life. ================================================================================ UBSAN: array-index-out-of-bounds in fs/xfs/xfs_ioctl.c:345:18 index 1 is out of range for type '__s32 [1]' Call Trace: <TASK> dump_stack_lvl+0x33/0x50 __ubsan_handle_out_of_bounds+0x9c/0xd0 xfs_ioc_attr_put_listent+0x413/0x420 [xfs 4a986a89a77bb77402ab8a87a37da369ef6a3f09] xfs_attr_list_ilocked+0x170/0x850 [xfs 4a986a89a77bb77402ab8a87a37da369ef6a3f09] xfs_attr_list+0xb7/0x120 [xfs 4a986a89a77bb77402ab8a87a37da369ef6a3f09] xfs_ioc_attr_list+0x13b/0x2e0 [xfs 4a986a89a77bb77402ab8a87a37da369ef6a3f09] xfs_attrlist_by_handle+0xab/0x120 [xfs 4a986a89a77bb77402ab8a87a37da369ef6a3f09] xfs_file_ioctl+0x1ff/0x15e0 [xfs 4a986a89a77bb77402ab8a87a37da369ef6a3f09] vfs_ioctl+0x1f/0x60 The kernel and xfsprogs code that uses these structures will not have problems, but the long tail of external user programs might. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Kees Cook <keescook@chromium.org>
2023-07-09Merge tag 'xfs-6.5-merge-6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds1-1/+1
Pull xfs fix from Darrick Wong: "Nothing exciting here, just getting rid of a gcc warning that I got tired of seeing when I turn on gcov" * tag 'xfs-6.5-merge-6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: fix uninit warning in xfs_growfs_data
2023-07-08xfs: fix uninit warning in xfs_growfs_dataDarrick J. Wong1-1/+1
Quiet down this gcc warning: fs/xfs/xfs_fsops.c: In function ‘xfs_growfs_data’: fs/xfs/xfs_fsops.c:219:21: error: ‘lastag_extended’ may be used uninitialized in this function [-Werror=maybe-uninitialized] 219 | if (lastag_extended) { | ^~~~~~~~~~~~~~~ fs/xfs/xfs_fsops.c:100:33: note: ‘lastag_extended’ was declared here 100 | bool lastag_extended; | ^~~~~~~~~~~~~~~ By setting its value explicitly. From code analysis I don't think this is a real problem, but I have better things to do than analyse this closely. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-07-06Merge tag 'xfs-6.5-merge-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds21-331/+586
Pull more xfs updates from Darrick Wong: - Fix some ordering problems with log items during log recovery - Don't deadlock the system by trying to flush busy freed extents while holding on to busy freed extents - Improve validation of log geometry parameters when reading the primary superblock - Validate the length field in the AGF header - Fix recordset filtering bugs when re-calling GETFSMAP to return more results when the resultset didn't previously fit in the caller's buffer - Fix integer overflows in GETFSMAP when working with rt volumes larger than 2^32 fsblocks - Fix GETFSMAP reporting the undefined space beyond the last rtextent - Fix filtering bugs in GETFSMAP's log device backend if the log ever becomes longer than 2^32 fsblocks - Improve validation of file offsets in the GETFSMAP range parameters - Fix an off by one bug in the pmem media failure notification computation - Validate the length field in the AGI header too * tag 'xfs-6.5-merge-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: Remove unneeded semicolon xfs: AGI length should be bounds checked xfs: fix the calculation for "end" and "length" xfs: fix xfs_btree_query_range callers to initialize btree rec fully xfs: validate fsmap offsets specified in the query keys xfs: fix logdev fsmap query result filtering xfs: clean up the rtbitmap fsmap backend xfs: fix getfsmap reporting past the last rt extent xfs: fix integer overflows in the fsmap rtbitmap and logdev backends xfs: fix interval filtering in multi-step fsmap queries xfs: fix bounds check in xfs_defer_agfl_block() xfs: AGF length has never been bounds checked xfs: journal geometry is not properly bounds checked xfs: don't block in busy flushing when freeing extents xfs: allow extent free intents to be retried xfs: pass alloc flags through to xfs_extent_busy_flush() xfs: use deferred frees for btree block freeing xfs: don't reverse order of items in bulk AIL insertion xfs: remove redundant initializations of pointers drop_leaf and save_leaf
2023-07-03xfs: Remove unneeded semicolonYang Li1-1/+1
./fs/xfs/xfs_extfree_item.c:723:3-4: Unneeded semicolon Reported-by: Abaci Robot <abaci@linux.alibaba.com> Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=5728 Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2023-07-03xfs: AGI length should be bounds checkedDarrick J. Wong3-39/+60
Similar to the recent patch strengthening the AGF agf_length verification, the AGI verifier does not check that the AGI length field is within known good bounds. This isn't currently checked by runtime kernel code, yet we assume in many places that it is correct and verify other metadata against it. Add length verification to the AGI verifier. Just like the AGF length checking, the length of the AGI must be equal to the size of the AG specified in the superblock, unless it is the last AG in the filesystem. In that case, it must be less than or equal to sb->sb_agblocks and greater than XFS_MIN_AG_BLOCKS, which is the smallest AG a growfs operation will allow to exist. There's only one place in the filesystem that actually uses agi_length, but let's not leave it vulnerable to the same weird nonsense that generates syzbot bugs, eh? Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-07-02xfs: fix the calculation for "end" and "length"Shiyang Ruan1-4/+5
The value of "end" should be "start + length - 1". Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2023-07-02xfs: fix xfs_btree_query_range callers to initialize btree rec fullyDarrick J. Wong3-20/+13
Use struct initializers to ensure that the xfs_btree_irecs passed into the query_range function are completely initialized. No functional changes, just closing some sloppy hygiene. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-07-02xfs: validate fsmap offsets specified in the query keysDarrick J. Wong1-11/+19
Improve the validation of the fsmap offset fields in the query keys and move the validation to the top of the function now that we have pushed the low key adjustment code downwards. Also fix some indenting issues that aren't worth a separate patch. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-07-02xfs: fix logdev fsmap query result filteringDarrick J. Wong1-22/+8
The external log device fsmap backend doesn't have an rmapbt to query, so it's wasteful to spend time initializing the rmap_irec objects. Worse yet, the log could (someday) be longer than 2^32 fsblocks, so using the rmap irec structure will result in integer overflows. Fix this mess by computing the start address that we want from keys[0] directly, and use the daddr-based record filtering algorithm that we also use for rtbitmap queries. Fixes: e89c041338ed ("xfs: implement the GETFSMAP ioctl") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-07-02xfs: clean up the rtbitmap fsmap backendDarrick J. Wong2-53/+34
The rtbitmap fsmap backend doesn't query the rmapbt, so it's wasteful to spend time initializing the rmap_irec objects. Worse yet, the logic to query the rtbitmap is spread across three separate functions, which is unnecessarily difficult to follow. Compute the start rtextent that we want from keys[0] directly and combine the functions to avoid passing parameters around everywhere, and consolidate all the logic into a single function. At one point many years ago I intended to use __xfs_getfsmap_rtdev as the launching point for realtime rmapbt queries, but this hasn't been the case for a long time. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-07-02xfs: fix getfsmap reporting past the last rt extentDarrick J. Wong1-1/+1
The realtime section ends at the last rt extent. If the user configures the rt geometry with an extent size that is not an integer factor of the number of rt blocks, it's possible for there to be rt blocks past the end of the last rt extent. These tail blocks cannot ever be allocated and will cause corruption reports if the last extent coincides with the end of an rt bitmap block, so do not report consider them for the GETFSMAP output. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-07-02xfs: fix integer overflows in the fsmap rtbitmap and logdev backendsDarrick J. Wong1-26/+64
It's not correct to use the rmap irec structure to hold query key information to query the rtbitmap because the realtime volume can be longer than 2^32 fsblocks in length. Because the rt volume doesn't have allocation groups, introduce a daddr-based record filtering algorithm and compute the rtextent values using 64-bit variables. The same problem exists in the external log device fsmap implementation, so use the same solution to fix it too. After this patch, all the code that touches info->low and info->high under xfs_getfsmap_logdev and __xfs_getfsmap_rtdev are unnecessary. Cleaning this up will be done in subsequent patches. Fixes: 4c934c7dd60c ("xfs: report realtime space information via the rtbitmap") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-07-02xfs: fix interval filtering in multi-step fsmap queriesDarrick J. Wong1-19/+48
I noticed a bug in ranged GETFSMAP queries: # xfs_io -c 'fsmap -vvvv' /opt EXT: DEV BLOCK-RANGE OWNER FILE-OFFSET AG AG-OFFSET TOTAL 0: 8:80 [0..7]: static fs metadata 0 (0..7) 8 <snip> 9: 8:80 [192..223]: 137 0..31 0 (192..223) 32 # xfs_io -c 'fsmap -vvvv -d 208 208' /opt # That's not right -- we asked what block maps block 208, and we should've received a mapping for inode 137 offset 16. Instead, we get nothing. The root cause of this problem is a mis-interaction between the fsmap code and how btree ranged queries work. xfs_btree_query_range returns any btree record that overlaps with the query interval, even if the record starts before or ends after the interval. Similarly, GETFSMAP is supposed to return a recordset containing all records that overlap the range queried. However, it's possible that the recordset is larger than the buffer that the caller provided to convey mappings to userspace. In /that/ case, userspace is supposed to copy the last record returned to fmh_keys[0] and call GETFSMAP again. In this case, we do not want to return mappings that we have already supplied to the caller. The call to xfs_btree_query_range is the same, but now we ignore any records that start before fmh_keys[0]. Unfortunately, we didn't implement the filtering predicate correctly. The predicate should only be called when we're calling back for more records. Accomplish this by setting info->low.rm_blockcount to a nonzero value and ensuring that it is cleared as necessary. As a result, we no longer want to adjust dkeys[0] in the main setup function because that's confusing. This patch doesn't touch the logdev/rtbitmap backends because they have bigger problems that will be addressed by subsequent patches. Found via xfs/556 with parent pointers enabled. Fixes: e89c041338ed ("xfs: implement the GETFSMAP ioctl") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-06-29Merge tag 'xfs-6.5-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds5-14/+19
Pull xfs updates from Darrick Wong: "There's not much going on this cycle -- the large extent counts feature graduated, so now users can create more extremely fragmented files! :P The rest are bug fixes; and I'll be sending more next week. - Fix a problem where shrink would blow out the space reserve by declining to shrink the filesystem - Drop the EXPERIMENTAL tag for the large extent counts feature - Set FMODE_CAN_ODIRECT and get rid of an address space op - Fix an AG count overflow bug in growfs if the new device size is redonkulously large" * tag 'xfs-6.5-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: fix ag count overflow during growfs xfs: set FMODE_CAN_ODIRECT instead of a dummy direct_IO method xfs: drop EXPERIMENTAL tag for large extent counts xfs: don't deplete the reserve pool when trying to shrink the fs
2023-06-29xfs: fix bounds check in xfs_defer_agfl_block()Dave Chinner1-5/+6
Need to happen before we allocate and then leak the xefi. Found by coverity via an xfsprogs libxfs scan. [djwong: This also fixes the type of the @agbno argument.] Fixes: 7dfee17b13e5 ("xfs: validate block number being freed before adding to xefi") Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2023-06-29xfs: AGF length has never been bounds checkedDave Chinner1-34/+56
The AGF verifier does not check that the AGF length field is within known good bounds. This has never been checked by runtime kernel code (i.e. the lack of verification goes back to 1993) yet we assume in many places that it is correct and verify other metdata against it. Add length verification to the AGF verifier. The length of the AGF must be equal to the size of the AG specified in the superblock, unless it is the last AG in the filesystem. In that case, it must be less than or equal to sb->sb_agblocks and greater than XFS_MIN_AG_BLOCKS, which is the smallest AG a growfs operation will allow to exist. This requires a bit of rework of the verifier function. We want to verify metadata before we use it to verify other metadata. Hence we need to verify the AGF sequence numbers before using them to verify the length of the AGF. Then we can verify the AGF length before we verify AGFL fields. Then we can verifier other fields that are bounds limited by the AGF length. And, finally, by calculating agf_length only once into a local variable, we can collapse repeated "if (xfs_has_foo() &&" conditionaly checks into single checks. This makes the code much easier to follow as all the checks for a given feature are obviously in the same place. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2023-06-29xfs: journal geometry is not properly bounds checkedDave Chinner2-33/+70
If the journal geometry results in a sector or log stripe unit validation problem, it indicates that we cannot set the log up to safely write to the the journal. In these cases, we must abort the mount because the corruption needs external intervention to resolve. Similarly, a journal that is too large cannot be written to safely, either, so we shouldn't allow those geometries to mount, either. If the log is too small, we risk having transaction reservations overruning the available log space and the system hanging waiting for space it can never provide. This is purely a runtime hang issue, not a corruption issue as per the first cases listed above. We abort mounts of the log is too small for V5 filesystems, but we must allow v4 filesystems to mount because, historically, there was no log size validity checking and so some systems may still be out there with undersized logs. The problem is that on V4 filesystems, when we discover a log geometry problem, we skip all the remaining checks and then allow the log to continue mounting. This mean that if one of the log size checks fails, we skip the log stripe unit check. i.e. we allow the mount because a "non-fatal" geometry is violated, and then fail to check the hard fail geometries that should fail the mount. Move all these fatal checks to the superblock verifier, and add a new check for the two log sector size geometry variables having the same values. This will prevent any attempt to mount a log that has invalid or inconsistent geometries long before we attempt to mount the log. However, for the minimum log size checks, we can only do that once we've setup up the log and calculated all the iclog sizes and roundoffs. Hence this needs to remain in the log mount code after the log has been initialised. It is also the only case where we should allow a v4 filesystem to continue running, so leave that handling in place, too. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2023-06-29xfs: don't block in busy flushing when freeing extentsDave Chinner4-30/+88
If the current transaction holds a busy extent and we are trying to allocate a new extent to fix up the free list, we can deadlock if the AG is entirely empty except for the busy extent held by the transaction. This can occur at runtime processing an XEFI with multiple extents in this path: __schedule+0x22f at ffffffff81f75e8f schedule+0x46 at ffffffff81f76366 xfs_extent_busy_flush+0x69 at ffffffff81477d99 xfs_alloc_ag_vextent_size+0x16a at ffffffff8141711a xfs_alloc_ag_vextent+0x19b at ffffffff81417edb xfs_alloc_fix_freelist+0x22f at ffffffff8141896f xfs_free_extent_fix_freelist+0x6a at ffffffff8141939a __xfs_free_extent+0x99 at ffffffff81419499 xfs_trans_free_extent+0x3e at ffffffff814a6fee xfs_extent_free_finish_item+0x24 at ffffffff814a70d4 xfs_defer_finish_noroll+0x1f7 at ffffffff81441407 xfs_defer_finish+0x11 at ffffffff814417e1 xfs_itruncate_extents_flags+0x13d at ffffffff8148b7dd xfs_inactive_truncate+0xb9 at ffffffff8148bb89 xfs_inactive+0x227 at ffffffff8148c4f7 xfs_fs_destroy_inode+0xb8 at ffffffff81496898 destroy_inode+0x3b at ffffffff8127d2ab do_unlinkat+0x1d1 at ffffffff81270df1 do_syscall_64+0x40 at ffffffff81f6b5f0 entry_SYSCALL_64_after_hwframe+0x44 at ffffffff8200007c This can also happen in log recovery when processing an EFI with multiple extents through this path: context_switch() kernel/sched/core.c:3881 __schedule() kernel/sched/core.c:5111 schedule() kernel/sched/core.c:5186 xfs_extent_busy_flush() fs/xfs/xfs_extent_busy.c:598 xfs_alloc_ag_vextent_size() fs/xfs/libxfs/xfs_alloc.c:1641 xfs_alloc_ag_vextent() fs/xfs/libxfs/xfs_alloc.c:828 xfs_alloc_fix_freelist() fs/xfs/libxfs/xfs_alloc.c:2362 xfs_free_extent_fix_freelist() fs/xfs/libxfs/xfs_alloc.c:3029 __xfs_free_extent() fs/xfs/libxfs/xfs_alloc.c:3067 xfs_trans_free_extent() fs/xfs/xfs_extfree_item.c:370 xfs_efi_recover() fs/xfs/xfs_extfree_item.c:626 xlog_recover_process_efi() fs/xfs/xfs_log_recover.c:4605 xlog_recover_process_intents() fs/xfs/xfs_log_recover.c:4893 xlog_recover_finish() fs/xfs/xfs_log_recover.c:5824 xfs_log_mount_finish() fs/xfs/xfs_log.c:764 xfs_mountfs() fs/xfs/xfs_mount.c:978 xfs_fs_fill_super() fs/xfs/xfs_super.c:1908 mount_bdev() fs/super.c:1417 xfs_fs_mount() fs/xfs/xfs_super.c:1985 legacy_get_tree() fs/fs_context.c:647 vfs_get_tree() fs/super.c:1547 do_new_mount() fs/namespace.c:2843 do_mount() fs/namespace.c:3163 ksys_mount() fs/namespace.c:3372 __do_sys_mount() fs/namespace.c:3386 __se_sys_mount() fs/namespace.c:3383 __x64_sys_mount() fs/namespace.c:3383 do_syscall_64() arch/x86/entry/common.c:296 entry_SYSCALL_64() arch/x86/entry/entry_64.S:180 To avoid this deadlock, we should not block in xfs_extent_busy_flush() if we hold a busy extent in the current transaction. Now that the EFI processing code can handle requeuing a partially completed EFI, we can detect this situation in xfs_extent_busy_flush() and return -EAGAIN rather than going to sleep forever. The -EAGAIN get propagated back out to the xfs_trans_free_extent() context, where the EFD is populated and the transaction is rolled, thereby moving the busy extents into the CIL. At this point, we can retry the extent free operation again with a clean transaction. If we hit the same "all free extents are busy" situation when trying to fix up the free list, we can safely call xfs_extent_busy_flush() and wait for the busy extents to resolve and wake us. At this point, the allocation search can make progress again and we can fix up the free list. This deadlock was first reported by Chandan in mid-2021, but I couldn't make myself understood during review, and didn't have time to fix it myself. It was reported again in March 2023, and again I have found myself unable to explain the complexities of the solution needed during review. As such, I don't have hours more time to waste trying to get the fix written the way it needs to be written, so I'm just doing it myself. This patchset is largely based on Wengang Wang's last patch, but with all the unnecessary stuff removed, split up into multiple patches and cleaned up somewhat. Reported-by: Chandan Babu R <chandanrlinux@gmail.com> Reported-by: Wengang Wang <wen.gang.wang@oracle.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2023-06-29xfs: allow extent free intents to be retriedDave Chinner1-3/+69
Extent freeing neeeds to be able to avoid a busy extent deadlock when the transaction itself holds the only busy extents in the allocation group. This may occur if we have an EFI that contains multiple extents to be freed, and the freeing the second intent requires the space the first extent free released to expand the AGFL. If we block on the busy extent at this point, we deadlock. We hold a dirty transaction that contains a entire atomic extent free operations within it, so if we can abort the extent free operation and commit the progress that we've made, the busy extent can be resolved by a log force. Hence we can restart the aborted extent free with a new transaction and continue to make progress without risking deadlocks. To enable this, we need the EFI processing code to be able to handle an -EAGAIN error to tell it to commit the current transaction and retry again. This mechanism is already built into the defer ops processing (used bythe refcount btree modification intents), so there's relatively little handling we need to add to the EFI code to enable this. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
2023-06-29xfs: pass alloc flags through to xfs_extent_busy_flush()Dave Chinner4-47/+56
To avoid blocking in xfs_extent_busy_flush() when freeing extents and the only busy extents are held by the current transaction, we need to pass the XFS_ALLOC_FLAG_FREEING flag context all the way into xfs_extent_busy_flush(). Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
2023-06-29xfs: use deferred frees for btree block freeingDave Chinner11-26/+33
Btrees that aren't freespace management trees use the normal extent allocation and freeing routines for their blocks. Hence when a btree block is freed, a direct call to xfs_free_extent() is made and the extent is immediately freed. This puts the entire free space management btrees under this path, so we are stacking btrees on btrees in the call stack. The inobt, finobt and refcount btrees all do this. However, the bmap btree does not do this - it calls xfs_free_extent_later() to defer the extent free operation via an XEFI and hence it gets processed in deferred operation processing during the commit of the primary transaction (i.e. via intent chaining). We need to change xfs_free_extent() to behave in a non-blocking manner so that we can avoid deadlocks with busy extents near ENOSPC in transactions that free multiple extents. Inserting or removing a record from a btree can cause a multi-level tree merge operation and that will free multiple blocks from the btree in a single transaction. i.e. we can call xfs_free_extent() multiple times, and hence the btree manipulation transaction is vulnerable to this busy extent deadlock vector. To fix this, convert all the remaining callers of xfs_free_extent() to use xfs_free_extent_later() to queue XEFIs and hence defer processing of the extent frees to a context that can be safely restarted if a deadlock condition is detected. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
2023-06-29xfs: don't reverse order of items in bulk AIL insertionDave Chinner1-1/+1
XFS has strict metadata ordering requirements. One of the things it does is maintain the commit order of items from transaction commit through the CIL and into the AIL. That is, if a transaction logs item A before item B in a modification, then they will be inserted into the CIL in the order {A, B}. These items are then written into the iclog during checkpointing in the order {A, B}. When the checkpoint commits, they are supposed to be inserted into the AIL in the order {A, B}, and when they are pushed from the AIL, they are pushed in the order {A, B}. If we crash, log recovery then replays the two items from the checkpoint in the order {A, B}, resulting in the objects the items apply to being queued for writeback at the end of the checkpoint in the order {A, B}. This means recovery behaves the same way as the runtime code. In places, we have subtle dependencies on this ordering being maintained. One of this place is performing intent recovery from the log. It assumes that recovering an intent will result in a non-intent object being the first thing that is modified in the recovery transaction, and so when the transaction commits and the journal flushes, the first object inserted into the AIL beyond the intent recovery range will be a non-intent item. It uses the transistion from intent items to non-intent items to stop the recovery pass. A recent log recovery issue indicated that an intent was appearing as the first item in the AIL beyond the recovery range, hence breaking the end of recovery detection that exists. Tracing indicated insertion of the items into the AIL was apparently occurring in the right order (the intent was last in the commit item list), but the intent was appearing first in the AIL. IOWs, the order of items in the AIL was {D,C,B,A}, not {A,B,C,D}, and bulk insertion was reversing the order of the items in the batch of items being inserted. Lucky for us, all the items fed to bulk insertion have the same LSN, so the reversal of order does not affect the log head/tail tracking that is based on the contents of the AIL. It only impacts on code that has implicit, subtle dependencies on object order, and AFAICT only the intent recovery loop is impacted by it. Make sure bulk AIL insertion does not reorder items incorrectly. Fixes: 0e57f6a36f9b ("xfs: bulk AIL insertion during transaction commit") Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
2023-06-29xfs: remove redundant initializations of pointers drop_leaf and save_leafColin Ian King1-2/+0
Pointers drop_leaf and save_leaf are initialized with values that are never read, they are being re-assigned later on just before they are used. Remove the redundant early initializations and keep the later assignments at the point where they are used. Cleans up two clang scan build warnings: fs/xfs/libxfs/xfs_attr_leaf.c:2288:29: warning: Value stored to 'drop_leaf' during its initialization is never read [deadcode.DeadStores] fs/xfs/libxfs/xfs_attr_leaf.c:2289:29: warning: Value stored to 'save_leaf' during its initialization is never read [deadcode.DeadStores] Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2023-06-28Merge tag 'mm-stable-2023-06-24-19-15' of ↵Linus Torvalds1-6/+0
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull mm updates from Andrew Morton: - Yosry Ahmed brought back some cgroup v1 stats in OOM logs - Yosry has also eliminated cgroup's atomic rstat flushing - Nhat Pham adds the new cachestat() syscall. It provides userspace with the ability to query pagecache status - a similar concept to mincore() but more powerful and with improved usability - Mel Gorman provides more optimizations for compaction, reducing the prevalence of page rescanning - Lorenzo Stoakes has done some maintanance work on the get_user_pages() interface - Liam Howlett continues with cleanups and maintenance work to the maple tree code. Peng Zhang also does some work on maple tree - Johannes Weiner has done some cleanup work on the compaction code - David Hildenbrand has contributed additional selftests for get_user_pages() - Thomas Gleixner has contributed some maintenance and optimization work for the vmalloc code - Baolin Wang has provided some compaction cleanups, - SeongJae Park continues maintenance work on the DAMON code - Huang Ying has done some maintenance on the swap code's usage of device refcounting - Christoph Hellwig has some cleanups for the filemap/directio code - Ryan Roberts provides two patch series which yield some rationalization of the kernel's access to pte entries - use the provided APIs rather than open-coding accesses - Lorenzo Stoakes has some fixes to the interaction between pagecache and directio access to file mappings - John Hubbard has a series of fixes to the MM selftesting code - ZhangPeng continues the folio conversion campaign - Hugh Dickins has been working on the pagetable handling code, mainly with a view to reducing the load on the mmap_lock - Catalin Marinas has reduced the arm64 kmalloc() minimum alignment from 128 to 8 - Domenico Cerasuolo has improved the zswap reclaim mechanism by reorganizing the LRU management - Matthew Wilcox provides some fixups to make gfs2 work better with the buffer_head code - Vishal Moola also has done some folio conversion work - Matthew Wilcox has removed the remnants of the pagevec code - their functionality is migrated over to struct folio_batch * tag 'mm-stable-2023-06-24-19-15' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (380 commits) mm/hugetlb: remove hugetlb_set_page_subpool() mm: nommu: correct the range of mmap_sem_read_lock in task_mem() hugetlb: revert use of page_cache_next_miss() Revert "page cache: fix page_cache_next/prev_miss off by one" mm/vmscan: fix root proactive reclaim unthrottling unbalanced node mm: memcg: rename and document global_reclaim() mm: kill [add|del]_page_to_lru_list() mm: compaction: convert to use a folio in isolate_migratepages_block() mm: zswap: fix double invalidate with exclusive loads mm: remove unnecessary pagevec includes mm: remove references to pagevec mm: rename invalidate_mapping_pagevec to mapping_try_invalidate mm: remove struct pagevec net: convert sunrpc from pagevec to folio_batch i915: convert i915_gpu_error to use a folio_batch pagevec: rename fbatch_count() mm: remove check_move_unevictable_pages() drm: convert drm_gem_put_pages() to use a folio_batch i915: convert shmem_sg_free_table() to use a folio_batch scatterlist: add sg_set_folio() ...