kernel/linux.git - Linux kernel stable tree (mirror)

Age	Commit message (Collapse)	Author	Files	Lines
2023-06-05	xfs: collect errors from inodegc for unlinked inode recovery	Dave Chinner	8	-37/+60
	Unlinked list recovery requires errors removing the inode the from the unlinked list get fed back to the main recovery loop. Now that we offload the unlinking to the inodegc work, we don't get errors being fed back when we trip over a corruption that prevents the inode from being removed from the unlinked list. This means we never clear the corrupt unlinked list bucket, resulting in runtime operations eventually tripping over it and shutting down. Fix this by collecting inodegc worker errors and feed them back to the flush caller. This is largely best effort - the only context that really cares is log recovery, and it only flushes a single inode at a time so we don't need complex synchronised handling. Essentially the inodegc workers will capture the first error that occurs and the next flush will gather them and clear them. The flush itself will only report the first gathered error. In the cases where callers can return errors, propagate the collected inodegc flush error up the error handling chain. In the case of inode unlinked list recovery, there are several superfluous calls to flush queued unlinked inodes - xlog_recover_iunlink_bucket() guarantees that it has flushed the inodegc and collected errors before it returns. Hence nothing in the calling path needs to run a flush, even when an error is returned. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Dave Chinner <david@fromorbit.com>
2023-06-05	xfs: validate block number being freed before adding to xefi	Dave Chinner	8	-23/+62
	Bad things happen in defered extent freeing operations if it is passed a bad block number in the xefi. This can come from a bogus agno/agbno pair from deferred agfl freeing, or just a bad fsbno being passed to __xfs_free_extent_later(). Either way, it's very difficult to diagnose where a null perag oops in EFI creation is coming from when the operation that queued the xefi has already been completed and there's no longer any trace of it around.... Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Dave Chinner <david@fromorbit.com>
2023-06-05	xfs: validity check agbnos on the AGFL	Dave Chinner	1	-0/+3
	If the agfl or the indexing in the AGF has been corrupted, getting a block form the AGFL could return an invalid block number. If this happens, bad things happen. Check the agbno we pull off the AGFL and return -EFSCORRUPTED if we find somethign bad. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Dave Chinner <david@fromorbit.com>
2023-06-05	xfs: fix agf/agfl verification on v4 filesystems	Dave Chinner	1	-17/+42
	When a v4 filesystem has fl_last - fl_first != fl_count, we do not not detect the corruption and allow the AGF to be used as it if was fully valid. On V5 filesystems, we reset the AGFL to empty in these cases and avoid the corruption at a small cost of leaked blocks. If we don't catch the corruption on V4 filesystems, bad things happen later when an allocation attempts to trim the free list and either double-frees stale entries in the AGFl or tries to free NULLAGBNO entries. Either way, this is bad. Prevent this from happening by using the AGFL_NEED_RESET logic for v4 filesysetms, too. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Dave Chinner <david@fromorbit.com>
2023-06-05	xfs: fix double xfs_perag_rele() in xfs_filestream_pick_ag()	Dave Chinner	1	-1/+0
	xfs_bmap_longest_free_extent() can return an error when accessing the AGF fails. In this case, the behaviour of xfs_filestream_pick_ag() is conditional on the error. We may continue the loop, or break out of it. The error handling after the loop cleans up the perag reference held when the break occurs. If we continue, the next loop iteration handles cleaning up the perag reference. EIther way, we don't need to release the active perag reference when xfs_bmap_longest_free_extent() fails. Doing so means we do a double decrement on the active reference count, and this causes tha active reference count to fall to zero. At this point, new active references will fail. This leads to unmount hanging because it tries to grab active references to that perag, only for it to fail. This happens inside a loop that retries until a inode tree radix tree tag is cleared, which cannot happen because we can't get an active reference to the perag. The unmount livelocks in this path: xfs_reclaim_inodes+0x80/0xc0 xfs_unmount_flush_inodes+0x5b/0x70 xfs_unmountfs+0x5b/0x1a0 xfs_fs_put_super+0x49/0x110 generic_shutdown_super+0x7c/0x1a0 kill_block_super+0x27/0x50 deactivate_locked_super+0x30/0x90 deactivate_super+0x3c/0x50 cleanup_mnt+0xc2/0x160 __cleanup_mnt+0x12/0x20 task_work_run+0x5e/0xa0 exit_to_user_mode_prepare+0x1bc/0x1c0 syscall_exit_to_user_mode+0x16/0x40 do_syscall_64+0x40/0x80 entry_SYSCALL_64_after_hwframe+0x63/0xcd Reported-by: Pengfei Xu <pengfei.xu@intel.com> Fixes: eb70aa2d8ed9 ("xfs: use for_each_perag_wrap in xfs_filestream_pick_ag") Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Dave Chinner <david@fromorbit.com>
2023-06-05	xfs: fix broken logic when detecting mergeable bmap records	Darrick J. Wong	1	-12/+13
	Commit 6bc6c99a944c was a well-intentioned effort to initiate consolidation of adjacent bmbt mapping records by setting the PREEN flag. Consolidation can only happen if the length of the combined record doesn't overflow the 21-bit blockcount field of the bmbt recordset. Unfortunately, the length test is inverted, leading to it triggering on data forks like these: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL 0: [0..16777207]: 76110848..92888055 0 (76110848..92888055) 16777208 1: [16777208..20639743]: 92888056..96750591 0 (92888056..96750591) 3862536 Note that record 0 has a length of 16777208 512b blocks. This corresponds to 2097151 4k fsblocks, which is the maximum. Hence the two records cannot be merged. However, the logic is still wrong even if we change the in-loop comparison, because the scope of our examination isn't broad enough inside the loop to detect mappings like this: 0: [0..9]: 76110838..76110847 0 (76110838..76110847) 10 1: [10..16777217]: 76110848..92888055 0 (76110848..92888055) 16777208 2: [16777218..20639753]: 92888056..96750591 0 (92888056..96750591) 3862536 These three records could be merged into two, but one cannot determine this purely from looking at records 0-1 or 1-2 in isolation. Hoist the mergability detection outside the loop, and base its decision making on whether or not a merged mapping could be expressed in fewer bmbt records. While we're at it, fix the incorrect return type of the iter function. Fixes: 336642f79283 ("xfs: alert the user about data/attr fork mappings that could be merged") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2023-06-04	xfs: Fix undefined behavior of shift into sign bit	Geert Uytterhoeven	1	-4/+4
	With gcc-5: In file included from ./include/trace/define_trace.h:102:0, from ./fs/xfs/scrub/trace.h:988, from fs/xfs/scrub/trace.c:40: ./fs/xfs/./scrub/trace.h: In function ‘trace_raw_output_xchk_fsgate_class’: ./fs/xfs/scrub/scrub.h:111:28: error: initializer element is not constant #define XREP_ALREADY_FIXED (1 << 31) /* checking our repair work */ ^ Shifting the (signed) value 1 into the sign bit is undefined behavior. Fix this for all definitions in the file by shifting "1U" instead of "1". This was exposed by the first user added in commit 466c525d6d35e691 ("xfs: minimize overhead of drain wakeups by using jump labels"). Fixes: 160b5a784525e8a4 ("xfs: hoist the already_fixed variable to the scrub context") Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Dave Chinner <david@fromorbit.com>
2023-06-04	xfs: fix AGF vs inode cluster buffer deadlock	Dave Chinner	4	-106/+166
	Lock order in XFS is AGI -> AGF, hence for operations involving inode unlinked list operations we always lock the AGI first. Inode unlinked list operations operate on the inode cluster buffer, so the lock order there is AGI -> inode cluster buffer. For O_TMPFILE operations, this now means the lock order set down in xfs_rename and xfs_link is AGI -> inode cluster buffer -> AGF as the unlinked ops are done before the directory modifications that may allocate space and lock the AGF. Unfortunately, we also now lock the inode cluster buffer when logging an inode so that we can attach the inode to the cluster buffer and pin it in memory. This creates a lock order of AGF -> inode cluster buffer in directory operations as we have to log the inode after we've allocated new space for it. This creates a lock inversion between the AGF and the inode cluster buffer. Because the inode cluster buffer is shared across multiple inodes, the inversion is not specific to individual inodes but can occur when inodes in the same cluster buffer are accessed in different orders. To fix this we need move all the inode log item cluster buffer interactions to the end of the current transaction. Unfortunately, xfs_trans_log_inode() calls are littered throughout the transactions with no thought to ordering against other items or locking. This makes it difficult to do anything that involves changing the call sites of xfs_trans_log_inode() to change locking orders. However, we do now have a mechanism that allows is to postpone dirty item processing to just before we commit the transaction: the ->iop_precommit method. This will be called after all the modifications are done and high level objects like AGI and AGF buffers have been locked and modified, thereby providing a mechanism that guarantees we don't lock the inode cluster buffer before those high level objects are locked. This change is largely moving the guts of xfs_trans_log_inode() to xfs_inode_item_precommit() and providing an extra flag context in the inode log item to track the dirty state of the inode in the current transaction. This also means we do a lot less repeated work in xfs_trans_log_inode() by only doing it once per transaction when all the work is done. Fixes: 298f7bec503f ("xfs: pin inode backing buffer to the inode log item") Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Dave Chinner <david@fromorbit.com>
2023-06-04	xfs: defered work could create precommits	Dave Chinner	1	-0/+5
	To fix a AGI-AGF-inode cluster buffer deadlock, we need to move inode cluster buffer operations to the ->iop_precommit() method. However, this means that deferred operations can require precommits to be run on the final transaction that the deferred ops pass back to xfs_trans_commit() context. This will be exposed by attribute handling, in that the last changes to the inode in the attr set state machine "disappear" because the precommit operation is not run. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2023-06-04	xfs: restore allocation trylock iteration	Dave Chinner	1	-6/+7
	It was accidentally dropped when refactoring the allocation code, resulting in the AG iteration always doing blocking AG iteration. This results in a small performance regression for a specific fsmark test that runs more user data writer threads than there are AGs. Reported-by: kernel test robot <oliver.sang@intel.com> Fixes: 2edf06a50f5b ("xfs: factor xfs_alloc_vextent_this_ag() for _iterate_ags()") Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2023-06-04	xfs: buffer pins need to hold a buffer reference	Dave Chinner	1	-23/+65
	When a buffer is unpinned by xfs_buf_item_unpin(), we need to access the buffer after we've dropped the buffer log item reference count. This opens a window where we can have two racing unpins for the buffer item (e.g. shutdown checkpoint context callback processing racing with journal IO iclog completion processing) and both attempt to access the buffer after dropping the BLI reference count. If we are unlucky, the "BLI freed" context wins the race and frees the buffer before the "BLI still active" case checks the buffer pin count. This results in a use after free that can only be triggered in active filesystem shutdown situations. To fix this, we need to ensure that buffer existence extends beyond the BLI reference count checks and until the unpin processing is complete. This implies that a buffer pin operation must also take a buffer reference to ensure that the buffer cannot be freed until the buffer unpin processing is complete. Reported-by: yangerkun <yangerkun@huawei.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2023-05-12	Merge tag 'xfs-6.4-rc1-fixes' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux	Linus Torvalds	14	-63/+65
	Pull xfs bug fixes from Dave Chinner: "Largely minor bug fixes and cleanups, th emost important of which are probably the fixes for regressions in the extent allocation code: - fixes for inode garbage collection shutdown racing with work queue updates - ensure inodegc workers run on the CPU they are supposed to - disable counter scrubbing until we can exclusively freeze the filesystem from the kernel - regression fixes for new allocation related bugs - a couple of minor cleanups" * tag 'xfs-6.4-rc1-fixes' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: fix xfs_inodegc_stop racing with mod_delayed_work xfs: disable reaping in fscounters scrub xfs: check that per-cpu inodegc workers actually run on that cpu xfs: explicitly specify cpu when forcing inodegc delayed work to run immediately xfs: fix negative array access in xfs_getbmap xfs: don't allocate into the data fork for an unshare request xfs: flush dirty data and drain directios before scrubbing cow fork xfs: set bnobt/cntbt numrecs correctly when formatting new AGs xfs: don't unconditionally null args->pag in xfs_bmap_btalloc_at_eof
2023-05-02	xfs: fix xfs_inodegc_stop racing with mod_delayed_work	Darrick J. Wong	1	-5/+27
	syzbot reported this warning from the faux inodegc shrinker that tries to kick off inodegc work: ------------[ cut here ]------------ WARNING: CPU: 1 PID: 102 at kernel/workqueue.c:1445 __queue_work+0xd44/0x1120 kernel/workqueue.c:1444 RIP: 0010:__queue_work+0xd44/0x1120 kernel/workqueue.c:1444 Call Trace: __queue_delayed_work+0x1c8/0x270 kernel/workqueue.c:1672 mod_delayed_work_on+0xe1/0x220 kernel/workqueue.c:1746 xfs_inodegc_shrinker_scan fs/xfs/xfs_icache.c:2212 [inline] xfs_inodegc_shrinker_scan+0x250/0x4f0 fs/xfs/xfs_icache.c:2191 do_shrink_slab+0x428/0xaa0 mm/vmscan.c:853 shrink_slab+0x175/0x660 mm/vmscan.c:1013 shrink_one+0x502/0x810 mm/vmscan.c:5343 shrink_many mm/vmscan.c:5394 [inline] lru_gen_shrink_node mm/vmscan.c:5511 [inline] shrink_node+0x2064/0x35f0 mm/vmscan.c:6459 kswapd_shrink_node mm/vmscan.c:7262 [inline] balance_pgdat+0xa02/0x1ac0 mm/vmscan.c:7452 kswapd+0x677/0xd60 mm/vmscan.c:7712 kthread+0x2e8/0x3a0 kernel/kthread.c:376 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:308 This warning corresponds to this code in __queue_work: /* * For a draining wq, only works from the same workqueue are * allowed. The __WQ_DESTROYING helps to spot the issue that * queues a new work item to a wq after destroy_workqueue(wq). */ if (unlikely(wq->flags & (__WQ_DESTROYING \| __WQ_DRAINING) && WARN_ON_ONCE(!is_chained_work(wq)))) return; For this to trip, we must have a thread draining the inodedgc workqueue and a second thread trying to queue inodegc work to that workqueue. This can happen if freezing or a ro remount race with reclaim poking our faux inodegc shrinker and another thread dropping an unlinked O_RDONLY file: Thread 0 Thread 1 Thread 2 xfs_inodegc_stop xfs_inodegc_shrinker_scan xfs_is_inodegc_enabled <yes, will continue> xfs_clear_inodegc_enabled xfs_inodegc_queue_all <list empty, do not queue inodegc worker> xfs_inodegc_queue <add to list> xfs_is_inodegc_enabled <no, returns> drain_workqueue <set WQ_DRAINING> llist_empty <no, will queue list> mod_delayed_work_on(..., 0) __queue_work <sees WQ_DRAINING, kaboom> In other words, everything between the access to inodegc_enabled state and the decision to poke the inodegc workqueue requires some kind of coordination to avoid the WQ_DRAINING state. We could perhaps introduce a lock here, but we could also try to eliminate WQ_DRAINING from the picture. We could replace the drain_workqueue call with a loop that flushes the workqueue and queues workers as long as there is at least one inode present in the per-cpu inodegc llists. We've disabled inodegc at this point, so we know that the number of queued inodes will eventually hit zero as long as xfs_inodegc_start cannot reactivate the workers. There are four callers of xfs_inodegc_start. Three of them come from the VFS with s_umount held: filesystem thawing, failed filesystem freezing, and the rw remount transition. The fourth caller is mounting rw (no remount or freezing possible). There are three callers ofs xfs_inodegc_stop. One is unmounting (no remount or thaw possible). Two of them come from the VFS with s_umount held: fs freezing and ro remount transition. Hence, it is correct to replace the drain_workqueue call with a loop that drains the inodegc llists. Fixes: 6191cf3ad59f ("xfs: flush inodegc workqueue tasks before cancel") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2023-05-02	xfs: disable reaping in fscounters scrub	Darrick J. Wong	6	-39/+6
	The fscounters scrub code doesn't work properly because it cannot quiesce updates to the percpu counters in the filesystem, hence it returns false corruption reports. This has been fixed properly in one of the online repair patchsets that are under review by replacing the xchk_disable_reaping calls with an exclusive filesystem freeze. Disabling background gc isn't sufficient to fix the problem. In other words, scrub doesn't need to call xfs_inodegc_stop, which is just as well since it wasn't correct to allow scrub to call xfs_inodegc_start when something else could be calling xfs_inodegc_stop (e.g. trying to freeze the filesystem). Neuter the scrubber for now, and remove the xchk_*_reaping functions. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2023-05-02	xfs: check that per-cpu inodegc workers actually run on that cpu	Darrick J. Wong	3	-0/+8
	Now that we've allegedly worked out the problem of the per-cpu inodegc workers being scheduled on the wrong cpu, let's put in a debugging knob to let us know if a worker ever gets mis-scheduled again. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2023-05-02	xfs: explicitly specify cpu when forcing inodegc delayed work to run immediately	Darrick J. Wong	1	-2/+4
	I've been noticing odd racing behavior in the inodegc code that could only be explained by one cpu adding an inode to its inactivation llist at the same time that another cpu is processing that cpu's llist. Preemption is disabled between get/put_cpu_ptr, so the only explanation is scheduler mayhem. I inserted the following debug code into xfs_inodegc_worker (see the next patch): ASSERT(gc->cpu == smp_processor_id()); This assertion tripped during overnight tests on the arm64 machines, but curiously not on x86_64. I think we haven't observed any resource leaks here because the lockfree list code can handle simultaneous llist_add and llist_del_all functions operating on the same list. However, the whole point of having percpu inodegc lists is to take advantage of warm memory caches by inactivating inodes on the last processor to touch the inode. The incorrect scheduling seems to occur after an inodegc worker is subjected to mod_delayed_work(). This wraps mod_delayed_work_on with WORK_CPU_UNBOUND specified as the cpu number. Unbound allows for scheduling on any cpu, not necessarily the same one that scheduled the work. Because preemption is disabled for as long as we have the gc pointer, I think it's safe to use current_cpu() (aka smp_processor_id) to queue the delayed work item on the correct cpu. Fixes: 7cf2b0f9611b ("xfs: bound maximum wait time for inodegc work") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2023-05-02	xfs: fix negative array access in xfs_getbmap	Darrick J. Wong	1	-1/+3
	In commit 8ee81ed581ff, Ye Bin complained about an ASSERT in the bmapx code that trips if we encounter a delalloc extent after flushing the pagecache to disk. The ioctl code does not hold MMAPLOCK so it's entirely possible that a racing write page fault can create a delalloc extent after the file has been flushed. The proposed solution was to replace the assertion with an early return that avoids filling out the bmap recordset with a delalloc entry if the caller didn't ask for it. At the time, I recall thinking that the forward logic sounded ok, but felt hesitant because I suspected that changing this code would cause something /else/ to burst loose due to some other subtlety. syzbot of course found that subtlety. If all the extent mappings found after the flush are delalloc mappings, we'll reach the end of the data fork without ever incrementing bmv->bmv_entries. This is new, since before we'd have emitted the delalloc mappings even though the caller didn't ask for them. Once we reach the end, we'll try to set BMV_OF_LAST on the -1st entry (because bmv_entries is zero) and go corrupt something else in memory. Yay. I really dislike all these stupid patches that fiddle around with debug code and break things that otherwise worked well enough. Nobody was complaining that calling XFS_IOC_BMAPX without BMV_IF_DELALLOC would return BMV_OF_DELALLOC records, and now we've gone from "weird behavior that nobody cared about" to "bad behavior that must be addressed immediately". Maybe I'll just ignore anything from Huawei from now on for my own sake. Reported-by: syzbot+c103d3808a0de5faaf80@syzkaller.appspotmail.com Link: https://lore.kernel.org/linux-xfs/20230412024907.GP360889@frogsfrogsfrogs/ Fixes: 8ee81ed581ff ("xfs: fix BUG_ON in xfs_getbmap()") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2023-05-02	xfs: don't allocate into the data fork for an unshare request	Darrick J. Wong	1	-2/+3
	For an unshare request, we only have to take action if the data fork has a shared mapping. We don't care if someone else set up a cow operation. If we find nothing in the data fork, return a hole to avoid allocating space. Note that fallocate will replace the delalloc reservation with an unwritten extent anyway, so this has no user-visible effects outside of avoiding unnecessary updates. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2023-05-02	xfs: flush dirty data and drain directios before scrubbing cow fork	Darrick J. Wong	1	-2/+2
	When we're scrubbing the COW fork, we need to take MMAPLOCK_EXCL to prevent page_mkwrite from modifying any inode state. The ILOCK should suffice to avoid confusing online fsck, but let's take the same locks that we do everywhere else. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2023-05-02	xfs: set bnobt/cntbt numrecs correctly when formatting new AGs	Darrick J. Wong	1	-10/+9
	Through generic/300, I discovered that mkfs.xfs creates corrupt filesystems when given these parameters: # mkfs.xfs -d size=512M /dev/sda -f -d su=128k,sw=4 --unsupported Filesystems formatted with --unsupported are not supported!! meta-data=/dev/sda isize=512 agcount=8, agsize=16352 blks = sectsz=512 attr=2, projid32bit=1 = crc=1 finobt=1, sparse=1, rmapbt=1 = reflink=1 bigtime=1 inobtcount=1 nrext64=1 data = bsize=4096 blocks=130816, imaxpct=25 = sunit=32 swidth=128 blks naming =version 2 bsize=4096 ascii-ci=0, ftype=1 log =internal log bsize=4096 blocks=8192, version=2 = sectsz=512 sunit=32 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 = rgcount=0 rgsize=0 blks Discarding blocks...Done. # xfs_repair -n /dev/sda Phase 1 - find and verify superblock... - reporting progress in intervals of 15 minutes Phase 2 - using internal log - zero log... - 16:30:50: zeroing log - 16320 of 16320 blocks done - scan filesystem freespace and inode maps... agf_freeblks 25, counted 0 in ag 4 sb_fdblocks 8823, counted 8798 The root cause of this problem is the numrecs handling in xfs_freesp_init_recs, which is used to initialize a new AG. Prior to calling the function, we set up the new bnobt block with numrecs == 1 and rely on _freesp_init_recs to format that new record. If the last record created has a blockcount of zero, then it sets numrecs = 0. That last bit isn't correct if the AG contains the log, the start of the log is not immediately after the initial blocks due to stripe alignment, and the end of the log is perfectly aligned with the end of the AG. For this case, we actually formatted a single bnobt record to handle the free space before the start of the (stripe aligned) log, and incremented arec to try to format a second record. That second record turned out to be unnecessary, so what we really want is to leave numrecs at 1. The numrecs handling itself is overly complicated because a different function sets numrecs == 1. Change the bnobt creation code to start with numrecs set to zero and only increment it after successfully formatting a free space extent into the btree block. Fixes: f327a00745ff ("xfs: account for log space when formatting new AGs") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2023-05-02	xfs: don't unconditionally null args->pag in xfs_bmap_btalloc_at_eof	Darrick J. Wong	1	-2/+3
	xfs/170 on a filesystem with su=128k,sw=4 produces this splat: BUG: kernel NULL pointer dereference, address: 0000000000000010 #PF: supervisor write access in kernel mode #PF: error_code(0x0002) - not-present page PGD 0 P4D 0 Oops: 0002 [#1] PREEMPT SMP CPU: 1 PID: 4022907 Comm: dd Tainted: G W 6.3.0-xfsx #2 6ebeeffbe9577d32 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20171121_152543-x86-ol7-bu RIP: 0010:xfs_perag_rele+0x10/0x70 [xfs] RSP: 0018:ffffc90001e43858 EFLAGS: 00010217 RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000100 RDX: ffffffffa054e717 RSI: 0000000000000005 RDI: 0000000000000000 RBP: ffff888194eea000 R08: 0000000000000000 R09: 0000000000000037 R10: ffff888100ac1cb0 R11: 0000000000000018 R12: 0000000000000000 R13: ffffc90001e43a38 R14: ffff888194eea000 R15: ffff888194eea000 FS: 00007f93d1a0e740(0000) GS:ffff88843fc80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000010 CR3: 000000018a34f000 CR4: 00000000003506e0 Call Trace: <TASK> xfs_bmap_btalloc+0x1a7/0x5d0 [xfs f85291d6841cbb3dc740083f1f331c0327394518] xfs_bmapi_allocate+0xee/0x470 [xfs f85291d6841cbb3dc740083f1f331c0327394518] xfs_bmapi_write+0x539/0x9e0 [xfs f85291d6841cbb3dc740083f1f331c0327394518] xfs_iomap_write_direct+0x1bb/0x2b0 [xfs f85291d6841cbb3dc740083f1f331c0327394518] xfs_direct_write_iomap_begin+0x51c/0x710 [xfs f85291d6841cbb3dc740083f1f331c0327394518] iomap_iter+0x132/0x2f0 __iomap_dio_rw+0x2f8/0x840 iomap_dio_rw+0xe/0x30 xfs_file_dio_write_aligned+0xad/0x180 [xfs f85291d6841cbb3dc740083f1f331c0327394518] xfs_file_write_iter+0xfb/0x190 [xfs f85291d6841cbb3dc740083f1f331c0327394518] vfs_write+0x2eb/0x410 ksys_write+0x65/0xe0 do_syscall_64+0x2b/0x80 This crash occurs under the "out_low_space" label. We grabbed a perag reference, passed it via args->pag into xfs_bmap_btalloc_at_eof, and afterwards args->pag is NULL. Fix the second function not to clobber args->pag if the caller had passed one in. Fixes: 85843327094f ("xfs: factor xfs_bmap_btalloc()") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2023-04-29	Merge tag 'xfs-6.4-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux	Linus Torvalds	81	-1887/+5199
	Pull xfs updates from Dave Chinner: "This consists mainly of online scrub functionality and the design documentation for the upcoming online repair functionality built on top of the scrub code: - Added detailed design documentation for the upcoming online repair feature - major update to online scrub to complete the reverse mapping cross-referencing infrastructure enabling us to fully validate allocated metadata against owner records. This is the last piece of scrub infrastructure needed before we can start merging online repair functionality. - Fixes for the ascii-ci hashing issues - deprecation of the ascii-ci functionality - on-disk format verification bug fixes - various random bug fixes for syzbot and other bug reports" * tag 'xfs-6.4-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (107 commits) xfs: fix livelock in delayed allocation at ENOSPC xfs: Extend table marker on deprecated mount options table xfs: fix duplicate includes xfs: fix BUG_ON in xfs_getbmap() xfs: verify buffer contents when we skip log replay xfs: _{attr,data}_map_shared should take ILOCK_EXCL until iread_extents is completely done xfs: remove WARN when dquot cache insertion fails xfs: don't consider future format versions valid xfs: deprecate the ascii-ci feature xfs: test the ascii case-insensitive hash xfs: stabilize the dirent name transformation function used for ascii-ci dir hash computation xfs: cross-reference rmap records with refcount btrees xfs: cross-reference rmap records with inode btrees xfs: cross-reference rmap records with free space btrees xfs: cross-reference rmap records with ag btrees xfs: introduce bitmap type for AG blocks xfs: convert xbitmap to interval tree xfs: drop the _safe behavior from the xbitmap foreach macro xfs: don't load local xattr values during scrub xfs: remove the for_each_xbitmap_ helpers ...
2023-04-28	Merge tag 'mm-stable-2023-04-27-15-30' of ↵	Linus Torvalds	2	-18/+2
	git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - Nick Piggin's "shoot lazy tlbs" series, to improve the peformance of switching from a user process to a kernel thread. - More folio conversions from Kefeng Wang, Zhang Peng and Pankaj Raghav. - zsmalloc performance improvements from Sergey Senozhatsky. - Yue Zhao has found and fixed some data race issues around the alteration of memcg userspace tunables. - VFS rationalizations from Christoph Hellwig: - removal of most of the callers of write_one_page() - make __filemap_get_folio()'s return value more useful - Luis Chamberlain has changed tmpfs so it no longer requires swap backing. Use `mount -o noswap'. - Qi Zheng has made the slab shrinkers operate locklessly, providing some scalability benefits. - Keith Busch has improved dmapool's performance, making part of its operations O(1) rather than O(n). - Peter Xu adds the UFFD_FEATURE_WP_UNPOPULATED feature to userfaultd, permitting userspace to wr-protect anon memory unpopulated ptes. - Kirill Shutemov has changed MAX_ORDER's meaning to be inclusive rather than exclusive, and has fixed a bunch of errors which were caused by its unintuitive meaning. - Axel Rasmussen give userfaultfd the UFFDIO_CONTINUE_MODE_WP feature, which causes minor faults to install a write-protected pte. - Vlastimil Babka has done some maintenance work on vma_merge(): cleanups to the kernel code and improvements to our userspace test harness. - Cleanups to do_fault_around() by Lorenzo Stoakes. - Mike Rapoport has moved a lot of initialization code out of various mm/ files and into mm/mm_init.c. - Lorenzo Stoakes removd vmf_insert_mixed_prot(), which was added for DRM, but DRM doesn't use it any more. - Lorenzo has also coverted read_kcore() and vread() to use iterators and has thereby removed the use of bounce buffers in some cases. - Lorenzo has also contributed further cleanups of vma_merge(). - Chaitanya Prakash provides some fixes to the mmap selftesting code. - Matthew Wilcox changes xfs and afs so they no longer take sleeping locks in ->map_page(), a step towards RCUification of pagefaults. - Suren Baghdasaryan has improved mmap_lock scalability by switching to per-VMA locking. - Frederic Weisbecker has reworked the percpu cache draining so that it no longer causes latency glitches on cpu isolated workloads. - Mike Rapoport cleans up and corrects the ARCH_FORCE_MAX_ORDER Kconfig logic. - Liu Shixin has changed zswap's initialization so we no longer waste a chunk of memory if zswap is not being used. - Yosry Ahmed has improved the performance of memcg statistics flushing. - David Stevens has fixed several issues involving khugepaged, userfaultfd and shmem. - Christoph Hellwig has provided some cleanup work to zram's IO-related code paths. - David Hildenbrand has fixed up some issues in the selftest code's testing of our pte state changing. - Pankaj Raghav has made page_endio() unneeded and has removed it. - Peter Xu contributed some rationalizations of the userfaultfd selftests. - Yosry Ahmed has fixed an issue around memcg's page recalim accounting. - Chaitanya Prakash has fixed some arm-related issues in the selftests/mm code. - Longlong Xia has improved the way in which KSM handles hwpoisoned pages. - Peter Xu fixes a few issues with uffd-wp at fork() time. - Stefan Roesch has changed KSM so that it may now be used on a per-process and per-cgroup basis. * tag 'mm-stable-2023-04-27-15-30' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (369 commits) mm,unmap: avoid flushing TLB in batch if PTE is inaccessible shmem: restrict noswap option to initial user namespace mm/khugepaged: fix conflicting mods to collapse_file() sparse: remove unnecessary 0 values from rc mm: move 'mmap_min_addr' logic from callers into vm_unmapped_area() hugetlb: pte_alloc_huge() to replace huge pte_alloc_map() maple_tree: fix allocation in mas_sparse_area() mm: do not increment pgfault stats when page fault handler retries zsmalloc: allow only one active pool compaction context selftests/mm: add new selftests for KSM mm: add new KSM process and sysfs knobs mm: add new api to enable ksm per process mm: shrinkers: fix debugfs file permissions mm: don't check VMA write permissions if the PTE/PMD indicates write permissions migrate_pages_batch: fix statistics for longterm pin retry userfaultfd: use helper function range_in_vma() lib/show_mem.c: use for_each_populated_zone() simplify code mm: correct arg in reclaim_pages()/reclaim_clean_pages_from_list() fs/buffer: convert create_page_buffers to folio_create_buffers fs/buffer: add folio_create_empty_buffers helper ...
2023-04-28	Merge tag 'sysctl-6.4-rc1' of ↵	Linus Torvalds	1	-19/+1
	git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux Pull sysctl updates from Luis Chamberlain: "This only does a few sysctl moves from the kernel/sysctl.c file, the rest of the work has been put towards deprecating two API calls which incur recursion and prevent us from simplifying the registration process / saving memory per move. Most of the changes have been soaking on linux-next since v6.3-rc3. I've slowed down the kernel/sysctl.c moves due to Matthew Wilcox's feedback that we should see if we could save memory with these moves instead of incurring more memory. We currently incur more memory since when we move a syctl from kernel/sysclt.c out to its own file we end up having to add a new empty sysctl used to register it. To achieve saving memory we want to allow syctls to be passed without requiring the end element being empty, and just have our registration process rely on ARRAY_SIZE(). Without this, supporting both styles of sysctls would make the sysctl registration pretty brittle, hard to read and maintain as can be seen from Meng Tang's efforts to do just this [0]. Fortunately, in order to use ARRAY_SIZE() for all sysctl registrations also implies doing the work to deprecate two API calls which use recursion in order to support sysctl declarations with subdirectories. And so during this development cycle quite a bit of effort went into this deprecation effort. I've annotated the following two APIs are deprecated and in few kernel releases we should be good to remove them: - register_sysctl_table() - register_sysctl_paths() During this merge window we should be able to deprecate and unexport register_sysctl_paths(), we can probably do that towards the end of this merge window. Deprecating register_sysctl_table() will take a bit more time but this pull request goes with a few example of how to do this. As it turns out each of the conversions to move away from either of these two API calls also saves memory. And so long term, all these changes will prove to have saved a bit of memory on boot. The way I see it then is if remove a user of one deprecated call, it gives us enough savings to move one kernel/sysctl.c out from the generic arrays as we end up with about the same amount of bytes. Since deprecating register_sysctl_table() and register_sysctl_paths() does not require maintainer coordination except the final unexport you'll see quite a bit of these changes from other pull requests, I've just kept the stragglers after rc3" Link: https://lkml.kernel.org/r/ZAD+cpbrqlc5vmry@bombadil.infradead.org [0] * tag 'sysctl-6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux: (29 commits) fs: fix sysctls.c built mm: compaction: remove incorrect #ifdef checks mm: compaction: move compaction sysctl to its own file mm: memory-failure: Move memory failure sysctls to its own file arm: simplify two-level sysctl registration for ctl_isa_vars ia64: simplify one-level sysctl registration for kdump_ctl_table utsname: simplify one-level sysctl registration for uts_kern_table ntfs: simplfy one-level sysctl registration for ntfs_sysctls coda: simplify one-level sysctl registration for coda_table fs/cachefiles: simplify one-level sysctl registration for cachefiles_sysctls xfs: simplify two-level sysctl registration for xfs_table nfs: simplify two-level sysctl registration for nfs_cb_sysctls nfs: simplify two-level sysctl registration for nfs4_cb_sysctls lockd: simplify two-level sysctl registration for nlm_sysctls proc_sysctl: enhance documentation xen: simplify sysctl registration for balloon md: simplify sysctl registration hv: simplify sysctl registration scsi: simplify sysctl registration with register_sysctl() csky: simplify alignment sysctl registration ...
2023-04-27	xfs: fix livelock in delayed allocation at ENOSPC	Dave Chinner	1	-1/+0
	On a filesystem with a non-zero stripe unit and a large sequential write, delayed allocation will set a minimum allocation length of the stripe unit. If allocation fails because there are no extents long enough for an aligned minlen allocation, it is supposed to fall back to unaligned allocation which allows single block extents to be allocated. When the allocator code was rewritting in the 6.3 cycle, this fallback was broken - the old code used args->fsbno as the both the allocation target and the allocation result, the new code passes the target as a separate parameter. The conversion didn't handle the aligned->unaligned fallback path correctly - it reset args->fsbno to the target fsbno on failure which broke allocation failure detection in the high level code and so it never fell back to unaligned allocations. This resulted in a loop in writeback trying to allocate an aligned block, getting a false positive success, trying to insert the result in the BMBT. This did nothing because the extent already was in the BMBT (merge results in an unchanged extent) and so it returned the prior extent to the conversion code as the current iomap. Because the iomap returned didn't cover the offset we tried to map, xfs_convert_blocks() then retries the allocation, which fails in the same way and now we have a livelock. Reported-and-tested-by: Brian Foster <bfoster@redhat.com> Fixes: 85843327094f ("xfs: factor xfs_bmap_btalloc()") Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-04-26	Merge tag 'for-6.4/io_uring-2023-04-21' of git://git.kernel.dk/linux	Linus Torvalds	1	-1/+2
	Pull io_uring updates from Jens Axboe: - Cleanup of the io-wq per-node mapping, notably getting rid of it so we just have a single io_wq entry per ring (Breno) - Followup to the above, move accounting to io_wq as well and completely drop struct io_wqe (Gabriel) - Enable KASAN for the internal io_uring caches (Breno) - Add support for multishot timeouts. Some applications use timeouts to wake someone waiting on completion entries, and this makes it a bit easier to just have a recurring timer rather than needing to rearm it every time (David) - Support archs that have shared cache coloring between userspace and the kernel, and hence have strict address requirements for mmap'ing the ring into userspace. This should only be parisc/hppa. (Helge, me) - XFS has supported O_DIRECT writes without needing to lock the inode exclusively for a long time, and ext4 now supports it as well. This is true for the common cases of not extending the file size. Flag the fs as having that feature, and utilize that to avoid serializing those writes in io_uring (me) - Enable completion batching for uring commands (me) - Revert patch adding io_uring restriction to what can be GUP mapped or not. This does not belong in io_uring, as io_uring isn't really special in this regard. Since this is also getting in the way of cleanups and improvements to the GUP code, get rid of if (me) - A few series greatly reducing the complexity of registered resources, like buffers or files. Not only does this clean up the code a lot, the simplified code is also a LOT more efficient (Pavel) - Series optimizing how we wait for events and run task_work related to it (Pavel) - Fixes for file/buffer unregistration with DEFER_TASKRUN (Pavel) - Misc cleanups and improvements (Pavel, me) * tag 'for-6.4/io_uring-2023-04-21' of git://git.kernel.dk/linux: (71 commits) Revert "io_uring/rsrc: disallow multi-source reg buffers" io_uring: add support for multishot timeouts io_uring/rsrc: disassociate nodes and rsrc_data io_uring/rsrc: devirtualise rsrc put callbacks io_uring/rsrc: pass node to io_rsrc_put_work() io_uring/rsrc: inline io_rsrc_put_work() io_uring/rsrc: add empty flag in rsrc_node io_uring/rsrc: merge nodes and io_rsrc_put io_uring/rsrc: infer node from ctx on io_queue_rsrc_removal io_uring/rsrc: remove unused io_rsrc_node::llist io_uring/rsrc: refactor io_queue_rsrc_removal io_uring/rsrc: simplify single file node switching io_uring/rsrc: clean up __io_sqe_buffers_update() io_uring/rsrc: inline switch_start fast path io_uring/rsrc: remove rsrc_data refs io_uring/rsrc: fix DEFER_TASKRUN rsrc quiesce io_uring/rsrc: use wq for quiescing io_uring/rsrc: refactor io_rsrc_ref_quiesce io_uring/rsrc: remove io_rsrc_node::done io_uring/rsrc: use nospec'ed indexes ...
2023-04-24	Merge tag 'v6.4/vfs.acl' of ↵	Linus Torvalds	1	-4/+0
	git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull acl updates from Christian Brauner: "After finishing the introduction of the new posix acl api last cycle the generic POSIX ACL xattr handlers are still around in the filesystems xattr handlers for two reasons: (1) Because a few filesystems rely on the ->list() method of the generic POSIX ACL xattr handlers in their ->listxattr() inode operation. (2) POSIX ACLs are only available if IOP_XATTR is raised. The IOP_XATTR flag is raised in inode_init_always() based on whether the sb->s_xattr pointer is non-NULL. IOW, the registered xattr handlers of the filesystem are used to raise IOP_XATTR. Removing the generic POSIX ACL xattr handlers from all filesystems would risk regressing filesystems that only implement POSIX ACL support and no other xattrs (nfs3 comes to mind). This contains the work to decouple POSIX ACLs from the IOP_XATTR flag as they don't depend on xattr handlers anymore. So it's now possible to remove the generic POSIX ACL xattr handlers from the sb->s_xattr list of all filesystems. This is a crucial step as the generic POSIX ACL xattr handlers aren't used for POSIX ACLs anymore and POSIX ACLs don't depend on the xattr infrastructure anymore. Adressing problem (1) will require more long-term work. It would be best to get rid of the ->list() method of xattr handlers completely at some point. For erofs, ext{2,4}, f2fs, jffs2, ocfs2, and reiserfs the nop POSIX ACL xattr handler is kept around so they can continue to use array-based xattr handler indexing. This update does simplify the ->listxattr() implementation of all these filesystems however" * tag 'v6.4/vfs.acl' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: acl: don't depend on IOP_XATTR ovl: check for ->listxattr() support reiserfs: rework priv inode handling fs: rename generic posix acl handlers reiserfs: rework ->listxattr() implementation fs: simplify ->listxattr() implementation fs: drop unused posix acl handlers xattr: remove unused argument xattr: add listxattr helper xattr: simplify listxattr helpers
2023-04-20	xfs: fix duplicate includes	Dave Chinner	1	-3/+1
	Header files were already included, just not in the normal order. Remove the duplicates, preserving normal order. Also move xfs_ag.h include to before the scrub internal includes which are normally last in the include list. Fixes: d5c88131dbf0 ("xfs: allow queued AG intents to drain before scrubbing") Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Dave Chinner <david@fromorbit.com>
2023-04-19	mm: vmscan: refactor updating current->reclaim_state	Yosry Ahmed	1	-2/+1
	During reclaim, we keep track of pages reclaimed from other means than LRU-based reclaim through scan_control->reclaim_state->reclaimed_slab, which we stash a pointer to in current task_struct. However, we keep track of more than just reclaimed slab pages through this. We also use it for clean file pages dropped through pruned inodes, and xfs buffer pages freed. Rename reclaimed_slab to reclaimed, and add a helper function that wraps updating it through current, so that future changes to this logic are contained within include/linux/swap.h. Link: https://lkml.kernel.org/r/20230413104034.1086717-4-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christoph Lameter <cl@linux.com> Cc: Darrick J. Wong <djwong@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: NeilBrown <neilb@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-14	Merge tag 'fix-asciici-bugs-6.4_2023-04-11' of ↵	Dave Chinner	5	-102/+185
	git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next xfs: fix ascii-ci problems, then kill it [v2] Last week, I was fiddling around with the metadump name obfuscation code while writing a debugger command to generate directories full of names that all have the same hash name. I had a few questions about how well all that worked with ascii-ci mode, and discovered a nasty discrepancy between the kernel and glibc's implementations of the tolower() function. I discovered that I could create a directory that is large enough to require separate leaf index blocks. The hashes stored in the dabtree use the ascii-ci specific hash function, which uses a library function to convert the name to lowercase before hashing. If the kernel and C library's versions of tolower do not behave exactly identically, xfs_ascii_ci_hashname will not produce the same results for the same inputs. xfs_repair will deem the leaf information corrupt and rebuild the directory. After that, lookups in the kernel will fail because the hash index doesn't work. The kernel's tolower function will convert extended ascii uppercase letters (e.g. A-with-umlaut) to extended ascii lowercase letters (e.g. a-with-umlaut), whereas glibc's will only do that if you force LANG to ascii. Tiny embedded libc implementations just plain won't do it at all, and the result is a mess. Stabilize the behavior of the hash function by encoding the name transformation function in libxfs, add it to the selftest, and fix all the userspace tools, none of which handle this transformation correctly. The v1 series generated a /lot/ of discussion, in which several things became very clear: (1) Linus is not enamored of case folding of any kind; (2) Dave and Christoph don't seem to agree on whether the feature is supposed to work for 7-bit ascii or latin1; (3) it trashes UTF8 encoded names if those happen to show up; and (4) I don't want to maintain this mess any longer than I have to. Kill it in 2030. v2: rename the functions to make it clear we're moving away from the letters t, o, l, o, w, e, and r; and deprecate the whole feature once we've fixed the bugs and added tests. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Dave Chinner <david@fromorbit.com>
2023-04-14	Merge tag 'scrub-strengthen-rmap-checking-6.4_2023-04-11' of ↵	Dave Chinner	5	-2/+411
	git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next xfs: strengthen rmapbt scrubbing [v24.5] This series strengthens space allocation record cross referencing by using AG block bitmaps to compute the difference between space used according to the rmap records and the primary metadata, and reports cross-referencing errors for any discrepancies. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Dave Chinner <david@fromorbit.com>
2023-04-14	Merge tag 'repair-bitmap-rework-6.4_2023-04-11' of ↵	Dave Chinner	4	-245/+358
	git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next xfs: rework online fsck incore bitmap [v24.5] In this series, we make some changes to the incore bitmap code: First, we shorten the prefix to 'xbitmap'. Then, we rework some utility functions for later use by online repair and clarify how the walk functions are supposed to be used. Finally, we use all these new pieces to convert the incore bitmap to use an interval tree instead of linked lists. This lifts the limitation that callers had to be careful not to set a range that was already set; and gets us ready for the btree rebuilder functions needing to be able to set bits in a bitmap and generate maximal contiguous extents for the set ranges. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Dave Chinner <david@fromorbit.com>
2023-04-14	Merge tag 'scrub-fix-xattr-memory-mgmt-6.4_2023-04-11' of ↵	Dave Chinner	4	-140/+239
	git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next xfs: clean up memory management in xattr scrub [v24.5] Currently, the extended attribute scrubber uses a single VLA to store all the context information needed in various parts of the scrubber code. This includes xattr leaf block space usage bitmaps, and the value buffer used to check the correctness of remote xattr value block headers. We try to minimize the insanity through the use of helper functions, but this is a memory management nightmare. Clean this up by making the bitmap and value pointers explicit members of struct xchk_xattr_buf. Second, strengthen the xattr checking by teaching it to look for overlapping data structures in the shortform attr data. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Dave Chinner <david@fromorbit.com>
2023-04-14	Merge tag 'scrub-detect-mergeable-records-6.4_2023-04-11' of ↵	Dave Chinner	3	-3/+196
	git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next xfs: detect mergeable and overlapping btree records [v24.5] While I was doing differential fuzz analysis between xfs_scrub and xfs_repair, I noticed that xfs_repair was only partially effective at detecting btree records that can be merged, and xfs_scrub totally didn't notice at all. For every interval btree type except for the bmbt, there should never exist two adjacent records with adjacent keyspaces because the blockcount field is always large enough to span the entire keyspace of the domain. This is because the free space, rmap, and refcount btrees have a blockcount field large enough to store the maximum AG length, and there can never be an allocation larger than an AG. The bmbt is a different story due to its ondisk encoding where the blockcount is only 21 bits wide. Because AGs can span up to 2^31 blocks and the RT volume can span up to 2^52 blocks, a preallocation of 2^22 blocks will be expressed as two records of 2^21 length. We don't opportunistically combine records when doing bmbt operations, which is why the fsck tools have never complained about this scenario. Offline repair is partially effective at detecting mergeable records because I taught it to do that for the rmap and refcount btrees. This series enhances the free space, rmap, and refcount scrubbers to detect mergeable records. For the bmbt, it will flag the file as being eligible for an optimization to shrink the size of the data structure. The last patch in this set also enhances the rmap scrubber to detect records that overlap incorrectly. This check is done automatically for non-overlapping btree types, but we have to do it separately for the rmapbt because there are constraints on which allocation types are allowed to overlap. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Dave Chinner <david@fromorbit.com>
2023-04-14	Merge tag 'scrub-merge-bmap-records-6.4_2023-04-12' of ↵	Dave Chinner	2	-136/+239
	git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next xfs: merge bmap records for faster scrubs [v24.5] I started looking into performance problems with the data fork scrubber in generic/333, and noticed a few things that needed improving. First, due to design reasons, it's possible for file forks btrees to contain multiple contiguous mappings to the same physical space. Instead of checking each ondisk mapping individually, it's much faster to combine them when possible and check the combined mapping because that's fewer trips through the rmap btree, and we can drop this check-around behavior that it does when an rmapbt lookup produces a record that starts before or ends after a particular bmbt mapping. Second, I noticed that the bmbt scrubber decides to walk every reverse mapping in the filesystem if the file fork is in btree format. This is very costly, and only necessary if the inode repair code had to zap a fork to convince iget to work. Constraining the full-rmap scan to this one case means we can skip it for normal files, which drives the runtime of this test from 8 hours down to 45 minutes (observed with realtime reflink and rebuild-all mode.) Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Dave Chinner <david@fromorbit.com>
2023-04-14	Merge tag 'scrub-iget-fixes-6.4_2023-04-12' of ↵	Dave Chinner	9	-101/+438
	git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next xfs: fix iget/irele usage in online fsck [v24.5] This patchset fixes a handful of problems relating to how we get and release incore inodes in the online scrub code. The first patch fixes how we handle DONTCACHE -- our reasons for setting (or clearing it) depend entirely on the runtime environment at irele time. Hence we can refactor iget and irele to use our own wrappers that set that context appropriately. The second patch fixes a race between the iget call in the inode core scrubber and other writer threads that are allocating or freeing inodes in the same AG by changing the behavior of xchk_iget (and the inode core scrub setup function) to return either an incore inode or the AGI buffer so that we can be sure that the inode cannot disappear on us. The final patch elides MMAPLOCK from scrub paths when possible. It did not fit anywhere else. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Dave Chinner <david@fromorbit.com>
2023-04-14	Merge tag 'scrub-parent-fixes-6.4_2023-04-12' of ↵	Dave Chinner	3	-168/+71
	git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next xfs: fix bugs in parent pointer checking [v24.5] Jan Kara pointed out that the VFS doesn't take i_rwsem of a child subdirectory that is being moved from one parent to another. Upon deeper analysis, I realized that this was the source of a very hard to trigger false corruption report in the parent pointer checking code. Now that we've refactored how directory walks work in scrub, we can also get rid of all the unnecessary and broken locking to make parent pointer scrubbing work properly. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Dave Chinner <david@fromorbit.com>
2023-04-14	Merge tag 'scrub-dir-iget-fixes-6.4_2023-04-12' of ↵	Dave Chinner	5	-217/+497
	git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next xfs: fix iget usage in directory scrub [v24.5] In this series, we fix some problems with how the directory scrubber grabs child inodes. First, we want to reduce EDEADLOCK returns by replacing fixed-iteration loops with interruptible trylock loops. Second, we add UNTRUSTED to the child iget call so that we can detect a dirent that points to an unallocated inode. Third, we fix a bug where we weren't checking the inode pointed to by dotdot entries at all. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Dave Chinner <david@fromorbit.com>
2023-04-14	Merge tag 'scrub-detect-rmapbt-gaps-6.4_2023-04-11' of ↵	Dave Chinner	9	-94/+195
	git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next xfs: detect incorrect gaps in rmap btree [v24.5] Following in the theme of the last two patchsets, this one strengthens the rmap btree record checking so that scrub can count the number of space records that map to a given owner and that do not map to a given owner. This enables us to determine exclusive ownership of space that can't be shared. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Dave Chinner <david@fromorbit.com>
2023-04-14	Merge tag 'scrub-detect-inobt-gaps-6.4_2023-04-11' of ↵	Dave Chinner	3	-88/+269
	git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next xfs: detect incorrect gaps in inode btree [v24.5] This series continues the corrections for a couple of problems I found in the inode btree scrubber. The first problem is that we don't directly check the inobt records have a direct correspondence with the finobt records, and vice versa. The second problem occurs on filesystems with sparse inode chunks -- the cross-referencing we do detects sparseness, but it doesn't actually check the consistency between the inobt hole records and the rmap data. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Dave Chinner <david@fromorbit.com>
2023-04-14	Merge tag 'scrub-detect-refcount-gaps-6.4_2023-04-11' of ↵	Dave Chinner	23	-129/+610
	git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next xfs: detect incorrect gaps in refcount btree [v24.5] The next few patchsets address a deficiency in scrub that I found while QAing the refcount btree scrubber. If there's a gap between refcount records, we need to cross-reference that gap with the reverse mappings to ensure that there are no overlapping records in the rmap btree. If we find any, then the refcount btree is not consistent. This is not a property that is specific to the refcount btree; they all need to have this sort of keyspace scanning logic to detect inconsistencies. To do this accurately, we need to be able to scan the keyspace of a btree (which we already do) to be able to tell the caller if the keyspace is empty, sparse, or fully covered by records. The first few patches add the keyspace scanner to the generic btree code, along with the ability to mask off parts of btree keys because when we scan the rmapbt, we only care about space usage, not the owners. The final patch closes the scanning gap in the refcountbt scanner. v23.1: create helpers for the key extraction and comparison functions, improve documentation, and eliminate the ->mask_key indirect calls Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Dave Chinner <david@fromorbit.com>
2023-04-14	Merge tag 'scrub-btree-key-enhancements-6.4_2023-04-11' of ↵	Dave Chinner	2	-8/+63
	git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next xfs: enhance btree key scrubbing [v24.5] This series fixes the scrub btree block checker to ensure that the keys in the parent block accurately represent the block, and check the ordering of all interior key records. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Dave Chinner <david@fromorbit.com>
2023-04-14	Merge tag 'rmap-btree-fix-key-handling-6.4_2023-04-11' of ↵	Dave Chinner	4	-10/+95
	git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next xfs: fix rmap btree key flag handling [v24.5] This series fixes numerous flag handling bugs in the rmapbt key code. The most serious transgression is that key comparisons completely strip out all flag bits from rm_offset, including the ones that participate in record lookups. The second problem is that for years we've been letting the unwritten flag (which is an attribute of a specific record and not part of the record key) escape from leaf records into key records. The solution to the second problem is to filter attribute flags when creating keys from records, and the solution to the first problem is to preserve only the flags used for key lookups. The ATTR and BMBT flags are a part of the lookup key, and the UNWRITTEN flag is a record attribute. This has worked for years without generating user complaints because ATTR and BMBT extents cannot be shared, so key comparisons succeed solely on rm_startblock. Only file data fork extents can be shared, and those records never set any of the three flag bits, so comparisons that dig into rm_owner and rm_offset work just fine. A filesystem written with an unpatched kernel and mounted on a patched kernel will work correctly because the ATTR/BMBT flags have been conveyed into keys correctly all along, and we still ignore the UNWRITTEN flag in any key record. This was what doomed my previous attempt to correct this problem in 2019. A filesystem written with a patched kernel and mounted on an unpatched kernel will also work correctly because unpatched kernels ignore all flags. With this patchset applied, the scrub code gains the ability to detect rmap btrees with incorrectly set attr and bmbt flags in the key records. After three years of testing, I haven't encountered any problems. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Dave Chinner <david@fromorbit.com>
2023-04-14	Merge tag 'btree-hoist-scrub-checks-6.4_2023-04-11' of ↵	Dave Chinner	4	-28/+31
	git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next xfs: hoist scrub record checks into libxfs [v24.5] There are a few things about btree records that scrub checked but the libxfs _get_rec functions didn't. Move these bits into libxfs so that everyone can benefit. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Dave Chinner <david@fromorbit.com>
2023-04-14	Merge tag 'btree-complain-bad-records-6.4_2023-04-11' of ↵	Dave Chinner	18	-198/+303
	git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next xfs: standardize btree record checking code [v24.5] While I was cleaning things up for 6.1, I noticed that the btree _query_range and _query_all functions don't perform the same checking that the _get_rec functions perform. In fact, they don't perform /any/ sanity checking, which means that callers aren't warned about impossible records. Therefore, hoist the record validation and complaint logging code into separate functions, and call them from any place where we convert an ondisk record into an incore record. For online scrub, we can replace checking code with a call to the record checking functions in libxfs, thereby reducing the size of the codebase. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Dave Chinner <david@fromorbit.com>
2023-04-14	Merge tag 'scrub-drain-intents-6.4_2023-04-11' of ↵	Dave Chinner	31	-39/+680
	git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next xfs: drain deferred work items when scrubbing [v24.5] The design doc for XFS online fsck contains a long discussion of the eventual consistency models in use for XFS metadata. In that chapter, we note that it is possible for scrub to collide with a chain of deferred space metadata updates, and proposes a lightweight solution: The use of a pending-intents counter so that scrub can wait for the system to drain all chains. This patchset implements that scrub drain. The first patch implements the basic mechanism, and the subsequent patches reduce the runtime overhead by converting the implementation to use sloppy counters and introducing jump labels to avoid walking into scrub hooks when it isn't running. This last paradigm repeats elsewhere in this megaseries. v23.1: make intent items take an active ref to the perag structure and document why we bump and drop the intent counts when we do Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Dave Chinner <david@fromorbit.com>
2023-04-14	Merge tag 'scrub-fix-legalese-6.4_2023-04-11' of ↵	Dave Chinner	33	-97/+97
	git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next xfs_scrub: fix licensing and copyright notices [v24.5] Fix various attribution problems in the xfs_scrub source code, such as the author's contact information, out of date SPDX tags, and a rough estimate of when the feature was under heavy development. The most egregious parts are the files that are missing license information completely. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Dave Chinner <david@fromorbit.com>
2023-04-14	Merge tag 'pass-perag-refs-6.4_2023-04-11' of ↵	Dave Chinner	9	-20/+22
	git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next xfs: pass perag references around when possible [v24.5] Avoid the cost of perag radix tree lookups by passing around active perag references when possible. v24.2: rework some of the naming and whatnot so there's less opencoding Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Dave Chinner <david@fromorbit.com>
2023-04-14	Merge tag 'intents-perag-refs-6.4_2023-04-11' of ↵	Dave Chinner	16	-85/+196
	git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next xfs: make intent items take a perag reference [v24.5] Now that we've cleaned up some code warts in the deferred work item processing code, let's make intent items take an active perag reference from their creation until they are finally freed by the defer ops machinery. This change facilitates the scrub drain in the next patchset and will make it easier for the future AG removal code to detect a busy AG in need of quiescing. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Dave Chinner <david@fromorbit.com>
2023-04-13	xfs: simplify two-level sysctl registration for xfs_table	Luis Chamberlain	1	-19/+1
	There is no need to declare two tables to just create directories, this can be easily be done with a prefix path with register_sysctl(). Simplify this registration. Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>