summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)AuthorFilesLines
2023-02-13xfs: introduce xfs_alloc_vextent_prepare()Dave Chinner1-44/+76
Now that we have wrapper functions for each type of allocation we can ask for, we can start unravelling xfs_alloc_ag_vextent(). That is essentially just a prepare stage, the allocation multiplexer and a post-allocation accounting step is the allocation proceeded. The current xfs_alloc_vextent*() wrappers all have a prepare stage, the allocation operation and a post-allocation accounting step. We can consolidate this by moving the AG alloc prep code into the wrapper functions, the accounting code in the wrapper accounting functions, and cut out the multiplexer layer entirely. This patch consolidates the AG preparation stage. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-13xfs: introduce xfs_alloc_vextent_exact_bno()Dave Chinner6-26/+72
Two of the callers to xfs_alloc_vextent_this_ag() actually want exact block number allocation, not anywhere-in-ag allocation. Split this out from _this_ag() as a first class citizen so no external extent allocation code needs to care about args->type anymore. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-13xfs: introduce xfs_alloc_vextent_near_bno()Dave Chinner6-54/+55
The remaining callers of xfs_alloc_vextent() are all doing NEAR_BNO allocations. We can replace that function with a new xfs_alloc_vextent_near_bno() function that does this explicitly. We also multiplex NEAR_BNO allocations through xfs_alloc_vextent_this_ag via args->type. Replace all of these with direct calls to xfs_alloc_vextent_near_bno(), too. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-13xfs: use xfs_alloc_vextent_start_bno() where appropriateDave Chinner4-38/+51
Change obvious callers of single AG allocation to use xfs_alloc_vextent_start_bno(). Callers no long need to specify XFS_ALLOCTYPE_START_BNO, and so the type can be driven inward and removed. While doing this, also pass the allocation target fsb as a parameter rather than encoding it in args->fsbno. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-13xfs: use xfs_alloc_vextent_first_ag() where appropriateDave Chinner3-32/+42
Change obvious callers of single AG allocation to use xfs_alloc_vextent_first_ag(). This gets rid of XFS_ALLOCTYPE_FIRST_AG as the type used within xfs_alloc_vextent_first_ag() during iteration is _THIS_AG. Hence we can remove the setting of args->type from all the callers of _first_ag() and remove the alloctype. While doing this, pass the allocation target fsb as a parameter rather than encoding it in args->fsbno. This starts the process of making args->fsbno an output only variable rather than input/output. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-13xfs: factor xfs_bmap_btalloc()Dave Chinner1-137/+196
There are several different contexts xfs_bmap_btalloc() handles, and large chunks of the code execute independent allocation contexts. Try to untangle this mess a bit. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-13xfs: use xfs_alloc_vextent_this_ag() where appropriateDave Chinner9-53/+74
Change obvious callers of single AG allocation to use xfs_alloc_vextent_this_ag(). Drive the per-ag grabbing out to the callers, too, so that callers with active references don't need to do new lookups just for an allocation in a context that already has a perag reference. The only remaining caller that does single AG allocation through xfs_alloc_vextent() is xfs_bmap_btalloc() with XFS_ALLOCTYPE_NEAR_BNO. That is going to need more untangling before it can be converted cleanly. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-13xfs: combine __xfs_alloc_vextent_this_ag and xfs_alloc_ag_vextentDave Chinner1-30/+35
There's a bit of a recursive conundrum around xfs_alloc_ag_vextent(). We can't first call xfs_alloc_ag_vextent() without preparing the AGFL for the allocation, and preparing the AGFL calls xfs_alloc_ag_vextent() to prepare the AGFL for the allocation. This "double allocation" requirement is not really clear from the current xfs_alloc_fix_freelist() calls that are sprinkled through the allocation code. It's not helped that xfs_alloc_ag_vextent() can actually allocate from the AGFL itself, but there's special code to prevent AGFL prep allocations from allocating from the free list it's trying to prep. The naming is also not consistent: args->wasfromfl is true when we allocated _from_ the free list, but the indication that we are allocating _for_ the free list is via checking that (args->resv == XFS_AG_RESV_AGFL). So, lets make this "allocation required for allocation" situation clear by moving it all inside xfs_alloc_ag_vextent(). The freelist allocation is a specific XFS_ALLOCTYPE_THIS_AG allocation, which translated directly to xfs_alloc_ag_vextent_size() allocation. This enables us to replace __xfs_alloc_vextent_this_ag() with a call to xfs_alloc_ag_vextent(), and we drive the freelist fixing further into the per-ag allocation algorithm. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-13xfs: factor xfs_alloc_vextent_this_ag() for _iterate_ags()Dave Chinner1-24/+26
The core of the per-ag iteration is effectively doing a "this ag" allocation on one AG at a time. Use the same code to implement the core "this ag" allocation in both xfs_alloc_vextent_this_ag() and xfs_alloc_vextent_iterate_ags(). This means we only call xfs_alloc_ag_vextent() from one place so we can easily collapse the call stack in future patches. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-13xfs: rework xfs_alloc_vextent()Dave Chinner1-179/+285
It's a multiplexing mess that can be greatly simplified, and really needs to be simplified to allow active per-ag references to propagate from initial AG selection code the the bmapi code. This splits the code out into separate a parameter checking function, an iterator function, and allocation completion functions and then implements the individual policies using these functions. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-13xfs: introduce xfs_for_each_perag_wrap()Dave Chinner3-50/+105
In several places we iterate every AG from a specific start agno and wrap back to the first AG when we reach the end of the filesystem to continue searching. We don't have a primitive for this iteration yet, so add one for conversion of these algorithms to per-ag based iteration. The filestream AG select code is a mess, and this initially makes it worse. The per-ag selection needs to be driven completely into the filestream code to clean this up and it will be done in a future patch that makes the filestream allocator use active per-ag references correctly. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-13xfs: perags need atomic operational stateDave Chinner13-65/+101
We currently don't have any flags or operational state in the xfs_perag except for the pagf_init and pagi_init flags. And the agflreset flag. Oh, there's also the pagf_metadata and pagi_inodeok flags, too. For controlling per-ag operations, we are going to need some atomic state flags. Hence add an opstate field similar to what we already have in the mount and log, and convert all these state flags across to atomic bit operations. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-13xfs: convert xfs_ialloc_next_ag() to an atomicDave Chinner4-20/+4
This is currently a spinlock lock protected rotor which can be implemented with a single atomic operation. Change it to be more efficient and get rid of the m_agirotor_lock. Noticed while converting the inode allocation AG selection loop to active perag references. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-13xfs: inobt can use perags in many more places than it doesDave Chinner7-56/+47
Lots of code in the inobt infrastructure is passed both xfs_mount and perags. We only need perags for the per-ag inode allocation code, so reduce the duplication by passing only the perags as the primary object. This ends up reducing the code size by a bit: text data bss dec hex filename orig 1138878 323979 548 1463405 16546d (TOTALS) patched 1138709 323979 548 1463236 1653c4 (TOTALS) Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-13xfs: use active perag references for inode allocationDave Chinner3-35/+33
Convert the inode allocation routines to use active perag references or references held by callers rather than grab their own. Also drive the perag further inwards to replace xfs_mounts when doing operations on a specific AG. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-13xfs: convert xfs_imap() to take a peragDave Chinner4-35/+28
Callers have referenced perags but they don't pass it into xfs_imap() so it takes it's own reference. Fix that so we can change inode allocation over to using active references. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-13xfs: rework the perag trace points to be perag centricDave Chinner3-28/+22
So that they all output the same information in the traces to make debugging refcount issues easier. This means that all the lookup/drop functions no longer need to use the full memory barrier atomic operations (atomic*_return()) so will have less overhead when tracing is off. The set/clear tag tracepoints no longer abuse the reference count to pass the tag - the tag being cleared is obvious from the _RET_IP_ that is recorded in the trace point. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-13xfs: active perag reference countingDave Chinner9-19/+105
We need to be able to dynamically remove instantiated AGs from memory safely, either for shrinking the filesystem or paging AG state in and out of memory (e.g. supporting millions of AGs). This means we need to be able to safely exclude operations from accessing perags while dynamic removal is in progress. To do this, introduce the concept of active and passive references. Active references are required for high level operations that make use of an AG for a given operation (e.g. allocation) and pin the perag in memory for the duration of the operation that is operating on the perag (e.g. transaction scope). This means we can fail to get an active reference to an AG, hence callers of the new active reference API must be able to handle lookup failure gracefully. Passive references are used in low level code, where we might need to access the perag structure for the purposes of completing high level operations. For example, buffers need to use passive references because: - we need to be able to do metadata IO during operations like grow and shrink transactions where high level active references to the AG have already been blocked - buffers need to pin the perag until they are reclaimed from memory, something that high level code has no direct control over. - unused cached buffers should not prevent a shrink from being started. Hence we have active references that will form exclusion barriers for operations to be performed on an AG, and passive references that will prevent reclaim of the perag until all objects with passive references have been reclaimed themselves. This patch introduce xfs_perag_grab()/xfs_perag_rele() as the API for active AG reference functionality. We also need to convert the for_each_perag*() iterators to use active references, which will start the process of converting high level code over to using active references. Conversion of non-iterator based code to active references will be done in followup patches. Note that the implementation using reference counting is really just a development vehicle for the API to ensure we don't have any leaks in the callers. Once we need to remove perag structures from memory dyanmically, we will need a much more robust per-ag state transition mechanism for preventing new references from being taken while we wait for existing references to drain before removal from memory can occur.... Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-10xfs: don't assert fail on transaction cancel with deferred opsDave Chinner1-2/+2
We can error out of an allocation transaction when updating BMBT blocks when things go wrong. This can be a btree corruption, and unexpected ENOSPC, etc. In these cases, we already have deferred ops queued for the first allocation that has been done, and we just want to cancel out the transaction and shut down the filesystem on error. In fact, we do just that for production systems - the assert that we can't have a transaction with defer ops attached unless we are already shut down is bogus and gets in the way of debugging whatever issue is actually causing the transaction to be cancelled. Remove the assert because it is causing spurious test failures to hang test machines. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-10xfs: t_firstblock is tracking AGs not blocksDave Chinner10-23/+21
The tp->t_firstblock field is now raelly tracking the highest AG we have locked, not the block number of the highest allocation we've made. It's purpose is to prevent AGF locking deadlocks, so rename it to "highest AG" and simplify the implementation to just track the agno rather than a fsbno. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-10xfs: drop firstblock constraints from allocation setupDave Chinner2-132/+66
Now that xfs_alloc_vextent() does all the AGF deadlock prevention filtering for multiple allocations in a single transaction, we no longer need the allocation setup code to care about what AGs we might already have locked. Hence we can remove all the "nullfb" conditional logic in places like xfs_bmap_btalloc() and instead have them focus simply on setting up locality constraints. If the allocation fails due to AGF lock filtering in xfs_alloc_vextent, then we just fall back as we normally do to more relaxed allocation constraints. As a result, any allocation that allows AG scanning (i.e. not confined to a single AG) and does not force a worst case full filesystem scan will now be able to attempt allocation from AGs lower than that defined by tp->t_firstblock. This is because xfs_alloc_vextent() allows try-locking of the AGFs and hence enables low space algorithms to at least -try- to get space from AGs lower than the one that we have currently locked and allocated from. This is a significant improvement in the low space allocation algorithm. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-10xfs: block reservation too large for minleft allocationDave Chinner3-11/+12
When we enter xfs_bmbt_alloc_block() without having first allocated a data extent (i.e. tp->t_firstblock == NULLFSBLOCK) because we are doing something like unwritten extent conversion, the transaction block reservation is used as the minleft value. This works for operations like unwritten extent conversion, but it assumes that the block reservation is only for a BMBT split. THis is not always true, and sometimes results in larger than necessary minleft values being set. We only actually need enough space for a btree split, something we already handle correctly in xfs_bmapi_write() via the xfs_bmapi_minleft() calculation. We should use xfs_bmapi_minleft() in xfs_bmbt_alloc_block() to calculate the number of blocks a BMBT split on this inode is going to require, not use the transaction block reservation that contains the maximum number of blocks this transaction may consume in it... Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-10xfs: prefer free inodes at ENOSPC over chunk allocationDave Chinner1-0/+17
When an XFS filesystem has free inodes in chunks already allocated on disk, it will still allocate new inode chunks if the target AG has no free inodes in it. Normally, this is a good idea as it preserves locality of all the inodes in a given directory. However, at ENOSPC this can lead to using the last few remaining free filesystem blocks to allocate a new chunk when there are many, many free inodes that could be allocated without consuming free space. This results in speeding up the consumption of the last few blocks and inode create operations then returning ENOSPC when there free inodes available because we don't have enough block left in the filesystem for directory creation reservations to proceed. Hence when we are near ENOSPC, we should be attempting to preserve the remaining blocks for directory block allocation rather than using them for unnecessary inode chunk creation. This particular behaviour is exposed by xfs/294, when it drives to ENOSPC on empty file creation whilst there are still thousands of free inodes available for allocation in other AGs in the filesystem. Hence, when we are within 1% of ENOSPC, change the inode allocation behaviour to prefer to use existing free inodes over allocating new inode chunks, even though it results is poorer locality of the data set. It is more important for the allocations to be space efficient near ENOSPC than to have optimal locality for performance, so lets modify the inode AG selection code to reflect that fact. This allows generic/294 to not only pass with this allocator rework patchset, but to increase the number of post-ENOSPC empty inode allocations to from ~600 to ~9080 before we hit ENOSPC on the directory create transaction reservation. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-10xfs: fix low space alloc deadlockDave Chinner3-26/+58
I've recently encountered an ABBA deadlock with g/476. The upcoming changes seem to make this much easier to hit, but the underlying problem is a pre-existing one. Essentially, if we select an AG for allocation, then lock the AGF and then fail to allocate for some reason (e.g. minimum length requirements cannot be satisfied), then we drop out of the allocation with the AGF still locked. The caller then modifies the allocation constraints - usually loosening them up - and tries again. This can result in trying to access AGFs that are lower than the AGF we already have locked from the failed attempt. e.g. the failed attempt skipped several AGs before failing, so we have locks an AG higher than the start AG. Retrying the allocation from the start AG then causes us to violate AGF lock ordering and this can lead to deadlocks. The deadlock exists even if allocation succeeds - we can do a followup allocations in the same transaction for BMBT blocks that aren't guaranteed to be in the same AG as the original, and can move into higher AGs. Hence we really need to move the tp->t_firstblock tracking down into xfs_alloc_vextent() where it can be set when we exit with a locked AG. xfs_alloc_vextent() can also check there if the requested allocation falls within the allow range of AGs set by tp->t_firstblock. If we can't allocate within the range set, we have to fail the allocation. If we are allowed to to non-blocking AGF locking, we can ignore the AG locking order limitations as we can use try-locks for the first iteration over requested AG range. This invalidates a set of post allocation asserts that check that the allocation is always above tp->t_firstblock if it is set. Because we can use try-locks to avoid the deadlock in some circumstances, having a pre-existing locked AGF doesn't always prevent allocation from lower order AGFs. Hence those ASSERTs need to be removed. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-10xfs: revert commit 8954c44ff477Darrick J. Wong1-1/+3
The name passed into __xfs_xattr_put_listent is exactly namelen bytes long and not null-terminated. Passing namelen+1 to the strscpy function strscpy(offset, (char *)name, namelen + 1); is therefore wrong. Go back to the old code, which works fine because strncpy won't find a null in @name and stops after namelen bytes. It really could be a memcpy call, but it worked for years. Reported-by: syzbot+898115bc6d7140437215@syzkaller.appspotmail.com Fixes: 8954c44ff477 ("xfs: use strscpy() to instead of strncpy()") Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2023-02-10xfs: make kobj_type structures constantThomas Weißschuh3-12/+12
Since commit ee6d3dd4ed48 ("driver core: make kobj_type constant.") the driver core allows the usage of const struct kobj_type. Take advantage of this to constify the structure definitions to prevent modification at runtime. Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2023-02-10xfs: allow setting full range of panic tagsDonald Douwsma2-2/+13
xfs will not allow combining other panic masks with XFS_PTAG_VERIFIER_ERROR. # sysctl fs.xfs.panic_mask=511 sysctl: setting key "fs.xfs.panic_mask": Invalid argument fs.xfs.panic_mask = 511 Update to the maximum value that can be set to allow the full range of masks. Do this using a mask of possible values to prevent this happening again as suggested by Darrick. Fixes: d519da41e2b7 ("xfs: Introduce XFS_PTAG_VERIFIER_ERROR panic mask") Signed-off-by: Donald Douwsma <ddouwsma@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-02-05xfs: don't use BMBT btree split workers for IO completionDave Chinner1-2/+16
When we split a BMBT due to record insertion, we offload it to a worker thread because we can be deep in the stack when we try to allocate a new block for the BMBT. Allocation can use several kilobytes of stack (full memory reclaim, swap and/or IO path can end up on the stack during allocation) and we can already be several kilobytes deep in the stack when we need to split the BMBT. A recent workload demonstrated a deadlock in this BMBT split offload. It requires several things to happen at once: 1. two inodes need a BMBT split at the same time, one must be unwritten extent conversion from IO completion, the other must be from extent allocation. 2. there must be a no available xfs_alloc_wq worker threads available in the worker pool. 3. There must be sustained severe memory shortages such that new kworker threads cannot be allocated to the xfs_alloc_wq pool for both threads that need split work to be run 4. The split work from the unwritten extent conversion must run first. 5. when the BMBT block allocation runs from the split work, it must loop over all AGs and not be able to either trylock an AGF successfully, or each AGF is is able to lock has no space available for a single block allocation. 6. The BMBT allocation must then attempt to lock the AGF that the second task queued to the rescuer thread already has locked before it finds an AGF it can allocate from. At this point, we have an ABBA deadlock between tasks queued on the xfs_alloc_wq rescuer thread and a locked AGF. i.e. The queued task holding the AGF lock can't be run by the rescuer thread until the task the rescuer thread is runing gets the AGF lock.... This is a highly improbably series of events, but there it is. There's a couple of ways to fix this, but the easiest way to ensure that we only punt tasks with a locked AGF that holds enough space for the BMBT block allocations to the worker thread. This works for unwritten extent conversion in IO completion (which doesn't have a locked AGF and space reservations) because we have tight control over the IO completion stack. It is typically only 6 functions deep when xfs_btree_split() is called because we've already offloaded the IO completion work to a worker thread and hence we don't need to worry about stack overruns here. The other place we can be called for a BMBT split without a preceeding allocation is __xfs_bunmapi() when punching out the center of an existing extent. We don't remove extents in the IO path, so these operations don't tend to be called with a lot of stack consumed. Hence we don't really need to ship the split off to a worker thread in these cases, either. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2023-02-05xfs: fix confusing variable names in xfs_refcount_item.cDarrick J. Wong1-27/+27
Variable names in this code module are inconsistent and confusing. xfs_phys_extent describe physical mappings, so rename them "pmap". xfs_refcount_intents describe refcount intents, so rename them "ri". Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2023-02-05xfs: pass refcount intent directly through the log intent codeDarrick J. Wong4-103/+74
Pass the incore refcount intent through the CUI logging code instead of repeatedly boxing and unboxing parameters. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2023-02-05xfs: fix confusing variable names in xfs_rmap_item.cDarrick J. Wong1-39/+40
Variable names in this code module are inconsistent and confusing. xfs_map_extent describe file mappings, so rename them "map". xfs_rmap_intents describe block mapping intents, so rename them "ri". Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2023-02-05xfs: pass rmap space mapping directly through the log intent codeDarrick J. Wong3-66/+55
Pass the incore rmap space mapping through the RUI logging code instead of repeatedly boxing and unboxing parameters. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2023-02-05xfs: fix confusing xfs_extent_item variable namesDarrick J. Wong2-51/+51
Change the name of all pointers to xfs_extent_item structures to "xefi" to make the name consistent and because the current selections ("new" and "free") mean other things in C. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2023-02-05xfs: pass xfs_extent_free_item directly through the log intent codeDarrick J. Wong1-25/+30
Pass the incore xfs_extent_free_item through the EFI logging code instead of repeatedly boxing and unboxing parameters. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2023-02-05xfs: fix confusing variable names in xfs_bmap_item.cDarrick J. Wong1-28/+28
Variable names in this code module are inconsistent and confusing. xfs_map_extent describe file mappings, so rename them "map". xfs_bmap_intents describe block mapping intents, so rename them "bi". Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2023-02-05xfs: pass the xfs_bmbt_irec directly through the log intent codeDarrick J. Wong3-72/+46
Instead of repeatedly boxing and unboxing the incore extent mapping structure as it passes through the BUI code, pass the pointer directly through. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2023-02-05xfs: use strscpy() to instead of strncpy()Xu Panda1-3/+1
The implementation of strscpy() is more robust and safer. That's now the recommended way to copy NUL-terminated strings. Signed-off-by: Xu Panda <xu.panda@zte.com.cn> Signed-off-by: Yang Yang <yang.yang29@zte.com.cn> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2023-01-28Merge tag '6.2-rc5-ksmbd-server-fixes' of git://git.samba.org/ksmbdLinus Torvalds8-10/+46
Pull ksmbd server fixes from Steve French: "Four smb3 server fixes, all also for stable: - fix for signing bug - fix to more strictly check packet length - add a max connections parm to limit simultaneous connections - fix error message flood that can occur with newer Samba xattr format" * tag '6.2-rc5-ksmbd-server-fixes' of git://git.samba.org/ksmbd: ksmbd: downgrade ndr version error message to debug ksmbd: limit pdu length size according to connection status ksmbd: do not sign response to session request for guest login ksmbd: add max connections parameter
2023-01-28Merge tag '6.2-rc5-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds1-0/+1
Pull cifs fix from Steve French: "Fix for reconnect oops in smbdirect (RDMA), also is marked for stable" * tag '6.2-rc5-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6: cifs: Fix oops due to uncleared server->smbd_conn in reconnect
2023-01-28Merge tag 'ovl-fixes-6.2-rc6' of ↵Linus Torvalds1-1/+5
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs Pull overlayfs fixes from Miklos Szeredi: "Fix two bugs, a recent one introduced in the last cycle, and an older one from v5.11" * tag 'ovl-fixes-6.2-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: ovl: fail on invalid uid/gid mapping at copy up ovl: fix tmpfile leak
2023-01-27ovl: fail on invalid uid/gid mapping at copy upMiklos Szeredi1-0/+4
If st_uid/st_gid doesn't have a mapping in the mounter's user_ns, then copy-up should fail, just like it would fail if the mounter task was doing the copy using "cp -a". There's a corner case where the "cp -a" would succeed but copy up fail: if there's a mapping of the invalid uid/gid (65534 by default) in the user namespace. This is because stat(2) will return this value if the mapping doesn't exist in the current user_ns and "cp -a" will in turn be able to create a file with this uid/gid. This behavior would be inconsistent with POSIX ACL's, which return -1 for invalid uid/gid which result in a failed copy. For consistency and simplicity fail the copy of the st_uid/st_gid are invalid. Fixes: 459c7c565ac3 ("ovl: unprivieged mounts") Cc: <stable@vger.kernel.org> # v5.11 Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Reviewed-by: Christian Brauner <brauner@kernel.org> Reviewed-by: Seth Forshee <sforshee@kernel.org>
2023-01-27ovl: fix tmpfile leakMiklos Szeredi1-1/+1
Missed an error cleanup. Reported-by: syzbot+fd749a7ea127a84e0ffd@syzkaller.appspotmail.com Fixes: 2b1a77461f16 ("ovl: use vfs_tmpfile_open() helper") Cc: <stable@vger.kernel.org> # v6.1 Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2023-01-26ksmbd: downgrade ndr version error message to debugNamjae Jeon1-4/+4
When user switch samba to ksmbd, The following message flood is coming when accessing files. Samba seems to changs dos attribute version to v5. This patch downgrade ndr version error message to debug. $ dmesg ... [68971.766914] ksmbd: v5 version is not supported [68971.779808] ksmbd: v5 version is not supported [68971.871544] ksmbd: v5 version is not supported [68971.910135] ksmbd: v5 version is not supported ... Cc: stable@vger.kernel.org Fixes: e2f34481b24d ("cifsd: add server-side procedures for SMB3") Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2023-01-26ksmbd: limit pdu length size according to connection statusNamjae Jeon2-4/+18
Stream protocol length will never be larger than 16KB until session setup. After session setup, the size of requests will not be larger than 16KB + SMB2 MAX WRITE size. This patch limits these invalidly oversized requests and closes the connection immediately. Fixes: 0626e6641f6b ("cifsd: add server handler for central processing and tranport layers") Cc: stable@vger.kernel.org Reported-by: zdi-disclosures@trendmicro.com # ZDI-CAN-18259 Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2023-01-25Merge tag 'fs.fuse.acl.v6.2-rc6' of ↵Linus Torvalds5-74/+78
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping Pull fuse ACL fix from Christian Brauner: "The new posix acl API doesn't depend on the xattr handler infrastructure anymore and instead only relies on the posix acl inode operations. As a result daemons without FUSE_POSIX_ACL are unable to use posix acls like they used to. Fix this by copying what we did for overlayfs during the posix acl api conversion. Make fuse implement a dedicated ->get_inode_acl() method as does overlayfs. Fuse can then also uses this to express different needs for vfs permission checking during lookup and acl based retrieval via the regular system call path. This allows fuse to continue to refuse retrieving posix acls for daemons that don't set FUSE_POSXI_ACL for permission checking while also allowing a fuse server to retrieve it via the usual system calls" * tag 'fs.fuse.acl.v6.2-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping: fuse: fixes after adapting to new posix acl api
2023-01-25cifs: Fix oops due to uncleared server->smbd_conn in reconnectDavid Howells1-0/+1
In smbd_destroy(), clear the server->smbd_conn pointer after freeing the smbd_connection struct that it points to so that reconnection doesn't get confused. Fixes: 8ef130f9ec27 ("CIFS: SMBD: Implement function to destroy a SMB Direct connection") Cc: stable@vger.kernel.org Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Acked-by: Tom Talpey <tom@talpey.com> Signed-off-by: David Howells <dhowells@redhat.com> Cc: Long Li <longli@microsoft.com> Cc: Pavel Shilovsky <piastryyy@gmail.com> Cc: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2023-01-24Merge tag 'nfsd-6.2-5' of ↵Linus Torvalds1-25/+36
git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux Pull nfsd fix from Chuck Lever: - Nail another UAF in NFSD's filecache * tag 'nfsd-6.2-5' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: nfsd: don't free files unconditionally in __nfsd_file_cache_purge
2023-01-24ext4: make xattr char unsignedness in hash explicitLinus Torvalds1-5/+6
Commit f3bbac32475b ("ext4: deal with legacy signed xattr name hash values") added a hashing function for the legacy case of having the xattr hash calculated using a signed 'char' type. It left the unsigned case alone, since it's all implicitly handled by the '-funsigned-char' compiler option. However, there's been some noise about back-porting it all into stable kernels that lack the '-funsigned-char', so let's just make that at least possible by making the whole 'this uses unsigned char' very explicit in the code itself. Whether such a back-port is really warranted or not, I'll leave to others, but at least together with this change it is technically sensible. Also, add a 'pr_warn_once()' for reporting the "hey, signedness for this hash calculation has changed" issue. Hopefully it never triggers except for that xfstests generic/454 test-case, but even if it does it's just good information to have. If for no other reason than "we can remove the legacy signed hash code entirely if nobody ever sees the message any more". Cc: Sasha Levin <sashal@kernel.org> Cc: Eric Biggers <ebiggers@kernel.org> Cc: Andreas Dilger <adilger@dilger.ca> Cc: Theodore Ts'o <tytso@mit.edu>, Cc: Jason Donenfeld <Jason@zx2c4.com> Cc: Masahiro Yamada <masahiroy@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-01-24fuse: fixes after adapting to new posix acl apiChristian Brauner5-74/+78
This cycle we ported all filesystems to the new posix acl api. While looking at further simplifications in this area to remove the last remnants of the generic dummy posix acl handlers we realized that we regressed fuse daemons that don't set FUSE_POSIX_ACL but still make use of posix acls. With the change to a dedicated posix acl api interacting with posix acls doesn't go through the old xattr codepaths anymore and instead only relies the get acl and set acl inode operations. Before this change fuse daemons that don't set FUSE_POSIX_ACL were able to get and set posix acl albeit with two caveats. First, that posix acls aren't cached. And second, that they aren't used for permission checking in the vfs. We regressed that use-case as we currently refuse to retrieve any posix acls if they aren't enabled via FUSE_POSIX_ACL. So older fuse daemons would see a change in behavior. We can restore the old behavior in multiple ways. We could change the new posix acl api and look for a dedicated xattr handler and if we find one prefer that over the dedicated posix acl api. That would break the consistency of the new posix acl api so we would very much prefer not to do that. We could introduce a new ACL_*_CACHE sentinel that would instruct the vfs permission checking codepath to not call into the filesystem and ignore acls. But a more straightforward fix for v6.2 is to do the same thing that Overlayfs does and give fuse a separate get acl method for permission checking. Overlayfs uses this to express different needs for vfs permission lookup and acl based retrieval via the regular system call path as well. Let fuse do the same for now. This way fuse can continue to refuse to retrieve posix acls for daemons that don't set FUSE_POSXI_ACL for permission checking while allowing a fuse server to retrieve it via the usual system calls. In the future, we could extend the get acl inode operation to not just pass a simple boolean to indicate rcu lookup but instead make it a flag argument. Then in addition to passing the information that this is an rcu lookup to the filesystem we could also introduce a flag that tells the filesystem that this is a request from the vfs to use these acls for permission checking. Then fuse could refuse the get acl request for permission checking when the daemon doesn't have FUSE_POSIX_ACL set in the same get acl method. This would also help Overlayfs and allow us to remove the second method for it as well. But since that change is more invasive as we need to update the get acl inode operation for multiple filesystems we should not do this as a fix for v6.2. Instead we will do this for the v6.3 merge window. Fwiw, since posix acls are now always correctly translated in the new posix acl api we could also allow them to be used for daemons without FUSE_POSIX_ACL that are not mounted on the host. But this is behavioral change and again if dones should be done for v6.3. For now, let's just restore the original behavior. A nice side-effect of this change is that for fuse daemons with and without FUSE_POSIX_ACL the same code is used for posix acls in a backwards compatible way. This also means we can remove the legacy xattr handlers completely. We've also added comments to explain the expected behavior for daemons without FUSE_POSIX_ACL into the code. Fixes: 318e66856dde ("xattr: use posix acl api") Signed-off-by: Seth Forshee (Digital Ocean) <sforshee@kernel.org> Reviewed-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2023-01-23nfsd: don't free files unconditionally in __nfsd_file_cache_purgeJeff Layton1-25/+36
nfsd_file_cache_purge is called when the server is shutting down, in which case, tearing things down is generally fine, but it also gets called when the exports cache is flushed. Instead of walking the cache and freeing everything unconditionally, handle it the same as when we have a notification of conflicting access. Fixes: ac3a2585f018 ("nfsd: rework refcounting in filecache") Reported-by: Ruben Vestergaard <rubenv@drcmr.dk> Reported-by: Torkil Svensgaard <torkil@drcmr.dk> Reported-by: Shachar Kagan <skagan@nvidia.com> Signed-off-by: Jeff Layton <jlayton@kernel.org> Tested-by: Shachar Kagan <skagan@nvidia.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>