summaryrefslogtreecommitdiff
path: root/fs/xfs/xfs_buf.h
AgeCommit message (Collapse)AuthorFilesLines
2024-02-22xfs: support in-memory buffer cache targetsDarrick J. Wong1-0/+9
Allow the buffer cache to target in-memory files by making it possible to have a buftarg that maps pages from private shmem files. As the prevous patch alludes, the in-memory buftarg contains its own cache, points to a shmem file, and does not point to a block_device. The next few patches will make it possible to construct an xfs_btree in pageable memory by using this buftarg. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-02-22xfs: teach buftargs to maintain their own buffer hashtableDarrick J. Wong1-0/+8
Currently, cached buffers are indexed by per-AG hashtables. This works great for the data device, but won't work for in-memory btrees. To handle that use case, buftargs will need to be able to index buffers independently of other data structures. We accomplish this by hoisting the rhashtable and its lock into a separate xfs_buf_cache structure, make the buftarg point to the _buf_cache structure, and rework various functions to use it. This will enable the in-memory buftarg to come up with its own _buf_cache. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-02-22xfs: remove the xfs_buftarg_t typedefChristoph Hellwig1-2/+2
Switch the few remaining holdouts to the struct version. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2023-12-15xfs: force all buffers to be written during btree bulk loadDarrick J. Wong1-0/+1
While stress-testing online repair of btrees, I noticed periodic assertion failures from the buffer cache about buffers with incorrect DELWRI_Q state. Looking further, I observed this race between the AIL trying to write out a btree block and repair zapping a btree block after the fact: AIL: Repair0: pin buffer X delwri_queue: set DELWRI_Q add to delwri list stale buf X: clear DELWRI_Q does not clear b_list free space X commit delwri_submit # oops Worse yet, I discovered that running the same repair over and over in a tight loop can result in a second race that cause data integrity problems with the repair: AIL: Repair0: Repair1: pin buffer X delwri_queue: set DELWRI_Q add to delwri list stale buf X: clear DELWRI_Q does not clear b_list free space X commit find free space X get buffer rewrite buffer delwri_queue: set DELWRI_Q already on a list, do not add commit BAD: committed tree root before all blocks written delwri_submit # too late now I traced this to my own misunderstanding of how the delwri lists work, particularly with regards to the AIL's buffer list. If a buffer is logged and committed, the buffer can end up on that AIL buffer list. If btree repairs are run twice in rapid succession, it's possible that the first repair will invalidate the buffer and free it before the next time the AIL wakes up. Marking the buffer stale clears DELWRI_Q from the buffer state without removing the buffer from its delwri list. The buffer doesn't know which list it's on, so it cannot know which lock to take to protect the list for a removal. If the second repair allocates the same block, it will then recycle the buffer to start writing the new btree block. Meanwhile, if the AIL wakes up and walks the buffer list, it will ignore the buffer because it can't lock it, and go back to sleep. When the second repair calls delwri_queue to put the buffer on the list of buffers to write before committing the new btree, it will set DELWRI_Q again, but since the buffer hasn't been removed from the AIL's buffer list, it won't add it to the bulkload buffer's list. This is incorrect, because the bulkload caller relies on delwri_submit to ensure that all the buffers have been sent to disk /before/ committing the new btree root pointer. This ordering requirement is required for data consistency. Worse, the AIL won't clear DELWRI_Q from the buffer when it does finally drop it, so the next thread to walk through the btree will trip over a debug assertion on that flag. To fix this, create a new function that waits for the buffer to be removed from any other delwri lists before adding the buffer to the caller's delwri list. By waiting for the buffer to clear both the delwri list and any potential delwri wait list, we can be sure that repair will initiate writes of all buffers and report all write errors back to userspace instead of committing the new structure. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2023-11-03Merge tag 'mm-stable-2023-11-01-14-33' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: "Many singleton patches against the MM code. The patch series which are included in this merge do the following: - Kemeng Shi has contributed some compation maintenance work in the series 'Fixes and cleanups to compaction' - Joel Fernandes has a patchset ('Optimize mremap during mutual alignment within PMD') which fixes an obscure issue with mremap()'s pagetable handling during a subsequent exec(), based upon an implementation which Linus suggested - More DAMON/DAMOS maintenance and feature work from SeongJae Park i the following patch series: mm/damon: misc fixups for documents, comments and its tracepoint mm/damon: add a tracepoint for damos apply target regions mm/damon: provide pseudo-moving sum based access rate mm/damon: implement DAMOS apply intervals mm/damon/core-test: Fix memory leaks in core-test mm/damon/sysfs-schemes: Do DAMOS tried regions update for only one apply interval - In the series 'Do not try to access unaccepted memory' Adrian Hunter provides some fixups for the recently-added 'unaccepted memory' feature. To increase the feature's checking coverage. 'Plug a few gaps where RAM is exposed without checking if it is unaccepted memory' - In the series 'cleanups for lockless slab shrink' Qi Zheng has done some maintenance work which is preparation for the lockless slab shrinking code - Qi Zheng has redone the earlier (and reverted) attempt to make slab shrinking lockless in the series 'use refcount+RCU method to implement lockless slab shrink' - David Hildenbrand contributes some maintenance work for the rmap code in the series 'Anon rmap cleanups' - Kefeng Wang does more folio conversions and some maintenance work in the migration code. Series 'mm: migrate: more folio conversion and unification' - Matthew Wilcox has fixed an issue in the buffer_head code which was causing long stalls under some heavy memory/IO loads. Some cleanups were added on the way. Series 'Add and use bdev_getblk()' - In the series 'Use nth_page() in place of direct struct page manipulation' Zi Yan has fixed a potential issue with the direct manipulation of hugetlb page frames - In the series 'mm: hugetlb: Skip initialization of gigantic tail struct pages if freed by HVO' has improved our handling of gigantic pages in the hugetlb vmmemmep optimizaton code. This provides significant boot time improvements when significant amounts of gigantic pages are in use - Matthew Wilcox has sent the series 'Small hugetlb cleanups' - code rationalization and folio conversions in the hugetlb code - Yin Fengwei has improved mlock()'s handling of large folios in the series 'support large folio for mlock' - In the series 'Expose swapcache stat for memcg v1' Liu Shixin has added statistics for memcg v1 users which are available (and useful) under memcg v2 - Florent Revest has enhanced the MDWE (Memory-Deny-Write-Executable) prctl so that userspace may direct the kernel to not automatically propagate the denial to child processes. The series is named 'MDWE without inheritance' - Kefeng Wang has provided the series 'mm: convert numa balancing functions to use a folio' which does what it says - In the series 'mm/ksm: add fork-exec support for prctl' Stefan Roesch makes is possible for a process to propagate KSM treatment across exec() - Huang Ying has enhanced memory tiering's calculation of memory distances. This is used to permit the dax/kmem driver to use 'high bandwidth memory' in addition to Optane Data Center Persistent Memory Modules (DCPMM). The series is named 'memory tiering: calculate abstract distance based on ACPI HMAT' - In the series 'Smart scanning mode for KSM' Stefan Roesch has optimized KSM by teaching it to retain and use some historical information from previous scans - Yosry Ahmed has fixed some inconsistencies in memcg statistics in the series 'mm: memcg: fix tracking of pending stats updates values' - In the series 'Implement IOCTL to get and optionally clear info about PTEs' Peter Xu has added an ioctl to /proc/<pid>/pagemap which permits us to atomically read-then-clear page softdirty state. This is mainly used by CRIU - Hugh Dickins contributed the series 'shmem,tmpfs: general maintenance', a bunch of relatively minor maintenance tweaks to this code - Matthew Wilcox has increased the use of the VMA lock over file-backed page faults in the series 'Handle more faults under the VMA lock'. Some rationalizations of the fault path became possible as a result - In the series 'mm/rmap: convert page_move_anon_rmap() to folio_move_anon_rmap()' David Hildenbrand has implemented some cleanups and folio conversions - In the series 'various improvements to the GUP interface' Lorenzo Stoakes has simplified and improved the GUP interface with an eye to providing groundwork for future improvements - Andrey Konovalov has sent along the series 'kasan: assorted fixes and improvements' which does those things - Some page allocator maintenance work from Kemeng Shi in the series 'Two minor cleanups to break_down_buddy_pages' - In thes series 'New selftest for mm' Breno Leitao has developed another MM self test which tickles a race we had between madvise() and page faults - In the series 'Add folio_end_read' Matthew Wilcox provides cleanups and an optimization to the core pagecache code - Nhat Pham has added memcg accounting for hugetlb memory in the series 'hugetlb memcg accounting' - Cleanups and rationalizations to the pagemap code from Lorenzo Stoakes, in the series 'Abstract vma_merge() and split_vma()' - Audra Mitchell has fixed issues in the procfs page_owner code's new timestamping feature which was causing some misbehaviours. In the series 'Fix page_owner's use of free timestamps' - Lorenzo Stoakes has fixed the handling of new mappings of sealed files in the series 'permit write-sealed memfd read-only shared mappings' - Mike Kravetz has optimized the hugetlb vmemmap optimization in the series 'Batch hugetlb vmemmap modification operations' - Some buffer_head folio conversions and cleanups from Matthew Wilcox in the series 'Finish the create_empty_buffers() transition' - As a page allocator performance optimization Huang Ying has added automatic tuning to the allocator's per-cpu-pages feature, in the series 'mm: PCP high auto-tuning' - Roman Gushchin has contributed the patchset 'mm: improve performance of accounted kernel memory allocations' which improves their performance by ~30% as measured by a micro-benchmark - folio conversions from Kefeng Wang in the series 'mm: convert page cpupid functions to folios' - Some kmemleak fixups in Liu Shixin's series 'Some bugfix about kmemleak' - Qi Zheng has improved our handling of memoryless nodes by keeping them off the allocation fallback list. This is done in the series 'handle memoryless nodes more appropriately' - khugepaged conversions from Vishal Moola in the series 'Some khugepaged folio conversions'" [ bcachefs conflicts with the dynamically allocated shrinkers have been resolved as per Stephen Rothwell in https://lore.kernel.org/all/20230913093553.4290421e@canb.auug.org.au/ with help from Qi Zheng. The clone3 test filtering conflict was half-arsed by yours truly ] * tag 'mm-stable-2023-11-01-14-33' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (406 commits) mm/damon/sysfs: update monitoring target regions for online input commit mm/damon/sysfs: remove requested targets when online-commit inputs selftests: add a sanity check for zswap Documentation: maple_tree: fix word spelling error mm/vmalloc: fix the unchecked dereference warning in vread_iter() zswap: export compression failure stats Documentation: ubsan: drop "the" from article title mempolicy: migration attempt to match interleave nodes mempolicy: mmap_lock is not needed while migrating folios mempolicy: alloc_pages_mpol() for NUMA policy without vma mm: add page_rmappable_folio() wrapper mempolicy: remove confusing MPOL_MF_LAZY dead code mempolicy: mpol_shared_policy_init() without pseudo-vma mempolicy trivia: use pgoff_t in shared mempolicy tree mempolicy trivia: slightly more consistent naming mempolicy trivia: delete those ancient pr_debug()s mempolicy: fix migrate_pages(2) syscall return nr_failed kernfs: drop shared NUMA mempolicy hooks hugetlbfs: drop shared NUMA mempolicy pretence mm/damon/sysfs-test: add a unit test for damon_sysfs_set_targets() ...
2023-10-28xfs: Convert to bdev_open_by_path()Jan Kara1-1/+2
Convert xfs to use bdev_open_by_path() and pass the handle around. CC: "Darrick J. Wong" <djwong@kernel.org> CC: linux-xfs@vger.kernel.org Acked-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230927093442.25915-28-jack@suse.cz Acked-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-10-04xfs: dynamically allocate the xfs-buf shrinkerQi Zheng1-1/+1
In preparation for implementing lockless slab shrink, use new APIs to dynamically allocate the xfs-buf shrinker, so that it can be freed asynchronously via RCU. Then it doesn't need to wait for RCU read-side critical section when releasing the struct xfs_buftarg. Link: https://lkml.kernel.org/r/20230911094444.68966-35-zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: Chandan Babu R <chandan.babu@oracle.com> Cc: "Darrick J. Wong" <djwong@kernel.org> Cc: Abhinav Kumar <quic_abhinavk@quicinc.com> Cc: Alasdair Kergon <agk@redhat.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: Andreas Gruenbacher <agruenba@redhat.com> Cc: Anna Schumaker <anna@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Bob Peterson <rpeterso@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Carlos Llamas <cmllamas@google.com> Cc: Chao Yu <chao@kernel.org> Cc: Chris Mason <clm@fb.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Christian Koenig <christian.koenig@amd.com> Cc: Chuck Lever <cel@kernel.org> Cc: Coly Li <colyli@suse.de> Cc: Dai Ngo <Dai.Ngo@oracle.com> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: Dave Chinner <david@fromorbit.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Airlie <airlied@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Sterba <dsterba@suse.com> Cc: Dmitry Baryshkov <dmitry.baryshkov@linaro.org> Cc: Gao Xiang <hsiangkao@linux.alibaba.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Huang Rui <ray.huang@amd.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Jason Wang <jasowang@redhat.com> Cc: Jeff Layton <jlayton@kernel.org> Cc: Jeffle Xu <jefflexu@linux.alibaba.com> Cc: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Juergen Gross <jgross@suse.com> Cc: Kent Overstreet <kent.overstreet@gmail.com> Cc: Kirill Tkhai <tkhai@ya.ru> Cc: Marijn Suijten <marijn.suijten@somainline.org> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Mike Snitzer <snitzer@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nadav Amit <namit@vmware.com> Cc: Neil Brown <neilb@suse.de> Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> Cc: Olga Kornievskaia <kolga@netapp.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Richard Weinberger <richard@nod.at> Cc: Rob Clark <robdclark@gmail.com> Cc: Rob Herring <robh@kernel.org> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Sean Paul <sean@poorly.run> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Song Liu <song@kernel.org> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Steven Price <steven.price@arm.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tomeu Vizoso <tomeu.vizoso@collabora.com> Cc: Tom Talpey <tom@talpey.com> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Cc: Yue Hu <huyue2@coolpad.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-10xfs: allow scanning ranges of the buffer cache for live buffersDarrick J. Wong1-0/+13
After an online repair, we need to invalidate buffers representing the blocks from the old metadata that we're replacing. It's possible that parts of a tree that were previously cached in memory are no longer accessible due to media failure or other corruption on interior nodes, so repair figures out the old blocks from the reverse mapping data and scans the buffer cache directly. In other words, online fsck needs to find all the live (i.e. non-stale) buffers for a range of fsblocks so that it can invalidate them. Unfortunately, the current buffer cache code triggers asserts if the rhashtable lookup finds a non-stale buffer of a different length than the key we searched for. For regular operation this is desirable, but for this repair procedure, we don't care since we're going to forcibly stale the buffer anyway. Add an internal lookup flag to avoid the assert. Skip buffers that are already XBF_STALE. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-07-21xfs: xfs_buf cache destroy isn't RCU safeDave Chinner1-4/+2
Darrick and Sachin Sant reported that xfs/435 and xfs/436 would report an non-empty xfs_buf slab on module remove. This isn't easily to reproduce, but is clearly a side effect of converting the buffer caceh to RUC freeing and lockless lookups. Sachin bisected and Darrick hit it when testing the patchset directly. Turns out that the xfs_buf slab is not destroyed when all the other XFS slab caches are destroyed. Instead, it's got it's own little wrapper function that gets called separately, and so it doesn't have an rcu_barrier() call in it that is needed to drain all the rcu callbacks before the slab is destroyed. Fix it by removing the xfs_buf_init/terminate wrappers that just allocate and destroy the xfs_buf slab, and move them to the same place that all the other slab caches are set up and destroyed. Reported-and-tested-by: Sachin Sant <sachinp@linux.ibm.com> Fixes: 298f34224506 ("xfs: lockless buffer lookup") Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2022-07-14xfs: lockless buffer lookupDave Chinner1-0/+1
Now that we have a standalone fast path for buffer lookup, we can easily convert it to use rcu lookups. When we continually hammer the buffer cache with trylock lookups, we end up with a huge amount of lock contention on the per-ag buffer hash locks: - 92.71% 0.05% [kernel] [k] xfs_inodegc_worker - 92.67% xfs_inodegc_worker - 92.13% xfs_inode_unlink - 91.52% xfs_inactive_ifree - 85.63% xfs_read_agi - 85.61% xfs_trans_read_buf_map - 85.59% xfs_buf_read_map - xfs_buf_get_map - 85.55% xfs_buf_find - 72.87% _raw_spin_lock - do_raw_spin_lock 71.86% __pv_queued_spin_lock_slowpath - 8.74% xfs_buf_rele - 7.88% _raw_spin_lock - 7.88% do_raw_spin_lock 7.63% __pv_queued_spin_lock_slowpath - 1.70% xfs_buf_trylock - 1.68% down_trylock - 1.41% _raw_spin_lock_irqsave - 1.39% do_raw_spin_lock __pv_queued_spin_lock_slowpath - 0.76% _raw_spin_unlock 0.75% do_raw_spin_unlock This is basically hammering the pag->pag_buf_lock from lots of CPUs doing trylocks at the same time. Most of the buffer trylock operations ultimately fail after we've done the lookup, so we're really hammering the buf hash lock whilst making no progress. We can also see significant spinlock traffic on the same lock just under normal operation when lots of tasks are accessing metadata from the same AG, so let's avoid all this by converting the lookup fast path to leverages the rhashtable's ability to do rcu protected lookups. We avoid races with the buffer release path by using atomic_inc_not_zero() on the buffer hold count. Any buffer that is in the LRU will have a non-zero count, thereby allowing the lockless fast path to be taken in most cache hit situations. If the buffer hold count is zero, then it is likely going through the release path so in that case we fall back to the existing lookup miss slow path. The slow path will then do an atomic lookup and insert under the buffer hash lock and hence serialise correctly against buffer release freeing the buffer. The use of rcu protected lookups means that buffer handles now need to be freed by RCU callbacks (same as inodes). We still free the buffer pages before the RCU callback - we won't be trying to access them at all on a buffer that has zero references - but we need the buffer handle itself to be present for the entire rcu protected read side to detect a zero hold count correctly. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2022-07-07xfs: rework xfs_buf_incore() APIDave Chinner1-4/+16
Make it consistent with the other buffer APIs to return a error and the buffer is placed in a parameter. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2022-04-21xfs: convert buffer flags to unsigned.Dave Chinner1-21/+21
5.18 w/ std=gnu11 compiled with gcc-5 wants flags stored in unsigned fields to be unsigned. This manifests as a compiler error such as: /kisskb/src/fs/xfs/./xfs_trace.h:432:2: note: in expansion of macro 'TP_printk' TP_printk("dev %d:%d daddr 0x%llx bbcount 0x%x hold %d pincount %d " ^ /kisskb/src/fs/xfs/./xfs_trace.h:440:5: note: in expansion of macro '__print_flags' __print_flags(__entry->flags, "|", XFS_BUF_FLAGS), ^ /kisskb/src/fs/xfs/xfs_buf.h:67:4: note: in expansion of macro 'XBF_UNMAPPED' { XBF_UNMAPPED, "UNMAPPED" } ^ /kisskb/src/fs/xfs/./xfs_trace.h:440:40: note: in expansion of macro 'XFS_BUF_FLAGS' __print_flags(__entry->flags, "|", XFS_BUF_FLAGS), ^ /kisskb/src/fs/xfs/./xfs_trace.h: In function 'trace_raw_output_xfs_buf_flags_class': /kisskb/src/fs/xfs/xfs_buf.h:46:23: error: initializer element is not constant #define XBF_UNMAPPED (1 << 31)/* do not map the buffer */ as __print_flags assigns XFS_BUF_FLAGS to a structure that uses an unsigned long for the flag. Since this results in the value of XBF_UNMAPPED causing a signed integer overflow, the result is technically undefined behavior, which gcc-5 does not accept as an integer constant. This is based on a patch from Arnd Bergman <arnd@arndb.de>. Reported-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Chandan Babu R <chandan.babu@oracle.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2021-12-04dax: return the partition offset from fs_dax_get_by_bdevChristoph Hellwig1-0/+1
Prepare for the removal of the block_device from the DAX I/O path by returning the partition offset from fs_dax_get_by_bdev so that the file systems have it at hand for use during I/O. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Link: https://lore.kernel.org/r/20211129102203.2243509-26-hch@lst.de Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2021-12-04xfs: move dax device handling into xfs_{alloc,free}_buftargChristoph Hellwig1-2/+2
Hide the DAX device lookup from the xfs_super.c code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Link: https://lore.kernel.org/r/20211129102203.2243509-22-hch@lst.de Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2021-08-19xfs: rename buffer cache index variable b_bnDave Chinner1-17/+1
To stop external users from using b_bn as the disk address of the buffer, rename it to b_rhash_key to indicate that it is the buffer cache index, not the block number of the buffer. Code that needs the disk address should use xfs_buf_daddr() to obtain it. Do the rename and clean up any of the remaining internal b_bn users. Also clean up any remaining b_bn cruft that is now unused. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-08-19xfs: introduce xfs_buf_daddr()Dave Chinner1-1/+5
Introduce a helper function xfs_buf_daddr() to extract the disk address of the buffer from the struct xfs_buf. This will replace direct accesses to bp->b_bn and bp->b_maps[0].bm_bn, as well as the XFS_BUF_ADDR() macro. This patch introduces the helper function and replaces all uses of XFS_BUF_ADDR() as this is just a simple sed replacement. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-08-19xfs: sb verifier doesn't handle uncached sb bufferDave Chinner1-1/+6
The verifier checks explicitly for bp->b_bn == XFS_SB_DADDR to match the primary superblock buffer, but the primary superblock is an uncached buffer and so bp->b_bn is always -1ULL. Hence this never matches and the CRC error reporting is wholly dependent on the mount superblock already being populated so CRC feature checks pass and allow CRC errors to be reported. Fix this so that the primary superblock CRC error reporting is not dependent on already having read the superblock into memory. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-08-10xfs: remove kmem_alloc_io()Dave Chinner1-6/+0
Since commit 59bb47985c1d ("mm, sl[aou]b: guarantee natural alignment for kmalloc(power-of-two)"), the core slab code now guarantees slab alignment in all situations sufficient for IO purposes (i.e. minimum of 512 byte alignment of >= 512 byte sized heap allocations) we no longer need the workaround in the XFS code to provide this guarantee. Replace the use of kmem_alloc_io() with kmem_alloc() or kmem_alloc_large() appropriately, and remove the kmem_alloc_io() interface altogether. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-06-07xfs: remove ->b_offset handling for page backed buffersChristoph Hellwig1-1/+2
->b_offset can only be non-zero for _XBF_KMEM backed buffers, so remove all code dealing with it for page backed buffers. Signed-off-by: Christoph Hellwig <hch@lst.de> [dgc: modified to fit this patchset] Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2021-01-23xfs: don't drain buffer lru on freeze and read-only remountBrian Foster1-0/+1
xfs_buftarg_drain() is called from xfs_log_quiesce() to ensure the buffer cache is reclaimed during unmount. xfs_log_quiesce() is also called from xfs_quiesce_attr(), however, which means that cache state is completely drained for filesystem freeze and read-only remount. While technically harmless, this is unnecessarily heavyweight. Both freeze and read-only mounts allow reads and thus allow population of the buffer cache. Therefore, the transitional sequence in either case really only needs to quiesce outstanding writes to return the filesystem in a generally read-only state. Additionally, some users have reported that attempts to freeze a filesystem concurrent with a read-heavy workload causes the freeze process to stall for a significant amount of time. This occurs because, as mentioned above, the read workload repopulates the buffer LRU while the freeze task attempts to drain it. To improve this situation, replace the drain in xfs_log_quiesce() with a buffer I/O quiesce and lift the drain into the unmount path. This removes buffer LRU reclaim from freeze and read-only [re]mount, but ensures the LRU is still drained before the filesystem unmounts. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-01-23xfs: rename xfs_wait_buftarg() to xfs_buftarg_drain()Brian Foster1-5/+5
xfs_wait_buftarg() is vaguely named and somewhat overloaded. Its primary purpose is to reclaim all buffers from the provided buffer target LRU. In preparation to refactor xfs_wait_buftarg() into serialization and LRU draining components, rename the function and associated helpers to something more descriptive. This patch has no functional changes with the minor exception of renaming a tracepoint. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2020-12-17xfs: remove xfs_buf_t typedefDave Chinner1-7/+7
Prepare for kernel xfs_buf alignment by getting rid of the xfs_buf_t typedef from userspace. [darrick: This patch is a port of a userspace patch removing the xfs_buf_t typedef in preparation to make the userspace xfs_buf code behave more like its kernel counterpart.] Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2020-09-16xfs: reuse _xfs_buf_read for re-reading the superblockChristoph Hellwig1-8/+1
Instead of poking deeply into buffer cache internals when re-reading the superblock during log recovery just generalize _xfs_buf_read and use it there. Note that we don't have to explicitly set up the ops as they must be set from the initial read. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-09-16xfs: fold xfs_buf_ioend_finish into xfs_ioendChristoph Hellwig1-7/+0
No need to keep a separate helper for this logic. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-09-16xfs: mark xfs_buf_ioend staticChristoph Hellwig1-1/+0
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-07-06xfs: call xfs_buf_iodone directlyDave Chinner1-5/+1
All unmarked dirty buffers should be in the AIL and have log items attached to them. Hence when they are written, we will run a callback to remove the item from the AIL if appropriate. Now that we've handled inode and dquot buffers, all remaining calls are to xfs_buf_iodone() and so we can hard code this rather than use an indirect call. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-07-06xfs: mark log recovery buffers for completionDave Chinner1-0/+2
Log recovery has it's own buffer write completion handler for buffers that it directly recovers. Convert these to direct calls by flagging these buffers as being log recovery buffers. The flag will get cleared by the log recovery IO completion routine, so it will never leak out of log recovery. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-07-06xfs: mark dquot buffers in cacheDave Chinner1-0/+2
dquot buffers always have write IO callbacks, so by marking them directly we can avoid needing to attach ->b_iodone functions to them. This avoids an indirect call, and makes future modifications much simpler. This is largely a rearrangement of the code at this point - no IO completion functionality changes at this point, just how the code is run is modified. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-07-06xfs: mark inode buffers in cacheDave Chinner1-13/+25
Inode buffers always have write IO callbacks, so by marking them directly we can avoid needing to attach ->b_iodone functions to them. This avoids an indirect call, and makes future modifications much simpler. While this is largely a refactor of existing functionality, we broaden the scope of the flag to beyond where inodes are explicitly attached because future changes need to know what type of log items are attached to the buffer. Adding this buffer flag may invoke the inode iodone callback in cases where it wouldn't have been previously, but this is not a functional change because the callback is identical to the normal buffer write iodone callback when inodes are not attached. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-05-07xfs: refactor ratelimited buffer error messages into helperBrian Foster1-0/+1
XFS has some inconsistent log message rate limiting with respect to buffer alerts. The metadata I/O error notification uses the generic ratelimited alert, the buffer push code uses a custom rate limit and the similar quiesce time failure checks are not rate limited at all (when they should be). The custom rate limit defined in the buf item code is specifically crafted for buffer alerts. It is more aggressive than generic rate limiting code because it must accommodate a high frequency of I/O error events in a relative short timeframe. Factor out the custom rate limit state from the buf item code into a per-buftarg rate limit so various alerts are limited based on the target. Define a buffer alert helper function and use it for the buffer alerts that are already ratelimited. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Allison Collins <allison.henderson@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-05-07xfs: factor out buffer I/O failure codeBrian Foster1-0/+1
We use the same buffer I/O failure code in a few different places. It's not much code, but it's not necessarily self-explanatory. Factor it into a helper and document it in one place. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Allison Collins <allison.henderson@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-03-12xfs: add a function to deal with corrupt buffers post-verifiersDarrick J. Wong1-0/+2
Add a helper function to get rid of buffers that we have decided are corrupt after the verifiers have run. This function is intended to handle metadata checks that can't happen in the verifiers, such as inter-block relationship checking. Note that we now mark the buffer stale so that it will not end up on any LRU and will be purged on release. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2020-01-27xfs: fix xfs_buf_ioerror_alert location reportingDarrick J. Wong1-3/+4
Instead of passing __func__ to the error reporting function, let's use the return address builtins so that the messages actually tell you which higher level function called the buffer functions. This was previously true for the xfs_buf_read callers, but not for the xfs_trans_read_buf callers. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2020-01-27xfs: make xfs_buf_read return an error codeDarrick J. Wong1-7/+3
Convert xfs_buf_read() to return numeric error codes like most everywhere else in xfs. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2020-01-27xfs: make xfs_buf_get_uncached return an error codeDarrick J. Wong1-2/+2
Convert xfs_buf_get_uncached() to return numeric error codes like most everywhere else in xfs. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2020-01-27xfs: make xfs_buf_get return an error codeDarrick J. Wong1-8/+4
Convert xfs_buf_get() to return numeric error codes like most everywhere else in xfs. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2020-01-27xfs: make xfs_buf_read_map return an error codeDarrick J. Wong1-5/+10
Convert xfs_buf_read_map() to return numeric error codes like most everywhere else in xfs. This involves moving the open-coded logic that reports metadata IO read / corruption errors and stales the buffer into xfs_buf_read_map so that the logic is all in one place. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2020-01-27xfs: make xfs_buf_get_map return an error codeDarrick J. Wong1-4/+9
Convert xfs_buf_get_map() to return numeric error codes like most everywhere else in xfs. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2019-10-28xfs: mark xfs_buf_free staticChristoph Hellwig1-1/+0
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-08-27xfs: get allocation alignment from the buftargDave Chinner1-0/+6
Needed to feed into the allocation routine to guarantee the memory buffers we add to bios are correctly aligned to the underlying device. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-06-29xfs: add struct xfs_mount pointer to struct xfs_bufChristoph Hellwig1-0/+1
We need to derive the mount pointer from a buffer in a lot of place. Add a direct pointer to short cut the pointer chasing. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-06-29xfs: remove the b_io_length field in struct xfs_bufChristoph Hellwig1-1/+0
This field is now always idential to b_length. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-06-29xfs: properly type the b_log_item field in struct xfs_bufChristoph Hellwig1-1/+1
Now that the log code doesn't abuse this field any more we can declare it as a struct xfs_buf_log_item pointer. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-06-29xfs: remove unused buffer cache APIsChristoph Hellwig1-27/+0
Now that the log code uses bios directly we can drop various special cases in the buffer cache code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-06-29xfs: renumber XBF_WRITE_FAILChristoph Hellwig1-1/+1
Assining a numerical value that is not close to the flags defined near by is just asking for conflicts later on. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-06-29xfs: remove the never used _XBF_COMPOUND flagChristoph Hellwig1-3/+1
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-06-12xfs: remove unused flag argumentsEric Sandeen1-3/+2
There are several functions which take a flag argument that is only ever passed as "0," so remove these arguments. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Bill O'Donnell <billodo@redhat.com> Reviewed-by: Allison Collins <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-06-12xfs: merge xfs_buf_zero and xfs_buf_iomoveChristoph Hellwig1-10/+1
xfs_buf_zero is the only caller of xfs_buf_iomove. Remove support for copying from or to the buffer in xfs_buf_iomove and merge the two functions. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-18xfs: fix xfs_buf magic number endian checksDarrick J. Wong1-2/+6
Create a separate magic16 check function so that we don't run afoul of static checkers. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-02-12xfs: distinguish between inobt and finobt magic valuesBrian Foster1-0/+2
The inode btree verifier code is shared between the inode btree and free inode btree because the underlying metadata formats are essentially equivalent. A side effect of this is that the verifier cannot determine whether a particular btree block should have an inobt or finobt magic value. This logic allows an unfortunate xfs_repair bug to escape detection where certain level > 0 nodes of the finobt are stamped with inobt magic by xfs_repair finobt reconstruction. This is fortunately not a severe problem since the inode btree magic values do not contribute to any changes in kernel behavior, but we do need a means to detect and prevent this problem in the future. Add a field to xfs_buf_ops to store the v4 and v5 superblock magic values expected by a particular verifier. Add a helper to check an on-disk magic value against the value expected by the verifier. Call the helper from the shared [f]inobt verifier code for magic value verification. This ensures that the inode btree blocks each have the appropriate magic value based on specific tree type and superblock version. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>