summaryrefslogtreecommitdiff
path: root/fs/dax.c
AgeCommit message (Collapse)AuthorFilesLines
2022-05-28Merge tag 'libnvdimm-for-5.19' of ↵Linus Torvalds1-5/+17
git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm Pull libnvdimm and DAX updates from Dan Williams: "New support for clearing memory errors when a file is in DAX mode, alongside with some other fixes and cleanups. Previously it was only possible to clear these errors using a truncate or hole-punch operation to trigger the filesystem to reallocate the block, now, any page aligned write can opportunistically clear errors as well. This change spans x86/mm, nvdimm, and fs/dax, and has received the appropriate sign-offs. Thanks to Jane for her work on this. Summary: - Add support for clearing memory error via pwrite(2) on DAX - Fix 'security overwrite' support in the presence of media errors - Miscellaneous cleanups and fixes for nfit_test (nvdimm unit tests)" * tag 'libnvdimm-for-5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: pmem: implement pmem_recovery_write() pmem: refactor pmem_clear_poison() dax: add .recovery_write dax_operation dax: introduce DAX_RECOVERY_WRITE dax access mode mce: fix set_mce_nospec to always unmap the whole page x86/mce: relocate set{clear}_mce_nospec() functions acpi/nfit: rely on mce->misc to determine poison granularity testing: nvdimm: asm/mce.h is not needed in nfit.c testing: nvdimm: iomap: make __nfit_test_ioremap a macro nvdimm: Allow overwrite in the presence of disabled dimms tools/testing/nvdimm: remove unneeded flush_workqueue
2022-05-16dax: add .recovery_write dax_operationJane Chu1-1/+12
Introduce dax_recovery_write() operation. The function is used to recover a dax range that contains poison. Typical use case is when a user process receives a SIGBUS with si_code BUS_MCEERR_AR indicating poison(s) in a dax range, in response, the user process issues a pwrite() to the page-aligned dax range, thus clears the poison and puts valid data in the range. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jane Chu <jane.chu@oracle.com> Link: https://lore.kernel.org/r/20220422224508.440670-6-jane.chu@oracle.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2022-05-16dax: introduce DAX_RECOVERY_WRITE dax access modeJane Chu1-4/+5
Up till now, dax_direct_access() is used implicitly for normal access, but for the purpose of recovery write, dax range with poison is requested. To make the interface clear, introduce enum dax_access_mode { DAX_ACCESS, DAX_RECOVERY_WRITE, } where DAX_ACCESS is used for normal dax access, and DAX_RECOVERY_WRITE is used for dax recovery write. Suggested-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Jane Chu <jane.chu@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Mike Snitzer <snitzer@redhat.com> Reviewed-by: Vivek Goyal <vgoyal@redhat.com> Link: https://lore.kernel.org/r/165247982851.52965.11024212198889762949.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2022-04-29dax: fix missing writeprotect the pte entryMuchun Song1-87/+12
Currently dax_mapping_entry_mkclean() fails to clean and write protect the pte entry within a DAX PMD entry during an *sync operation. This can result in data loss in the following sequence: 1) process A mmap write to DAX PMD, dirtying PMD radix tree entry and making the pmd entry dirty and writeable. 2) process B mmap with the @offset (e.g. 4K) and @length (e.g. 4K) write to the same file, dirtying PMD radix tree entry (already done in 1)) and making the pte entry dirty and writeable. 3) fsync, flushing out PMD data and cleaning the radix tree entry. We currently fail to mark the pte entry as clean and write protected since the vma of process B is not covered in dax_entry_mkclean(). 4) process B writes to the pte. These don't cause any page faults since the pte entry is dirty and writeable. The radix tree entry remains clean. 5) fsync, which fails to flush the dirty PMD data because the radix tree entry was clean. 6) crash - dirty data that should have been fsync'd as part of 5) could still have been in the processor cache, and is lost. Just to use pfn_mkclean_range() to clean the pfns to fix this issue. Link: https://lkml.kernel.org/r/20220403053957.10770-6-songmuchun@bytedance.com Fixes: 4b4bb46d00b3 ("dax: clear dirty entry tags on cache flush") Signed-off-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Alistair Popple <apopple@nvidia.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Ross Zwisler <zwisler@kernel.org> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Cc: Xiyu Yang <xiyuyang19@fudan.edu.cn> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-04-29dax: fix cache flush on PMD-mapped pagesMuchun Song1-1/+2
The flush_cache_page() only remove a PAGE_SIZE sized range from the cache. However, it does not cover the full pages in a THP except a head page. Replace it with flush_cache_range() to fix this issue. This is just a documentation issue with the respect to properly documenting the expected usage of cache flushing before modifying the pmd. However, in practice this is not a problem due to the fact that DAX is not available on architectures with virtually indexed caches per: commit d92576f1167c ("dax: does not work correctly with virtual aliasing caches") Link: https://lkml.kernel.org/r/20220403053957.10770-3-songmuchun@bytedance.com Fixes: f729c8c9b24f ("dax: wrprotect pmd_t in dax_mapping_entry_mkclean") Signed-off-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Alistair Popple <apopple@nvidia.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Ross Zwisler <zwisler@kernel.org> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Cc: Xiyu Yang <xiyuyang19@fudan.edu.cn> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-03-25Merge tag 'dax-for-5.18' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm Pull DAX updates from Dan Williams: "Andrew has been shepherding major dax features that touch the core -mm through his tree, but I still collect the dax updates that are core-mm independent. - Fix a crash due to a missing rcu_barrier() in dax_fs_exit() - Fix two miscellaneous doc issues" * tag 'dax-for-5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: dax: Fix missing kdoc for dax_device dax: make sure inodes are flushed before destroy cache fsdax: fix function description
2022-02-18fsdax: fix function descriptionShiyang Ruan1-1/+1
The function name has been changed, so the description should be updated too. Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220127124058.1172422-5-ruansy.fnst@fujitsu.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2022-02-02block: remove genhd.hChristoph Hellwig1-1/+0
There is no good reason to keep genhd.h separate from the main blkdev.h header that includes it. So fold the contents of genhd.h into blkdev.h and remove genhd.h entirely. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20220124093913.742411-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-12-18dax: remove the copy_from_iter and copy_to_iter methodsChristoph Hellwig1-5/+0
These methods indirect the actual DAX read/write path. In the end pmem uses magic flush and mc safe variants and fuse and dcssblk use plain ones while device mapper picks redirects to the underlying device. Add set_dax_nocache() and set_dax_nomc() APIs to control which copy routines are used to remove indirect call from the read/write fast path as well as a lot of boilerplate code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Vivek Goyal <vgoyal@redhat.com> [virtiofs] Link: https://lore.kernel.org/r/20211215084508.435401-5-hch@lst.de Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2021-12-04fsdax: shift partition offset handling into the file systemsChristoph Hellwig1-5/+1
Remove the last user of ->bdev in dax.c by requiring the file system to pass in an address that already includes the DAX offset. As part of the only set ->bdev or ->daxdev when actually required in the ->iomap_begin methods. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> [erofs] Reviewed-by: Darrick J. Wong <djwong@kernel.org> Link: https://lore.kernel.org/r/20211129102203.2243509-27-hch@lst.de Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2021-12-04iomap: add a IOMAP_DAX flagChristoph Hellwig1-3/+4
Add a flag so that the file system can easily detect DAX operations based just on the iomap operation requested instead of looking at inode state using IS_DAX. This will be needed to apply the to be added partition offset only for operations that actually use DAX, but not things like fiemap that are based on the block device. In the long run it should also allow turning the bdev, dax_dev and inline_data into a union. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Link: https://lore.kernel.org/r/20211129102203.2243509-25-hch@lst.de Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2021-12-04fsdax: decouple zeroing from the iomap buffered I/O codeChristoph Hellwig1-14/+63
Unshare the DAX and iomap buffered I/O page zeroing code. This code previously did a IS_DAX check deep inside the iomap code, which in fact was the only DAX check in the code. Instead move these checks into the callers. Most callers already have DAX special casing anyway and XFS will need it for reflink support as well. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Link: https://lore.kernel.org/r/20211129102203.2243509-19-hch@lst.de Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2021-12-04fsdax: factor out a dax_memzero helperChristoph Hellwig1-17/+19
Factor out a helper for the "manual" zeroing of a DAX range to clean up dax_iomap_zero a lot. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Link: https://lore.kernel.org/r/20211129102203.2243509-18-hch@lst.de Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2021-12-04fsdax: simplify the offset check in dax_iomap_zeroChristoph Hellwig1-3/+1
The file relative offset must have the same alignment as the storage offset, so use that and get rid of the call to iomap_sector. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Link: https://lore.kernel.org/r/20211129102203.2243509-17-hch@lst.de Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2021-12-04fsdax: simplify the pgoff calculationChristoph Hellwig1-25/+10
Replace the two steps of dax_iomap_sector and bdev_dax_pgoff with a single dax_iomap_pgoff helper that avoids lots of cumbersome sector conversions. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Link: https://lore.kernel.org/r/20211129102203.2243509-15-hch@lst.de Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2021-12-04fsdax: use a saner calling convention for copy_cow_page_daxChristoph Hellwig1-16/+13
Just pass the vm_fault and iomap_iter structures, and figure out the rest locally. Note that this requires moving dax_iomap_sector up in the file. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Link: https://lore.kernel.org/r/20211129102203.2243509-14-hch@lst.de Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2021-12-04fsdax: remove a pointless __force cast in copy_cow_page_daxChristoph Hellwig1-1/+1
Despite its name copy_user_page expected kernel addresses, which is what we already have. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Link: https://lore.kernel.org/r/20211129102203.2243509-13-hch@lst.de Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2021-08-31Merge tag 'iomap-5.15-merge-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds1-303/+275
Pull iomap updates from Darrick Wong: "The most notable externally visible change for this cycle is the addition of support for reads to inline tail fragments of files, which was requested by the erofs developers; and a correction for a kernel memory corruption bug if the sysadmin tries to activate a swapfile with more pages than the swapfile header suggests. We also now report writeback completion errors to the file mapping correctly, instead of munging all errors into EIO. Internally, the bulk of the changes are Christoph's patchset to reduce the indirect function call count by a third to a half by converting iomap iteration from a loop pattern to a generator/consumer pattern. As an added bonus, fsdax no longer open-codes iomap apply loops. Summary: - Simplify the bio_end_page usage in the buffered IO code. - Support reading inline data at nonzero offsets for erofs. - Fix some typos and bad grammar. - Convert kmap_atomic usage in the inline data read path. - Add some extra inline data input checking. - Fix a memory corruption bug stemming from iomap_swapfile_activate trying to activate more pages than mm was expecting. - Pass errnos through the page writeback code so that writeback errors are reported correctly instead of being munged to EIO. - Replace iomap_apply with a open-coded iterator loops to reduce the number of indirect calls by a third to a half. - Refactor the fsdax code to use iomap iterators instead of the open-coded iomap_apply code that it had before. - Format file range iomap tracepoint data in hexadecimal and standardize the names used in the pretty-print string" * tag 'iomap-5.15-merge-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (41 commits) iomap: standardize tracepoint formatting and storage mm/swap: consider max pages in iomap_swapfile_add_extent iomap: move loop control code to iter.c iomap: constify iomap_iter_srcmap fsdax: switch the fault handlers to use iomap_iter fsdax: factor out a dax_fault_actor() helper fsdax: factor out helpers to simplify the dax fault code iomap: rework unshare flag iomap: pass an iomap_iter to various buffered I/O helpers iomap: remove iomap_apply fsdax: switch dax_iomap_rw to use iomap_iter iomap: switch iomap_swapfile_activate to use iomap_iter iomap: switch iomap_seek_data to use iomap_iter iomap: switch iomap_seek_hole to use iomap_iter iomap: switch iomap_bmap to use iomap_iter iomap: switch iomap_fiemap to use iomap_iter iomap: switch __iomap_dio_rw to use iomap_iter iomap: switch iomap_page_mkwrite to use iomap_iter iomap: switch iomap_zero_range to use iomap_iter iomap: switch iomap_file_unshare to use iomap_iter ...
2021-08-17fsdax: switch the fault handlers to use iomap_iterChristoph Hellwig1-118/+75
Avoid the open coded calls to ->iomap_begin and ->iomap_end and call iomap_iter instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-08-17fsdax: factor out a dax_fault_actor() helperShiyang Ruan1-148/+149
The core logic in the two dax page fault functions is similar. So, move the logic into a common helper function. Also, to facilitate the addition of new features, such as CoW, switch-case is no longer used to handle different iomap types. Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-08-17fsdax: factor out helpers to simplify the dax fault codeShiyang Ruan1-69/+84
The dax page fault code is too long and a bit difficult to read. And it is hard to understand when we trying to add new features. Some of the PTE/PMD codes have similar logic. So, factor out helper functions to simplify the code. Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com> Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> [hch: minor cleanups] Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-08-17fsdax: switch dax_iomap_rw to use iomap_iterChristoph Hellwig1-25/+24
Switch the dax_iomap_rw implementation to use iomap_iter. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-08-17fsdax: mark the iomap argument to dax_iomap_sector as constChristoph Hellwig1-1/+1
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-08-11Merge branch 'for-5.14/dax' into libnvdimm-fixesDan Williams1-1/+1
Pick up some small dax cleanups that make some of Ira's follow on work easier.
2021-07-08fs/dax: Clarify nr_pages to dax_direct_access()Ira Weiny1-1/+1
dax_direct_access() takes a number of pages. PHYS_PFN(PAGE_SIZE) is a very round about way to specify '1'. Change the nr_pages parameter to the explicit value of '1'. Reviewed-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Ira Weiny <ira.weiny@intel.com> Link: https://lore.kernel.org/r/20210525172428.3634316-3-ira.weiny@intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2021-06-29dax: fix ENOMEM handling in grab_mapping_entry()Jan Kara1-1/+2
grab_mapping_entry() has a bug in handling of ENOMEM condition. Suppose we have a PMD entry at index i which we are downgrading to a PTE entry. grab_mapping_entry() will set pmd_downgrade to true, lock the entry, clear the entry in xarray, and decrement mapping->nrpages. The it will call: entry = dax_make_entry(pfn_to_pfn_t(0), flags); dax_lock_entry(xas, entry); which inserts new PTE entry into xarray. However this may fail allocating the new node. We handle this by: if (xas_nomem(xas, mapping_gfp_mask(mapping) & ~__GFP_HIGHMEM)) goto retry; however pmd_downgrade stays set to true even though 'entry' returned from get_unlocked_entry() will be NULL now. And we will go again through the downgrade branch. This is mostly harmless except that mapping->nrpages is decremented again and we temporarily have an invalid entry stored in xarray. Fix the problem by setting pmd_downgrade to false each time we lookup the entry we work with so that it matches the entry we found. Link: https://lkml.kernel.org/r/20210622160015.18004-1-jack@suse.cz Fixes: b15cd800682f ("dax: Convert page fault handlers to XArray") Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-15Merge tag 'dax-fixes-5.13-rc2' of ↵Linus Torvalds1-12/+23
git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm Pull dax fixes from Dan Williams: "A fix for a hang condition due to missed wakeups in the filesystem-dax core when exercised by virtiofs. This bug has been there from the beginning, but the condition has not triggered on other filesystems since they hold a lock over invalidation events" * tag 'dax-fixes-5.13-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: dax: Wake up all waiters after invalidating dax entry dax: Add a wakeup mode parameter to put_unlocked_entry() dax: Add an enum for specifying dax wakup mode
2021-05-08dax: Wake up all waiters after invalidating dax entryVivek Goyal1-1/+1
I am seeing missed wakeups which ultimately lead to a deadlock when I am using virtiofs with DAX enabled and running "make -j". I had to mount virtiofs as rootfs and also reduce to dax window size to 256M to reproduce the problem consistently. So here is the problem. put_unlocked_entry() wakes up waiters only if entry is not null as well as !dax_is_conflict(entry). But if I call multiple instances of invalidate_inode_pages2() in parallel, then I can run into a situation where there are waiters on this index but nobody will wake these waiters. invalidate_inode_pages2() invalidate_inode_pages2_range() invalidate_exceptional_entry2() dax_invalidate_mapping_entry_sync() __dax_invalidate_entry() { xas_lock_irq(&xas); entry = get_unlocked_entry(&xas, 0); ... ... dax_disassociate_entry(entry, mapping, trunc); xas_store(&xas, NULL); ... ... put_unlocked_entry(&xas, entry); xas_unlock_irq(&xas); } Say a fault in in progress and it has locked entry at offset say "0x1c". Now say three instances of invalidate_inode_pages2() are in progress (A, B, C) and they all try to invalidate entry at offset "0x1c". Given dax entry is locked, all tree instances A, B, C will wait in wait queue. When dax fault finishes, say A is woken up. It will store NULL entry at index "0x1c" and wake up B. When B comes along it will find "entry=0" at page offset 0x1c and it will call put_unlocked_entry(&xas, 0). And this means put_unlocked_entry() will not wake up next waiter, given the current code. And that means C continues to wait and is not woken up. This patch fixes the issue by waking up all waiters when a dax entry has been invalidated. This seems to fix the deadlock I am facing and I can make forward progress. Reported-by: Sergio Lopez <slp@redhat.com> Fixes: ac401cc78242 ("dax: New fault locking") Reviewed-by: Jan Kara <jack@suse.cz> Suggested-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Link: https://lore.kernel.org/r/20210428190314.1865312-4-vgoyal@redhat.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2021-05-08dax: Add a wakeup mode parameter to put_unlocked_entry()Vivek Goyal1-7/+7
As of now put_unlocked_entry() always wakes up next waiter. In next patches we want to wake up all waiters at one callsite. Hence, add a parameter to the function. This patch does not introduce any change of behavior. Reviewed-by: Greg Kurz <groug@kaod.org> Reviewed-by: Jan Kara <jack@suse.cz> Suggested-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Link: https://lore.kernel.org/r/20210428190314.1865312-3-vgoyal@redhat.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2021-05-08dax: Add an enum for specifying dax wakup modeVivek Goyal1-6/+17
Dan mentioned that he is not very fond of passing around a boolean true/false to specify if only next waiter should be woken up or all waiters should be woken up. He instead prefers that we introduce an enum and make it very explicity at the callsite itself. Easier to read code. This patch should not introduce any change of behavior. Reviewed-by: Greg Kurz <groug@kaod.org> Reviewed-by: Jan Kara <jack@suse.cz> Suggested-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Link: https://lore.kernel.org/r/20210428190314.1865312-2-vgoyal@redhat.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2021-05-05dax: account DAX entries as nrpagesMatthew Wilcox (Oracle)1-3/+3
Simplify mapping_needs_writeback() by accounting DAX entries as pages instead of exceptional entries. Link: https://lkml.kernel.org/r/20201026151849.24232-4-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Tested-by: Vishal Verma <vishal.l.verma@intel.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-05mm: introduce and use mapping_empty()Matthew Wilcox (Oracle)1-1/+1
Patch series "Remove nrexceptional tracking", v2. We actually use nrexceptional for very little these days. It's a minor pain to keep in sync with nrpages, but the pain becomes much bigger with the THP patches because we don't know how many indices a shadow entry occupies. It's easier to just remove it than keep it accurate. Also, we save 8 bytes per inode which is nothing to sneeze at; on my laptop, it would improve shmem_inode_cache from 22 to 23 objects per 16kB, and inode_cache from 26 to 27 objects. Combined, that saves a megabyte of memory from a combined usage of 25MB for both caches. Unfortunately, ext4 doesn't cross a magic boundary, so it doesn't save any memory for ext4. This patch (of 4): Instead of checking the two counters (nrpages and nrexceptional), we can just check whether i_pages is empty. Link: https://lkml.kernel.org/r/20201026151849.24232-1-willy@infradead.org Link: https://lkml.kernel.org/r/20201026151849.24232-2-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Tested-by: Vishal Verma <vishal.l.verma@intel.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-09mm: provide a saner PTE walking API for modulesPaolo Bonzini1-2/+3
Currently, the follow_pfn function is exported for modules but follow_pte is not. However, follow_pfn is very easy to misuse, because it does not provide protections (so most of its callers assume the page is writable!) and because it returns after having already unlocked the page table lock. Provide instead a simplified version of follow_pte that does not have the pmdpp and range arguments. The older version survives as follow_invalidate_pte() for use by fs/dax.c. Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-12-16mm: simplify follow_pte{,pmd}Christoph Hellwig1-5/+4
Merge __follow_pte_pmd, follow_pte_pmd and follow_pte into a single follow_pte function and just pass two additional NULL arguments for the two previous follow_pte callers. [sfr@canb.auug.org.au: merge fix for "s390/pci: remove races against pte updates"] Link: https://lkml.kernel.org/r/20201111221254.7f6a3658@canb.auug.org.au Link: https://lkml.kernel.org/r/20201029101432.47011-3-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-20Merge tag 'fuse-update-5.10' of ↵Linus Torvalds1-6/+23
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse Pull fuse updates from Miklos Szeredi: - Support directly accessing host page cache from virtiofs. This can improve I/O performance for various workloads, as well as reducing the memory requirement by eliminating double caching. Thanks to Vivek Goyal for doing most of the work on this. - Allow automatic submounting inside virtiofs. This allows unique st_dev/ st_ino values to be assigned inside the guest to files residing on different filesystems on the host. Thanks to Max Reitz for the patches. - Fix an old use after free bug found by Pradeep P V K. * tag 'fuse-update-5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: (25 commits) virtiofs: calculate number of scatter-gather elements accurately fuse: connection remove fix fuse: implement crossmounts fuse: Allow fuse_fill_super_common() for submounts fuse: split fuse_mount off of fuse_conn fuse: drop fuse_conn parameter where possible fuse: store fuse_conn in fuse_req fuse: add submount support to <uapi/linux/fuse.h> fuse: fix page dereference after free virtiofs: add logic to free up a memory range virtiofs: maintain a list of busy elements virtiofs: serialize truncate/punch_hole and dax fault path virtiofs: define dax address space operations virtiofs: add DAX mmap support virtiofs: implement dax read/write operations virtiofs: introduce setupmapping/removemapping commands virtiofs: implement FUSE_INIT map_alignment field virtiofs: keep a list of free dax memory ranges virtiofs: add a mount option to enable dax virtiofs: set up virtio_fs dax_device ...
2020-09-21iomap: Change calling convention for zeroingMatthew Wilcox (Oracle)1-7/+6
Pass the full length to iomap_zero() and dax_iomap_zero(), and have them return how many bytes they actually handled. This is preparatory work for handling THP, although it looks like DAX could actually take advantage of it if there's a larger contiguous area. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2020-09-10dax: Create a range version of dax_layout_busy_page()Vivek Goyal1-6/+23
virtiofs device has a range of memory which is mapped into file inodes using dax. This memory is mapped in qemu on host and maps different sections of real file on host. Size of this memory is limited (determined by administrator) and depending on filesystem size, we will soon reach a situation where all the memory is in use and we need to reclaim some. As part of reclaim process, we will need to make sure that there are no active references to pages (taken by get_user_pages()) on the memory range we are trying to reclaim. I am planning to use dax_layout_busy_page() for this. But in current form this is per inode and scans through all the pages of the inode. We want to reclaim only a portion of memory (say 2MB page). So we want to make sure that only that 2MB range of pages do not have any references (and don't want to unmap all the pages of inode). Hence, create a range version of this function named dax_layout_busy_page_range() which can be used to pass a range which needs to be unmapped. Cc: Dan Williams <dan.j.williams@intel.com> Cc: linux-nvdimm@lists.01.org Cc: Jan Kara <jack@suse.cz> Cc: Vishal L Verma <vishal.l.verma@intel.com> Cc: "Weiny, Ira" <ira.weiny@intel.com> Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-08-24treewide: Use fallthrough pseudo-keywordGustavo A. R. Silva1-1/+1
Replace the existing /* fall through */ comments and its variants with the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary fall-through markings when it is the case. [1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
2020-07-31dax: Fix incorrect argument passed to xas_set_err()Hao Li1-1/+1
The argument passed to xas_set_err() to indicate an error should be negative. Otherwise, xas_error() will return 0, and grab_mapping_entry() will return the found entry instead of 'SIGBUS' when the entry is not in fact valid. This would result in problems in subsequent code paths. Link: https://lore.kernel.org/r/20200729034436.24267-1-lihao2018.fnst@cn.fujitsu.com Reviewed-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Signed-off-by: Hao Li <lihao2018.fnst@cn.fujitsu.com> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
2020-07-28fs/dax: Remove unused size parameterIra Weiny1-7/+6
Passing size to copy_user_dax implies it can copy variable sizes of data when in fact it calls copy_user_page() which is exactly a page. We are safe because the only caller uses PAGE_SIZE anyway so just remove the variable for clarity. While we are at it change copy_user_dax() to copy_cow_page_dax() to make it clear it is a singleton helper for this one case not implementing what dax_iomap_actor() does. Link: https://lore.kernel.org/r/20200717072056.73134-11-ira.weiny@intel.com Reviewed-by: Ben Widawsky <ben.widawsky@intel.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Ira Weiny <ira.weiny@intel.com> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
2020-04-03dax,iomap: Add helper dax_iomap_zero() to zero a rangeVivek Goyal1-8/+8
Add a helper dax_ioamp_zero() to zero a range. This patch basically merges __dax_zero_page_range() and iomap_dax_zero(). Suggested-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20200228163456.1587-7-vgoyal@redhat.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2020-04-03dax: Use new dax zero page method for zeroing a pageVivek Goyal1-30/+23
Use new dax native zero page method for zeroing page if I/O is page aligned. Otherwise fall back to direct_access() + memcpy(). This gets rid of one of the depenendency on block device in dax path. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Link: https://lore.kernel.org/r/20200228163456.1587-6-vgoyal@redhat.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2020-02-06dax: pass NOWAIT flag to iomap_applyJeff Moyer1-0/+3
fstests generic/471 reports a failure when run with MOUNT_OPTIONS="-o dax". The reason is that the initial pwrite to an empty file with the RWF_NOWAIT flag set does not return -EAGAIN. It turns out that dax_iomap_rw doesn't pass that flag through to iomap_apply. With this patch applied, generic/471 passes for me. Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/x49r1z86e1d.fsf@segfault.boston.devel.redhat.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2020-01-03dax: Pass dax_dev instead of bdev to dax_writeback_mapping_range()Vivek Goyal1-7/+1
As of now dax_writeback_mapping_range() takes "struct block_device" as a parameter and dax_dev is searched from bdev name. This also involves taking a fresh reference on dax_dev and putting that reference at the end of function. We are developing a new filesystem virtio-fs and using dax to access host page cache directly. But there is no block device. IOW, we want to make use of dax but want to get rid of this assumption that there is always a block device associated with dax_dev. So pass in "struct dax_device" as parameter instead of bdev. ext2/ext4/xfs are current users and they already have a reference on dax_device. So there is no need to take reference and drop reference to dax_device on each call of this function. Suggested-by: Christoph Hellwig <hch@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Link: https://lore.kernel.org/r/20200103183307.GB13350@redhat.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2019-11-30Merge tag 'iomap-5.5-merge-11' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds1-5/+8
Pull iomap updates from Darrick Wong: "In this release, we hoisted as much of XFS' writeback code into iomap as was practicable, refactored the unshare file data function, added the ability to perform buffered io copy on write, and tweaked various parts of the directio implementation as needed to port ext4's directio code (that will be a separate pull). Summary: - Make iomap_dio_rw callers explicitly tell us if they want us to wait - Port the xfs writeback code to iomap to complete the buffered io library functions - Refactor the unshare code to share common pieces - Add support for performing copy on write with buffered writes - Other minor fixes - Fix unchecked return in iomap_bmap - Fix a type casting bug in a ternary statement in iomap_dio_bio_actor - Improve tracepoints for easier diagnostic ability - Fix pipe page leakage in directio reads" * tag 'iomap-5.5-merge-11' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (31 commits) iomap: Fix pipe page leakage during splicing iomap: trace iomap_appply results iomap: fix return value of iomap_dio_bio_actor on 32bit systems iomap: iomap_bmap should check iomap_apply return value iomap: Fix overflow in iomap_page_mkwrite fs/iomap: remove redundant check in iomap_dio_rw() iomap: use a srcmap for a read-modify-write I/O iomap: renumber IOMAP_HOLE to 0 iomap: use write_begin to read pages to unshare iomap: move the zeroing case out of iomap_read_page_sync iomap: ignore non-shared or non-data blocks in xfs_file_dirty iomap: always use AOP_FLAG_NOFS in iomap_write_begin iomap: remove the unused iomap argument to __iomap_write_end iomap: better document the IOMAP_F_* flags iomap: enhance writeback error message iomap: pass a struct page to iomap_finish_page_writeback iomap: cleanup iomap_ioend_compare iomap: move struct iomap_page out of iomap.h iomap: warn on inline maps in iomap_writepage_map iomap: lift the xfs writeback code to iomap ...
2019-10-23fs/dax: Fix pmd vs pte conflict detectionDan Williams1-2/+3
Users reported a v5.3 performance regression and inability to establish huge page mappings. A revised version of the ndctl "dax.sh" huge page unit test identifies commit 23c84eb78375 "dax: Fix missed wakeup with PMD faults" as the source. Update get_unlocked_entry() to check for NULL entries before checking the entry order, otherwise NULL is misinterpreted as a present pte conflict. The 'order' check needs to happen before the locked check as an unlocked entry at the wrong order must fallback to lookup the correct order. Reported-by: Jeff Smits <jeff.smits@intel.com> Reported-by: Doug Nelson <doug.nelson@intel.com> Cc: <stable@vger.kernel.org> Fixes: 23c84eb78375 ("dax: Fix missed wakeup with PMD faults") Reviewed-by: Jan Kara <jack@suse.cz> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Link: https://lore.kernel.org/r/157167532455.3945484.11971474077040503994.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2019-10-21iomap: use a srcmap for a read-modify-write I/OGoldwyn Rodrigues1-5/+8
The srcmap is used to identify where the read is to be performed from. It is passed to ->iomap_begin, which can fill it in if we need to read data for partially written blocks from a different location than the write target. The srcmap is only supported for buffered writes so far. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> [hch: merged two patches, removed the IOMAP_F_COW flag, use iomap as srcmap if not set, adjust length down to srcmap end as well] Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Acked-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
2019-08-06dax: dax_layout_busy_page() should not unmap cow pagesVivek Goyal1-1/+1
Vivek: "As of now dax_layout_busy_page() calls unmap_mapping_range() with last argument as 1, which says even unmap cow pages. I am wondering who needs to get rid of cow pages as well. I noticed one interesting side affect of this. I mount xfs with -o dax and mmaped a file with MAP_PRIVATE and wrote some data to a page which created cow page. Then I called fallocate() on that file to zero a page of file. fallocate() called dax_layout_busy_page() which unmapped cow pages as well and then I tried to read back the data I wrote and what I get is old data from persistent memory. I lost the data I had written. This read basically resulted in new fault and read back the data from persistent memory. This sounds wrong. Are there any users which need to unmap cow pages as well? If not, I am proposing changing it to not unmap cow pages. I noticed this while while writing virtio_fs code where when I tried to reclaim a memory range and that corrupted the executable and I was running from virtio-fs and program got segment violation." Dan: "In fact the unmap_mapping_range() in this path is only to synchronize against get_user_pages_fast() and force it to call back into the filesystem to re-establish the mapping. COW pages should be left untouched by dax_layout_busy_page()." Cc: <stable@vger.kernel.org> Fixes: 5fac7408d828 ("mm, fs, dax: handle layout changes to pinned dax mappings") Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Link: https://lore.kernel.org/r/20190802192956.GA3032@redhat.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2019-07-29dax: Fix missed wakeup in put_unlocked_entry()Jan Kara1-1/+1
The condition checking whether put_unlocked_entry() needs to wake up following waiter got broken by commit 23c84eb78375 ("dax: Fix missed wakeup with PMD faults"). We need to wake the waiter whenever the passed entry is valid (i.e., non-NULL and not special conflict entry). This could lead to processes never being woken up when waiting for entry lock. Fix the condition. Cc: <stable@vger.kernel.org> Link: http://lore.kernel.org/r/20190729120228.GC17833@quack2.suse.cz Fixes: 23c84eb78375 ("dax: Fix missed wakeup with PMD faults") Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2019-07-19Merge tag 'iomap-5.3-merge-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds1-1/+0
Pull iomap split/cleanup from Darrick Wong: "As promised, here's the second part of the iomap merge for 5.3, in which we break up iomap.c into smaller files grouped by functional area so that it'll be easier in the long run to maintain cohesiveness of code units and to review incoming patches. There are no functional changes and fs/iomap.c split cleanly. Summary: - Regroup the fs/iomap.c code by major functional area so that we can start development for 5.4 from a more stable base" * tag 'iomap-5.3-merge-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: iomap: move internal declarations into fs/iomap/ iomap: move the main iteration code into a separate file iomap: move the buffered IO code into a separate file iomap: move the direct IO code into a separate file iomap: move the SEEK_HOLE code into a separate file iomap: move the file mapping reporting code into a separate file iomap: move the swapfile code into a separate file iomap: start moving code to fs/iomap/