summaryrefslogtreecommitdiff
path: root/fs/zonefs/file.c
AgeCommit message (Collapse)AuthorFilesLines
2023-08-28Merge tag 'iomap-6.6-merge-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds1-1/+1
Pull iomap updates from Darrick Wong: "We've got some big changes for this release -- I'm very happy to be landing willy's work to enable large folios for the page cache for general read and write IOs when the fs can make contiguous space allocations, and Ritesh's work to track sub-folio dirty state to eliminate the write amplification problems inherent in using large folios. As a bonus, io_uring can now process write completions in the caller's context instead of bouncing through a workqueue, which should reduce io latency dramatically. IOWs, XFS should see a nice performance bump for both IO paths. Summary: - Make large writes to the page cache fill sparse parts of the cache with large folios, then use large memcpy calls for the large folio. - Track the per-block dirty state of each large folio so that a buffered write to a single byte on a large folio does not result in a (potentially) multi-megabyte writeback IO. - Allow some directio completions to be performed in the initiating task's context instead of punting through a workqueue. This will reduce latency for some io_uring requests" * tag 'iomap-6.6-merge-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (26 commits) iomap: support IOCB_DIO_CALLER_COMP io_uring/rw: add write support for IOCB_DIO_CALLER_COMP fs: add IOCB flags related to passing back dio completions iomap: add IOMAP_DIO_INLINE_COMP iomap: only set iocb->private for polled bio iomap: treat a write through cache the same as FUA iomap: use an unsigned type for IOMAP_DIO_* defines iomap: cleanup up iomap_dio_bio_end_io() iomap: Add per-block dirty state tracking to improve performance iomap: Allocate ifs in ->write_begin() early iomap: Refactor iomap_write_delalloc_punch() function out iomap: Use iomap_punch_t typedef iomap: Fix possible overflow condition in iomap_write_delalloc_scan iomap: Add some uptodate state handling helpers for ifs state bitmap iomap: Drop ifs argument from iomap_set_range_uptodate() iomap: Rename iomap_page to iomap_folio_state and others iomap: Copy larger chunks from userspace iomap: Create large folios in the buffered write path filemap: Allow __filemap_get_folio to allocate large folios filemap: Add fgf_t typedef ...
2023-08-10zonefs: fix synchronous direct writes to sequential filesDamien Le Moal1-108/+3
Commit 16d7fd3cfa72 ("zonefs: use iomap for synchronous direct writes") changes zonefs code from a self-built zone append BIO to using iomap for synchronous direct writes. This change relies on iomap submit BIO callback to change the write BIO built by iomap to a zone append BIO. However, this change overlooked the fact that a write BIO may be very large as it is split when issued. The change from a regular write to a zone append operation for the built BIO can result in a block layer warning as zone append BIO are not allowed to be split. WARNING: CPU: 18 PID: 202210 at block/bio.c:1644 bio_split+0x288/0x350 Call Trace: ? __warn+0xc9/0x2b0 ? bio_split+0x288/0x350 ? report_bug+0x2e6/0x390 ? handle_bug+0x41/0x80 ? exc_invalid_op+0x13/0x40 ? asm_exc_invalid_op+0x16/0x20 ? bio_split+0x288/0x350 bio_split_rw+0x4bc/0x810 ? __pfx_bio_split_rw+0x10/0x10 ? lockdep_unlock+0xf2/0x250 __bio_split_to_limits+0x1d8/0x900 blk_mq_submit_bio+0x1cf/0x18a0 ? __pfx_iov_iter_extract_pages+0x10/0x10 ? __pfx_blk_mq_submit_bio+0x10/0x10 ? find_held_lock+0x2d/0x110 ? lock_release+0x362/0x620 ? mark_held_locks+0x9e/0xe0 __submit_bio+0x1ea/0x290 ? __pfx___submit_bio+0x10/0x10 ? seqcount_lockdep_reader_access.constprop.0+0x82/0x90 submit_bio_noacct_nocheck+0x675/0xa20 ? __pfx_bio_iov_iter_get_pages+0x10/0x10 ? __pfx_submit_bio_noacct_nocheck+0x10/0x10 iomap_dio_bio_iter+0x624/0x1280 __iomap_dio_rw+0xa22/0x18a0 ? lock_is_held_type+0xe3/0x140 ? __pfx___iomap_dio_rw+0x10/0x10 ? lock_release+0x362/0x620 ? zonefs_file_write_iter+0x74c/0xc80 [zonefs] ? down_write+0x13d/0x1e0 iomap_dio_rw+0xe/0x40 zonefs_file_write_iter+0x5ea/0xc80 [zonefs] do_iter_readv_writev+0x18b/0x2c0 ? __pfx_do_iter_readv_writev+0x10/0x10 ? inode_security+0x54/0xf0 do_iter_write+0x13b/0x7c0 ? lock_is_held_type+0xe3/0x140 vfs_writev+0x185/0x550 ? __pfx_vfs_writev+0x10/0x10 ? __handle_mm_fault+0x9bd/0x1c90 ? find_held_lock+0x2d/0x110 ? lock_release+0x362/0x620 ? find_held_lock+0x2d/0x110 ? lock_release+0x362/0x620 ? __up_read+0x1ea/0x720 ? do_pwritev+0x136/0x1f0 do_pwritev+0x136/0x1f0 ? __pfx_do_pwritev+0x10/0x10 ? syscall_enter_from_user_mode+0x22/0x90 ? lockdep_hardirqs_on+0x7d/0x100 do_syscall_64+0x58/0x80 This error depends on the hardware used, specifically on the max zone append bytes and max_[hw_]sectors limits. Tests using AMD Epyc machines that have low limits did not reveal this issue while runs on Intel Xeon machines with larger limits trigger it. Manually splitting the zone append BIO using bio_split_rw() can solve this issue but also requires issuing the fragment BIOs synchronously with submit_bio_wait(), to avoid potential reordering of the zone append BIO fragments, which would lead to data corruption. That is, this solution is not better than using regular write BIOs which are subject to serialization using zone write locking at the IO scheduler level. Given this, fix the issue by removing zone append support and using regular write BIOs for synchronous direct writes. This allows preseving the use of iomap and having identical synchronous and asynchronous sequential file write path. Zone append support will be reintroduced later through io_uring commands to ensure that the needed special handling is done correctly. Reported-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Fixes: 16d7fd3cfa72 ("zonefs: use iomap for synchronous direct writes") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2023-07-25iomap: Add per-block dirty state tracking to improve performanceRitesh Harjani (IBM)1-1/+1
When filesystem blocksize is less than folio size (either with mapping_large_folio_support() or with blocksize < pagesize) and when the folio is uptodate in pagecache, then even a byte write can cause an entire folio to be written to disk during writeback. This happens because we currently don't have a mechanism to track per-block dirty state within struct iomap_folio_state. We currently only track uptodate state. This patch implements support for tracking per-block dirty state in iomap_folio_state->state bitmap. This should help improve the filesystem write performance and help reduce write amplification. Performance testing of below fio workload reveals ~16x performance improvement using nvme with XFS (4k blocksize) on Power (64K pagesize) FIO reported write bw scores improved from around ~28 MBps to ~452 MBps. 1. <test_randwrite.fio> [global] ioengine=psync rw=randwrite overwrite=1 pre_read=1 direct=0 bs=4k size=1G dir=./ numjobs=8 fdatasync=1 runtime=60 iodepth=64 group_reporting=1 [fio-run] 2. Also our internal performance team reported that this patch improves their database workload performance by around ~83% (with XFS on Power) Reported-by: Aravinda Herle <araherle@in.ibm.com> Reported-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-06-28Merge tag 'mm-stable-2023-06-24-19-15' of ↵Linus Torvalds1-3/+1
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull mm updates from Andrew Morton: - Yosry Ahmed brought back some cgroup v1 stats in OOM logs - Yosry has also eliminated cgroup's atomic rstat flushing - Nhat Pham adds the new cachestat() syscall. It provides userspace with the ability to query pagecache status - a similar concept to mincore() but more powerful and with improved usability - Mel Gorman provides more optimizations for compaction, reducing the prevalence of page rescanning - Lorenzo Stoakes has done some maintanance work on the get_user_pages() interface - Liam Howlett continues with cleanups and maintenance work to the maple tree code. Peng Zhang also does some work on maple tree - Johannes Weiner has done some cleanup work on the compaction code - David Hildenbrand has contributed additional selftests for get_user_pages() - Thomas Gleixner has contributed some maintenance and optimization work for the vmalloc code - Baolin Wang has provided some compaction cleanups, - SeongJae Park continues maintenance work on the DAMON code - Huang Ying has done some maintenance on the swap code's usage of device refcounting - Christoph Hellwig has some cleanups for the filemap/directio code - Ryan Roberts provides two patch series which yield some rationalization of the kernel's access to pte entries - use the provided APIs rather than open-coding accesses - Lorenzo Stoakes has some fixes to the interaction between pagecache and directio access to file mappings - John Hubbard has a series of fixes to the MM selftesting code - ZhangPeng continues the folio conversion campaign - Hugh Dickins has been working on the pagetable handling code, mainly with a view to reducing the load on the mmap_lock - Catalin Marinas has reduced the arm64 kmalloc() minimum alignment from 128 to 8 - Domenico Cerasuolo has improved the zswap reclaim mechanism by reorganizing the LRU management - Matthew Wilcox provides some fixups to make gfs2 work better with the buffer_head code - Vishal Moola also has done some folio conversion work - Matthew Wilcox has removed the remnants of the pagevec code - their functionality is migrated over to struct folio_batch * tag 'mm-stable-2023-06-24-19-15' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (380 commits) mm/hugetlb: remove hugetlb_set_page_subpool() mm: nommu: correct the range of mmap_sem_read_lock in task_mem() hugetlb: revert use of page_cache_next_miss() Revert "page cache: fix page_cache_next/prev_miss off by one" mm/vmscan: fix root proactive reclaim unthrottling unbalanced node mm: memcg: rename and document global_reclaim() mm: kill [add|del]_page_to_lru_list() mm: compaction: convert to use a folio in isolate_migratepages_block() mm: zswap: fix double invalidate with exclusive loads mm: remove unnecessary pagevec includes mm: remove references to pagevec mm: rename invalidate_mapping_pagevec to mapping_try_invalidate mm: remove struct pagevec net: convert sunrpc from pagevec to folio_batch i915: convert i915_gpu_error to use a folio_batch pagevec: rename fbatch_count() mm: remove check_move_unevictable_pages() drm: convert drm_gem_put_pages() to use a folio_batch i915: convert shmem_sg_free_table() to use a folio_batch scatterlist: add sg_set_folio() ...
2023-06-26Merge tag 'for-6.5/splice-2023-06-23' of git://git.kernel.dk/linuxLinus Torvalds1-1/+39
Pull splice updates from Jens Axboe: "This kills off ITER_PIPE to avoid a race between truncate, iov_iter_revert() on the pipe and an as-yet incomplete DMA to a bio with unpinned/unref'ed pages from an O_DIRECT splice read. This causes memory corruption. Instead, we either use (a) filemap_splice_read(), which invokes the buffered file reading code and splices from the pagecache into the pipe; (b) copy_splice_read(), which bulk-allocates a buffer, reads into it and then pushes the filled pages into the pipe; or (c) handle it in filesystem-specific code. Summary: - Rename direct_splice_read() to copy_splice_read() - Simplify the calculations for the number of pages to be reclaimed in copy_splice_read() - Turn do_splice_to() into a helper, vfs_splice_read(), so that it can be used by overlayfs and coda to perform the checks on the lower fs - Make vfs_splice_read() jump to copy_splice_read() to handle direct-I/O and DAX - Provide shmem with its own splice_read to handle non-existent pages in the pagecache. We don't want a ->read_folio() as we don't want to populate holes, but filemap_get_pages() requires it - Provide overlayfs with its own splice_read to call down to a lower layer as overlayfs doesn't provide ->read_folio() - Provide coda with its own splice_read to call down to a lower layer as coda doesn't provide ->read_folio() - Direct ->splice_read to copy_splice_read() in tty, procfs, kernfs and random files as they just copy to the output buffer and don't splice pages - Provide wrappers for afs, ceph, ecryptfs, ext4, f2fs, nfs, ntfs3, ocfs2, orangefs, xfs and zonefs to do locking and/or revalidation - Make cifs use filemap_splice_read() - Replace pointers to generic_file_splice_read() with pointers to filemap_splice_read() as DIO and DAX are handled in the caller; filesystems can still provide their own alternate ->splice_read() op - Remove generic_file_splice_read() - Remove ITER_PIPE and its paraphernalia as generic_file_splice_read was the only user" * tag 'for-6.5/splice-2023-06-23' of git://git.kernel.dk/linux: (31 commits) splice: kdoc for filemap_splice_read() and copy_splice_read() iov_iter: Kill ITER_PIPE splice: Remove generic_file_splice_read() splice: Use filemap_splice_read() instead of generic_file_splice_read() cifs: Use filemap_splice_read() trace: Convert trace/seq to use copy_splice_read() zonefs: Provide a splice-read wrapper xfs: Provide a splice-read wrapper orangefs: Provide a splice-read wrapper ocfs2: Provide a splice-read wrapper ntfs3: Provide a splice-read wrapper nfs: Provide a splice-read wrapper f2fs: Provide a splice-read wrapper ext4: Provide a splice-read wrapper ecryptfs: Provide a splice-read wrapper ceph: Provide a splice-read wrapper afs: Provide a splice-read wrapper 9p: Add splice_read wrapper net: Make sock_splice_read() use copy_splice_read() by default tty, proc, kernfs, random: Use copy_splice_read() ...
2023-06-14zonefs: set FMODE_CAN_ODIRECT instead of a dummy direct_IO methodChristoph Hellwig1-1/+1
Since commit a2ad63daa88b ("VFS: add FMODE_CAN_ODIRECT file flag") file systems can just set the FMODE_CAN_ODIRECT flag at open time instead of wiring up a dummy direct_IO method to indicate support for direct I/O. Do that for zonefs so that noop_direct_IO can eventually be removed. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
2023-06-14zonefs: use iomap for synchronous direct writesDamien Le Moal1-96/+110
Remove the function zonefs_file_dio_append() that is used to manually issue REQ_OP_ZONE_APPEND BIOs for processing synchronous direct writes and use iomap instead. To preserve the use of zone append operations for synchronous writes, different struct iomap_dio_ops are defined. For synchronous direct writes using zone append, zonefs_zone_append_dio_ops is introduced. The submit_bio operation of this structure is defined as the function zonefs_file_zone_append_dio_submit_io() which is used to change the BIO opreation for synchronous direct IO writes to REQ_OP_ZONE_APPEND. In order to preserve the write location check on completion of zone append BIOs, the end_io operation is also defined using the function zonefs_file_zone_append_dio_bio_end_io(). This check now relies on the zonefs_zone_append_bio structure, allocated together with zone append BIOs with a dedicated BIO set. This structure include the target inode of a zone append BIO as well as the target append offset location for the zone append operation. This is used to perform a check against bio->bi_iter.bi_sector when the BIO completes, without needing to use the zone information z_wpoffset field, thus removing the need for taking the inode truncate mutex. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
2023-06-10iomap: update ki_pos in iomap_file_buffered_writeChristoph Hellwig1-3/+1
All callers of iomap_file_buffered_write need to updated ki_pos, move it into common code. Link: https://lkml.kernel.org/r/20230601145904.1385409-8-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Acked-by: Damien Le Moal <dlemoal@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Anna Schumaker <anna@kernel.org> Cc: Chao Yu <chao@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: Miklos Szeredi <mszeredi@redhat.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Xiubo Li <xiubli@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-24zonefs: Provide a splice-read wrapperDavid Howells1-1/+39
Provide a splice_read wrapper for zonefs. This does some checks before proceeding and locks the inode across the call to filemap_splice_read() and a size check in case of truncation. Splicing from direct I/O is handled by the caller. Signed-off-by: David Howells <dhowells@redhat.com> cc: Christoph Hellwig <hch@lst.de> cc: Al Viro <viro@zeniv.linux.org.uk> cc: Jens Axboe <axboe@kernel.dk> cc: Darrick J. Wong <djwong@kernel.org> cc: linux-xfs@vger.kernel.org cc: linux-fsdevel@vger.kernel.org cc: linux-block@vger.kernel.org cc: linux-mm@kvack.org Acked-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230522135018.2742245-26-dhowells@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-03-30zonefs: Do not propagate iomap_dio_rw() ENOTBLK error to user spaceDamien Le Moal1-2/+12
The call to invalidate_inode_pages2_range() in __iomap_dio_rw() may fail, in which case -ENOTBLK is returned and this error code is propagated back to user space trhough iomap_dio_rw() -> zonefs_file_dio_write() return chain. This error code is fairly obscure and may confuse the user. Avoid this and be consistent with the behavior of zonefs_file_dio_append() for similar invalidate_inode_pages2_range() errors by returning -EBUSY to user space when iomap_dio_rw() returns -ENOTBLK. Suggested-by: Christoph Hellwig <hch@infradead.org> Fixes: 8dcc1a9d90c1 ("fs: New zonefs file system") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Tested-by: Hans Holmberg <hans.holmberg@wdc.com>
2023-03-30zonefs: Always invalidate last cached page on append writeDamien Le Moal1-0/+14
When a direct append write is executed, the append offset may correspond to the last page of a sequential file inode which might have been cached already by buffered reads, page faults with mmap-read or non-direct readahead. To ensure that the on-disk and cached data is consistant for such last cached page, make sure to always invalidate it in zonefs_file_dio_append(). If the invalidation fails, return -EBUSY to userspace to differentiate from IO errors. This invalidation will always be a no-op when the FS block size (device zone write granularity) is equal to the page size (e.g. 4K). Reported-by: Hans Holmberg <Hans.Holmberg@wdc.com> Fixes: 02ef12a663c7 ("zonefs: use REQ_OP_ZONE_APPEND for sync DIO") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Tested-by: Hans Holmberg <hans.holmberg@wdc.com>
2023-03-21zonefs: Fix error message in zonefs_file_dio_append()Damien Le Moal1-1/+1
Since the expected write location in a sequential file is always at the end of the file (append write), when an invalid write append location is detected in zonefs_file_dio_append(), print the invalid written location instead of the expected write location. Fixes: a608da3bd730 ("zonefs: Detect append writes at invalid locations") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
2023-03-21zonefs: Prevent uninitialized symbol 'size' warningDamien Le Moal1-1/+1
In zonefs_file_dio_append(), initialize the variable size to 0 to prevent compilation and static code analizers warning such as: New smatch warnings: fs/zonefs/file.c:441 zonefs_file_dio_append() error: uninitialized symbol 'size'. The warning is a false positive as size is never actually used uninitialized. No functional change. Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <error27@gmail.com> Link: https://lore.kernel.org/r/202303191227.GL8Dprbi-lkp@intel.com/ Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
2023-01-23zonefs: Separate zone information from inode informationDamien Le Moal1-44/+55
In preparation for adding dynamic inode allocation, separate an inode zone information from the zonefs inode structure. The new data structure zonefs_zone is introduced to store in memory information about a zone that must be kept throughout the lifetime of the device mount. Linking between a zone file inode and its zone information is done by setting the inode i_private field to point to a struct zonefs_zone. Using the i_private pointer avoids the need for adding a pointer in struct zonefs_inode_info. Beside the vfs inode, this structure is reduced to a mutex and a write open counter. One struct zonefs_zone is created per file inode on mount. These structures are organized in an array using the new struct zonefs_zone_group data structure to represent zone groups. The zonefs_zone arrays are indexed per file number (the index of a struct zonefs_zone in its array directly gives the file number/name for that zone file inode). Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
2023-01-23zonefs: Reduce struct zonefs_inode_info sizeDamien Le Moal1-21/+14
Instead of using the i_ztype field in struct zonefs_inode_info to indicate the zone type of an inode, introduce the new inode flag ZONEFS_ZONE_CNV to be set in the i_flags field of struct zonefs_inode_info to identify conventional zones. If this flag is not set, the zone of an inode is considered to be a sequential zone. The helpers zonefs_zone_is_cnv(), zonefs_zone_is_seq(), zonefs_inode_is_cnv() and zonefs_inode_is_seq() are introduced to simplify testing the zone type of a struct zonefs_inode_info and of a struct inode. Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
2023-01-23zonefs: Reorganize codeDamien Le Moal1-0/+874
Move all code related to zone file operations from super.c to the new file.c file. Inode and zone management code remains in super.c. Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>