summaryrefslogtreecommitdiff
path: root/fs/zonefs
AgeCommit message (Collapse)AuthorFilesLines
2023-01-24zonefs: Detect append writes at invalid locationsDamien Le Moal1-0/+22
commit a608da3bd730d718f2d3ebec1c26f9865f8f17ce upstream. Using REQ_OP_ZONE_APPEND operations for synchronous writes to sequential files succeeds regardless of the zone write pointer position, as long as the target zone is not full. This means that if an external (buggy) application writes to the zone of a sequential file underneath the file system, subsequent file write() operation will succeed but the file size will not be correct and the file will contain invalid data written by another application. Modify zonefs_file_dio_append() to check the written sector of an append write (returned in bio->bi_iter.bi_sector) and return -EIO if there is a mismatch with the file zone wp offset field. This change triggers a call to zonefs_io_error() and a zone check. Modify zonefs_io_error_cb() to not expose the unexpected data after the current inode size when the errors=remount-ro mode is used. Other error modes are correctly handled already. Fixes: 02ef12a663c7 ("zonefs: use REQ_OP_ZONE_APPEND for sync DIO") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-12-02zonefs: fix zone report size in __zonefs_io_error()Damien Le Moal1-10/+27
[ Upstream commit 7dd12d65ac646046a3fe0bbf9a4e86f4514207b3 ] When an IO error occurs, the function __zonefs_io_error() is used to issue a zone report to obtain the latest zone information from the device. This function gets a zone report for all zones used as storage for a file, which is always 1 zone except for files representing aggregated conventional zones. The number of zones of a zone report for a file is calculated in __zonefs_io_error() by doing a bit-shift of the inode i_zone_size field, which is equal to or larger than the device zone size. However, this calculation does not take into account that the last zone of a zoned device may be smaller than the zone size reported by bdev_zone_sectors() (which is used to set the bit shift size). As a result, if an error occurs for an IO targetting such last smaller zone, the zone report will ask for 0 zones, leading to an invalid zone report. Fix this by using the fact that all files require a 1 zone report, except if the inode i_zone_size field indicates a zone size larger than the device zone size. This exception case corresponds to a mount with aggregated conventional zones. A check for this exception is added to the file inode initialization during mount. If an invalid setup is detected, emit an error and fail the mount (check contributed by Johannes Thumshirn). Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-06-25zonefs: fix zonefs_iomap_begin() for readsDamien Le Moal1-30/+62
commit c1c1204c0d0c1dccc1310b9277fb2bd8b663d8fe upstream. If a readahead is issued to a sequential zone file with an offset exactly equal to the current file size, the iomap type is set to IOMAP_UNWRITTEN, which will prevent an IO, but the iomap length is calculated as 0. This causes a WARN_ON() in iomap_iter(): [17309.548939] WARNING: CPU: 3 PID: 2137 at fs/iomap/iter.c:34 iomap_iter+0x9cf/0xe80 [...] [17309.650907] RIP: 0010:iomap_iter+0x9cf/0xe80 [...] [17309.754560] Call Trace: [17309.757078] <TASK> [17309.759240] ? lock_is_held_type+0xd8/0x130 [17309.763531] iomap_readahead+0x1a8/0x870 [17309.767550] ? iomap_read_folio+0x4c0/0x4c0 [17309.771817] ? lockdep_hardirqs_on_prepare+0x400/0x400 [17309.778848] ? lock_release+0x370/0x750 [17309.784462] ? folio_add_lru+0x217/0x3f0 [17309.790220] ? reacquire_held_locks+0x4e0/0x4e0 [17309.796543] read_pages+0x17d/0xb60 [17309.801854] ? folio_add_lru+0x238/0x3f0 [17309.807573] ? readahead_expand+0x5f0/0x5f0 [17309.813554] ? policy_node+0xb5/0x140 [17309.819018] page_cache_ra_unbounded+0x27d/0x450 [17309.825439] filemap_get_pages+0x500/0x1450 [17309.831444] ? filemap_add_folio+0x140/0x140 [17309.837519] ? lock_is_held_type+0xd8/0x130 [17309.843509] filemap_read+0x28c/0x9f0 [17309.848953] ? zonefs_file_read_iter+0x1ea/0x4d0 [zonefs] [17309.856162] ? trace_contention_end+0xd6/0x130 [17309.862416] ? __mutex_lock+0x221/0x1480 [17309.868151] ? zonefs_file_read_iter+0x166/0x4d0 [zonefs] [17309.875364] ? filemap_get_pages+0x1450/0x1450 [17309.881647] ? __mutex_unlock_slowpath+0x15e/0x620 [17309.888248] ? wait_for_completion_io_timeout+0x20/0x20 [17309.895231] ? lock_is_held_type+0xd8/0x130 [17309.901115] ? lock_is_held_type+0xd8/0x130 [17309.906934] zonefs_file_read_iter+0x356/0x4d0 [zonefs] [17309.913750] new_sync_read+0x2d8/0x520 [17309.919035] ? __x64_sys_lseek+0x1d0/0x1d0 Furthermore, this causes iomap_readahead() to loop forever as iomap_readahead_iter() always returns 0, making no progress. Fix this by treating reads after the file size as access to holes, setting the iomap type to IOMAP_HOLE, the iomap addr to IOMAP_NULL_ADDR and using the length argument as is for the iomap length. To simplify the code with this change, zonefs_iomap_begin() is split into the read variant, zonefs_read_iomap_begin() and zonefs_read_iomap_ops, and the write variant, zonefs_write_iomap_begin() and zonefs_write_iomap_ops. Reported-by: Jorgen Hansen <Jorgen.Hansen@wdc.com> Fixes: 8dcc1a9d90c1 ("fs: New zonefs file system") Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Jorgen Hansen <Jorgen.Hansen@wdc.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-06-14zonefs: fix handling of explicit_open option on mountDamien Le Moal1-5/+6
commit a2a513be7139b279f1b5b2cee59c6c4950c34346 upstream. Ignoring the explicit_open mount option on mount for devices that do not have a limit on the number of open zones must be done after the mount options are parsed and set in s_mount_opts. Move the check to ignore the explicit_open option after the call to zonefs_parse_options() in zonefs_fill_super(). Fixes: b5c00e975779 ("zonefs: open/close zone on file open/close") Cc: <stable@vger.kernel.org> Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-05-09zonefs: Clear inode information flags on inode creationDamien Le Moal1-0/+1
commit 694852ead287a3433126e7ebda397b242dc99624 upstream. Ensure that the i_flags field of struct zonefs_inode_info is cleared to 0 when initializing a zone file inode, avoiding seeing the flag ZONEFS_ZONE_OPEN being incorrectly set. Fixes: b5c00e975779 ("zonefs: open/close zone on file open/close") Cc: <stable@vger.kernel.org> Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-05-09zonefs: Fix management of open zonesDamien Le Moal1-5/+40
commit 1da18a296f5ba4f99429e62a7cf4fdbefa598902 upstream. The mount option "explicit_open" manages the device open zone resources to ensure that if an application opens a sequential file for writing, the file zone can always be written by explicitly opening the zone and accounting for that state with the s_open_zones counter. However, if some zones are already open when mounting, the device open zone resource usage status will be larger than the initial s_open_zones value of 0. Ensure that this inconsistency does not happen by closing any sequential zone that is open when mounting. Furthermore, with ZNS drives, closing an explicitly open zone that has not been written will change the zone state to "closed", that is, the zone will remain in an active state. Since this can then cause failures of explicit open operations on other zones if the drive active zone resources are exceeded, we need to make sure that the zone is not active anymore by resetting it instead of closing it. To address this, zonefs_zone_mgmt() is modified to change a REQ_OP_ZONE_CLOSE request into a REQ_OP_ZONE_RESET for sequential zones that have not been written. Fixes: b5c00e975779 ("zonefs: open/close zone on file open/close") Cc: <stable@vger.kernel.org> Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-12-22zonefs: add MODULE_ALIAS_FSNaohiro Aota1-0/+1
commit 8ffea2599f63fdbee968b894eab78170abf3ec2c upstream. Add MODULE_ALIAS_FS() to load the module automatically when you do "mount -t zonefs". Fixes: 8dcc1a9d90c1 ("fs: New zonefs file system") Cc: stable <stable@vger.kernel.org> # 5.6+ Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: Johannes Thumshirn <jth@kernel.org> Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-25zonefs: fix to update .i_wr_refcnt correctly in zonefs_open_zone()Chao Yu1-4/+3
commit 6980d29ce4da223ad7f0751c7f1d61d3c6b54ab3 upstream. In zonefs_open_zone(), if opened zone count is larger than .s_max_open_zones threshold, we missed to recover .i_wr_refcnt, fix this. Fixes: b5c00e975779 ("zonefs: open/close zone on file open/close") Cc: <stable@vger.kernel.org> Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-25zonefs: prevent use of seq files as swap fileDamien Le Moal1-0/+16
commit 1601ea068b886da1f8f8d4e18b9403e9e24adef6 upstream. The sequential write constraint of sequential zone file prevent their use as swap files. Only allow conventional zone files to be used as swap files. Fixes: 8dcc1a9d90c1 ("fs: New zonefs file system") Cc: <stable@vger.kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-25zonefs: Fix O_APPEND async write handlingDamien Le Moal1-10/+68
commit ebfd68cd0c1e81267c757332385cb96df30dacce upstream. zonefs updates the size of a sequential zone file inode only on completion of direct writes. When executing asynchronous append writes (with a file open with O_APPEND or using RWF_APPEND), the use of the current inode size in generic_write_checks() to set an iocb offset thus leads to unaligned write if an application issues an append write operation with another write already being executed. Fix this problem by introducing zonefs_write_checks() as a modified version of generic_write_checks() using the file inode wp_offset for an append write iocb offset. Also introduce zonefs_write_check_limits() to replace generic_write_check_limits() call. This zonefs special helper makes sure that the maximum file limit used is the maximum size of the file being accessed. Since zonefs_write_checks() already truncates the iov_iter, the calls to iov_iter_truncate() in zonefs_file_dio_write() and zonefs_file_buffered_write() are removed. Fixes: 8dcc1a9d90c1 ("fs: New zonefs file system") Cc: <stable@vger.kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-04zonefs: Fix file size of zones in full conditionShin'ichiro Kawasaki1-0/+3
commit 059c01039c0185dbee7ed080f1f2bd22cb1e4dab upstream. Per ZBC/ZAC/ZNS specifications, write pointers may not have valid values when zones are in full condition. However, when zonefs mounts a zoned block device, zonefs refers write pointers to set file size even when the zones are in full condition. This results in wrong file size. To fix this, refer maximum file size in place of write pointers for zones in full condition. Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Fixes: 8dcc1a9d90c1 ("fs: New zonefs file system") Cc: <stable@vger.kernel.org> # 5.6+ Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-17zonefs: select CONFIG_CRC32Arnd Bergmann1-0/+1
commit 4f8b848788f77c7f5c3bd98febce66b7aa14785f upstream. When CRC32 is disabled, zonefs cannot be linked: ld: fs/zonefs/super.o: in function `zonefs_fill_super': Add a Kconfig 'select' statement for it. Fixes: 8dcc1a9d90c1 ("fs: New zonefs file system") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-12-10zonefs: fix page reference and BIO leakDamien Le Moal1-6/+8
In zonefs_file_dio_append(), the pages obtained using bio_iov_iter_get_pages() are not released on completion of the REQ_OP_APPEND BIO, nor when bio_iov_iter_get_pages() fails. Furthermore, a call to bio_put() is missing when bio_iov_iter_get_pages() fails. Fix these resource leaks by adding BIO resource release code (bio_put()i and bio_release_pages()) at the end of the function after the BIO execution and add a jump to this resource cleanup code in case of bio_iov_iter_get_pages() failure. While at it, also fix the call to task_io_account_write() to be passed the correct BIO size instead of bio_iov_iter_get_pages() return value. Reported-by: Christoph Hellwig <hch@lst.de> Fixes: 02ef12a663c7 ("zonefs: use REQ_OP_ZONE_APPEND for sync DIO") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2020-10-24Merge branch 'work.misc' of ↵Linus Torvalds1-2/+1
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull misc vfs updates from Al Viro: "Assorted stuff all over the place (the largest group here is Christoph's stat cleanups)" * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: fs: remove KSTAT_QUERY_FLAGS fs: remove vfs_stat_set_lookup_flags fs: move vfs_fstatat out of line fs: implement vfs_stat and vfs_lstat in terms of vfs_fstatat fs: remove vfs_statx_fd fs: omfs: use kmemdup() rather than kmalloc+memcpy [PATCH] reduce boilerplate in fsid handling fs: Remove duplicated flag O_NDELAY occurring twice in VALID_OPEN_FLAGS selftests: mount: add nosymfollow tests Add a "nosymfollow" mount option.
2020-09-18[PATCH] reduce boilerplate in fsid handlingAl Viro1-2/+1
Get rid of boilerplate in most of ->statfs() instances... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-09-15zonefs: open/close zone on file open/closeJohannes Thumshirn2-4/+189
NVMe Zoned Namespace introduced the concept of active zones, which are zones in the implicit open, explicit open or closed condition. Drives may have a limit on the number of zones that can be simultaneously active. This potential limitation translate into a risk for applications to see write IO errors due to this limit if the zone of a file being written to is not already active when a write request is issued. To avoid these potential errors, the zone of a file can explicitly be made active using an open zone command when the file is open for the first time. If the zone open command succeeds, the application is then guaranteed that write requests can be processed. This indirect management of active zones relies on the maximum number of open zones of a drive, which is always lower or equal to the maximum number of active zones. On the first open of a sequential zone file, send a REQ_OP_ZONE_OPEN command to the block device. Conversely, on the last release of a zone file and send a REQ_OP_ZONE_CLOSE to the device if the zone is not full or empty. As truncating a zone file to 0 or max can deactivate a zone as well, we need to serialize against truncates and also be careful not to close a zone as the file may still be open for writing, e.g. the user called ftruncate(). If the zone file is not open and a process does a truncate(), then no close operation is needed. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
2020-09-15zonefs: provide no-lock zonefs_io_error variantJohannes Thumshirn1-3/+8
Subsequent patches need to call zonefs_io_error() with the i_truncate_mutex already held, so factor out the body of zonefs_io_error() into __zonefs_io_error() which can be called from with the i_truncate_mutex held. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
2020-09-15zonefs: introduce helper for zone managementJohannes Thumshirn1-7/+22
Introduce a helper function for sending zone management commands to the block device. As zone management commands can change a zone write pointer position reflected in the size of the zone file, this function expects the truncate mutex to be held. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
2020-08-11zonefs: add zone-capacity supportJohannes Thumshirn2-4/+15
In the zoned storage model, the sectors within a zone are typically all writeable. With the introduction of the Zoned Namespace (ZNS) Command Set in the NVM Express organization, the model was extended to have a specific writeable capacity. This zone capacity can be less than the overall zone size for a NVMe ZNS device or null_blk in zoned-mode. For other ZBC/ZAC devices the zone capacity is always equal to the zone size. Use the zone capacity field instead from blk_zone for determining the maximum inode size and inode blocks in zonefs. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
2020-08-07Merge tag 'iomap-5.9-merge-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds1-2/+5
Pull iomap updates from Darrick Wong: "The most notable changes are: - iomap no longer invalidates the page cache when performing a direct read, since doing so is unnecessary and the old directio code doesn't do that either. - iomap embraced the use of returning ENOTBLK from a direct write to trigger falling back to a buffered write since ext4 already did this and btrfs wants it for their port. - iomap falls back to buffered writes if we're doing a direct write and the page cache invalidation after the flush fails; this was necessary to handle a corner case in the btrfs port. - Remove email virus scanner detritus that was accidentally included in yesterday's pull request. Clearly I need(ed) to update my git branch checker scripts. :(" * tag 'iomap-5.9-merge-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: iomap: fall back to buffered writes for invalidation failures xfs: use ENOTBLK for direct I/O to buffered I/O fallback iomap: Only invalidate page cache pages on direct IO writes iomap: Make sure iomap_end is called after iomap_begin
2020-08-05iomap: fall back to buffered writes for invalidation failuresChristoph Hellwig1-2/+5
Failing to invalid the page cache means data in incoherent, which is a very bad state for the system. Always fall back to buffered I/O through the page cache if we can't invalidate mappings. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Acked-by: Bob Peterson <rpeterso@redhat.com> Acked-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Theodore Ts'o <tytso@mit.edu> # for ext4 Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com> # for gfs2 Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
2020-07-20zonefs: count pages after truncating the iteratorJohannes Thumshirn1-4/+4
Count pages after possibly truncating the iterator to the maximum zone append size, not before. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
2020-07-20zonefs: Fix compilation warningDamien Le Moal1-3/+7
Avoid the compilation warning "Variable 'ret' is reassigned a value before the old one has been used." in zonefs_create_zgroup() by setting ret for the error path only if an error happens. Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
2020-06-04Merge tag 'zonefs-5.8-rc1' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs Pull zonefs update from Damien Le Moal: "Only one patch in this pull request to cleanup handling of uuid using the import_uuid() helper, from Andy" * tag 'zonefs-5.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs: zonefs: Replace uuid_copy() with import_uuid()
2020-06-03Merge tag 'for-5.8/block-2020-06-01' of git://git.kernel.dk/linux-blockLinus Torvalds1-9/+73
Pull block updates from Jens Axboe: "Core block changes that have been queued up for this release: - Remove dead blk-throttle and blk-wbt code (Guoqing) - Include pid in blktrace note traces (Jan) - Don't spew I/O errors on wouldblock termination (me) - Zone append addition (Johannes, Keith, Damien) - IO accounting improvements (Konstantin, Christoph) - blk-mq hardware map update improvements (Ming) - Scheduler dispatch improvement (Salman) - Inline block encryption support (Satya) - Request map fixes and improvements (Weiping) - blk-iocost tweaks (Tejun) - Fix for timeout failing with error injection (Keith) - Queue re-run fixes (Douglas) - CPU hotplug improvements (Christoph) - Queue entry/exit improvements (Christoph) - Move DMA drain handling to the few drivers that use it (Christoph) - Partition handling cleanups (Christoph)" * tag 'for-5.8/block-2020-06-01' of git://git.kernel.dk/linux-block: (127 commits) block: mark bio_wouldblock_error() bio with BIO_QUIET blk-wbt: rename __wbt_update_limits to wbt_update_limits blk-wbt: remove wbt_update_limits blk-throttle: remove tg_drain_bios blk-throttle: remove blk_throtl_drain null_blk: force complete for timeout request blk-mq: drain I/O when all CPUs in a hctx are offline blk-mq: add blk_mq_all_tag_iter blk-mq: open code __blk_mq_alloc_request in blk_mq_alloc_request_hctx blk-mq: use BLK_MQ_NO_TAG in more places blk-mq: rename BLK_MQ_TAG_FAIL to BLK_MQ_NO_TAG blk-mq: move more request initialization to blk_mq_rq_ctx_init blk-mq: simplify the blk_mq_get_request calling convention blk-mq: remove the bio argument to ->prepare_request nvme: force complete cancelled requests blk-mq: blk-mq: provide forced completion method block: fix a warning when blkdev.h is included for !CONFIG_BLOCK builds block: blk-crypto-fallback: remove redundant initialization of variable err block: reduce part_stat_lock() scope block: use __this_cpu_add() instead of access by smp_processor_id() ...
2020-06-02iomap: convert from readpages to readaheadMatthew Wilcox (Oracle)1-4/+3
Use the new readahead operation in iomap. Convert XFS and ZoneFS to use it. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Cc: Chao Yu <yuchao0@huawei.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Dave Chinner <dchinner@redhat.com> Cc: Eric Biggers <ebiggers@google.com> Cc: Gao Xiang <gaoxiang25@huawei.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com> Cc: Miklos Szeredi <mszeredi@redhat.com> Link: http://lkml.kernel.org/r/20200414150233.24495-26-willy@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-05-22block: remove the error_sector argument to blkdev_issue_flushChristoph Hellwig1-1/+1
The argument isn't used by any caller, and drivers don't fill out bi_sector for flush requests either. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-13zonefs: use REQ_OP_ZONE_APPEND for sync DIOJohannes Thumshirn1-8/+72
Synchronous direct I/O to a sequential write only zone can be issued using the new REQ_OP_ZONE_APPEND request operation. As dispatching multiple BIOs can potentially result in reordering, we cannot support asynchronous IO via this interface. We also can only dispatch up to queue_max_zone_append_sectors() via the new zone-append method and have to return a short write back to user-space in case an IO larger than queue_max_zone_append_sectors() has been issued. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Acked-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-04-27zonefs: Replace uuid_copy() with import_uuid()Andy Shevchenko1-1/+1
There is a specific API to treat raw data as UUID, i.e. import_uuid(). Use it instead of uuid_copy() with explicit casting. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
2020-03-25zonfs: Fix handling of read-only zonesDamien Le Moal1-7/+21
The write pointer of zones in the read-only consition is defined as invalid by the SCSI ZBC and ATA ZAC specifications. It is thus not possible to determine the correct size of a read-only zone file on mount. Fix this by handling read-only zones in the same manner as offline zones by disabling all accesses to the zone (read and write) and initializing the inode size of the read-only zone to 0). For zones found to be in the read-only condition at runtime, only disable write access to the zone and keep the size of the zone file to its last updated value to allow the user to recover previously written data. Also fix zonefs documentation file to reflect this change. Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
2020-02-26zonefs: select FS_IOMAPJohannes Thumshirn1-0/+1
Zonefs makes use of iomap internally, so it should also select iomap in Kconfig. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
2020-02-26zonefs: fix IOCB_NOWAIT handlingChristoph Hellwig1-4/+4
IOCB_NOWAIT can't just be ignored as it breaks applications expecting it not to block. Just refuse the operation as applications must handle that (e.g. by falling back to a thread pool). Fixes: 8dcc1a9d90c1 ("fs: New zonefs file system") Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
2020-02-07fs: New zonefs file systemDamien Le Moal4-0/+1641
zonefs is a very simple file system exposing each zone of a zoned block device as a file. Unlike a regular file system with zoned block device support (e.g. f2fs), zonefs does not hide the sequential write constraint of zoned block devices to the user. Files representing sequential write zones of the device must be written sequentially starting from the end of the file (append only writes). As such, zonefs is in essence closer to a raw block device access interface than to a full featured POSIX file system. The goal of zonefs is to simplify the implementation of zoned block device support in applications by replacing raw block device file accesses with a richer file API, avoiding relying on direct block device file ioctls which may be more obscure to developers. One example of this approach is the implementation of LSM (log-structured merge) tree structures (such as used in RocksDB and LevelDB) on zoned block devices by allowing SSTables to be stored in a zone file similarly to a regular file system rather than as a range of sectors of a zoned device. The introduction of the higher level construct "one file is one zone" can help reducing the amount of changes needed in the application as well as introducing support for different application programming languages. Zonefs on-disk metadata is reduced to an immutable super block to persistently store a magic number and optional feature flags and values. On mount, zonefs uses blkdev_report_zones() to obtain the device zone configuration and populates the mount point with a static file tree solely based on this information. E.g. file sizes come from the device zone type and write pointer offset managed by the device itself. The zone files created on mount have the following characteristics. 1) Files representing zones of the same type are grouped together under a common sub-directory: * For conventional zones, the sub-directory "cnv" is used. * For sequential write zones, the sub-directory "seq" is used. These two directories are the only directories that exist in zonefs. Users cannot create other directories and cannot rename nor delete the "cnv" and "seq" sub-directories. 2) The name of zone files is the number of the file within the zone type sub-directory, in order of increasing zone start sector. 3) The size of conventional zone files is fixed to the device zone size. Conventional zone files cannot be truncated. 4) The size of sequential zone files represent the file's zone write pointer position relative to the zone start sector. Truncating these files is allowed only down to 0, in which case, the zone is reset to rewind the zone write pointer position to the start of the zone, or up to the zone size, in which case the file's zone is transitioned to the FULL state (finish zone operation). 5) All read and write operations to files are not allowed beyond the file zone size. Any access exceeding the zone size is failed with the -EFBIG error. 6) Creating, deleting, renaming or modifying any attribute of files and sub-directories is not allowed. 7) There are no restrictions on the type of read and write operations that can be issued to conventional zone files. Buffered, direct and mmap read & write operations are accepted. For sequential zone files, there are no restrictions on read operations, but all write operations must be direct IO append writes. mmap write of sequential files is not allowed. Several optional features of zonefs can be enabled at format time. * Conventional zone aggregation: ranges of contiguous conventional zones can be aggregated into a single larger file instead of the default one file per zone. * File ownership: The owner UID and GID of zone files is by default 0 (root) but can be changed to any valid UID/GID. * File access permissions: the default 640 access permissions can be changed. The mkzonefs tool is used to format zoned block devices for use with zonefs. This tool is available on Github at: git@github.com:damien-lemoal/zonefs-tools.git. zonefs-tools also includes a test suite which can be run against any zoned block device, including null_blk block device created with zoned mode. Example: the following formats a 15TB host-managed SMR HDD with 256 MB zones with the conventional zones aggregation feature enabled. $ sudo mkzonefs -o aggr_cnv /dev/sdX $ sudo mount -t zonefs /dev/sdX /mnt $ ls -l /mnt/ total 0 dr-xr-xr-x 2 root root 1 Nov 25 13:23 cnv dr-xr-xr-x 2 root root 55356 Nov 25 13:23 seq The size of the zone files sub-directories indicate the number of files existing for each type of zones. In this example, there is only one conventional zone file (all conventional zones are aggregated under a single file). $ ls -l /mnt/cnv total 137101312 -rw-r----- 1 root root 140391743488 Nov 25 13:23 0 This aggregated conventional zone file can be used as a regular file. $ sudo mkfs.ext4 /mnt/cnv/0 $ sudo mount -o loop /mnt/cnv/0 /data The "seq" sub-directory grouping files for sequential write zones has in this example 55356 zones. $ ls -lv /mnt/seq total 14511243264 -rw-r----- 1 root root 0 Nov 25 13:23 0 -rw-r----- 1 root root 0 Nov 25 13:23 1 -rw-r----- 1 root root 0 Nov 25 13:23 2 ... -rw-r----- 1 root root 0 Nov 25 13:23 55354 -rw-r----- 1 root root 0 Nov 25 13:23 55355 For sequential write zone files, the file size changes as data is appended at the end of the file, similarly to any regular file system. $ dd if=/dev/zero of=/mnt/seq/0 bs=4K count=1 conv=notrunc oflag=direct 1+0 records in 1+0 records out 4096 bytes (4.1 kB, 4.0 KiB) copied, 0.000452219 s, 9.1 MB/s $ ls -l /mnt/seq/0 -rw-r----- 1 root root 4096 Nov 25 13:23 /mnt/seq/0 The written file can be truncated to the zone size, preventing any further write operation. $ truncate -s 268435456 /mnt/seq/0 $ ls -l /mnt/seq/0 -rw-r----- 1 root root 268435456 Nov 25 13:49 /mnt/seq/0 Truncation to 0 size allows freeing the file zone storage space and restart append-writes to the file. $ truncate -s 0 /mnt/seq/0 $ ls -l /mnt/seq/0 -rw-r----- 1 root root 0 Nov 25 13:49 /mnt/seq/0 Since files are statically mapped to zones on the disk, the number of blocks of a file as reported by stat() and fstat() indicates the size of the file zone. $ stat /mnt/seq/0 File: /mnt/seq/0 Size: 0 Blocks: 524288 IO Block: 4096 regular empty file Device: 870h/2160d Inode: 50431 Links: 1 Access: (0640/-rw-r-----) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2019-11-25 13:23:57.048971997 +0900 Modify: 2019-11-25 13:52:25.553805765 +0900 Change: 2019-11-25 13:52:25.553805765 +0900 Birth: - The number of blocks of the file ("Blocks") in units of 512B blocks gives the maximum file size of 524288 * 512 B = 256 MB, corresponding to the device zone size in this example. Of note is that the "IO block" field always indicates the minimum IO size for writes and corresponds to the device physical sector size. This code contains contributions from: * Johannes Thumshirn <jthumshirn@suse.de>, * Darrick J. Wong <darrick.wong@oracle.com>, * Christoph Hellwig <hch@lst.de>, * Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> and * Ting Yao <tingyao@hust.edu.cn>. Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>