summaryrefslogtreecommitdiff
path: root/fs/btrfs/extent_io.c
AgeCommit message (Collapse)AuthorFilesLines
2022-05-16btrfs: remove unused bio_flags argument to btrfs_submit_metadata_bioChristoph Hellwig1-2/+1
This argument is unused since commit 953651eb308f ("btrfs: factor out helper adding a page to bio") and commit 1b36294a6cd5 ("btrfs: call submit_bio_hook directly for metadata pages") reworked the way metadata bio submission is handled. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16btrfs: move btrfs_readpage to extent_io.cChristoph Hellwig1-2/+33
Keep btrfs_readpage next to btrfs_do_readpage and the other address space operations. This allows to keep submit_one_bio and struct btrfs_bio_ctrl file local in extent_io.c. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16btrfs: return correct error number for __extent_writepage_io()Qu Wenruo1-1/+12
[BUG] If we hit an error from submit_extent_page() inside __extent_writepage_io(), we could still return 0 to the caller, and even trigger the warning in btrfs_page_assert_not_dirty(). [CAUSE] In __extent_writepage_io(), if we hit an error from submit_extent_page(), we will just clean up the range and continue. This is completely fine for regular PAGE_SIZE == sectorsize, as we can only hit one sector in one page, thus after the error we're ensured to exit and @ret will be saved. But for subpage case, we may have other dirty subpage range in the page, and in the next loop, we may succeeded submitting the next range. In that case, @ret will be overwritten, and we return 0 to the caller, while we have hit some error. [FIX] Introduce @has_error and @saved_ret to record the first error we hit, so we will never forget what error we hit. CC: stable@vger.kernel.org # 5.15+ Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16btrfs: fix the error handling for submit_extent_page() for btrfs_do_readpage()Qu Wenruo1-2/+6
[BUG] Test case generic/475 have a very high chance (almost 100%) to hit a fs hang, where a data page will never be unlocked and hang all later operations. [CAUSE] In btrfs_do_readpage(), if we hit an error from submit_extent_page() we will try to do the cleanup for our current io range, and exit. This works fine for PAGE_SIZE == sectorsize cases, but not for subpage. For subpage btrfs_do_readpage() will lock the full page first, which can contain several different sectors and extents: btrfs_do_readpage() |- begin_page_read() | |- btrfs_subpage_start_reader(); | Now the page will have PAGE_SIZE / sectorsize reader pending, | and the page is locked. | |- end_page_read() for different branches | This function will reduce subpage readers, and when readers | reach 0, it will unlock the page. But when submit_extent_page() failed, we only cleanup the current io range, while the remaining io range will never be cleaned up, and the page remains locked forever. [FIX] Update the error handling of submit_extent_page() to cleanup all the remaining subpage range before exiting the loop. Please note that, now submit_extent_page() can only fail due to sanity check in alloc_new_bio(). Thus regular IO errors are impossible to trigger the error path. CC: stable@vger.kernel.org # 5.15+ Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16btrfs: avoid double clean up when submit_one_bio() failedQu Wenruo1-78/+37
[BUG] When running generic/475 with 64K page size and 4K sector size, it has a very high chance (almost 100%) to hang, with mostly data page locked but no one is going to unlock it. [CAUSE] With commit 1784b7d502a9 ("btrfs: handle csum lookup errors properly on reads"), if we failed to lookup checksum due to metadata IO error, we will return error for btrfs_submit_data_bio(). This will cause the page to be unlocked twice in btrfs_do_readpage(): btrfs_do_readpage() |- submit_extent_page() | |- submit_one_bio() | |- btrfs_submit_data_bio() | |- if (ret) { | |- bio->bi_status = ret; | |- bio_endio(bio); } | In the endio function, we will call end_page_read() | and unlock_extent() to cleanup the subpage range. | |- if (ret) { |- unlock_extent(); end_page_read() } Here we unlock the extent and cleanup the subpage range again. For unlock_extent(), it's mostly double unlock safe. But for end_page_read(), it's not, especially for subpage case, as for subpage case we will call btrfs_subpage_end_reader() to reduce the reader number, and use that to number to determine if we need to unlock the full page. If double accounted, it can underflow the number and leave the page locked without anyone to unlock it. [FIX] The commit 1784b7d502a9 ("btrfs: handle csum lookup errors properly on reads") itself is completely fine, it's our existing code not properly handling the error from bio submission hook properly. This patch will make submit_one_bio() to return void so that the callers will never be able to do cleanup when bio submission hook fails. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16btrfs: simplify parameters of submit_read_repair() and renameQu Wenruo1-12/+16
Cleanup the function submit_read_repair() by: - Remove the fixed argument submit_bio_hook() The function is only called on buffered data read path, so the @submit_bio_hook argument is always btrfs_submit_data_bio(). Since it's fixed, then there is no need to pass that argument at all. - Rename the function to submit_data_read_repair() Just to be more explicit on all the 3 things, data, read and repair. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16btrfs: pass a block_device to btrfs_bio_cloneChristoph Hellwig1-2/+2
Pass the block_device to bio_alloc_clone instead of setting it later. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16btrfs: use on-stack bio in repair_io_failureChristoph Hellwig1-28/+24
The I/O in repair_io_failue is synchronous and doesn't need a btrfs_bio, so just use an on-stack bio. Also cleanup the error handling to use goto labels and not discard the actual return values. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16btrfs: check-integrity: split submit_bio from btrfsic checkingChristoph Hellwig1-1/+2
Require a separate call to the integrity checking helpers from the actual bio submission. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16btrfs: wait between incomplete batch memory allocationsSweet Tea Dorminy1-0/+6
When allocating memory in a loop, each iteration should call memalloc_retry_wait() in order to prevent starving memory-freeing processes (and to mark where allocation loops are). Other filesystems do that as well. The bulk page allocation is the only place in btrfs with an allocation retry loop, so add an appropriate call to it. Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16btrfs: allocate page arrays using bulk page allocatorSweet Tea Dorminy1-8/+11
While calling alloc_page() in a loop is an effective way to populate an array of pages, the MM subsystem provides a method to allocate pages in bulk. alloc_pages_bulk_array() populates the NULL slots in a page array, trying to grab more than one page at a time. Unfortunately, it doesn't guarantee allocating all slots in the array, but it's easy to call it in a loop and return an error if no progress occurs. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16btrfs: factor out allocating an array of pagesSweet Tea Dorminy1-16/+50
Several functions currently populate an array of page pointers one allocated page at a time. Factor out the common code so as to allow improvements to all of the sites at once. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16btrfs: make nodesize >= PAGE_SIZE case to reuse the non-subpage routineQu Wenruo1-35/+67
The reason why we only support 64K page size for subpage is, for 64K page size we can ensure no matter what the nodesize is, we can fit it into one page. When other page size come, especially like 16K, the limitation is a bit limiting. To remove such limitation, we allow nodesize >= PAGE_SIZE case to go the non-subpage routine. By this, we can allow 4K sectorsize on 16K page size. Although this introduces another smaller limitation, the metadata can not cross page boundary, which is already met by most recent mkfs. Another small improvement is, we can avoid the overhead for metadata if nodesize >= PAGE_SIZE. For 4K sector size and 64K page size/node size, or 4K sector size and 16K page size/node size, we don't need to allocate extra memory for the metadata pages. Please note that, this patch will not yet enable other page size support yet. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16btrfs: warn when extent buffer leak test failsQu Wenruo1-0/+1
Although we have btrfs_extent_buffer_leak_debug_check() (enabled by CONFIG_BTRFS_DEBUG option) to detect and warn QA testers that we have some extent buffer leakage, it's just pr_err(), not noisy enough for fstests to cache. So here we trigger a WARN_ON() if the allocated_ebs list is not empty. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-04-26Merge tag 'for-5.18-rc4-tag' of ↵Linus Torvalds1-15/+29
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - direct IO fixes: - restore passing file offset to correctly calculate checksums when repairing on read and bio split happens - use correct bio when sumitting IO on zoned filesystem - zoned mode fixes: - fix selection of device to correctly calculate device capabilities when allocating a new bio - use a dedicated lock for exclusion during relocation - fix leaked plug after failure syncing log - fix assertion during scrub and relocation * tag 'for-5.18-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: zoned: use dedicated lock for data relocation btrfs: fix assertion failure during scrub due to block group reallocation btrfs: fix direct I/O writes for split bios on zoned devices btrfs: fix direct I/O read repair for split bios btrfs: fix and document the zoned device choice in alloc_new_bio btrfs: fix leaked plug after failure syncing log on zoned filesystems
2022-04-19btrfs: fix direct I/O read repair for split biosChristoph Hellwig1-0/+1
When a bio is split in btrfs_submit_direct, dip->file_offset contains the file offset for the first bio. But this means the start value used in btrfs_check_read_dio_bio is incorrect for subsequent bios. Add a file_offset field to struct btrfs_bio to pass along the correct offset. Given that check_data_csum only uses start of an error message this means problems with this miscalculation will only show up when I/O fails or checksums mismatch. The logic was removed in f4f39fc5dc30 ("btrfs: remove btrfs_bio::logical member") but we need it due to the bio splitting. CC: stable@vger.kernel.org # 5.16+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
2022-04-19btrfs: fix and document the zoned device choice in alloc_new_bioChristoph Hellwig1-15/+28
Zone Append bios only need a valid block device in struct bio, but not the device in the btrfs_bio. Use the information from btrfs_zoned_get_device to set up bi_bdev and fix zoned writes on multi-device file system with non-homogeneous capabilities and remove the pointless btrfs_bio.device assignment. Add big fat comments explaining what is going on here. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-26Merge tag 'for-5.18/write-streams-2022-03-18' of git://git.kernel.dk/linux-blockLinus Torvalds1-1/+0
Pull NVMe write streams removal from Jens Axboe: "This removes the write streams support in NVMe. No vendor ever really shipped working support for this, and they are not interested in supporting it. With the NVMe support gone, we have nothing in the tree that supports this. Remove passing around of the hints. The only discussion point in this patchset imho is the fact that the file specific write hint setting/getting fcntl helpers will now return -1/EINVAL like they did before we supported write hints. No known applications use these functions, I only know of one prototype that I help do for RocksDB, and that's not used. That said, with a change like this, it's always a bit controversial. Alternatively, we could just make them return 0 and pretend it worked. It's placement based hints after all" * tag 'for-5.18/write-streams-2022-03-18' of git://git.kernel.dk/linux-block: fs: remove fs.f_write_hint fs: remove kiocb.ki_hint block: remove the per-bio/request write hint nvme: remove support or stream based temperature hint
2022-03-23Merge tag 'folio-5.18b' of git://git.infradead.org/users/willy/pagecacheLinus Torvalds1-17/+18
Pull filesystem folio updates from Matthew Wilcox: "Primarily this series converts some of the address_space operations to take a folio instead of a page. Notably: - a_ops->is_partially_uptodate() takes a folio instead of a page and changes the type of the 'from' and 'count' arguments to make it obvious they're bytes. - a_ops->invalidatepage() becomes ->invalidate_folio() and has a similar type change. - a_ops->launder_page() becomes ->launder_folio() - a_ops->set_page_dirty() becomes ->dirty_folio() and adds the address_space as an argument. There are a couple of other misc changes up front that weren't worth separating into their own pull request" * tag 'folio-5.18b' of git://git.infradead.org/users/willy/pagecache: (53 commits) fs: Remove aops ->set_page_dirty fb_defio: Use noop_dirty_folio() fs: Convert __set_page_dirty_no_writeback to noop_dirty_folio fs: Convert __set_page_dirty_buffers to block_dirty_folio nilfs: Convert nilfs_set_page_dirty() to nilfs_dirty_folio() mm: Convert swap_set_page_dirty() to swap_dirty_folio() ubifs: Convert ubifs_set_page_dirty to ubifs_dirty_folio f2fs: Convert f2fs_set_node_page_dirty to f2fs_dirty_node_folio f2fs: Convert f2fs_set_data_page_dirty to f2fs_dirty_data_folio f2fs: Convert f2fs_set_meta_page_dirty to f2fs_dirty_meta_folio afs: Convert afs_dir_set_page_dirty() to afs_dir_dirty_folio() btrfs: Convert extent_range_redirty_for_io() to use folios fs: Convert trivial uses of __set_page_dirty_nobuffers to filemap_dirty_folio btrfs: Convert from set_page_dirty to dirty_folio fscache: Convert fscache_set_page_dirty() to fscache_dirty_folio() fs: Add aops->dirty_folio fs: Remove aops->launder_page orangefs: Convert launder_page to launder_folio nfs: Convert from launder_page to launder_folio fuse: Convert from launder_page to launder_folio ...
2022-03-22Merge tag 'for-5.18-tag' of ↵Linus Torvalds1-19/+26
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs updates from David Sterba: "This contains feature updates, performance improvements, preparatory and core work and some related VFS updates: Features: - encoded read/write ioctls, allows user space to read or write raw data directly to extents (now compressed, encrypted in the future), will be used by send/receive v2 where it saves processing time - zoned mode now works with metadata DUP (the mkfs.btrfs default) - error message header updates: - print error state: transaction abort, other error, log tree errors - print transient filesystem state: remount, device replace, ignored checksum verifications - tree-checker: verify the transaction id of the to-be-written dirty extent buffer Performance improvements for fsync: - directory logging speedups (up to -90% run time) - avoid logging all directory changes during renames (up to -60% run time) - avoid inode logging during rename and link when possible (up to -60% run time) - prepare extents to be logged before locking a log tree path (throughput +7%) - stop copying old file extents when doing a full fsync() - improved logging of old extents after truncate Core, fixes: - improved stale device identification by dev_t and not just path (for devices that are behind other layers like device mapper) - continued extent tree v2 preparatory work - disable features that won't work yet - add wrappers and abstractions for new tree roots - improved error handling - add super block write annotations around background block group reclaim - fix device scanning messages potentially accessing stale pointer - cleanups and refactoring VFS: - allow reflinks/deduplication from two different mounts of the same filesystem - export and add helpers for read/write range verification, for the encoded ioctls" * tag 'for-5.18-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (98 commits) btrfs: zoned: put block group after final usage btrfs: don't access possibly stale fs_info data in device_list_add btrfs: add lockdep_assert_held to need_preemptive_reclaim btrfs: verify the tranisd of the to-be-written dirty extent buffer btrfs: unify the error handling of btrfs_read_buffer() btrfs: unify the error handling pattern for read_tree_block() btrfs: factor out do_free_extent_accounting helper btrfs: remove last_ref from the extent freeing code btrfs: add a alloc_reserved_extent helper btrfs: remove BUG_ON(ret) in alloc_reserved_tree_block btrfs: add and use helper for unlinking inode during log replay btrfs: extend locking to all space_info members accesses btrfs: zoned: mark relocation as writing fs: allow cross-vfsmount reflink/dedupe btrfs: remove the cross file system checks from remap btrfs: pass btrfs_fs_info to btrfs_recover_relocation btrfs: pass btrfs_fs_info for deleting snapshots and cleaner btrfs: add filesystems state details to error messages btrfs: deal with unexpected extent type during reflinking btrfs: fix unexpected error path when reflinking an inline extent ...
2022-03-22Merge tag 'for-5.18/block-2022-03-18' of git://git.kernel.dk/linux-blockLinus Torvalds1-3/+3
Pull block updates from Jens Axboe: - BFQ cleanups and fixes (Yu, Zhang, Yahu, Paolo) - blk-rq-qos completion fix (Tejun) - blk-cgroup merge fix (Tejun) - Add offline error return value to distinguish it from an IO error on the device (Song) - IO stats fixes (Zhang, Christoph) - blkcg refcount fixes (Ming, Yu) - Fix for indefinite dispatch loop softlockup (Shin'ichiro) - blk-mq hardware queue management improvements (Ming) - sbitmap dead code removal (Ming, John) - Plugging merge improvements (me) - Show blk-crypto capabilities in sysfs (Eric) - Multiple delayed queue run improvement (David) - Block throttling fixes (Ming) - Start deprecating auto module loading based on dev_t (Christoph) - bio allocation improvements (Christoph, Chaitanya) - Get rid of bio_devname (Christoph) - bio clone improvements (Christoph) - Block plugging improvements (Christoph) - Get rid of genhd.h header (Christoph) - Ensure drivers use appropriate flush helpers (Christoph) - Refcounting improvements (Christoph) - Queue initialization and teardown improvements (Ming, Christoph) - Misc fixes/improvements (Barry, Chaitanya, Colin, Dan, Jiapeng, Lukas, Nian, Yang, Eric, Chengming) * tag 'for-5.18/block-2022-03-18' of git://git.kernel.dk/linux-block: (127 commits) block: cancel all throttled bios in del_gendisk() block: let blkcg_gq grab request queue's refcnt block: avoid use-after-free on throttle data block: limit request dispatch loop duration block/bfq-iosched: Fix spelling mistake "tenative" -> "tentative" sr: simplify the local variable initialization in sr_block_open() block: don't merge across cgroup boundaries if blkcg is enabled block: fix rq-qos breakage from skipping rq_qos_done_bio() block: flush plug based on hardware and software queue order block: ensure plug merging checks the correct queue at least once block: move rq_qos_exit() into disk_release() block: do more work in elevator_exit block: move blk_exit_queue into disk_release block: move q_usage_counter release into blk_queue_release block: don't remove hctx debugfs dir from blk_mq_exit_queue block: move blkcg initialization/destroy into disk allocation/release handler sr: implement ->free_disk to simplify refcounting sd: implement ->free_disk to simplify refcounting sd: delay calling free_opal_dev sd: call sd_zbc_release_disk before releasing the scsi_device reference ...
2022-03-15btrfs: Convert extent_range_redirty_for_io() to use foliosMatthew Wilcox (Oracle)1-7/+7
This removes a call to __set_page_dirty_nobuffers(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs Tested-by: David Howells <dhowells@redhat.com> # afs
2022-03-15btrfs: Convert from invalidatepage to invalidate_folioMatthew Wilcox (Oracle)1-8/+8
A lot of the underlying infrastructure in btrfs needs to be switched over to folios, but this at least documents that invalidatepage can't be passed a tail page. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs Tested-by: David Howells <dhowells@redhat.com> # afs
2022-03-15btrfs: Use folio_invalidate()Matthew Wilcox (Oracle)1-2/+3
Instead of calling ->invalidatepage directly, use folio_invalidate(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs Tested-by: David Howells <dhowells@redhat.com> # afs
2022-03-14btrfs: zoned: put block group after final usageNikolay Borisov1-3/+4
It's counter-intuitive (and wrong) to put the block group _before_ the final usage in submit_eb_page. Fix it by re-ordering the call to btrfs_put_block_group after its final reference. Also fix a minor typo in 'implies' Fixes: be1a1d7a5d24 ("btrfs: zoned: finish fully written block group") CC: stable@vger.kernel.org # 5.16+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14btrfs: do not clean up repair bio if submit failsJosef Bacik1-8/+7
The submit helper will always run bio_endio() on the bio if it fails to submit, so cleaning up the bio just leads to a variety of use-after-free and NULL pointer dereference bugs because we race with the endio function that is cleaning up the bio. Instead just return BLK_STS_OK as the repair function has to continue to process the rest of the pages, and the endio for the repair bio will do the appropriate cleanup for the page that it was given. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14btrfs: do not try to repair bio that has no mirror setJosef Bacik1-0/+9
If we fail to submit a bio for whatever reason, we may not have setup a mirror_num for that bio. This means we shouldn't try to do the repair workflow, if we do we'll hit an BUG_ON(!failrec->this_mirror) in clean_io_failure. Instead simply skip the repair workflow if we have no mirror set, and add an assert to btrfs_check_repairable() to make it easier to catch what is happening in the future. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14btrfs: remove no longer used counter when reading data pageFilipe Manana1-4/+1
After commit 92082d40976ed0 ("btrfs: integrate page status update for data read path into begin/end_page_read"), the 'nr' counter at btrfs_do_readpage() is no longer used, we increment it but we never read from it. So just remove it. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14btrfs: fix lost error return value when reading a data pageFilipe Manana1-0/+1
At btrfs_do_readpage(), if we get an error when trying to lookup for an extent map, we end up marking the page with the error bit, clearing the uptodate bit on it, and doing everything else that should be done. However we return success (0) to the caller, when we should return the error encoded in the extent map pointer. So fix that by returning the error encoded in the pointer. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14btrfs: stop checking for NULL return from btrfs_get_extent()Filipe Manana1-3/+3
At extent_io.c, in the read page and write page code paths, we are testing if the return value from btrfs_get_extent() can be NULL. However that is not possible, as btrfs_get_extent() always returns either an error pointer or a (non-NULL) pointer to an extent map structure. Everywhere else outside extent_io.c we never check for NULL, we always treat any returned value as a non-NULL pointer if it does not encode an error. So check only for the IS_ERR() case at extent_io.c. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14btrfs: stop checking for NULL return from btrfs_get_extent_fiemap()Johannes Thumshirn1-1/+1
In get_extent_skip_holes() we're checking the return of btrfs_get_extent_fiemap() for an error pointer or NULL, but btrfs_get_extent_fiemap() will never return NULL, only error pointers or a valid extent_map. The other caller of btrfs_get_extent_fiemap(), find_desired_extent(), correctly only checks for error-pointers. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-07block: remove the per-bio/request write hintChristoph Hellwig1-1/+0
With the NVMe support for this gone, there are no consumers of these hints left, so remove them. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220304175556.407719-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-07Merge branch 'for-5.18/block' into for-5.18/write-streamsJens Axboe1-3/+3
* for-5.18/block: (96 commits) block: remove bio_devname ext4: stop using bio_devname raid5-ppl: stop using bio_devname raid1: stop using bio_devname md-multipath: stop using bio_devname dm-integrity: stop using bio_devname dm-crypt: stop using bio_devname pktcdvd: remove a pointless debug check in pkt_submit_bio block: remove handle_bad_sector block: fix and cleanup bio_check_ro bfq: fix use-after-free in bfq_dispatch_request blk-crypto: show crypto capabilities in sysfs block: don't delete queue kobject before its children block: simplify calling convention of elv_unregister_queue() block: remove redundant semicolon block: default BLOCK_LEGACY_AUTOLOAD to y block: update io_ticks when io hang block, bfq: don't move oom_bfqq block, bfq: avoid moving bfqq to it's parent bfqg block, bfq: cleanup bfq_bfqq_to_bfqg() ...
2022-03-06Merge tag 'for-5.17-rc6-tag' of ↵Linus Torvalds1-3/+13
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "A few more fixes for various problems that have user visible effects or seem to be urgent: - fix corruption when combining DIO and non-blocking io_uring over multiple extents (seen on MariaDB) - fix relocation crash due to premature return from commit - fix quota deadlock between rescan and qgroup removal - fix item data bounds checks in tree-checker (found on a fuzzed image) - fix fsync of prealloc extents after EOF - add missing run of delayed items after unlink during log replay - don't start relocation until snapshot drop is finished - fix reversed condition for subpage writers locking - fix warning on page error" * tag 'for-5.17-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: fallback to blocking mode when doing async dio over multiple extents btrfs: add missing run of delayed items after unlink during log replay btrfs: qgroup: fix deadlock between rescan worker and remove qgroup btrfs: fix relocation crash due to premature return from btrfs_commit_transaction() btrfs: do not start relocation until in progress drops are done btrfs: tree-checker: use u64 for item data end to avoid overflow btrfs: do not WARN_ON() if we have PageError set btrfs: fix lost prealloc extents beyond eof after full fsync btrfs: subpage: fix a wrong check on subpage->writers
2022-03-02btrfs: do not WARN_ON() if we have PageError setJosef Bacik1-3/+13
Whenever we do any extent buffer operations we call assert_eb_page_uptodate() to complain loudly if we're operating on an non-uptodate page. Our overnight tests caught this warning earlier this week WARNING: CPU: 1 PID: 553508 at fs/btrfs/extent_io.c:6849 assert_eb_page_uptodate+0x3f/0x50 CPU: 1 PID: 553508 Comm: kworker/u4:13 Tainted: G W 5.17.0-rc3+ #564 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014 Workqueue: btrfs-cache btrfs_work_helper RIP: 0010:assert_eb_page_uptodate+0x3f/0x50 RSP: 0018:ffffa961440a7c68 EFLAGS: 00010246 RAX: 0017ffffc0002112 RBX: ffffe6e74453f9c0 RCX: 0000000000001000 RDX: ffffe6e74467c887 RSI: ffffe6e74453f9c0 RDI: ffff8d4c5efc2fc0 RBP: 0000000000000d56 R08: ffff8d4d4a224000 R09: 0000000000000000 R10: 00015817fa9d1ef0 R11: 000000000000000c R12: 00000000000007b1 R13: ffff8d4c5efc2fc0 R14: 0000000001500000 R15: 0000000001cb1000 FS: 0000000000000000(0000) GS:ffff8d4dbbd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007ff31d3448d8 CR3: 0000000118be8004 CR4: 0000000000370ee0 Call Trace: extent_buffer_test_bit+0x3f/0x70 free_space_test_bit+0xa6/0xc0 load_free_space_tree+0x1f6/0x470 caching_thread+0x454/0x630 ? rcu_read_lock_sched_held+0x12/0x60 ? rcu_read_lock_sched_held+0x12/0x60 ? rcu_read_lock_sched_held+0x12/0x60 ? lock_release+0x1f0/0x2d0 btrfs_work_helper+0xf2/0x3e0 ? lock_release+0x1f0/0x2d0 ? finish_task_switch.isra.0+0xf9/0x3a0 process_one_work+0x26d/0x580 ? process_one_work+0x580/0x580 worker_thread+0x55/0x3b0 ? process_one_work+0x580/0x580 kthread+0xf0/0x120 ? kthread_complete_and_exit+0x20/0x20 ret_from_fork+0x1f/0x30 This was partially fixed by c2e39305299f01 ("btrfs: clear extent buffer uptodate when we fail to write it"), however all that fix did was keep us from finding extent buffers after a failed writeout. It didn't keep us from continuing to use a buffer that we already had found. In this case we're searching the commit root to cache the block group, so we can start committing the transaction and switch the commit root and then start writing. After the switch we can look up an extent buffer that hasn't been written yet and start processing that block group. Then we fail to write that block out and clear Uptodate on the page, and then we start spewing these errors. Normally we're protected by the tree lock to a certain degree here. If we read a block we have that block read locked, and we block the writer from locking the block before we submit it for the write. However this isn't necessarily fool proof because the read could happen before we do the submit_bio and after we locked and unlocked the extent buffer. Also in this particular case we have path->skip_locking set, so that won't save us here. We'll simply get a block that was valid when we read it, but became invalid while we were using it. What we really want is to catch the case where we've "read" a block but it's not marked Uptodate. On read we ClearPageError(), so if we're !Uptodate and !Error we know we didn't do the right thing for reading the page. Fix this by checking !Uptodate && !Error, this way we will not complain if our buffer gets invalidated while we're using it, and we'll maintain the spirit of the check which is to make sure we have a fully in-cache block while we're messing with it. CC: stable@vger.kernel.org # 5.4+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-02-04block: pass a block_device to bio_clone_fastChristoph Hellwig1-2/+2
Pass a block_device to bio_clone_fast and __bio_clone_fast and give the functions more suitable names. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mike Snitzer <snitzer@redhat.com> Link: https://lore.kernel.org/r/20220202160109.108149-14-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-02-02block: pass a block_device and opf to bio_alloc_biosetChristoph Hellwig1-1/+1
Pass the block_device and operation that we plan to use this bio for to bio_alloc_bioset to optimize the assigment. NULL/0 can be passed, both for the passthrough case on a raw request_queue and to temporarily avoid refactoring some nasty code. Also move the gfp_mask argument after the nr_vecs argument for a much more logical calling convention matching what most of the kernel does. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20220124091107.642561-16-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-01-22mm: remove cleancacheChristoph Hellwig1-10/+0
Patch series "remove Xen tmem leftovers". Since the removal of the Xen tmem driver in 2019, the cleancache hooks are entirely unused, as are large parts of frontswap. This series against linux-next (with the folio changes included) removes cleancaches, and cuts down frontswap to the bits actually used by zswap. This patch (of 13): The cleancache subsystem is unused since the removal of Xen tmem driver in commit 814bbf49dcd0 ("xen: remove tmem driver"). [akpm@linux-foundation.org: remove now-unreachable code] Link: https://lkml.kernel.org/r/20211224062246.1258487-1-hch@lst.de Link: https://lkml.kernel.org/r/20211224062246.1258487-2-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Juergen Gross <jgross@suse.com> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Konrad Rzeszutek Wilk <Konrad.wilk@oracle.com> Cc: Hugh Dickins <hughd@google.com> Cc: Seth Jennings <sjenning@redhat.com> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-01-07btrfs: fix argument list that the kdoc format and script verifiedYang Li1-3/+2
The warnings were found by running scripts/kernel-doc, which is caused by using 'make W=1'. fs/btrfs/extent_io.c:3210: warning: Function parameter or member 'bio_ctrl' not described in 'btrfs_bio_add_page' fs/btrfs/extent_io.c:3210: warning: Excess function parameter 'bio' description in 'btrfs_bio_add_page' fs/btrfs/extent_io.c:3210: warning: Excess function parameter 'prev_bio_flags' description in 'btrfs_bio_add_page' fs/btrfs/space-info.c:1602: warning: Excess function parameter 'root' description in 'btrfs_reserve_metadata_bytes' fs/btrfs/space-info.c:1602: warning: Function parameter or member 'fs_info' not described in 'btrfs_reserve_metadata_bytes' Note: this is fixing only the warnings regarding parameter list, the first line is not strictly conforming to the kdoc format as the btrfs codebase does not stick to that and keeps the first line more free form (because it's only for internal use). Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add note ] Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07btrfs: remove reada infrastructureQu Wenruo1-3/+0
Currently there is only one user for btrfs metadata readahead, and that's scrub. But even for the single user, it's not providing the correct functionality it needs, as scrub needs reada for commit root, which current readahead can't provide. (Although it's pretty easy to add such feature). Despite this, there are some extra problems related to metadata readahead: - Duplicated feature with btrfs_path::reada - Partly duplicated feature of btrfs_fs_info::buffer_radix Btrfs already caches its metadata in buffer_radix, while readahead tries to read the tree block no matter if it's already cached. - Poor layer separation Metadata readahead works kinda at device level. This is definitely not the correct layer it should be, since metadata is at btrfs logical address space, it should not bother device at all. This brings extra chance for bugs to sneak in, while brings unnecessary complexity. - Dead code In the very beginning of scrub.c we have #undef DEBUG, rendering all the debug related code useless and unable to test. Thus here I purpose to remove the metadata readahead mechanism completely. [BENCHMARK] There is a full benchmark for the scrub performance difference using the old btrfs_reada_add() and btrfs_path::reada. For the worst case (no dirty metadata, slow HDD), there could be a 5% performance drop for scrub. For other cases (even SATA SSD), there is no distinguishable performance difference. The number is reported scrub speed, in MiB/s. The resolution is limited by the reported duration, which only has a resolution of 1 second. Old New Diff SSD 455.3 466.332 +2.42% HDD 103.927 98.012 -5.69% Comprehensive test methodology is in the cover letter of the patch. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07btrfs: zoned: drop redundant check for REQ_OP_ZONE_APPEND and btrfs_is_zonedJohannes Thumshirn1-3/+2
REQ_OP_ZONE_APPEND can only work on zoned devices, so it is redundant to check if the filesystem is zoned when REQ_OP_ZONE_APPEND is set as the bio's bio_op. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07btrfs: zoned: sink zone check into btrfs_repair_one_zoneJohannes Thumshirn1-2/+2
Sink zone check into btrfs_repair_one_zone() so we don't need to do it in all callers. Also as btrfs_repair_one_zone() doesn't return a sensible error, make it a boolean function and return false in case it got called on a non-zoned filesystem and true on a zoned filesystem. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07btrfs: zoned: encapsulate inode locking for zoned relocationJohannes Thumshirn1-6/+2
Encapsulate the inode lock needed for serializing the data relocation writes on a zoned filesystem into a helper. This streamlines the code reading flow and hides special casing for zoned filesystems. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-03btrfs: remove unnecessary @nr_written parametersQu Wenruo1-16/+11
We use @nr_written to record how many pages have been started by btrfs_run_delalloc_range(). Currently there are only two cases that would populate @nr_written: - Inline extent creation - Compressed write But both cases will also set @page_started to one. In fact, in writepage_delalloc() we have the following code, showing that @nr_written is really only utilized for above two cases: /* did the fill delalloc function already unlock and start * the IO? */ if (page_started) { /* * we've unlocked the page, so we can't update * the mapping's writeback index, just update * nr_to_write. */ wbc->nr_to_write -= nr_written; return 1; } But for such cases, writepage_delalloc() will return 1, and exit __extent_writepage() without going through __extent_writepage_io(). Thus this means, inside __extent_writepage_io(), we always get @nr_written as 0. So this patch is going to remove the unnecessary parameter from the following functions: - writepage_delalloc() As @nr_written passed in is always the initial value 0. Although inside that function, we still need a local @nr_written to update wbc->nr_to_write. - __extent_writepage_io() As explained above, @nr_written passed in can only be 0. This also means we can remove one update_nr_written() call. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-12-15btrfs: check WRITE_ERR when trying to read an extent bufferJosef Bacik1-0/+8
Filipe reported a hang when we have errors on btrfs. This turned out to be a side-effect of my fix c2e39305299f01 ("btrfs: clear extent buffer uptodate when we fail to write it") which made it so we clear EXTENT_BUFFER_UPTODATE on an eb when we fail to write it out. Below is a paste of Filipe's analysis he got from using drgn to debug the hang """ btree readahead code calls read_extent_buffer_pages(), sets ->io_pages to a value while writeback of all pages has not yet completed: --> writeback for the first 3 pages finishes, we clear EXTENT_BUFFER_UPTODATE from eb on the first page when we get an error. --> at this point eb->io_pages is 1 and we cleared Uptodate bit from the first 3 pages --> read_extent_buffer_pages() does not see EXTENT_BUFFER_UPTODATE() so it continues, it's able to lock the pages since we obviously don't hold the pages locked during writeback --> read_extent_buffer_pages() then computes 'num_reads' as 3, and sets eb->io_pages to 3, since only the first page does not have Uptodate bit set at this point --> writeback for the remaining page completes, we ended decrementing eb->io_pages by 1, resulting in eb->io_pages == 2, and therefore never calling end_extent_buffer_writeback(), so EXTENT_BUFFER_WRITEBACK remains in the eb's flags --> of course, when the read bio completes, it doesn't and shouldn't call end_extent_buffer_writeback() --> we should clear EXTENT_BUFFER_UPTODATE only after all pages of the eb finished writeback? or maybe make the read pages code wait for writeback of all pages of the eb to complete before checking which pages need to be read, touch ->io_pages, submit read bio, etc writeback bit never cleared means we can hang when aborting a transaction, at: btrfs_cleanup_one_transaction() btrfs_destroy_marked_extents() wait_on_extent_buffer_writeback() """ This is a problem because our writes are not synchronized with reads in any way. We clear the UPTODATE flag and then we can easily come in and try to read the EB while we're still waiting on other bio's to complete. We have two options here, we could lock all the pages, and then check to see if eb->io_pages != 0 to know if we've already got an outstanding write on the eb. Or we can simply check to see if we have WRITE_ERR set on this extent buffer. We set this bit _before_ we clear UPTODATE, so if the read gets triggered because we aren't UPTODATE because of a write error we're guaranteed to have WRITE_ERR set, and in this case we can simply return -EIO. This will fix the reported hang. Reported-by: Filipe Manana <fdmanana@suse.com> Fixes: c2e39305299f01 ("btrfs: clear extent buffer uptodate when we fail to write it") CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-12-08btrfs: call mapping_set_error() on btree inode with a write errorJosef Bacik1-0/+8
generic/484 fails sometimes with compression on because the write ends up small enough that it goes into the btree. This means that we never call mapping_set_error() on the inode itself, because the page gets marked as fine when we inline it into the metadata. When the metadata writeback happens we see it and abort the transaction properly and mark the fs as readonly, however we don't do the mapping_set_error() on anything. In syncfs() we will simply return 0 if the sb is marked read-only, so we can't check for this in our syncfs callback. The only way the error gets returned if we called mapping_set_error() on something. Fix this by calling mapping_set_error() on the btree inode mapping. This allows us to properly return an error on syncfs and pass generic/484 with compression on. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-12-08btrfs: clear extent buffer uptodate when we fail to write itJosef Bacik1-0/+6
I got dmesg errors on generic/281 on our overnight fstests. Looking at the history this happens occasionally, with errors like this WARNING: CPU: 0 PID: 673217 at fs/btrfs/extent_io.c:6848 assert_eb_page_uptodate+0x3f/0x50 CPU: 0 PID: 673217 Comm: kworker/u4:13 Tainted: G W 5.16.0-rc2+ #469 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014 Workqueue: btrfs-cache btrfs_work_helper RIP: 0010:assert_eb_page_uptodate+0x3f/0x50 RSP: 0018:ffffae598230bc60 EFLAGS: 00010246 RAX: 0017ffffc0002112 RBX: ffffebaec4100900 RCX: 0000000000001000 RDX: ffffebaec45733c7 RSI: ffffebaec4100900 RDI: ffff9fd98919f340 RBP: 0000000000000d56 R08: ffff9fd98e300000 R09: 0000000000000000 R10: 0001207370a91c50 R11: 0000000000000000 R12: 00000000000007b0 R13: ffff9fd98919f340 R14: 0000000001500000 R15: 0000000001cb0000 FS: 0000000000000000(0000) GS:ffff9fd9fbc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f549fcf8940 CR3: 0000000114908004 CR4: 0000000000370ef0 Call Trace: extent_buffer_test_bit+0x3f/0x70 free_space_test_bit+0xa6/0xc0 load_free_space_tree+0x1d6/0x430 caching_thread+0x454/0x630 ? rcu_read_lock_sched_held+0x12/0x60 ? rcu_read_lock_sched_held+0x12/0x60 ? rcu_read_lock_sched_held+0x12/0x60 ? lock_release+0x1f0/0x2d0 btrfs_work_helper+0xf2/0x3e0 ? lock_release+0x1f0/0x2d0 ? finish_task_switch.isra.0+0xf9/0x3a0 process_one_work+0x270/0x5a0 worker_thread+0x55/0x3c0 ? process_one_work+0x5a0/0x5a0 kthread+0x174/0x1a0 ? set_kthread_struct+0x40/0x40 ret_from_fork+0x1f/0x30 This happens because we're trying to read from a extent buffer page that is !PageUptodate. This happens because we will clear the page uptodate when we have an IO error, but we don't clear the extent buffer uptodate. If we do a read later and find this extent buffer we'll think its valid and not return an error, and then trip over this warning. Fix this by also clearing uptodate on the extent buffer when this happens, so that we get an error when we do a btrfs_search_slot() and find this block later. CC: stable@vger.kernel.org # 5.4+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: remove btrfs_bio::logical memberQu Wenruo1-1/+0
The member btrfs_bio::logical is only initialized by two call sites: - btrfs_repair_one_sector() No corresponding site to utilize it. - btrfs_submit_direct() The corresponding site to utilize it is btrfs_check_read_dio_bio(). However for btrfs_check_read_dio_bio(), we can grab the file_offset from btrfs_dio_private::file_offset directly. Thus it turns out we don't really need that btrfs_bio::logical member at all. For btrfs_bio, the logical bytenr can be fetched from its bio->bi_iter.bi_sector directly. So let's just remove the member to save 8 bytes for structure btrfs_bio. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: add a BTRFS_FS_ERROR helperJosef Bacik1-1/+1
We have a few flags that are inconsistently used to describe the fs in different states of failure. As of 5963ffcaf383 ("btrfs: always abort the transaction if we abort a trans handle") we will always set BTRFS_FS_STATE_ERROR if we abort, so we don't have to check both ABORTED and ERROR to see if things have gone wrong. Add a helper to check BTRFS_FS_STATE_ERROR and then convert all checkers of FS_STATE_ERROR to use the helper. The TRANS_ABORTED bit check was added in af7227338135 ("Btrfs: clean up resources during umount after trans is aborted") but is not actually specific. Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: subpage: avoid potential deadlock with compression and delallocQu Wenruo1-9/+27
[BUG] With experimental subpage compression enabled, a simple fsstress can lead to self deadlock on page 720896: mkfs.btrfs -f -s 4k $dev > /dev/null mount $dev -o compress $mnt $fsstress -p 1 -n 100 -w -d $mnt -v -s 1625511156 [CAUSE] If we have a file layout looks like below: 0 32K 64K 96K 128K |//| |///////////////| 4K Then we run delalloc range for the inode, it will: - Call find_lock_delalloc_range() with @delalloc_start = 0 Then we got a delalloc range [0, 4K). This range will be COWed. - Call find_lock_delalloc_range() again with @delalloc_start = 4K Since find_lock_delalloc_range() never cares whether the range is still inside page range [0, 64K), it will return range [64K, 128K). This range meets the condition for subpage compression, will go through async COW path. And async COW path will return @page_started. But that @page_started is now for range [64K, 128K), not for range [0, 64K). - writepage_dellloc() returned 1 for page [0, 64K) Thus page [0, 64K) will not be unlocked, nor its page dirty status will be cleared. Next time when we try to lock page [0, 64K) we will deadlock, as there is no one to release page [0, 64K). This problem will never happen for regular page size as one page only contains one sector. After the first find_lock_delalloc_range() call, the @delalloc_end will go beyond @page_end no matter if we found a delalloc range or not Thus this bug only happens for subpage, as now we need multiple runs to exhaust the delalloc range of a page. [FIX] Fix the problem by ensuring the delalloc range we ran at least started inside @locked_page. So that we will never get incorrect @page_started. And to prevent such problem from happening again: - Make find_lock_delalloc_range() return false if the found range is beyond @end value passed in. Since @end will be utilized now, add an ASSERT() to ensure we pass correct @end into find_lock_delalloc_range(). This also means, for selftests we needs to populate @end before calling find_lock_delalloc_range(). - New ASSERT() in find_lock_delalloc_range() Now we will make sure the @start/@end passed in at least covers part of the page. - New ASSERT() in run_delalloc_range() To make sure the range at least starts inside @locked page. - Use @delalloc_start as proper cursor, while @delalloc_end is always reset to @page_end. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>