summaryrefslogtreecommitdiff
path: root/fs/btrfs/subpage.c
AgeCommit message (Collapse)AuthorFilesLines
2024-01-19btrfs: don't unconditionally call folio_start_writeback in subpageJosef Bacik1-1/+2
In the normal case we check if a page is under writeback and skip it before we attempt to begin writeback. The exception is subpage metadata writes, where we know we don't have an eb under writeback and we're doing it one eb at a time. Since b5612c368648 ("mm: return void from folio_start_writeback() and related functions") we now will BUG_ON() if we call folio_start_writeback() on a folio that's already under writeback. Previously folio_start_writeback() would bail if writeback was already started. Fix this in the subpage code by checking if we have writeback set and skipping it if we do. This fixes the panic we were seeing on subpage. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-16btrfs: migrate subpage code to folio interfacesQu Wenruo1-161/+145
Although subpage itself is conflicting with higher folio, since subpage (sectorsize < PAGE_SIZE and nodesize < PAGE_SIZE) means we will never need higher order folio, there is a hidden pitfall: - btrfs_page_*() helpers Those helpers are an abstraction to handle both subpage and non-subpage cases, which means we're going to pass pages pointers to those helpers. And since those helpers are shared between data and metadata paths, it's unavoidable to let them to handle folios, including higher order folios). Meanwhile for true subpage case, we should only have a single page backed folios anyway, thus add a new ASSERT() for btrfs_subpage_assert() to ensure that. Also since those helpers are shared between both data and metadata, add some extra ASSERT()s for data path to make sure we only get single page backed folio for now. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-16btrfs: cleanup metadata page pointer usageQu Wenruo1-27/+28
Although we have migrated extent_buffer::pages[] to folios[], we're still mostly using the folio_page() help to grab the page. This patch would do the following cleanups for metadata: - Introduce num_extent_folios() helper This is to replace most num_extent_pages() callers. - Use num_extent_folios() to iterate future large folios This allows us to use things like bio_add_folio()/bio_add_folio_nofail(), and only set the needed flags for the folio (aka the leading/tailing page), which reduces the loop iteration to 1 for large folios. - Change metadata related functions to use folio pointers Including their function name, involving: * attach_extent_buffer_page() * detach_extent_buffer_page() * page_range_has_eb() * btrfs_release_extent_buffer_pages() * btree_clear_page_dirty() * btrfs_page_inc_eb_refs() * btrfs_page_dec_eb_refs() - Change btrfs_is_subpage() to accept an address_space pointer This is to allow both page->mapping and folio->mapping to be utilized. As data is still using the old per-page code, and may keep so for a while. - Special corner case place holder for future order mismatches between extent buffer and inode filemap For now it's just a block of comments and a dead ASSERT(), no real handling yet. The subpage code would still go page, just because subpage and large folio are conflicting conditions, thus we don't need to bother subpage with higher order folios at all. Just folio_page(folio, 0) would be enough. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ minor styling tweaks ] Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15btrfs: migrate to use folio private instead of page privateQu Wenruo1-34/+60
As a cleanup and preparation for future folio migration, this patch would replace all page->private to folio version. This includes: - PagePrivate() -> folio_test_private() - page->private -> folio_get_private() - attach_page_private() -> folio_attach_private() - detach_page_private() -> folio_detach_private() Since we're here, also remove the forced cast on page->private, since it's (void *) already, we don't really need to do the cast. For now even if we missed some call sites, it won't cause any problem yet, as we're only using order 0 folio (single page), thus all those folio/page flags should be synced. But for the future conversion to utilize higher order folio, the page <-> folio flag sync is no longer guaranteed, thus we have to migrate to utilize folio flags. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: stop setting PageError in the data I/O pathChristoph Hellwig1-35/+0
PageError is not used by the VFS/MM and deprecated because it uses up a page bit and has no coherent rules. Instead read errors are usually propagated by not setting or clearing the uptodate bit, and write errors are propagated through the address_space. Btrfs now only sets the flag and never clears it for data pages, so just remove all places setting it, and the subpage error bit. Note that the error propagation for superblock writes that work on the block device mapping still uses PageError for now, but that will be addressed in a separate series. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: subpage: dump extra subpage bitmaps for debugQu Wenruo1-0/+42
There is a bug report that assert_eb_page_uptodate() gets triggered for free space tree metadata. Without proper dump for the subpage bitmaps it's much harder to debug. Thus this patch would dump all the subpage bitmaps (split them into their own bitmaps) for a easier debugging. The output would look like this: (Dumped after a tree block got read from disk) page:000000006e34bf49 refcount:4 mapcount:0 mapping:0000000067661ac4 index:0x1d1 pfn:0x110e9 memcg:ffff0000d7d62000 aops:btree_aops [btrfs] ino:1 flags: 0x8000000000002002(referenced|private|zone=2) page_type: 0xffffffff() raw: 8000000000002002 0000000000000000 dead000000000122 ffff00000188bed0 raw: 00000000000001d1 ffff0000c7992700 00000004ffffffff ffff0000d7d62000 page dumped because: btrfs subpage dump BTRFS warning (device dm-1): start=30490624 len=16384 page=30474240 bitmaps: uptodate=4-7 error= dirty= writeback= ordered= checked= Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: export bitmap_test_range_all_{set,zero}Naohiro Aota1-22/+0
bitmap_test_range_all_{set,zero} defined in subpage.c are useful for other components. Move them to misc.h and use them in zoned.c. Also, as find_next{,_zero}_bit take/return "unsigned long" instead of "unsigned int", convert the type to "unsigned long". While at it, also rewrite the "if (...) return true; else return false;" pattern and add const to the input bitmap. Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: move the printk helpers out of ctree.hJosef Bacik1-0/+1
We have a bunch of printk helpers that are in ctree.h. These have nothing to do with ctree.c, so move them into their own header. Subsequent patches will cleanup the printk helpers. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-12btrfs: convert process_page_range() to use filemap_get_folios_contig()Vishal Moola (Oracle)1-1/+1
Converted function to use folios throughout. This is in preparation for the removal of find_get_pages_contig(). Now also supports large folios. Since we may receive more than nr_pages pages, nr_pages may underflow. Since nr_pages > 0 is equivalent to index <= end_index, we replaced it with this check instead. Also minor comment renaming for consistency in subpage. Link: https://lkml.kernel.org/r/20220824004023.77310-5-vishal.moola@gmail.com Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Acked-by: David Sterba <dsterb@suse.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Chris Mason <clm@fb.com> Cc: David Sterba <dsterba@suse.com> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-25btrfs: remove extent writepage address space operationChristoph Hellwig1-1/+1
Same as in commit 21b4ee7029c9 ("xfs: drop ->writepage completely"): we can remove the callback as it's only used in one place - single page writeback from memory reclaim and is not called for cgroup writeback at all. We only allow such writeback from kswapd, not from direct memory reclaim, and so it is rarely used. When it comes from kswapd, it is effectively random dirty page shoot-down, which is horrible for IO patterns. We can rely on background writeback to clean all dirty pages in an efficient way and not let it be interrupted by kswapd. Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: fix typos in commentsDavid Sterba1-1/+1
Codespell has found a few typos. Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16btrfs: remove unnecessary type castsYu Zhe1-1/+1
Explicit type casts are not necessary when it's void* to another pointer type. Signed-off-by: Yu Zhe <yuzhe@nfschina.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16btrfs: make nodesize >= PAGE_SIZE case to reuse the non-subpage routineQu Wenruo1-15/+38
The reason why we only support 64K page size for subpage is, for 64K page size we can ensure no matter what the nodesize is, we can fit it into one page. When other page size come, especially like 16K, the limitation is a bit limiting. To remove such limitation, we allow nodesize >= PAGE_SIZE case to go the non-subpage routine. By this, we can allow 4K sectorsize on 16K page size. Although this introduces another smaller limitation, the metadata can not cross page boundary, which is already met by most recent mkfs. Another small improvement is, we can avoid the overhead for metadata if nodesize >= PAGE_SIZE. For 4K sector size and 64K page size/node size, or 4K sector size and 16K page size/node size, we don't need to allocate extra memory for the metadata pages. Please note that, this patch will not yet enable other page size support yet. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-02btrfs: subpage: fix a wrong check on subpage->writersQu Wenruo1-1/+1
[BUG] When looping btrfs/074 with 64K page size and 4K sectorsize, there is a low chance (1/50~1/100) to crash with the following ASSERT() triggered in btrfs_subpage_start_writer(): ret = atomic_add_return(nbits, &subpage->writers); ASSERT(ret == nbits); <<< This one <<< [CAUSE] With more debugging output on the parameters of btrfs_subpage_start_writer(), it shows a very concerning error: ret=29 nbits=13 start=393216 len=53248 For @nbits it's correct, but @ret which is the returned value from atomic_add_return(), it's not only larger than nbits, but also larger than max sectors per page value (for 64K page size and 4K sector size, it's 16). This indicates that some call sites are not properly decreasing the value. And that's exactly the case, in btrfs_page_unlock_writer(), due to the fact that we can have page locked either by lock_page() or process_one_page(), we have to check if the subpage has any writer. If no writers, it's locked by lock_page() and we only need to unlock it. But unfortunately the check for the writers are completely opposite: if (atomic_read(&subpage->writers)) /* No writers, locked by plain lock_page() */ return unlock_page(page); We directly unlock the page if it has writers, which is the completely opposite what we want. Thankfully the affected call site is only limited to extent_write_locked_range(), so it's mostly affecting compressed write. [FIX] Just fix the wrong check condition to fix the bug. Fixes: e55a0de18572 ("btrfs: rework page locking in __extent_writepage()") CC: stable@vger.kernel.org # 5.16 Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: handle page locking in btrfs_page_end_writer_lock with no writersQu Wenruo1-0/+10
There are several call sites of extent_clear_unlock_delalloc() which get @locked_page = NULL. So that extent_clear_unlock_delalloc() will try to call process_one_page() to unlock every page even the first page is not locked by btrfs_page_start_writer_lock(). This will trigger an ASSERT() in btrfs_subpage_end_and_test_writer() as previously we require every page passed to btrfs_subpage_end_and_test_writer() to be locked by btrfs_page_start_writer_lock(). But compression path doesn't go that way. Thankfully it's not hard to distinguish page locked by lock_page() and btrfs_page_start_writer_lock(). So do the check in btrfs_subpage_end_and_test_writer() so now it can handle both cases well. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: rework page locking in __extent_writepage()Qu Wenruo1-0/+43
Pages passed to __extent_writepage() are always locked, but they may be locked by different functions. There are two types of locked page for __extent_writepage(): - Page locked by plain lock_page() It should not have any subpage::writers count. Can be unlocked by unlock_page(). This is the most common locked page for __extent_writepage() called inside extent_write_cache_pages() or extent_write_full_page(). Rarer cases include the @locked_page from extent_write_locked_range(). - Page locked by lock_delalloc_pages() There is only one caller, all pages except @locked_page for extent_write_locked_range(). In this case, we have to call subpage helper to handle the case. So here we introduce a helper, btrfs_page_unlock_writer(), to allow __extent_writepage() to unlock different locked pages. And since for all other callers of __extent_writepage() their pages are ensured to be locked by lock_page(), also add an extra check for epd::extent_locked to unlock such pages directly. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: subpage: add bitmap for PageChecked flagQu Wenruo1-2/+45
Although in btrfs we have very limited usage of PageChecked flag, it's still some page flag not yet subpage compatible. Fix it by introducing btrfs_subpage::checked_offset to do the convert. For most call sites, especially for free-space cache, COW fixup and btrfs_invalidatepage(), they all work in full page mode anyway. For other call sites, they work as subpage compatible mode. Some call sites need extra modification: - btrfs_drop_pages() Needs extra parameter to get the real range we need to clear checked flag. Also since btrfs_drop_pages() will accept pages beyond the dirtied range, update btrfs_subpage_clamp_range() to handle such case by setting @len to 0 if the page is beyond target range. - btrfs_invalidatepage() We need to call subpage helper before calling __btrfs_releasepage(), or it will trigger ASSERT() as page->private will be cleared. - btrfs_verify_data_csum() In theory we don't need the io_bio->csum check anymore, but it's won't hurt. Just change the comment. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: subpage: pack all subpage bitmaps into a larger bitmapQu Wenruo1-45/+81
Currently we use u16 bitmap to make 4k sectorsize work for 64K page size. But this u16 bitmap is not large enough to contain larger page size like 128K, nor is space efficient for 16K page size. To handle both cases, here we pack all subpage bitmaps into a larger bitmap, now btrfs_subpage::bitmaps[] will be the ultimate bitmap for subpage usage. Each sub-bitmap will has its start bit number recorded in btrfs_subpage_info::*_start, and its bitmap length will be recorded in btrfs_subpage_info::bitmap_nr_bits. All subpage bitmap operations will be converted from using direct u16 operations to bitmap operations, with above *_start calculated. For 64K page size with 4K sectorsize, this should not cause much difference. While for 16K page size, we will only need 1 unsigned long (u32) to store all the bitmaps, which saves quite some space. Furthermore, this allows us to support larger page size like 128K and 258K. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-25btrfs: subpage: introduce btrfs_subpage_bitmap_infoQu Wenruo1-0/+28
Currently we use fixed size u16 bitmap for subpage bitmap. This is fine for 4K sectorsize with 64K page size. But for 4K sectorsize and larger page size, the bitmap is too small, while for smaller page size like 16K, u16 bitmaps waste too much space. Here we introduce a new helper structure, btrfs_subpage_bitmap_info, to record the proper bitmap size, and where each bitmap should start at. By this, we can later compact all subpage bitmaps into one u32 bitmap. This patch is the first step. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-25btrfs: subpage: make btrfs_alloc_subpage() return btrfs_subpage directlyQu Wenruo1-16/+19
The existing calling convention of btrfs_alloc_subpage() is pretty awful. Change it to a more common pattern by returning struct btrfs_subpage directly and let the caller to determine if the call succeeded. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-25btrfs: subpage: only call btrfs_alloc_subpage() when sectorsize is smaller ↵Qu Wenruo1-2/+1
than PAGE_SIZE There are two call sites of btrfs_alloc_subpage(): - btrfs_attach_subpage() We have ensured sectorsize is smaller than PAGE_SIZE - alloc_extent_buffer() We call btrfs_alloc_subpage() unconditionally. The alloc_extent_buffer() forces us to check the sectorsize size against page size inside btrfs_alloc_subpage(). Since the function name, btrfs_alloc_subpage(), already indicates it should only get called for subpage cases, do the check in alloc_extent_buffer() and add an ASSERT() in btrfs_alloc_subpage(). Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23btrfs: subpage: fix a potential use-after-free in writeback helperQu Wenruo1-1/+3
[BUG] There is a possible use-after-free bug when running generic/095. BUG: Unable to handle kernel data access on write at 0x6b6b6b6b6b6b725b Faulting instruction address: 0xc000000000283654 c000000000283078 do_raw_spin_unlock+0x88/0x230 c0000000012b1e14 _raw_spin_unlock_irqrestore+0x44/0x90 c000000000a918dc btrfs_subpage_clear_writeback+0xac/0xe0 c0000000009e0458 end_bio_extent_writepage+0x158/0x270 c000000000b6fd14 bio_endio+0x254/0x270 c0000000009fc0f0 btrfs_end_bio+0x1a0/0x200 c000000000b6fd14 bio_endio+0x254/0x270 c000000000b781fc blk_update_request+0x46c/0x670 c000000000b8b394 blk_mq_end_request+0x34/0x1d0 c000000000d82d1c lo_complete_rq+0x11c/0x140 c000000000b880a4 blk_complete_reqs+0x84/0xb0 c0000000012b2ca4 __do_softirq+0x334/0x680 c0000000001dd878 irq_exit+0x148/0x1d0 c000000000016f4c do_IRQ+0x20c/0x240 c000000000009240 hardware_interrupt_common_virt+0x1b0/0x1c0 [CAUSE] There is very small race window like the following in generic/095. Thread 1 | Thread 2 --------------------------------+------------------------------------ end_bio_extent_writepage() | btrfs_releasepage() |- spin_lock_irqsave() | | |- end_page_writeback() | | | | |- if (PageWriteback() ||...) | | |- clear_page_extent_mapped() | | |- kfree(subpage); |- spin_unlock_irqrestore(). The race can also happen between writeback and btrfs_invalidatepage(), although that would be much harder as btrfs_invalidatepage() has much more work to do before the clear_page_extent_mapped() call. [FIX] Here we "wait" for the subapge spinlock to be released before we detach subpage structure. So this patch will introduce a new function, wait_subpage_spinlock(), to do the "wait" by acquiring the spinlock and release it. Since the caller has ensured the page is not dirty nor writeback, and page is already locked, the only way to hold the subpage spinlock is from endio function. Thus we only need to acquire the spinlock to wait for any existing holder. Reported-by: Ritesh Harjani <riteshh@linux.ibm.com> Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23btrfs: subpage: fix writeback which does not have ordered extentQu Wenruo1-0/+20
[BUG] When running fsstress with subpage RW support, there are random BUG_ON()s triggered with the following trace: kernel BUG at fs/btrfs/file-item.c:667! Internal error: Oops - BUG: 0 [#1] SMP CPU: 1 PID: 3486 Comm: kworker/u13:2 5.11.0-rc4-custom+ #43 Hardware name: Radxa ROCK Pi 4B (DT) Workqueue: btrfs-worker-high btrfs_work_helper [btrfs] pstate: 60000005 (nZCv daif -PAN -UAO -TCO BTYPE=--) pc : btrfs_csum_one_bio+0x420/0x4e0 [btrfs] lr : btrfs_csum_one_bio+0x400/0x4e0 [btrfs] Call trace: btrfs_csum_one_bio+0x420/0x4e0 [btrfs] btrfs_submit_bio_start+0x20/0x30 [btrfs] run_one_async_start+0x28/0x44 [btrfs] btrfs_work_helper+0x128/0x1b4 [btrfs] process_one_work+0x22c/0x430 worker_thread+0x70/0x3a0 kthread+0x13c/0x140 ret_from_fork+0x10/0x30 [CAUSE] Above BUG_ON() means there is some bio range which doesn't have ordered extent, which indeed is worth a BUG_ON(). Unlike regular sectorsize == PAGE_SIZE case, in subpage we have extra subpage dirty bitmap to record which range is dirty and should be written back. This means, if we submit bio for a subpage range, we do not only need to clear page dirty, but also need to clear subpage dirty bits. In __extent_writepage_io(), we will call btrfs_page_clear_dirty() for any range we submit a bio. But there is loophole, if we hit a range which is beyond i_size, we just call btrfs_writepage_endio_finish_ordered() to finish the ordered io, then break out, without clearing the subpage dirty. This means, if we hit above branch, the subpage dirty bits are still there, if other range of the page get dirtied and we need to writeback that page again, we will submit bio for the old range, leaving a wild bio range which doesn't have ordered extent. [FIX] Fix it by always calling btrfs_page_clear_dirty() in __extent_writepage_io(). Also to avoid such problem from happening again, add a new assert, btrfs_page_assert_not_dirty(), to make sure both page dirty and subpage dirty bits are cleared before exiting __extent_writepage_io(). Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-21btrfs: subpage: fix a rare race between metadata endio and eb freeingQu Wenruo1-4/+15
[BUG] There is a very rare ASSERT() triggering during full fstests run for subpage rw support. No other reproducer so far. The ASSERT() gets triggered for metadata read in btrfs_page_set_uptodate() inside end_page_read(). [CAUSE] There is still a small race window for metadata only, the race could happen like this: T1 | T2 ------------------------------------+----------------------------- end_bio_extent_readpage() | |- btrfs_validate_metadata_buffer() | | |- free_extent_buffer() | | Still have 2 refs | |- end_page_read() | |- if (unlikely(PagePrivate()) | | The page still has Private | | | free_extent_buffer() | | | Only one ref 1, will be | | | released | | |- detach_extent_buffer_page() | | |- btrfs_detach_subpage() |- btrfs_set_page_uptodate() | The page no longer has Private| >>> ASSERT() triggered <<< | This race window is super small, thus pretty hard to hit, even with so many runs of fstests. But the race window is still there, we have to go another way to solve it other than relying on random PagePrivate() check. Data path is not affected, as it will lock the page before reading, while unlocking the page after the last read has finished, thus no race window. [FIX] This patch will fix the bug by repurposing btrfs_subpage::readers. Now btrfs_subpage::readers will be a member shared by both metadata and data. For metadata path, we don't do the page unlock as metadata only relies on extent locking. At the same time, teach page_range_has_eb() to take btrfs_subpage::readers into consideration. So that even if the last eb of a page gets freed, page::private won't be detached as long as there still are pending end_page_read() calls. By this we eliminate the race window, this will slight increase the metadata memory usage, as the page may not be released as frequently as usual. But it should not be a big deal. The code got introduced in ("btrfs: submit read time repair only for each corrupted sector"), but the fix is in a separate patch to keep the problem description and the crash is rare so it should not hurt bisectability. Signed-off-by: Qu Wegruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-21btrfs: introduce helpers for subpage ordered statusQu Wenruo1-0/+29
This patch introduces the following functions to handle btrfs subpage ordered (Private2) status: - btrfs_subpage_set_ordered() - btrfs_subpage_clear_ordered() - btrfs_subpage_test_ordered() These helpers can only be called when the range is ensured to be inside the page. - btrfs_page_set_ordered() - btrfs_page_clear_ordered() - btrfs_page_test_ordered() These helpers can handle both regular sector size and subpage without problem. These functions are here to coordinate btrfs_invalidatepage() with btrfs_writepage_endio_finish_ordered(), to make sure only one of those functions can finish the ordered extent. Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64] Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64] Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-21btrfs: make process_one_page() to handle subpage lockingQu Wenruo1-12/+77
Introduce a new data inodes specific subpage member, writers, to record how many sectors are under page lock for delalloc writing. This member acts pretty much the same as readers, except it's only for delalloc writes. This is important for delalloc code to trace which page can really be freed, as we have cases like run_delalloc_nocow() where we may exit processing nocow range inside a page, but need to exit to do cow half way. In that case, we need a way to determine if we can really unlock a full page. With the new btrfs_subpage::writers, there is a new requirement: - Page locked by process_one_page() must be unlocked by process_one_page() There are still tons of call sites manually lock and unlock a page, without updating btrfs_subpage::writers. So if we lock a page through process_one_page() then it must be unlocked by process_one_page() to keep btrfs_subpage::writers consistent. This will be handled in next patch. Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64] Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64] Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-21btrfs: provide btrfs_page_clamp_*() helpersQu Wenruo1-0/+38
In the coming subpage RW supports, there are a lot of page status update calls which need to be converted to subpage compatible version, which needs @start and @len. Some call sites already have such @start/@len and are already in page range, like various endio functions. But there are also call sites which need to clamp the range for subpage case, like btrfs_dirty_pagse() and __process_contig_pages(). Here we introduce new helpers, btrfs_page_clamp_*(), to do and only do the clamp for subpage version. Although in theory all existing btrfs_page_*() calls can be converted to use btrfs_page_clamp_*() directly, but that would make us to do unnecessary clamp operations. Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64] Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64] Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-04-19btrfs: subpage: add overview commentsQu Wenruo1-0/+58
This patch adds an overview how btrfs subpage support works: - limitations - behavior - basic implementation points Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-04-19btrfs: subpage: introduce helpers for writeback statusQu Wenruo1-0/+30
Introduces the following functions to handle subpage writeback status: - btrfs_subpage_set_writeback() - btrfs_subpage_clear_writeback() - btrfs_subpage_test_writeback() These helpers can only be called when the range is ensured to be inside the page. - btrfs_page_set_writeback() - btrfs_page_clear_writeback() - btrfs_page_test_writeback() These helpers can handle both regular sector size and subpage without problem. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-04-19btrfs: subpage: introduce helpers for dirty statusQu Wenruo1-0/+52
Introduce the following functions to handle subpage dirty status: - btrfs_subpage_set_dirty() - btrfs_subpage_clear_dirty() - btrfs_subpage_test_dirty() These helpers can only be called when the range is ensured to be inside the page. - btrfs_page_set_dirty() - btrfs_page_clear_dirty() - btrfs_page_test_dirty() These helpers can handle both regular sector size and subpage without problem. Thus they would be used to replace PageDirty() related calls in later patches. There is one special point to note here, just like set_page_dirty() and clear_page_dirty_for_io(), btrfs_*page_set_dirty() and btrfs_*page_clear_dirty() must be called with page locked. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-09btrfs: integrate page status update for data read path into begin/end_page_readQu Wenruo1-10/+43
In btrfs data page read path, the page status update are handled in two different locations: btrfs_do_read_page() { while (cur <= end) { /* No need to read from disk */ if (HOLE/PREALLOC/INLINE){ memset(); set_extent_uptodate(); continue; } /* Read from disk */ ret = submit_extent_page(end_bio_extent_readpage); } end_bio_extent_readpage() { endio_readpage_uptodate_page_status(); } This is fine for sectorsize == PAGE_SIZE case, as for above loop we should only hit one branch and then exit. But for subpage, there is more work to be done in page status update: - Page Unlock condition Unlike regular page size == sectorsize case, we can no longer just unlock a page. Only the last reader of the page can unlock the page. This means, we can unlock the page either in the while() loop, or in the endio function. - Page uptodate condition Since we have multiple sectors to read for a page, we can only mark the full page uptodate if all sectors are uptodate. To handle both subpage and regular cases, introduce a pair of functions to help handling page status update: - begin_page_read() For regular case, it does nothing. For subpage case, it updates the reader counters so that later end_page_read() can know who is the last one to unlock the page. - end_page_read() This is just endio_readpage_uptodate_page_status() renamed. The original name is a little too long and too specific for endio. The new thing added is the condition for page unlock. Now for subpage data, we unlock the page if we're the last reader. This does not only provide the basis for subpage data read, but also hide the special handling of page read from the main read loop. Also, since we're changing how the page lock is handled, there are two existing error paths where we need to manually unlock the page before calling begin_page_read(). Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-09btrfs: introduce helpers for subpage error statusQu Wenruo1-0/+29
Introduce the following functions to handle subpage error status: - btrfs_subpage_set_error() - btrfs_subpage_clear_error() - btrfs_subpage_test_error() These helpers can only be called when the page has subpage attached and the range is ensured to be inside the page. - btrfs_page_set_error() - btrfs_page_clear_error() - btrfs_page_test_error() These helpers can handle both regular sector size and subpage without problem. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-09btrfs: introduce helpers for subpage uptodate statusQu Wenruo1-0/+113
Introduce the following functions to handle subpage uptodate status: - btrfs_subpage_set_uptodate() - btrfs_subpage_clear_uptodate() - btrfs_subpage_test_uptodate() These helpers can only be called when the page has subpage attached and the range is ensured to be inside the page. - btrfs_page_set_uptodate() - btrfs_page_clear_uptodate() - btrfs_page_test_uptodate() These helpers can handle both regular sector size and subpage. Although caller should still ensure that the range is inside the page. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-09btrfs: support subpage for extent buffer page releaseQu Wenruo1-0/+42
In btrfs_release_extent_buffer_pages(), we need to add extra handling for subpage. Introduce a helper, detach_extent_buffer_page(), to do different handling for regular and subpage cases. For subpage case, handle detaching page private. For unmapped (dummy or cloned) ebs, we can detach the page private immediately as the page can only be attached to one unmapped eb. For mapped ebs, we have to ensure there are no eb in the page range before we delete it, as page->private is shared between all ebs in the same page. But there is a subpage specific race, where we can race with extent buffer allocation, and clear the page private while new eb is still being utilized, like this: Extent buffer A is the new extent buffer which will be allocated, while extent buffer B is the last existing extent buffer of the page. T1 (eb A) | T2 (eb B) -------------------------------+------------------------------ alloc_extent_buffer() | btrfs_release_extent_buffer_pages() |- p = find_or_create_page() | | |- attach_extent_buffer_page() | | | | |- detach_extent_buffer_page() | | |- if (!page_range_has_eb()) | | | No new eb in the page range yet | | | As new eb A hasn't yet been | | | inserted into radix tree. | | |- btrfs_detach_subpage() | | |- detach_page_private(); |- radix_tree_insert() | Then we have a metadata eb whose page has no private bit. To avoid such race, we introduce a subpage metadata-specific member, btrfs_subpage::eb_refs. In alloc_extent_buffer() we increase eb_refs in the critical section of private_lock. Then page_range_has_eb() will return true for detach_extent_buffer_page(), and will not detach page private. The section is marked by: - btrfs_page_inc_eb_refs() - btrfs_page_dec_eb_refs() Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-09btrfs: make attach_extent_buffer_page() handle subpage caseQu Wenruo1-6/+24
For subpage case, we need to allocate additional memory for each metadata page. So we need to: - Allow attach_extent_buffer_page() to return int to indicate allocation failure - Allow manually pre-allocate subpage memory for alloc_extent_buffer() As we don't want to use GFP_ATOMIC under spinlock, we introduce btrfs_alloc_subpage() and btrfs_free_subpage() functions for this purpose. (The simple wrap for btrfs_free_subpage() is for later convert to kmem_cache. Already internally tested without problem) - Preallocate btrfs_subpage structure for alloc_extent_buffer() We don't want to call memory allocation with spinlock held, so do preallocation before we acquire mapping->private_lock. - Handle subpage and regular case differently in attach_extent_buffer_page() For regular case, no change, just do the usual thing. For subpage case, allocate new memory or use the preallocated memory. For future subpage metadata, we will make use of radix tree to grab extent buffer. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-09btrfs: introduce the skeleton of btrfs_subpage structureQu Wenruo1-0/+43
For sectorsize < page size support, we need a structure to record extra status info for each sector of a page. Introduce the skeleton structure, all subpage related code would go to subpage.[ch]. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>