summaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2023-12-16btrfs: factor out helper for single device IO checkJohannes Thumshirn1-4/+23
The check in btrfs_map_block() deciding if a particular I/O is targeting a single device is getting more and more convoluted. Factor out the check conditions into a helper function, with no functional change otherwise. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-16btrfs: migrate btrfs_repair_io_failure() to folio interfacesQu Wenruo3-12/+20
[BUG] Test case btrfs/124 failed if larger metadata folio is enabled, the dying message looks like this: BTRFS error (device dm-2): bad tree block start, mirror 2 want 31686656 have 0 BTRFS info (device dm-2): read error corrected: ino 0 off 31686656 (dev /dev/mapper/test-scratch2 sector 20928) BUG: kernel NULL pointer dereference, address: 0000000000000020 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page CPU: 6 PID: 350881 Comm: btrfs Tainted: G OE 6.7.0-rc3-custom+ #128 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 2/2/2022 RIP: 0010:btrfs_read_extent_buffer+0x106/0x180 [btrfs] PKRU: 55555554 Call Trace: <TASK> read_tree_block+0x33/0xb0 [btrfs] read_block_for_search+0x23e/0x340 [btrfs] btrfs_search_slot+0x2f9/0xe60 [btrfs] btrfs_lookup_csum+0x75/0x160 [btrfs] btrfs_lookup_bio_sums+0x21a/0x560 [btrfs] btrfs_submit_chunk+0x152/0x680 [btrfs] btrfs_submit_bio+0x1c/0x50 [btrfs] submit_one_bio+0x40/0x80 [btrfs] submit_extent_page+0x158/0x390 [btrfs] btrfs_do_readpage+0x330/0x740 [btrfs] extent_readahead+0x38d/0x6c0 [btrfs] read_pages+0x94/0x2c0 page_cache_ra_unbounded+0x12d/0x190 relocate_file_extent_cluster+0x7c1/0x9d0 [btrfs] relocate_block_group+0x2d3/0x560 [btrfs] btrfs_relocate_block_group+0x2c7/0x4b0 [btrfs] btrfs_relocate_chunk+0x4c/0x1a0 [btrfs] btrfs_balance+0x925/0x13c0 [btrfs] btrfs_ioctl+0x19f1/0x25d0 [btrfs] __x64_sys_ioctl+0x90/0xd0 do_syscall_64+0x3f/0xf0 entry_SYSCALL_64_after_hwframe+0x6e/0x76 [CAUSE] The dying line is at btrfs_repair_io_failure() call inside btrfs_repair_eb_io_failure(). The function is still relying on the extent buffer using page sized folios. When the extent buffer is using larger folio, we go into the 2nd slot of folios[], and triggered the NULL pointer dereference. [FIX] Migrate btrfs_repair_io_failure() to folio interfaces. So that when we hit a larger folio, we just submit the whole folio in one go. This also affects data repair path through btrfs_end_repair_bio(), thankfully data is still fully page based, we can just add an ASSERT(), and use page_folio() to convert the page to folio. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-16btrfs: migrate eb_bitmap_offset() to folio interfacesQu Wenruo1-12/+10
[BUG] Test case btrfs/002 would fail if larger folios are enabled for metadata: assertion failed: folio, in fs/btrfs/extent_io.c:4358 ------------[ cut here ]------------ kernel BUG at fs/btrfs/extent_io.c:4358! invalid opcode: 0000 [#1] PREEMPT SMP NOPTI CPU: 1 PID: 30916 Comm: fsstress Tainted: G OE 6.7.0-rc3-custom+ #128 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 2/2/2022 RIP: 0010:assert_eb_folio_uptodate+0x98/0xe0 [btrfs] Call Trace: <TASK> extent_buffer_test_bit+0x3c/0x70 [btrfs] free_space_test_bit+0xcd/0x140 [btrfs] modify_free_space_bitmap+0x27a/0x430 [btrfs] add_to_free_space_tree+0x8d/0x160 [btrfs] __btrfs_free_extent.isra.0+0xef1/0x13c0 [btrfs] __btrfs_run_delayed_refs+0x786/0x13c0 [btrfs] btrfs_run_delayed_refs+0x33/0x120 [btrfs] btrfs_commit_transaction+0xa2/0x1350 [btrfs] iterate_supers+0x77/0xe0 ksys_sync+0x60/0xa0 __do_sys_sync+0xa/0x20 do_syscall_64+0x3f/0xf0 entry_SYSCALL_64_after_hwframe+0x6e/0x76 </TASK> [CAUSE] The function extent_buffer_test_bit() is not folio compatible. It still assumes the old fixed page size, when an extent buffer with large folio passed in, only eb->folios[0] is populated. Then if the target bit range falls in the 2nd page of the folio, then we would check eb->folios[1], and trigger the ASSERT(). [FIX] Just migrate eb_bitmap_offset() to folio interfaces, using the folio_size() to replace PAGE_SIZE. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-16btrfs: migrate various end io functions to foliosQu Wenruo2-77/+81
If we still go the old page based iterator functions, like bio_for_each_segment_all(), we can hit middle pages of a folio (compound page). In that case if we set any page flag on those middle pages, we can easily trigger VM_BUG_ON(), as for compound page flags, they should follow their flag policies (normally only set on leading or tail pages). To avoid such problem in the future full folio migration, here we do: - Change from bio_for_each_segment_all() to bio_for_each_folio_all() This completely removes the ability to access the middle page. - Add extra ASSERT()s for data read/write paths To ensure we only get single paged folio for data now. - Rename those end io functions to follow a certain schema * end_bbio_compressed_read() * end_bbio_compressed_write() These two endio functions don't set any page flags, as they use pages not mapped to any address space. They can be very good candidates for higher order folio testing. And they are shared between compression and encoded IO. * end_bbio_data_read() * end_bbio_data_write() * end_bbio_meta_read() * end_bbio_meta_write() The old function names are not unified: - end_bio_extent_writepage() - end_bio_extent_readpage() - extent_buffer_write_end_io() - extent_buffer_read_end_io() They share no schema on where the "end_*io" string should be, nor can be confusing just using "extent_buffer" and "extent" to distinguish data and metadata paths. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-16btrfs: migrate subpage code to folio interfacesQu Wenruo12-284/+281
Although subpage itself is conflicting with higher folio, since subpage (sectorsize < PAGE_SIZE and nodesize < PAGE_SIZE) means we will never need higher order folio, there is a hidden pitfall: - btrfs_page_*() helpers Those helpers are an abstraction to handle both subpage and non-subpage cases, which means we're going to pass pages pointers to those helpers. And since those helpers are shared between data and metadata paths, it's unavoidable to let them to handle folios, including higher order folios). Meanwhile for true subpage case, we should only have a single page backed folios anyway, thus add a new ASSERT() for btrfs_subpage_assert() to ensure that. Also since those helpers are shared between both data and metadata, add some extra ASSERT()s for data path to make sure we only get single page backed folio for now. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-16btrfs: migrate get_eb_page_index() and get_eb_offset_in_page() to foliosQu Wenruo5-117/+141
These two functions are still using the old page based code, which is not going to handle larger folios at all. The migration itself is going to involve the following changes: - PAGE_SIZE -> folio_size() - PAGE_SHIFT -> folio_shift() - get_eb_page_index() -> get_eb_folio_index() - get_eb_offset_in_page() -> get_eb_offset_in_folio() And since we're going to support larger folios, although above straight conversion is good enough, this patch would add extra comments in the involved functions to explain why the same single line code can now cover 3 cases: - folio_size == PAGE_SIZE, sectorsize == PAGE_SIZE, nodesize >= PAGE_SIZE The common, non-subpage case with per-page folio. - folio_size > PAGE_SIZE, sectorsize == PAGE_SIZE, nodesize >= PAGE_SIZE The incoming larger folio, non-subpage case. - folio_size == PAGE_SIZE, sectorsize < PAGE_SIZE, nodesize < PAGE_SIZE The existing subpage case, we won't larger folio anyway. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-16btrfs: don't double put our subpage reference in alloc_extent_bufferJosef Bacik1-3/+19
This fixes as case in "btrfs: refactor alloc_extent_buffer() to allocate-then-attach method". We have been seeing panics in the CI for the subpage stuff recently, it happens on btrfs/187 but could potentially happen anywhere. In the subpage case, if we race with somebody else inserting the same extent buffer, the error case will end up calling detach_extent_buffer_page() on the page twice. This is done first in the bit for (int i = 0; i < attached; i++) detach_extent_buffer_page(eb, eb->pages[i]; and then again in btrfs_release_extent_buffer(). This works fine for !subpage because we're the only person who ever has ourselves on the private, and so when we do the initial detach_extent_buffer_page() we know we've completely removed it. However for subpage we could be using this page private elsewhere, so this results in a double put on the subpage, which can result in an early freeing. The fix here is to clear eb->pages[i] for everything we detach. Then anything still attached to the eb is freed in btrfs_release_extent_buffer(). Because of this change we must update btrfs_release_extent_buffer_pages() to not use num_extent_folios, because it assumes eb->folio[0] is set properly. Since this is only interested in freeing any pages we have on the extent buffer we can simply use INLINE_EXTENT_BUFFER_PAGES. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-16btrfs: cleanup metadata page pointer usageQu Wenruo6-180/+224
Although we have migrated extent_buffer::pages[] to folios[], we're still mostly using the folio_page() help to grab the page. This patch would do the following cleanups for metadata: - Introduce num_extent_folios() helper This is to replace most num_extent_pages() callers. - Use num_extent_folios() to iterate future large folios This allows us to use things like bio_add_folio()/bio_add_folio_nofail(), and only set the needed flags for the folio (aka the leading/tailing page), which reduces the loop iteration to 1 for large folios. - Change metadata related functions to use folio pointers Including their function name, involving: * attach_extent_buffer_page() * detach_extent_buffer_page() * page_range_has_eb() * btrfs_release_extent_buffer_pages() * btree_clear_page_dirty() * btrfs_page_inc_eb_refs() * btrfs_page_dec_eb_refs() - Change btrfs_is_subpage() to accept an address_space pointer This is to allow both page->mapping and folio->mapping to be utilized. As data is still using the old per-page code, and may keep so for a while. - Special corner case place holder for future order mismatches between extent buffer and inode filemap For now it's just a block of comments and a dead ASSERT(), no real handling yet. The subpage code would still go page, just because subpage and large folio are conflicting conditions, thus we don't need to bother subpage with higher order folios at all. Just folio_page(folio, 0) would be enough. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ minor styling tweaks ] Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-16btrfs: migrate extent_buffer::pages[] to folioQu Wenruo7-77/+104
For now extent_buffer::pages[] are still only accepting single page pointer, thus we can migrate to folios pretty easily. As for single page, page and folio are 1:1 mapped, including their page flags. This patch would just do the conversion from struct page to struct folio, providing the first step to higher order folio in the future. This conversion is pretty simple: - extent_buffer::pages[] -> extent_buffer::folios[] - page_address(eb->pages[i]) -> folio_address(eb->pages[i]) - eb->pages[i] -> folio_page(eb->folios[i], 0) There would be more specific cleanups preparing for the incoming higher order folio support. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-16btrfs: refactor alloc_extent_buffer() to allocate-then-attach methodQu Wenruo6-46/+123
Currently alloc_extent_buffer() utilizes find_or_create_page() to allocate one page a time for an extent buffer. This method has the following disadvantages: - find_or_create_page() is the legacy way of allocating new pages With the new folio infrastructure, find_or_create_page() is just redirected to filemap_get_folio(). - Lacks the way to support higher order (order >= 1) folios As we can not yet let filemap give us a higher order folio. This patch would change the workflow by the following way: Old | new -----------------------------------+------------------------------------- | ret = btrfs_alloc_page_array(); for (i = 0; i < num_pages; i++) { | for (i = 0; i < num_pages; i++) { p = find_or_create_page(); | ret = filemap_add_folio(); /* Attach page private */ | /* Reuse page cache if needed */ /* Reused eb if needed */ | | /* Attach page private and | reuse eb if needed */ | } By this we split the page allocation and private attaching into two parts, allowing future updates to each part more easily, and migrate to folio interfaces (especially for possible higher order folios). Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-16btrfs: sysfs: validate scrub_speed_max valueDavid Disseldorp1-0/+4
The value set as scrub_speed_max accepts size with suffixes (k/m/g/t/p/e) but we should still validate it for trailing characters, similar to what we do with chunk_size_store. CC: stable@vger.kernel.org # 5.15+ Signed-off-by: David Disseldorp <ddiss@suse.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-16btrfs: switch btrfs_root::delayed_nodes_tree to xarray from radix-treeDavid Sterba4-34/+41
The radix-tree has been superseded by the xarray (https://lwn.net/Articles/745073), this patch converts the btrfs_root::delayed_nodes, the APIs are used in a simple way. First idea is to do xa_insert() but this would require GFP_ATOMIC allocation which we want to avoid if possible. The preload mechanism of radix-tree can be emulated within the xarray API. - xa_reserve() with GFP_NOFS outside of the lock, the reserved entry is inserted atomically at most once - xa_store() under a lock, in case something races in we can detect that and xa_load() returns a valid pointer All uses of xa_load() must check for a valid pointer in case they manage to get between the xa_reserve() and xa_store(), this is handled in btrfs_get_delayed_node(). Otherwise the functionality is equivalent, xarray implements the radix-tree and there should be no performance difference. The patch continues the efforts started in 253bf57555e451 ("btrfs: turn delayed_nodes_tree into an XArray") and fixes the problems with locking and GFP flags 088aea3b97e0ae ("Revert "btrfs: turn delayed_nodes_tree into an XArray""). Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-16btrfs: fix typos found by codespellDavid Sterba9-12/+12
Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-16btrfs: fix mismatching parameter names for btrfs_get_extent()Qu Wenruo1-1/+1
The definition for btrfs_get_extent() is using "u64 end" as the last parameter, but in implementation we go "u64 len", and all call sites follows the implementation. This can be very confusing during development, as most developers including me, would just use the snippet returned by LSP (clangd in my case), which would only check the definition. Unfortunately this mismatch is introduced from the very beginning of btrfs. Fix it to prevent further confusion. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-16btrfs: use the flags of an extent map to identify the compression typeFilipe Manana13-131/+158
Currently, in struct extent_map, we use an unsigned int (32 bits) to identify the compression type of an extent and an unsigned long (64 bits on a 64 bits platform, 32 bits otherwise) for flags. We are only using 6 different flags, so an unsigned long is excessive and we can use flags to identify the compression type instead of using a dedicated 32 bits field. We can easily have tens or hundreds of thousands (or more) of extent maps on busy and large filesystems, specially with compression enabled or many or large files with tons of small extents. So it's convenient to have the extent_map structure as small as possible in order to use less memory. So remove the compression type field from struct extent_map, use flags to identify the compression type and shorten the flags field from an unsigned long to a u32. This saves 8 bytes (on 64 bits platforms) and reduces the size of the structure from 136 bytes down to 128 bytes, using now only two cache lines, and increases the number of extent maps we can have per 4K page from 30 to 32. By using a u32 for the flags instead of an unsigned long, we no longer use test_bit(), set_bit() and clear_bit(), but that level of atomicity is not needed as most flags are never cleared once set (before adding an extent map to the tree), and the ones that can be cleared or set after an extent map is added to the tree, are always performed while holding the write lock on the extent map tree, while the reader holds a lock on the tree or tests for a flag that never changes once the extent map is in the tree (such as compression flags). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-16btrfs: refactor mergable_maps() for more readabilityFilipe Manana1-14/+14
At mergable_maps() instead of having a single if statement with many ORed and ANDed conditions, refactor it with multiple if statements that check a single condition and return immediately once a requirement fails. This makes it easier to read. Also change the return type from int to bool, make the arguments const and rename the function from mergable_maps() to mergeable_maps(). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-16btrfs: make extent_map_end() argument constFilipe Manana1-1/+1
The extent map pointer argument for extent_map_end() can be const as we are not modifyng anything in the extent map. So make it const, as it will allow further changes to callers that have a const extent map pointer. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-16btrfs: avoid useless rbtree iterations when attempting to merge extent mapFilipe Manana1-17/+21
When trying to merge an extent map that was just inserted or unpinned, we will try to merge it with any adjacent extent map that is suitable. However we will only check if our extent map is mergeable after searching for the previous and next extent maps in the rbtree, meaning that we are doing unnecessary calls to rb_prev() and rb_next() in case our extent map is not mergeable (it's compressed, in the list of modifed extents, being logged or pinned), wasting CPU time chasing rbtree pointers and pulling in unnecessary cache lines. So change the logic to check first if an extent map is mergeable before searching for the next and previous extent maps in the rbtree. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-16btrfs: log messages at unpin_extent_range() during unexpected casesFilipe Manana3-8/+18
At unpin_extent_range() we trigger a WARN_ON() when we don't find an extent map or we find one with a start offset not matching the start offset of the target range. This however isn't very useful for debugging because: 1) We don't know which condition was triggered, as they are both in the same WARN_ON() call; 2) We don't know which inode was affected, from which root, for which range, what's the start offset of the extent map, and so on. So trigger a separate warning for each case and log a message for each case providing information about the inode, its root, the target range, the generation and the start offset of the extent map we found. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-16btrfs: remove redundant value assignment at btrfs_add_extent_mapping()Filipe Manana1-2/+0
At btrfs_add_extent_mapping(), in case add_extent_mapping() returned -EEXIST, it's pointless to assign 0 to 'ret' since we will assign a value to it shortly after, without 'ret' being used before that. So remove that pointless assignment. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-16btrfs: unexport add_extent_mapping()Filipe Manana3-26/+25
There's no need to export add_extent_mapping(), as it's only used inside extent_map.c and in the self tests. For the tests we can use instead btrfs_add_extent_mapping(), which will accomplish exactly the same as we don't expect collisions in any of them. So unexport it and make the tests use btrfs_add_extent_mapping() instead of add_extent_mapping(). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-16btrfs: tests: print all values as decimal in messages for extent map testsFilipe Manana1-7/+7
Some error messages of the extent map tests print decimal values of start offsets and lengths, while other are oddly printing in hexadecimal, which is far less human friendly, specially taking into consideration that all the values are small and multiples of 4K, so it's a lot easier to read them as decimal values. Change the format specifiers to print as decimal instead. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-16btrfs: tests: do not ignore NULL extent maps for extent maps testsFilipe Manana1-10/+30
Several of the extent map tests call btrfs_add_extent_mapping() which is supposed to succeed and return an extent map through the pointer to pointer argument. However the tests are deliberately ignoring a NULL extent map, which is not expected to happen. So change the tests to error out if a NULL extent map is found. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-16btrfs: tests: fix error messages for test case 4 of extent map testsFilipe Manana1-2/+2
In test case 4 for extent maps, if we error out we are supposed to print in interval but instead of printing a non-inclusive end offset, we are printing the length of the interval, which makes it confusing. So fix that to print the exclusive end offset instead. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-16btrfs: assert extent map is not in a list when setting it upFilipe Manana1-1/+3
When setting up a new extent map, at setup_extent_mapping(), we're doing a list move operation to add the extent map the tree's list of modified extents. This is confusing because at this point the extent map can not be in any list, because it's a new extent map. So replace the list move with a list add and add an assertion that checks that the extent map is not currently in any list. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-16btrfs: allocate btrfs_inode::file_extent_tree only without NO_HOLESDavid Sterba4-10/+29
The file_extent_tree was added in 41a2ee75aab0 ("btrfs: introduce per-inode file extent tree") so we have an explicit mapping of the file extents to know where it is safe to update i_size. When the feature NO_HOLES is enabled, and it's been a mkfs default since 5.15, the tree is not necessary. To save some space in the inode, allocate the tree only when necessary. This reduces size by 16 bytes from 1096 to 1080 on a x86_64 release config. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15btrfs: cache that we don't have security.capability setJosef Bacik3-2/+62
When profiling a workload I noticed we were constantly calling getxattr. These were mostly coming from __remove_privs, which will lookup if security.capability exists to remove it. However instrumenting getxattr showed we get called nearly constantly on an idle machine on a lot of accesses. These are wasteful and not free. Other security LSMs have a way to cache their results, but capability doesn't have this, so it's asking us all the time for the xattr. Fix this by setting a flag in our inode that it doesn't have a security.capability xattr. We set this on new inodes and after a failed lookup of security.capability. If we set this xattr at all we'll clear the flag. I haven't found a test in fsperf that this makes a visible difference on, but I assume fs_mark related tests would show it clearly. This is a perf report output of the smallfiles100k run where it shows 20% of our time spent in __remove_privs because we're looking up the non-existent xattr. --21.86%--btrfs_write_check.constprop.0 --21.62%--__file_remove_privs --21.55%--security_inode_need_killpriv --21.54%--cap_inode_need_killpriv --21.53%--__vfs_getxattr --20.89%--btrfs_getxattr Obviously this is just CPU time in a mostly IO bound test, so the actual effect of removing this callchain is minimal. However in just normal testing of an idle system tracing showed around 100 getxattr calls per minute, and with this patch there are 0. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15btrfs: remove code for inode_cache and recovery mount optionsJosef Bacik1-35/+0
We've deprecated these a while ago in 5.11, go ahead and remove the code for them. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Acked-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15btrfs: set clear_cache if we use usebackuprootJosef Bacik2-3/+9
We're currently setting this when we try to load the roots and we see that usebackuproot is set. Instead set this at mount option parsing time. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Acked-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15btrfs: move one shot mount option clearing to super.cJosef Bacik3-16/+15
There's no reason this has to happen in open_ctree, and in fact in the old mount API we had to call this from remount. Move this to super.c, unexport it, and call it from both mount and reconfigure. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Acked-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15btrfs: remove old mount API codeJosef Bacik3-1081/+13
Now that we've switched to the new mount API, remove the old stuff. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Acked-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15btrfs: move the device specific mount options to super.cJosef Bacik2-23/+25
We add these mount options based on the fs_devices settings, which can be set once we've opened the fs_devices. Move these into their own helper and call it from get_tree_super. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Acked-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15btrfs: switch to the new mount APIJosef Bacik3-41/+60
Now that we have all of the parts in place to use the new mount API, switch our fs_type to use the new callbacks. There are a few things that have to be done at the same time because of the order of operations changes that come along with the new mount API. These must be done in the same patch otherwise things will go wrong. 1. Export and use btrfs_check_options in open_ctree(). This is because the options are done ahead of time, and we need to check them once we have the feature flags loaded. 2. Update the free space cache settings. Since we're coming in with the options already set we need to make sure we don't undo what the user has asked for. 3. Set our sb_flags at init_fs_context time, the fs_context stuff is trying to manage the sb_flagss itself, so move that into init_fs_context and out of the fill super part. Additionally I've marked the unused functions with __maybe_unused and will remove them in a future patch. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Acked-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15btrfs: handle the ro->rw transition for mounting different subvolumesJosef Bacik1-1/+128
This is a special case that we've carried around since 0723a0473fb4 ("btrfs: allow mounting btrfs subvolumes with different ro/rw options") where we'll under the covers flip the file system to RW if you're mixing and matching ro/rw options with different subvol mounts. The first mount is what the super gets setup as, so we'd handle this by remount the super as rw under the covers to facilitate this behavior. With the new mount API we can't really allow this, because user space has the ability to specify the super block settings, and the mount settings. So if the user explicitly sets the super block as read only, and then tried to mount a rw mount with the super block we'll reject this. However the old API was less descriptive and thus we allowed this kind of behavior. This patch preserves this behavior for the old API calls. This is inspired by Christians work [1], and includes his comment in btrfs_get_tree_super() explaining the history and how it all works in the old and new APIs. Link: https://lore.kernel.org/all/20230626-fs-btrfs-mount-api-v1-2-045e9735a00b@kernel.org/ Reviewed-by: Christian Brauner <brauner@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15btrfs: add get_tree callback for new mount APIJosef Bacik1-4/+204
This is the actual mounting callback for the new mount API. Implement this using our current fill super as a guideline, making the appropriate adjustments for the new mount API. Our old mount operation had two fs_types, one to handle the actual opening, and the one that we called to handle the actual opening and then did the subvol lookup for returning the actual root dentry. This is mirrored here, but simply with different behaviors for ->get_tree. We use the existence of ->s_fs_info to tell which part we're in. The initial call allocates the fs_info, then call mount_fc() with a duplicated fc to do the actual open_ctree part. Then we take that vfsmount and use it to look up our subvolume that we're mounting and return that as our s_root. This idea was taken from Christians attempt to convert us to the new mount API [1]. In btrfs_get_tree_super() the mount device is scanned and opened in one go under uuid_mutex we expect that all related devices have been already scanned, either by mount or from the outside. A device forget can be called on some of the devices as the whole context is not protected but it's an unlikely event, though it's a minor behaviour change. References: https://lore.kernel.org/all/20230626-fs-btrfs-mount-api-v1-2-045e9735a00b@kernel.org/ Reviewed-by: Christian Brauner <brauner@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add note about device scanning ] Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15btrfs: add reconfigure callback for fs_contextJosef Bacik3-29/+197
This is what is used to remount the file system with the new mount API. Because the mount options are parsed separately and one at a time I've added a helper to emit the mount options after the fact once the mount is configured, this matches the dmesg output for what happens with the old mount API. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Acked-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15btrfs: add fs context handling functionsJosef Bacik1-1/+35
We are going to use the fs context to hold the mount options, so allocate the btrfs_fs_context when we're asked to init the fs context, and free it in the free callback. Reviewed-by: Christian Brauner <brauner@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15btrfs: add parse_param callback for the new mount APIJosef Bacik1-0/+380
The parse_param callback handles one parameter at a time, take our existing mount option parsing loop and adjust it to handle one parameter at a time, and tie it into the fs_context_operations. Create a btrfs_fs_context object that will store the various mount properties, we'll house this in fc->fs_private. This is necessary to separate because remounting will use ->reconfigure, and we'll get a new copy of the parsed parameters, so we can no longer directly mess with the fs_info in this stage. In the future we'll add this to the btrfs_fs_info and update the users to use the new context object instead. There's a change how the option device= is processed. Previously all mount options were parsed in one go under uuid_mutex and the devices opened. This prevented a concurrent scan to happen during mount. Now we could see a device scan happen (e.g. by udev) but this should not affect the end result, mount will either see the populated fs_devices or will scan the device by itself. Alternatively we could save all the device paths first and then process them in one go as before but this does not seem to be necessary. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Acked-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add note about device scanning ] Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15btrfs: add fs_parameter definitionsJosef Bacik1-1/+125
In order to convert to the new mount API we have to change how we do the mount option parsing. For now we're going to duplicate these helpers to make it easier to follow, and then remove the old code once everything is in place. This patch contains the re-definition of all of our mount options into the new fs_parameter_spec format. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Acked-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15btrfs: add a NOSPACECACHE mount option flagJosef Bacik2-0/+2
With the old mount API we'd pre-populate the mount options with the space cache settings of the file system, and then the user toggled them on or off with the mount options. When we switch to the new mount API the mount options will be set before we get into opening the file system, so we need a flag to indicate that the user explicitly asked for -o nospace_cache so we can make the appropriate changes after the fact. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Acked-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15btrfs: split out ro->rw and rw->ro helpers into their own functionsJosef Bacik1-113/+116
When we remount ro->rw or rw->ro we have some cleanup tasks that have to be managed. Split these out into their own function to make btrfs_remount smaller. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Acked-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15btrfs: do not allow free space tree rebuild on extent tree v2Josef Bacik1-1/+5
We currently don't allow these options to be set if we're extent tree v2 via the mount option parsing. However when we switch to the new mount API we'll no longer have the super block loaded, so won't be able to make this distinction at mount option parsing time. Address this by checking for extent tree v2 at the point where we make the decision to rebuild the free space tree. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Acked-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15btrfs: move space cache settings into open_ctreeJosef Bacik3-24/+50
Currently we pre-load the space cache settings in btrfs_parse_options, however when we switch to the new mount API the mount option parsing will happen before we have the super block loaded. Add a helper to set the appropriate options based on the fs settings, this will allow us to have consistent free space cache settings. This also folds in the space cache related decisions we make for subpage sectorsize support, so all of this is done in one place. Since this was being called by parse options it looks like we're changing the behavior of remount, but in fact we aren't. The pre-loading of the free space cache settings is done because we want to handle the case of users not using any space_cache options, we'll derive the appropriate mount option based on the on disk state. On remount this wouldn't reset anything as we'll have cleared the v1 cache generation if we mounted -o nospace_cache. Similarly it's impossible to turn off the free space tree without specifically saying -o nospace_cache,clear_cache, which will delete the free space tree and clear the compat_ro option. Again in this case calling this code in remount wouldn't result in any change. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Acked-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15btrfs: set default compress type at btrfs_init_fs_info timeJosef Bacik1-7/+3
With the new mount API we'll be setting our compression well before we call open_ctree. We don't want to overwrite our settings, so set the default in btrfs_init_fs_info instead of open_ctree. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Acked-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15btrfs: split out the mount option validation code into its own helperJosef Bacik1-29/+37
We're going to need to validate mount options after they're all parsed with the new mount API, split this code out into its own helper so we can use it when we swap over to the new mount API. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Acked-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ minor adjustments in the messages ] Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15fs: indicate request originates from old mount APIChristian Brauner1-0/+11
We already communicate to filesystems when a remount request comes from the old mount API as some filesystems choose to implement different behavior in the new mount API than the old mount API to e.g., take the chance to fix significant API bugs. Allow the same for regular mount requests. Fixes: b330966f79fb ("fuse: reject options on reconfigure via fsconfig(2)") Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15btrfs: remove no longer used EXTENT_MAP_DELALLOC block start valueFilipe Manana4-9/+2
After commit ac3c0d36a2a2 ("btrfs: make fiemap more efficient and accurate reporting extent sharedness") we no longer need to create special extent maps during fiemap that have a block start with the EXTENT_MAP_DELALLOC value. So this block start value for extent maps is no longer used since then, therefore remove it. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15btrfs: allow extent buffer helpers to skip cross-page handlingQu Wenruo3-3/+75
Currently btrfs extent buffer helpers are doing all the cross-page handling, as there is no guarantee that all those eb pages are contiguous. However on systems with enough memory, there is a very high chance the page cache for btree_inode are allocated with physically contiguous pages. In that case, we can skip all the complex cross-page handling, thus speeding up the code. This patch adds a new member, extent_buffer::addr, which is only set to non-NULL if all the extent buffer pages are physically contiguous. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15btrfs: reflow btrfs_free_tree_blockJohannes Thumshirn1-49/+50
Reflow btrfs_free_tree_block() so that there is one level of indentation needed. This patch has no functional changes. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15btrfs: use memset_page instead of opencoding itJohannes Thumshirn1-1/+1
Use memset_page() in memset_extent_buffer() instead of opencoding it. This does not not change any functionality. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>