summaryrefslogtreecommitdiff
path: root/fs/btrfs/relocation.c
AgeCommit message (Collapse)AuthorFilesLines
2020-05-25btrfs: simplify iget helpersDavid Sterba1-11/+2
The inode lookup starting at btrfs_iget takes the full location key, while only the objectid is used to match the inode, because the lookup happens inside the given root thus the inode number is unique. The entire location key is properly set up in btrfs_init_locked_inode. Simplify the helpers and pass only inode number, renaming it to 'ino' instead of 'objectid'. This allows to remove temporary variables key, saving some stack space. Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25btrfs: open code read_fs_rootDavid Sterba1-12/+9
After the update to btrfs_get_fs_root, read_fs_root has become trivial wrapper that can be open coded. Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25btrfs: simplify root lookup by idDavid Sterba1-7/+1
The main function to lookup a root by its id btrfs_get_fs_root takes the whole key, while only using the objectid. The value of offset is preset to (u64)-1 but not actually used until btrfs_find_root that does the actual search. Switch btrfs_get_fs_root to use only objectid and remove all local variables that existed just for the lookup. The actual key for search is set up in btrfs_get_fs_root, reusing another key variable. Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25btrfs: reloc: clear DEAD_RELOC_TREE bit for orphan roots to prevent runaway ↵Qu Wenruo1-0/+2
balance [BUG] There are several reported runaway balance, that balance is flooding the log with "found X extents" where the X never changes. [CAUSE] Commit d2311e698578 ("btrfs: relocation: Delay reloc tree deletion after merge_reloc_roots") introduced BTRFS_ROOT_DEAD_RELOC_TREE bit to indicate that one subvolume has finished its tree blocks swap with its reloc tree. However if balance is canceled or hits ENOSPC halfway, we didn't clear the BTRFS_ROOT_DEAD_RELOC_TREE bit, leaving that bit hanging forever until unmount. Any subvolume root with that bit, would cause backref cache to skip this tree block, as it has finished its tree block swap. This would cause all tree blocks of that root be ignored by balance, leading to runaway balance. [FIX] Fix the problem by also clearing the BTRFS_ROOT_DEAD_RELOC_TREE bit for the original subvolume of orphan reloc root. Add an umount check for the stale bit still set. Fixes: d2311e698578 ("btrfs: relocation: Delay reloc tree deletion after merge_reloc_roots") Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25btrfs: reloc: fix reloc root leak and NULL pointer dereferenceQu Wenruo1-3/+9
[BUG] When balance is canceled, there is a pretty high chance that unmounting the fs can lead to lead the NULL pointer dereference: BTRFS warning (device dm-3): page private not zero on page 223158272 ... BTRFS warning (device dm-3): page private not zero on page 223162368 BTRFS error (device dm-3): leaked root 18446744073709551608-304 refcount 1 BUG: kernel NULL pointer dereference, address: 0000000000000168 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: 0000 [#1] PREEMPT SMP NOPTI CPU: 2 PID: 5793 Comm: umount Tainted: G O 5.7.0-rc5-custom+ #53 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:__lock_acquire+0x5dc/0x24c0 Call Trace: lock_acquire+0xab/0x390 _raw_spin_lock+0x39/0x80 btrfs_release_extent_buffer_pages+0xd7/0x200 [btrfs] release_extent_buffer+0xb2/0x170 [btrfs] free_extent_buffer+0x66/0xb0 [btrfs] btrfs_put_root+0x8e/0x130 [btrfs] btrfs_check_leaked_roots.cold+0x5/0x5d [btrfs] btrfs_free_fs_info+0xe5/0x120 [btrfs] btrfs_kill_super+0x1f/0x30 [btrfs] deactivate_locked_super+0x3b/0x80 deactivate_super+0x3e/0x50 cleanup_mnt+0x109/0x160 __cleanup_mnt+0x12/0x20 task_work_run+0x67/0xa0 exit_to_usermode_loop+0xc5/0xd0 syscall_return_slowpath+0x205/0x360 do_syscall_64+0x6e/0xb0 entry_SYSCALL_64_after_hwframe+0x49/0xb3 RIP: 0033:0x7fd028ef740b [CAUSE] When balance is canceled, all reloc roots are marked as orphan, and orphan reloc roots are going to be cleaned up. However for orphan reloc roots and merged reloc roots, their lifespan are quite different: Merged reloc roots | Orphan reloc roots by cancel -------------------------------------------------------------------- create_reloc_root() | create_reloc_root() |- refs == 1 | |- refs == 1 | btrfs_grab_root(reloc_root); | btrfs_grab_root(reloc_root); |- refs == 2 | |- refs == 2 | root->reloc_root = reloc_root; | root->reloc_root = reloc_root; >>> No difference so far <<< | prepare_to_merge() | prepare_to_merge() |- btrfs_set_root_refs(item, 1);| |- if (!err) (err == -EINTR) | merge_reloc_roots() | merge_reloc_roots() |- merge_reloc_root() | |- Doing nothing to put reloc root |- insert_dirty_subvol() | |- refs == 2 |- __del_reloc_root() | |- btrfs_put_root() | |- refs == 1 | >>> Now orphan reloc roots still have refs 2 <<< | clean_dirty_subvols() | clean_dirty_subvols() |- btrfs_drop_snapshot() | |- btrfS_drop_snapshot() |- reloc_root get freed | |- reloc_root still has refs 2 | related ebs get freed, but | reloc_root still recorded in | allocated_roots btrfs_check_leaked_roots() | btrfs_check_leaked_roots() |- No leaked roots | |- Leaked reloc_roots detected | |- btrfs_put_root() | |- free_extent_buffer(root->node); | |- eb already freed, caused NULL | pointer dereference [FIX] The fix is to clear fs_root->reloc_root and put it at merge_reloc_roots() time, so that we won't leak reloc roots. Fixes: d2311e698578 ("btrfs: relocation: Delay reloc tree deletion after merge_reloc_roots") CC: stable@vger.kernel.org # 5.1+ Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25btrfs: don't set SHAREABLE flag for data reloc treeQu Wenruo1-11/+5
SHAREABLE flag is set for subvolumes because users can create snapshot for subvolumes, thus sharing tree blocks of them. But data reloc tree is not exposed to user space, as it's only an internal tree for data relocation, thus it doesn't need the full path replacement handling at all. This patch will make data reloc tree a non-shareable tree, and add btrfs_fs_info::data_reloc_root for data reloc tree, so relocation code can grab it from fs_info directly. This would slightly improve tree relocation, as now data reloc tree can go through regular COW routine to get relocated, without bothering the complex tree reloc tree routine. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25btrfs: rename BTRFS_ROOT_REF_COWS to BTRFS_ROOT_SHAREABLEQu Wenruo1-11/+14
The name BTRFS_ROOT_REF_COWS is not very clear about the meaning. In fact, that bit can only be set to those trees: - Subvolume roots - Data reloc root - Reloc roots for above roots All other trees won't get this bit set. So just by the result, it is obvious that, roots with this bit set can have tree blocks shared with other trees. Either shared by snapshots, or by reloc roots (an special snapshot created by relocation). This patch will rename BTRFS_ROOT_REF_COWS to BTRFS_ROOT_SHAREABLE to make it easier to understand, and update all comment mentioning "reference counted" to follow the rename. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25btrfs: remove the redundant parameter level in btrfs_bin_search()Qu Wenruo1-5/+3
All callers pass the eb::level so we can get read it directly inside the btrfs_bin_search and key_search. This is inspired by the work of Marek in U-boot. CC: Marek Behun <marek.behun@nic.cz> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25btrfs: use list_for_each_entry_safe in free_reloc_rootsNikolay Borisov1-11/+5
The function always works on a local copy of the reloc root list, which cannot be modified outside of it so using list_for_each_entry is fine. Additionally the macro handles empty lists so drop list_empty checks of callers. No semantic changes. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25btrfs: reloc: move error handling of build_backref_tree() to backref.cQu Wenruo1-47/+1
The error cleanup will be extracted as a new function, btrfs_backref_error_cleanup(), and moved to backref.c and exported for later usage. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25btrfs: backref: rename and move finish_upper_links()Qu Wenruo1-115/+1
This the the 2nd major part of generic backref cache. Move it to backref.c so we can reuse it. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25btrfs: backref: rename and move handle_one_tree_block()Qu Wenruo1-355/+2
This function is the major part of backref cache build process, move it to backref.c so we can reuse it later. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25btrfs: reloc: open code read_fs_root() for handle_indirect_tree_backref()Qu Wenruo1-1/+5
The backref code is going to be moved to backref.c, and read_fs_root() is just a simple wrapper, open-code it to prepare to the incoming code move. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25btrfs: backref: rename and move should_ignore_root()Qu Wenruo1-4/+6
This function is mostly single purpose to relocation backref cache, but since we're moving the main part of backref cache to backref.c, we need to export such function. And to avoid confusion, rename the function to btrfs_should_ignore_reloc_root() make the name a little more clear. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25btrfs: backref: rename and move backref_tree_panic()Qu Wenruo1-20/+9
Also change the parameter, since all callers can easily grab an fs_info, there is no need for all the pointer chasing. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25btrfs: backref: rename and move backref_cache_cleanup()Qu Wenruo1-31/+1
Since we're releasing all existing nodes/edges, other than cleanup the mess after error, "release" is a more proper naming here. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25btrfs: backref: rename and move remove_backref_node()Qu Wenruo1-48/+5
Also add comment explaining the cleanup progress, to differ it from btrfs_backref_drop_node(). Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25btrfs: backref: rename and move drop_backref_node()Qu Wenruo1-38/+7
With extra comment for drop_backref_node() as it has some similarity with remove_backref_node(), thus we need extra comment explaining the difference. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25btrfs: backref: rename and move free_backref_(node|edge)Qu Wenruo1-31/+11
Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25btrfs: backref: rename and move link_backref_edge()Qu Wenruo1-19/+4
Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25btrfs: backref: rename and move alloc_backref_edge()Qu Wenruo1-14/+3
Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25btrfs: backref: rename and move alloc_backref_node()Qu Wenruo1-26/+6
Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25btrfs: backref: rename and move backref_cache_init()Qu Wenruo1-17/+1
Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25btrfs: rename tree_entry to rb_simple_node and export itQu Wenruo1-77/+32
Structure tree_entry provides a very simple rb_tree which only uses bytenr as search index. That tree_entry is used in 3 structures: backref_node, mapping_node and tree_block. Since we're going to make backref_node independnt from relocation, it's a good time to extract the tree_entry into rb_simple_node, and export it into misc.h. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25btrfs: backref: move btrfs_backref_(node|edge|cache) structures to backref.hQu Wenruo1-113/+0
These 3 structures are the main part of btrfs backref cache, move them to backref.h to build the basis for later reuse. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25btrfs: reloc: add btrfs_ prefix for backref_node/edge/cacheQu Wenruo1-137/+141
Those three structures are the main elements of backref cache. Add the "btrfs_" prefix for later export. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25btrfs: reloc: refactor useless nodes handling into its own functionQu Wenruo1-37/+76
This patch will also add some comment for the cleanup. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25btrfs: reloc: refactor finishing part of upper linkage into finish_upper_links()Qu Wenruo1-69/+117
After handle_one_tree_backref(), all newly added (not cached) edges and nodes have the following features: - Only backref_edge::list[LOWER] is linked. This means, we can only iterate from botton to top, not the other direction. - Newly added nodes are not added to cache rb_tree yet So to finish the backref cache, we still need to finish the links and add all nodes into backref cache rb_tree. This patch will refactor the existing code into finish_upper_links(), add more comments of each branch, and why we need to do all the work. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25btrfs: reloc: remove the open-coded goto loop for breadth-first searchQu Wenruo1-81/+88
build_backref_tree() uses "goto again;" to implement a breadth-first search to build backref cache. This patch will extract most of its work into a wrapper, handle_one_tree_block(), and use a do {} while() loop to implement the same thing. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25btrfs: reloc: pass essential members for alloc_backref_node()Qu Wenruo1-20/+20
Bytenr and level are essential parameters for backref_node, thus it makes sense to initialize them at allocation time. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25btrfs: reloc: use wrapper to replace open-coded edge linkingQu Wenruo1-16/+37
Since backref_edge is used to connect upper and lower backref nodes, and needs to access both nodes, some code can look pretty nasty: list_add_tail(&edge->list[LOWER], &cur->upper); The above code will link @cur to the LOWER side of the edge, while both "LOWER" and "upper" words show up. This can sometimes be very confusing for reader to grasp. This patch introduces a new wrapper, link_backref_edge(), to handle the linking behavior. Which also has extra ASSERT() to ensure caller won't pass wrong nodes. Also, this updates the comment of related lists of backref_node and backref_edge, to make it more clear that each list points to what. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25btrfs: reloc: refactor indirect tree backref processing into its own functionQu Wenruo1-135/+159
The processing of indirect tree backref (TREE_BLOCK_REF) is the most complex work. We need to grab the fs root, do a tree search to locate all its parent nodes, link all needed edges, and put all uncached edges to pending edge list. This is definitely worth a helper function. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25btrfs: reloc: refactor direct tree backref processing into its own functionQu Wenruo1-52/+78
For BTRFS_SHARED_BLOCK_REF_KEY, its processing is straightforward, as we now the parent node bytenr directly. If the parent is already cached, or a root, call it a day. If the parent is not cached, add it pending list. This patch will just refactor this part into its own function, handle_direct_tree_backref() and add some comment explaining the @ref_key parameter. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25btrfs: reloc: make reloc root search-specific for relocation backref cacheQu Wenruo1-11/+39
find_reloc_root() searches reloc_control::reloc_root_tree to find the reloc root. This behavior is only useful for relocation backref cache. For the incoming more generic purpose backref cache, we don't care about who owns the reloc root, but only care if it's a reloc root. So this patch makes the following modifications to make the reloc root search more specific to relocation backref: - Add backref_node::is_reloc_root This will be an extra indicator for generic purposed backref cache. User doesn't need to read root key from backref_node::root to determine if it's a reloc root. Also for reloc tree root, it's useless and will be queued to useless list. - Add backref_cache::is_reloc This will allow backref cache code to do different behavior for generic purpose backref cache and relocation backref cache. - Pass fs_info to find_reloc_root() - Export find_reloc_root() So backref.c can utilize this function. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25btrfs: reloc: add backref_cache::fs_info memberQu Wenruo1-2/+6
Add this member so that we can grab fs_info without the help from reloc_control. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25btrfs: reloc: add backref_cache::pending_edge and backref_cache::useless_nodeQu Wenruo1-28/+46
These two new members will act the same as the existing local lists, @useless and @list in build_backref_tree(). Currently build_backref_tree() is only executed serially, thus moving such local list into backref_cache is still safe. Also since we're here, use list_first_entry() to replace a lot of list_entry() calls after !list_empty(). Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25btrfs: reloc: rename mark_block_processed and __mark_block_processedQu Wenruo1-34/+22
These two functions are weirdly named, mark_block_processed() in fact just marks a range dirty unconditionally, while __mark_block_processed() does extra check before doing the marking. This patch will open code old mark_block_processed, and rename __mark_block_processed() to remove the "__" prefix. Since we're here, also kill the forward declaration, which could also kill in_block_group() with in_range() macro. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25btrfs: reloc: use btrfs_backref_iter infrastructureQu Wenruo1-129/+62
In the core function of relocation, build_backref_tree, it needs to iterate all backref items of one tree block. Use btrfs_backref_iter infrastructure to do the loop and make the code more readable. The backref items look would be much more easier to read: ret = btrfs_backref_iter_start(iter, cur->bytenr); for (; ret == 0; ret = btrfs_backref_iter_next(iter)) { /* The really important work */ } Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-04-23btrfs: fix transaction leak in btrfs_recover_relocationXiyu Yang1-0/+1
btrfs_recover_relocation() invokes btrfs_join_transaction(), which joins a btrfs_trans_handle object into transactions and returns a reference of it with increased refcount to "trans". When btrfs_recover_relocation() returns, "trans" becomes invalid, so the refcount should be decreased to keep refcount balanced. The reference counting issue happens in one exception handling path of btrfs_recover_relocation(). When read_fs_root() failed, the refcnt increased by btrfs_join_transaction() is not decreased, causing a refcnt leak. Fix this issue by calling btrfs_end_transaction() on this error path when read_fs_root() failed. Fixes: 79787eaab461 ("btrfs: replace many BUG_ONs with proper error handling") CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Xiyu Yang <xiyuyang19@fudan.edu.cn> Signed-off-by: Xin Tan <tanxin.ctf@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-04-17btrfs: fix setting last_trans for reloc rootsJosef Bacik1-2/+17
I made a mistake with my previous fix, I assumed that we didn't need to mess with the reloc roots once we were out of the part of relocation where we are actually moving the extents. The subtle thing that I missed is that btrfs_init_reloc_root() also updates the last_trans for the reloc root when we do btrfs_record_root_in_trans() for the corresponding fs_root. I've added a comment to make sure future me doesn't make this mistake again. This showed up as a WARN_ON() in btrfs_copy_root() because our last_trans didn't == the current transid. This could happen if we snapshotted a fs root with a reloc root after we set rc->create_reloc_tree = 0, but before we actually merge the reloc root. Worth mentioning that the regression produced the following warning when running snapshot creation and balance in parallel: BTRFS info (device sdc): relocating block group 30408704 flags metadata|dup ------------[ cut here ]------------ WARNING: CPU: 0 PID: 12823 at fs/btrfs/ctree.c:191 btrfs_copy_root+0x26f/0x430 [btrfs] CPU: 0 PID: 12823 Comm: btrfs Tainted: G W 5.6.0-rc7-btrfs-next-58 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014 RIP: 0010:btrfs_copy_root+0x26f/0x430 [btrfs] RSP: 0018:ffffb96e044279b8 EFLAGS: 00010202 RAX: 0000000000000009 RBX: ffff9da70bf61000 RCX: ffffb96e04427a48 RDX: ffff9da733a770c8 RSI: ffff9da70bf61000 RDI: ffff9da694163818 RBP: ffff9da733a770c8 R08: fffffffffffffff8 R09: 0000000000000002 R10: ffffb96e044279a0 R11: 0000000000000000 R12: ffff9da694163818 R13: fffffffffffffff8 R14: ffff9da6d2512000 R15: ffff9da714cdac00 FS: 00007fdeacf328c0(0000) GS:ffff9da735e00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000055a2a5b8a118 CR3: 00000001eed78002 CR4: 00000000003606f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: ? create_reloc_root+0x49/0x2b0 [btrfs] ? kmem_cache_alloc_trace+0xe5/0x200 create_reloc_root+0x8b/0x2b0 [btrfs] btrfs_reloc_post_snapshot+0x96/0x5b0 [btrfs] create_pending_snapshot+0x610/0x1010 [btrfs] create_pending_snapshots+0xa8/0xd0 [btrfs] btrfs_commit_transaction+0x4c7/0xc50 [btrfs] ? btrfs_mksubvol+0x3cd/0x560 [btrfs] btrfs_mksubvol+0x455/0x560 [btrfs] __btrfs_ioctl_snap_create+0x15f/0x190 [btrfs] btrfs_ioctl_snap_create_v2+0xa4/0xf0 [btrfs] ? mem_cgroup_commit_charge+0x6e/0x540 btrfs_ioctl+0x12d8/0x3760 [btrfs] ? do_raw_spin_unlock+0x49/0xc0 ? _raw_spin_unlock+0x29/0x40 ? __handle_mm_fault+0x11b3/0x14b0 ? ksys_ioctl+0x92/0xb0 ksys_ioctl+0x92/0xb0 ? trace_hardirqs_off_thunk+0x1a/0x1c __x64_sys_ioctl+0x16/0x20 do_syscall_64+0x5c/0x280 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x7fdeabd3bdd7 Fixes: 2abc726ab4b8 ("btrfs: do not init a reloc root if we aren't relocating") Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-04-08btrfs: check commit root generation in should_ignore_rootJosef Bacik1-2/+2
Previously we would set the reloc root's last snapshot to transid - 1. However there was a problem with doing this, and we changed it to setting the last snapshot to the generation of the commit node of the fs root. This however broke should_ignore_root(). The assumption is that if we are in a generation newer than when the reloc root was created, then we would find the reloc root through normal backref lookups, and thus can ignore any fs roots we find with an old enough reloc root. Now that the last snapshot could be considerably further in the past than before, we'd end up incorrectly ignoring an fs root. Thus we'd find no nodes for the bytenr we were searching for, and we'd fail to relocate anything. We'd loop through the relocate code again and see that there were still used space in that block group, attempt to relocate those bytenr's again, fail in the same way, and just loop like this forever. This is tricky in that we have to not modify the fs root at all during this time, so we need to have a block group that has data in this fs root that is not shared by any other root, which is why this has been difficult to reproduce. Fixes: 054570a1dc94 ("Btrfs: fix relocation incorrectly dropping data references") CC: stable@vger.kernel.org # 4.9+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23btrfs: track reloc roots based on their commit root bytenrJosef Bacik1-10/+7
We always search the commit root of the extent tree for looking up back references, however we track the reloc roots based on their current bytenr. This is wrong, if we commit the transaction between relocating tree blocks we could end up in this code in build_backref_tree if (key.objectid == key.offset) { /* * Only root blocks of reloc trees use backref * pointing to itself. */ root = find_reloc_root(rc, cur->bytenr); ASSERT(root); cur->root = root; break; } find_reloc_root() is looking based on the bytenr we had in the commit root, but if we've COWed this reloc root we will not find that bytenr, and we will trip over the ASSERT(root). Fix this by using the commit_root->start bytenr for indexing the commit root. Then we change the __update_reloc_root() caller to be used when we switch the commit root for the reloc root during commit. This fixes the panic I was seeing when we started throttling relocation for delayed refs. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23btrfs: restart relocate_tree_blocks properlyJosef Bacik1-9/+2
There are two bugs here, but fixing them independently would just result in pain if you happened to bisect between the two patches. First is how we handle the -EAGAIN from relocate_tree_block(). We don't set error, unless we happen to be the first node, which makes no sense, I have no idea what the code was trying to accomplish here. We in fact _do_ want err set here so that we know we need to restart in relocate_block_group(). Also we need finish_pending_nodes() to not actually call link_to_upper(), because we didn't actually relocate the block. And then if we do get -EAGAIN we do not want to set our backref cache last_trans to the one before ours. This would force us to update our backref cache if we didn't cross transaction ids, which would mean we'd have some nodes updated to their new_bytenr, but still able to find their old bytenr because we're searching the same commit root as the last time we went through relocate_tree_blocks. Fixing these two things keeps us from panicing when we start breaking out of relocate_tree_blocks() either for delayed ref flushing or enospc. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23btrfs: reloc: reorder reservation before root selectionJosef Bacik1-6/+8
Since we're not only checking for metadata reservations but also if we need to throttle our delayed ref generation, reorder reserve_metadata_space() above the select_one_root() call in relocate_tree_block(). The reason we want this is because select_reloc_root() will mess with the backref cache, and if we're going to bail we want to be able to cleanly remove this node from the backref cache and come back along to regenerate it. Move it up so this is the first thing we do to make restarting cleaner. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23btrfs: do not readahead in build_backref_treeJosef Bacik1-2/+0
Here we are just searching down to the bytenr we're building the backref tree for, and all of it's paths to the roots. These bytenrs are not guaranteed to be anywhere near each other, so readahead just generates extra latency. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23btrfs: move the root freeing stuff into btrfs_put_rootJosef Bacik1-4/+0
There are a few different ways to free roots, either you allocated them yourself and you just do free_extent_buffer(root->node); free_extent_buffer(root->commit_node); btrfs_put_root(root); Which is the pattern for log roots. Or for snapshots/subvolumes that are being dropped you simply call btrfs_free_fs_root() which does all the cleanup for you. Unify this all into btrfs_put_root(), so that we don't free up things associated with the root until the last reference is dropped. This makes the root freeing code much more significant. The only caveat is at close_ctree() time we have to free the extent buffers for all of our main roots (extent_root, chunk_root, etc) because we have to drop the btree_inode and we'll run into issues if we hold onto those nodes until ->kill_sb() time. This will be addressed in the future when we kill the btree_inode. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23btrfs: remove a BUG_ON() from merge_reloc_roots()Josef Bacik1-1/+15
This was pretty subtle, we default to reloc roots having 0 root refs, so if we crash in the middle of the relocation they can just be deleted. If we successfully complete the relocation operations we'll set our root refs to 1 in prepare_to_merge() and then go on to merge_reloc_roots(). At prepare_to_merge() time if any of the reloc roots have a 0 reference still, we will remove that reloc root from our reloc root rb tree, and then clean it up later. However this only happens if we successfully start a transaction. If we've aborted previously we will skip this step completely, and only have reloc roots with a reference count of 0, but were never properly removed from the reloc control's rb tree. This isn't a problem per-se, our references are held by the list the reloc roots are on, and by the original root the reloc root belongs to. If we end up in this situation all the reloc roots will be added to the dirty_reloc_list, and then properly dropped at that point. The reloc control will be free'd and the rb tree is no longer used. There were two options when fixing this, one was to remove the BUG_ON(), the other was to make prepare_to_merge() handle the case where we couldn't start a trans handle. IMO this is the cleaner solution. I started with handling the error in prepare_to_merge(), but it turned out super ugly. And in the end this BUG_ON() simply doesn't matter, the cleanup was happening properly, we were just panicing because this BUG_ON() only matters in the success case. So I've opted to just remove it and add a comment where it was. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23btrfs: hold a ref on the root->reloc_rootJosef Bacik1-10/+48
We previously were relying on root->reloc_root to be cleaned up by the drop snapshot, or the error handling. However if btrfs_drop_snapshot() failed it wouldn't drop the ref for the root. Also we sort of depend on the right thing to happen with moving reloc roots between lists and the fs root they belong to, which makes it hard to figure out who owns the reference. Fix this by explicitly holding a reference on the reloc root for roo->reloc_root. This means that we hold two references on reloc roots, one for whichever reloc_roots list it's attached to, and the root->reloc_root we're on. This makes it easier to reason out who owns a reference on the root, and when it needs to be dropped. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23btrfs: clear DEAD_RELOC_TREE before dropping the reloc rootJosef Bacik1-6/+6
The DEAD_RELOC_TREE flag is in place in order to avoid a use after free in init_reloc_root, tracking the presence of reloc_root. However adding the explicit tree references in previous patches makes the use after free impossible because at this point we no longer have a reloc_control set on the fs_info and thus cannot enter the function. So move this to be coupled with clearing the root->reloc_root so we're consistent with all other operations of the reloc root. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23btrfs: free the reloc_control in a consistent wayJosef Bacik1-2/+14
If we have an error while processing the reloc roots we could leak roots that were added to rc->reloc_roots before we hit the error. We could have also not removed the reloc tree mapping from our rb_tree, so clean up any remaining nodes in the reloc root rb_tree. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ use rbtree_postorder_for_each_entry_safe ] Signed-off-by: David Sterba <dsterba@suse.com>