summaryrefslogtreecommitdiff
path: root/fs/btrfs/inode.c
AgeCommit message (Collapse)AuthorFilesLines
2018-11-21Btrfs: fix null pointer dereference on compressed write path errorFilipe Manana1-0/+1
commit 3527a018c00e5dbada2f9d7ed5576437b6dd5cfb upstream. At inode.c:compress_file_range(), under the "free_pages_out" label, we can end up dereferencing the "pages" pointer when it has a NULL value. This case happens when "start" has a value of 0 and we fail to allocate memory for the "pages" pointer. When that happens we jump to the "cont" label and then enter the "if (start == 0)" branch where we immediately call the cow_file_range_inline() function. If that function returns 0 (success creating an inline extent) or an error (like -ENOMEM for example) we jump to the "free_pages_out" label and then access "pages[i]" leading to a NULL pointer dereference, since "nr_pages" has a value greater than zero at that point. Fix this by setting "nr_pages" to 0 when we fail to allocate memory for the "pages" pointer. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=201119 Fixes: 771ed689d2cd ("Btrfs: Optimize compressed writeback and reads") CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-07-03Btrfs: fix unexpected cow in run_delalloc_nocowLiu Bo1-3/+30
commit 5811375325420052fcadd944792a416a43072b7f upstream. Fstests generic/475 provides a way to fail metadata reads while checking if checksum exists for the inode inside run_delalloc_nocow(), and csum_exist_in_range() interprets error (-EIO) as inode having checksum and makes its caller enter the cow path. In case of free space inode, this ends up with a warning in cow_file_range(). The same problem applies to btrfs_cross_ref_exist() since it may also read metadata in between. With this, run_delalloc_nocow() bails out when errors occur at the two places. cc: <stable@vger.kernel.org> v2.6.28+ Fixes: 17d217fe970d ("Btrfs: fix nodatasum handling in balancing code") Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sudip Mukherjee <sudipm.mukherjee@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-30do d_instantiate/unlock_new_inode combinations safelyAl Viro1-12/+4
commit 1e2e547a93a00ebc21582c06ca3c6cfea2a309ee upstream. For anything NFS-exported we do _not_ want to unlock new inode before it has grown an alias; original set of fixes got the ordering right, but missed the nasty complication in case of lockdep being enabled - unlock_new_inode() does lockdep_annotate_inode_mutex_key(inode) which can only be done before anyone gets a chance to touch ->i_mutex. Unfortunately, flipping the order and doing unlock_new_inode() before d_instantiate() opens a window when mkdir can race with open-by-fhandle on a guessed fhandle, leading to multiple aliases for a directory inode and all the breakage that follows from that. Correct solution: a new primitive (d_instantiate_new()) combining these two in the right order - lockdep annotate, then d_instantiate(), then the rest of unlock_new_inode(). All combinations of d_instantiate() with unlock_new_inode() should be converted to that. Cc: stable@kernel.org # 2.6.29 and later Tested-by: Mike Marshall <hubcap@omnibond.com> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-22Btrfs: fix deadlock in run_delalloc_nocowLiu Bo1-1/+4
commit e89166990f11c3f21e1649d760dd35f9e410321c upstream. @cur_offset is not set back to what it should be (@cow_start) if btrfs_next_leaf() returns something wrong, and the range [cow_start, cur_offset) remains locked forever. cc: <stable@vger.kernel.org> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-16btrfs: Handle btrfs_set_extent_delalloc failure in fixup workerNikolay Borisov1-1/+9
commit f3038ee3a3f1017a1cbe9907e31fa12d366c5dcb upstream. This function was introduced by 247e743cbe6e ("Btrfs: Use async helpers to deal with pages that have been improperly dirtied") and it didn't do any error handling then. This function might very well fail in ENOMEM situation, yet it's not handled, this could lead to inconsistent state. So let's handle the failure by setting the mapping error bit. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-20btrfs: add missing memset while reading compressed inline extentsZygo Blaxell1-0/+14
[ Upstream commit e1699d2d7bf6e6cce3e1baff19f9dd4595a58664 ] This is a story about 4 distinct (and very old) btrfs bugs. Commit c8b978188c ("Btrfs: Add zlib compression support") added three data corruption bugs for inline extents (bugs #1-3). Commit 93c82d5750 ("Btrfs: zero page past end of inline file items") fixed bug #1: uncompressed inline extents followed by a hole and more extents could get non-zero data in the hole as they were read. The fix was to add a memset in btrfs_get_extent to zero out the hole. Commit 166ae5a418 ("btrfs: fix inline compressed read err corruption") fixed bug #2: compressed inline extents which contained non-zero bytes might be replaced with zero bytes in some cases. This patch removed an unhelpful memset from uncompress_inline, but the case where memset is required was missed. There is also a memset in the decompression code, but this only covers decompressed data that is shorter than the ram_bytes from the extent ref record. This memset doesn't cover the region between the end of the decompressed data and the end of the page. It has also moved around a few times over the years, so there's no single patch to refer to. This patch fixes bug #3: compressed inline extents followed by a hole and more extents could get non-zero data in the hole as they were read (i.e. bug #3 is the same as bug #1, but s/uncompressed/compressed/). The fix is the same: zero out the hole in the compressed case too, by putting a memset back in uncompress_inline, but this time with correct parameters. The last and oldest bug, bug #0, is the cause of the offending inline extent/hole/extent pattern. Bug #0 is a subtle and mostly-harmless quirk of behavior somewhere in the btrfs write code. In a few special cases, an inline extent and hole are allowed to persist where they normally would be combined with later extents in the file. A fast reproducer for bug #0 is presented below. A few offending extents are also created in the wild during large rsync transfers with the -S flag. A Linux kernel build (git checkout; make allyesconfig; make -j8) will produce a handful of offending files as well. Once an offending file is created, it can present different content to userspace each time it is read. Bug #0 is at least 4 and possibly 8 years old. I verified every vX.Y kernel back to v3.5 has this behavior. There are fossil records of this bug's effects in commits all the way back to v2.6.32. I have no reason to believe bug #0 wasn't present at the beginning of btrfs compression support in v2.6.29, but I can't easily test kernels that old to be sure. It is not clear whether bug #0 is worth fixing. A fix would likely require injecting extra reads into currently write-only paths, and most of the exceptional cases caused by bug #0 are already handled now. Whether we like them or not, bug #0's inline extents followed by holes are part of the btrfs de-facto disk format now, and we need to be able to read them without data corruption or an infoleak. So enough about bug #0, let's get back to bug #3 (this patch). An example of on-disk structure leading to data corruption found in the wild: item 61 key (606890 INODE_ITEM 0) itemoff 9662 itemsize 160 inode generation 50 transid 50 size 47424 nbytes 49141 block group 0 mode 100644 links 1 uid 0 gid 0 rdev 0 flags 0x0(none) item 62 key (606890 INODE_REF 603050) itemoff 9642 itemsize 20 inode ref index 3 namelen 10 name: DB_File.so item 63 key (606890 EXTENT_DATA 0) itemoff 8280 itemsize 1362 inline extent data size 1341 ram 4085 compress(zlib) item 64 key (606890 EXTENT_DATA 4096) itemoff 8227 itemsize 53 extent data disk byte 5367308288 nr 20480 extent data offset 0 nr 45056 ram 45056 extent compression(zlib) Different data appears in userspace during each read of the 11 bytes between 4085 and 4096. The extent in item 63 is not long enough to fill the first page of the file, so a memset is required to fill the space between item 63 (ending at 4085) and item 64 (beginning at 4096) with zero. Here is a reproducer from Liu Bo, which demonstrates another method of creating the same inline extent and hole pattern: Using 'page_poison=on' kernel command line (or enable CONFIG_PAGE_POISONING) run the following: # touch foo # chattr +c foo # xfs_io -f -c "pwrite -W 0 1000" foo # xfs_io -f -c "falloc 4 8188" foo # od -x foo # echo 3 >/proc/sys/vm/drop_caches # od -x foo This produce the following on my box: Correct output: file contains 1000 data bytes followed by zeros: 0000000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd * 0001740 cdcd cdcd cdcd cdcd 0000 0000 0000 0000 0001760 0000 0000 0000 0000 0000 0000 0000 0000 * 0020000 Actual output: the data after the first 1000 bytes will be different each run: 0000000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd * 0001740 cdcd cdcd cdcd cdcd 6c63 7400 635f 006d 0001760 5f74 6f43 7400 435f 0053 5f74 7363 7400 0002000 435f 0056 5f74 6164 7400 645f 0062 5f74 (...) Signed-off-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: Chris Mason <clm@fb.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-08-07Btrfs: adjust outstanding_extents counter properly when dio write is splitLiu Bo1-2/+9
[ Upstream commit c2931667c83ded6504b3857e99cc45b21fa496fb ] Currently how btrfs dio deals with split dio write is not good enough if dio write is split into several segments due to the lack of contiguous space, a large dio write like 'dd bs=1G count=1' can end up with incorrect outstanding_extents counter and endio would complain loudly with an assertion. This fixes the problem by compensating the outstanding_extents counter in inode if a large dio write gets split. Reported-by: Anand Jain <anand.jain@oracle.com> Tested-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-07-05Btrfs: fix truncate down when no_holes feature is enabledLiu Bo1-1/+12
[ Upstream commit 91298eec05cd8d4e828cf7ee5d4a6334f70cf69a ] For such a file mapping, [0-4k][hole][8k-12k] In NO_HOLES mode, we don't have the [hole] extent any more. Commit c1aa45759e90 ("Btrfs: fix shrinking truncate when the no_holes feature is enabled") fixed disk isize not being updated in NO_HOLES mode when data is not flushed. However, even if data has been flushed, we can still have trouble in updating disk isize since we updated disk isize to 'start' of the last evicted extent. Reviewed-by: Chris Mason <clm@fb.com> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-14btrfs: use correct types for page indices in btrfs_page_exists_in_rangeDavid Sterba1-2/+2
commit cc2b702c52094b637a351d7491ac5200331d0445 upstream. Variables start_idx and end_idx are supposed to hold a page index derived from the file offsets. The int type is not the right one though, offsets larger than 1 << 44 will get silently trimmed off the high bits. (1 << 44 is 16TiB) What can go wrong, if start is below the boundary and end gets trimmed: - if there's a page after start, we'll find it (radix_tree_gang_lookup_slot) - the final check "if (page->index <= end_idx)" will unexpectedly fail The function will return false, ie. "there's no page in the range", although there is at least one. btrfs_page_exists_in_range is used to prevent races in: * in hole punching, where we make sure there are not pages in the truncated range, otherwise we'll wait for them to finish and redo truncation, but we're going to replace the pages with holes anyway so the only problem is the intermediate state * lock_extent_direct: we want to make sure there are no pages before we lock and start DIO, to prevent stale data reads For practical occurence of the bug, there are several constaints. The file must be quite large, the affected range must cross the 16TiB boundary and the internal state of the file pages and pending operations must match. Also, we must not have started any ordered data in the range, otherwise we don't even reach the buggy function check. DIO locking tries hard in several places to avoid deadlocks with buffered IO and avoids waiting for ranges. The worst consequence seems to be stale data read. CC: Liu Bo <bo.li.liu@oracle.com> Fixes: fc4adbff823f7 ("btrfs: Drop EXTENT_UPTODATE check in hole punching and direct locking") Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-11-18btrfs: qgroup: Prevent qgroup->reserved from going subzeroGoldwyn Rodrigues1-2/+7
commit 0b34c261e235a5c74dcf78bd305845bd15fe2b42 upstream. While free'ing qgroup->reserved resources, we much check if the page has not been invalidated by a truncate operation by checking if the page is still dirty before reducing the qgroup resources. Resources in such a case are free'd when the entire extent is released by delayed_ref. This fixes a double accounting while releasing resources in case of truncating a file, reproduced by the following testcase. SCRATCH_DEV=/dev/vdb SCRATCH_MNT=/mnt mkfs.btrfs -f $SCRATCH_DEV mount -t btrfs $SCRATCH_DEV $SCRATCH_MNT cd $SCRATCH_MNT btrfs quota enable $SCRATCH_MNT btrfs subvolume create a btrfs qgroup limit 500m a $SCRATCH_MNT sync for c in {1..15}; do dd if=/dev/zero bs=1M count=40 of=$SCRATCH_MNT/a/file; done sleep 10 sync sleep 5 touch $SCRATCH_MNT/a/newfile echo "Removing file" rm $SCRATCH_MNT/a/file Fixes: b9d0b38928 ("btrfs: Add handler for invalidate page") Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-03-10Btrfs: fix deadlock running delayed iputs at transaction commit timeFilipe Manana1-4/+0
commit c2d6cb1636d235257086f939a8194ef0bf93af6e upstream. While running a stress test I ran into a deadlock when running the delayed iputs at transaction time, which produced the following report and trace: [ 886.399989] ============================================= [ 886.400871] [ INFO: possible recursive locking detected ] [ 886.401663] 4.4.0-rc6-btrfs-next-18+ #1 Not tainted [ 886.402384] --------------------------------------------- [ 886.403182] fio/8277 is trying to acquire lock: [ 886.403568] (&fs_info->delayed_iput_sem){++++..}, at: [<ffffffffa0538823>] btrfs_run_delayed_iputs+0x36/0xbf [btrfs] [ 886.403568] [ 886.403568] but task is already holding lock: [ 886.403568] (&fs_info->delayed_iput_sem){++++..}, at: [<ffffffffa0538823>] btrfs_run_delayed_iputs+0x36/0xbf [btrfs] [ 886.403568] [ 886.403568] other info that might help us debug this: [ 886.403568] Possible unsafe locking scenario: [ 886.403568] [ 886.403568] CPU0 [ 886.403568] ---- [ 886.403568] lock(&fs_info->delayed_iput_sem); [ 886.403568] lock(&fs_info->delayed_iput_sem); [ 886.403568] [ 886.403568] *** DEADLOCK *** [ 886.403568] [ 886.403568] May be due to missing lock nesting notation [ 886.403568] [ 886.403568] 3 locks held by fio/8277: [ 886.403568] #0: (sb_writers#11){.+.+.+}, at: [<ffffffff81174c4c>] __sb_start_write+0x5f/0xb0 [ 886.403568] #1: (&sb->s_type->i_mutex_key#15){+.+.+.}, at: [<ffffffffa054620d>] btrfs_file_write_iter+0x73/0x408 [btrfs] [ 886.403568] #2: (&fs_info->delayed_iput_sem){++++..}, at: [<ffffffffa0538823>] btrfs_run_delayed_iputs+0x36/0xbf [btrfs] [ 886.403568] [ 886.403568] stack backtrace: [ 886.403568] CPU: 6 PID: 8277 Comm: fio Not tainted 4.4.0-rc6-btrfs-next-18+ #1 [ 886.403568] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014 [ 886.403568] 0000000000000000 ffff88009f80f770 ffffffff8125d4fd ffffffff82af1fc0 [ 886.403568] ffff88009f80f830 ffffffff8108e5f9 0000000200000000 ffff88009fd92290 [ 886.403568] 0000000000000000 ffffffff82af1fc0 ffffffff829cfb01 00042b216d008804 [ 886.403568] Call Trace: [ 886.403568] [<ffffffff8125d4fd>] dump_stack+0x4e/0x79 [ 886.403568] [<ffffffff8108e5f9>] __lock_acquire+0xd42/0xf0b [ 886.403568] [<ffffffff810c22db>] ? __module_address+0xdf/0x108 [ 886.403568] [<ffffffff8108eb77>] lock_acquire+0x10d/0x194 [ 886.403568] [<ffffffff8108eb77>] ? lock_acquire+0x10d/0x194 [ 886.403568] [<ffffffffa0538823>] ? btrfs_run_delayed_iputs+0x36/0xbf [btrfs] [ 886.489542] [<ffffffff8148556b>] down_read+0x3e/0x4d [ 886.489542] [<ffffffffa0538823>] ? btrfs_run_delayed_iputs+0x36/0xbf [btrfs] [ 886.489542] [<ffffffffa0538823>] btrfs_run_delayed_iputs+0x36/0xbf [btrfs] [ 886.489542] [<ffffffffa0533953>] btrfs_commit_transaction+0x8f5/0x96e [btrfs] [ 886.489542] [<ffffffffa0521d7a>] flush_space+0x435/0x44a [btrfs] [ 886.489542] [<ffffffffa052218b>] ? reserve_metadata_bytes+0x26a/0x384 [btrfs] [ 886.489542] [<ffffffffa05221ae>] reserve_metadata_bytes+0x28d/0x384 [btrfs] [ 886.489542] [<ffffffffa052256c>] ? btrfs_block_rsv_refill+0x58/0x96 [btrfs] [ 886.489542] [<ffffffffa0522584>] btrfs_block_rsv_refill+0x70/0x96 [btrfs] [ 886.489542] [<ffffffffa053d747>] btrfs_evict_inode+0x394/0x55a [btrfs] [ 886.489542] [<ffffffff81188e31>] evict+0xa7/0x15c [ 886.489542] [<ffffffff81189878>] iput+0x1d3/0x266 [ 886.489542] [<ffffffffa053887c>] btrfs_run_delayed_iputs+0x8f/0xbf [btrfs] [ 886.489542] [<ffffffffa0533953>] btrfs_commit_transaction+0x8f5/0x96e [btrfs] [ 886.489542] [<ffffffff81085096>] ? signal_pending_state+0x31/0x31 [ 886.489542] [<ffffffffa0521191>] btrfs_alloc_data_chunk_ondemand+0x1d7/0x288 [btrfs] [ 886.489542] [<ffffffffa0521282>] btrfs_check_data_free_space+0x40/0x59 [btrfs] [ 886.489542] [<ffffffffa05228f5>] btrfs_delalloc_reserve_space+0x1e/0x4e [btrfs] [ 886.489542] [<ffffffffa053620a>] btrfs_direct_IO+0x10c/0x27e [btrfs] [ 886.489542] [<ffffffff8111d9a1>] generic_file_direct_write+0xb3/0x128 [ 886.489542] [<ffffffffa05463c3>] btrfs_file_write_iter+0x229/0x408 [btrfs] [ 886.489542] [<ffffffff8108ae38>] ? __lock_is_held+0x38/0x50 [ 886.489542] [<ffffffff8117279e>] __vfs_write+0x7c/0xa5 [ 886.489542] [<ffffffff81172cda>] vfs_write+0xa0/0xe4 [ 886.489542] [<ffffffff811734cc>] SyS_write+0x50/0x7e [ 886.489542] [<ffffffff814872d7>] entry_SYSCALL_64_fastpath+0x12/0x6f [ 1081.852335] INFO: task fio:8244 blocked for more than 120 seconds. [ 1081.854348] Not tainted 4.4.0-rc6-btrfs-next-18+ #1 [ 1081.857560] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 1081.863227] fio D ffff880213f9bb28 0 8244 8240 0x00000000 [ 1081.868719] ffff880213f9bb28 00ffffff810fc6b0 ffffffff0000000a ffff88023ed55240 [ 1081.872499] ffff880206b5d400 ffff880213f9c000 ffff88020a4d5318 ffff880206b5d400 [ 1081.876834] ffffffff00000001 ffff880206b5d400 ffff880213f9bb40 ffffffff81482ba4 [ 1081.880782] Call Trace: [ 1081.881793] [<ffffffff81482ba4>] schedule+0x7f/0x97 [ 1081.883340] [<ffffffff81485eb5>] rwsem_down_write_failed+0x2d5/0x325 [ 1081.895525] [<ffffffff8108d48d>] ? trace_hardirqs_on_caller+0x16/0x1ab [ 1081.897419] [<ffffffff81269723>] call_rwsem_down_write_failed+0x13/0x20 [ 1081.899251] [<ffffffff81269723>] ? call_rwsem_down_write_failed+0x13/0x20 [ 1081.901063] [<ffffffff81089fae>] ? __down_write_nested.isra.0+0x1f/0x21 [ 1081.902365] [<ffffffff814855bd>] down_write+0x43/0x57 [ 1081.903846] [<ffffffffa05211b0>] ? btrfs_alloc_data_chunk_ondemand+0x1f6/0x288 [btrfs] [ 1081.906078] [<ffffffffa05211b0>] btrfs_alloc_data_chunk_ondemand+0x1f6/0x288 [btrfs] [ 1081.908846] [<ffffffff8108d461>] ? mark_held_locks+0x56/0x6c [ 1081.910409] [<ffffffffa0521282>] btrfs_check_data_free_space+0x40/0x59 [btrfs] [ 1081.912482] [<ffffffffa05228f5>] btrfs_delalloc_reserve_space+0x1e/0x4e [btrfs] [ 1081.914597] [<ffffffffa053620a>] btrfs_direct_IO+0x10c/0x27e [btrfs] [ 1081.919037] [<ffffffff8111d9a1>] generic_file_direct_write+0xb3/0x128 [ 1081.920754] [<ffffffffa05463c3>] btrfs_file_write_iter+0x229/0x408 [btrfs] [ 1081.922496] [<ffffffff8108ae38>] ? __lock_is_held+0x38/0x50 [ 1081.923922] [<ffffffff8117279e>] __vfs_write+0x7c/0xa5 [ 1081.925275] [<ffffffff81172cda>] vfs_write+0xa0/0xe4 [ 1081.926584] [<ffffffff811734cc>] SyS_write+0x50/0x7e [ 1081.927968] [<ffffffff814872d7>] entry_SYSCALL_64_fastpath+0x12/0x6f [ 1081.985293] INFO: lockdep is turned off. [ 1081.986132] INFO: task fio:8249 blocked for more than 120 seconds. [ 1081.987434] Not tainted 4.4.0-rc6-btrfs-next-18+ #1 [ 1081.988534] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 1081.990147] fio D ffff880218febbb8 0 8249 8240 0x00000000 [ 1081.991626] ffff880218febbb8 00ffffff81486b8e ffff88020000000b ffff88023ed75240 [ 1081.993258] ffff8802120a9a00 ffff880218fec000 ffff88020a4d5318 ffff8802120a9a00 [ 1081.994850] ffffffff00000001 ffff8802120a9a00 ffff880218febbd0 ffffffff81482ba4 [ 1081.996485] Call Trace: [ 1081.997037] [<ffffffff81482ba4>] schedule+0x7f/0x97 [ 1081.998017] [<ffffffff81485eb5>] rwsem_down_write_failed+0x2d5/0x325 [ 1081.999241] [<ffffffff810852a5>] ? finish_wait+0x6d/0x76 [ 1082.000306] [<ffffffff81269723>] call_rwsem_down_write_failed+0x13/0x20 [ 1082.001533] [<ffffffff81269723>] ? call_rwsem_down_write_failed+0x13/0x20 [ 1082.002776] [<ffffffff81089fae>] ? __down_write_nested.isra.0+0x1f/0x21 [ 1082.003995] [<ffffffff814855bd>] down_write+0x43/0x57 [ 1082.005000] [<ffffffffa05211b0>] ? btrfs_alloc_data_chunk_ondemand+0x1f6/0x288 [btrfs] [ 1082.007403] [<ffffffffa05211b0>] btrfs_alloc_data_chunk_ondemand+0x1f6/0x288 [btrfs] [ 1082.008988] [<ffffffffa0545064>] btrfs_fallocate+0x7c1/0xc2f [btrfs] [ 1082.010193] [<ffffffff8108a1ba>] ? percpu_down_read+0x4e/0x77 [ 1082.011280] [<ffffffff81174c4c>] ? __sb_start_write+0x5f/0xb0 [ 1082.012265] [<ffffffff81174c4c>] ? __sb_start_write+0x5f/0xb0 [ 1082.013021] [<ffffffff811712e4>] vfs_fallocate+0x170/0x1ff [ 1082.013738] [<ffffffff81181ebb>] ioctl_preallocate+0x89/0x9b [ 1082.014778] [<ffffffff811822d7>] do_vfs_ioctl+0x40a/0x4ea [ 1082.015778] [<ffffffff81176ea7>] ? SYSC_newfstat+0x25/0x2e [ 1082.016806] [<ffffffff8118b4de>] ? __fget_light+0x4d/0x71 [ 1082.017789] [<ffffffff8118240e>] SyS_ioctl+0x57/0x79 [ 1082.018706] [<ffffffff814872d7>] entry_SYSCALL_64_fastpath+0x12/0x6f This happens because we can recursively acquire the semaphore fs_info->delayed_iput_sem when attempting to allocate space to satisfy a file write request as shown in the first trace above - when committing a transaction we acquire (down_read) the semaphore before running the delayed iputs, and when running a delayed iput() we can end up calling an inode's eviction handler, which in turn commits another transaction and attempts to acquire (down_read) again the semaphore to run more delayed iput operations. This results in a deadlock because if a task acquires multiple times a semaphore it should invoke down_read_nested() with a different lockdep class for each level of recursion. Fix this by simplifying the implementation and use a mutex instead that is acquired by the cleaner kthread before it runs the delayed iputs instead of always acquiring a semaphore before delayed references are run from anywhere. Fixes: d7c151717a1e (btrfs: Fix NO_SPACE bug caused by delayed-iput) Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-03-04Btrfs: fix transaction handle leak on failure to create hard linkFilipe Manana1-2/+4
commit 271dba4521aed0c37c063548f876b49f5cd64b2e upstream. If we failed to create a hard link we were not always releasing the the transaction handle we got before, resulting in a memory leak and preventing any other tasks from being able to commit the current transaction. Fix this by always releasing our transaction handle. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-03-04Btrfs: fix number of transaction units required to create symlinkFilipe Manana1-1/+3
commit 9269d12b2d57d9e3d13036bb750762d1110d425c upstream. We weren't accounting for the insertion of an inline extent item for the symlink inode nor that we need to update the parent inode item (through the call to btrfs_add_nondir()). So fix this by including two more transaction units. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-03-04Btrfs: igrab inode in writepageJosef Bacik1-2/+15
commit be7bd730841e69fe8f70120098596f648cd1f3ff upstream. We hit this panic on a few of our boxes this week where we have an ordered_extent with an NULL inode. We do an igrab() of the inode in writepages, but weren't doing it in writepage which can be called directly from the VM on dirty pages. If the inode has been unlinked then we could have I_FREEING set which means igrab() would return NULL and we get this panic. Fix this by trying to igrab in btrfs_writepage, and if it returns NULL then just redirty the page and return AOP_WRITEPAGE_ACTIVATE; so the VM knows it wasn't successful. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-02-25Btrfs: fix direct IO requests not reporting IO error to user spaceFilipe Manana1-0/+2
commit 1636d1d77ef4e01e57f706a4cae3371463896136 upstream. If a bio for a direct IO request fails, we were not setting the error in the parent bio (the main DIO bio), making us not return the error to user space in btrfs_direct_IO(), that is, it made __blockdev_direct_IO() return the number of bytes issued for IO and not the error a bio created and submitted by btrfs_submit_direct() got from the block layer. This essentially happens because when we call: dio_end_io(dio_bio, bio->bi_error); It does not set dio_bio->bi_error to the value of the second argument. So just add this missing assignment in endio callbacks, just as we do in the error path at btrfs_submit_direct() when we fail to clone the dio bio or allocate its private object. This follows the convention of what is done with other similar APIs such as bio_endio() where the caller is responsible for setting the bi_error field in the bio it passes as an argument to bio_endio(). This was detected by the new generic test cases in xfstests: 271, 272, 276 and 278. Which essentially setup a dm error target, then load the error table, do a direct IO write and unload the error table. They expect the write to fail with -EIO, which was not getting reported when testing against btrfs. Fixes: 4246a0b63bd8 ("block: add a bi_error field to struct bio") Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-02-25btrfs: properly set the termination value of ctx->pos in readdirDavid Sterba1-1/+13
commit bc4ef7592f657ae81b017207a1098817126ad4cb upstream. The value of ctx->pos in the last readdir call is supposed to be set to INT_MAX due to 32bit compatibility, unless 'pos' is intentially set to a larger value, then it's LLONG_MAX. There's a report from PaX SIZE_OVERFLOW plugin that "ctx->pos++" overflows (https://forums.grsecurity.net/viewtopic.php?f=1&t=4284), on a 64bit arch, where the value is 0x7fffffffffffffff ie. LLONG_MAX before the increment. We can get to that situation like that: * emit all regular readdir entries * still in the same call to readdir, bump the last pos to INT_MAX * next call to readdir will not emit any entries, but will reach the bump code again, finds pos to be INT_MAX and sets it to LLONG_MAX Normally this is not a problem, but if we call readdir again, we'll find 'pos' set to LLONG_MAX and the unconditional increment will overflow. The report from Victor at (http://thread.gmane.org/gmane.comp.file-systems.btrfs/49500) with debugging print shows that pattern: Overflow: e Overflow: 7fffffff Overflow: 7fffffffffffffff PAX: size overflow detected in function btrfs_real_readdir fs/btrfs/inode.c:5760 cicus.935_282 max, count: 9, decl: pos; num: 0; context: dir_context; CPU: 0 PID: 2630 Comm: polkitd Not tainted 4.2.3-grsec #1 Hardware name: Gigabyte Technology Co., Ltd. H81ND2H/H81ND2H, BIOS F3 08/11/2015 ffffffff81901608 0000000000000000 ffffffff819015e6 ffffc90004973d48 ffffffff81742f0f 0000000000000007 ffffffff81901608 ffffc90004973d78 ffffffff811cb706 0000000000000000 ffff8800d47359e0 ffffc90004973ed8 Call Trace: [<ffffffff81742f0f>] dump_stack+0x4c/0x7f [<ffffffff811cb706>] report_size_overflow+0x36/0x40 [<ffffffff812ef0bc>] btrfs_real_readdir+0x69c/0x6d0 [<ffffffff811dafc8>] iterate_dir+0xa8/0x150 [<ffffffff811e6d8d>] ? __fget_light+0x2d/0x70 [<ffffffff811dba3a>] SyS_getdents+0xba/0x1c0 Overflow: 1a [<ffffffff811db070>] ? iterate_dir+0x150/0x150 [<ffffffff81749b69>] entry_SYSCALL_64_fastpath+0x12/0x83 The jump from 7fffffff to 7fffffffffffffff happens when new dir entries are not yet synced and are processed from the delayed list. Then the code could go to the bump section again even though it might not emit any new dir entries from the delayed list. The fix avoids entering the "bump" section again once we've finished emitting the entries, both for synced and delayed entries. References: https://forums.grsecurity.net/viewtopic.php?f=1&t=4284 Reported-by: Victor <services@swwu.com> Signed-off-by: David Sterba <dsterba@suse.com> Tested-by: Holger Hoffstätte <holger.hoffstaette@googlemail.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-11-28Merge branch 'for-linus-4.4' of ↵Linus Torvalds1-23/+1
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "This has Mark Fasheh's patches to fix quota accounting during subvol deletion, which we've been working on for a while now. The patch is pretty small but it's a key fix. Otherwise it's a random assortment" * 'for-linus-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: btrfs: fix balance range usage filters in 4.4-rc btrfs: qgroup: account shared subtree during snapshot delete Btrfs: use btrfs_get_fs_root in resolve_indirect_ref btrfs: qgroup: fix quota disable during rescan Btrfs: fix race between cleaner kthread and space cache writeout Btrfs: fix scrub preventing unused block groups from being deleted Btrfs: fix race between scrub and block group deletion btrfs: fix rcu warning during device replace btrfs: Continue replace when set_block_ro failed btrfs: fix clashing number of the enhanced balance usage filter Btrfs: fix the number of transaction units needed to remove a block group Btrfs: use global reserve when deleting unused block group after ENOSPC Btrfs: tests: checking for NULL instead of IS_ERR() btrfs: fix signed overflows in btrfs_sync_file
2015-11-25Btrfs: use global reserve when deleting unused block group after ENOSPCFilipe Manana1-23/+1
It's possible to reach a state where the cleaner kthread isn't able to start a transaction to delete an unused block group due to lack of enough free metadata space and due to lack of unallocated device space to allocate a new metadata block group as well. If this happens try to use space from the global block group reserve just like we do for unlink operations, so that we don't reach a permanent state where starting a transaction for filesystem operations (file creation, renames, etc) keeps failing with -ENOSPC. Such an unfortunate state was observed on a machine where over a dozen unused data block groups existed and the cleaner kthread was failing to delete them due to ENOSPC error when attempting to start a transaction, and even running balance with a -dusage=0 filter failed with ENOSPC as well. Also unmounting and mounting again the filesystem didn't help. Allowing the cleaner kthread to use the global block reserve to delete the unused data block groups fixed the problem. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-11-14Merge branch 'for-linus-4.4' of ↵Linus Torvalds1-17/+45
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes and cleanups from Chris Mason: "Some of this got cherry-picked from a github repo this week, but I verified the patches. We have three small scrub cleanups and a collection of fixes" * 'for-linus-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: btrfs: Use fs_info directly in btrfs_delete_unused_bgs btrfs: Fix lost-data-profile caused by balance bg btrfs: Fix lost-data-profile caused by auto removing bg btrfs: Remove len argument from scrub_find_csum btrfs: Reduce unnecessary arguments in scrub_recheck_block btrfs: Use scrub_checksum_data and scrub_checksum_tree_block for scrub_recheck_block_checksum btrfs: Reset sblock->xxx_error stats before calling scrub_recheck_block_checksum btrfs: scrub: setup all fields for sblock_to_check btrfs: scrub: set error stats when tree block spanning stripes Btrfs: fix race when listing an inode's xattrs Btrfs: fix race leading to BUG_ON when running delalloc for nodatacow Btrfs: fix race leading to incorrect item deletion when dropping extents Btrfs: fix sleeping inside atomic context in qgroup rescan worker Btrfs: fix race waiting for qgroup rescan worker btrfs: qgroup: exit the rescan worker during umount Btrfs: fix extent accounting for partial direct IO writes
2015-11-10fs/btrfs/inode.c: remove unnecessary new_valid_dev() checkYaowei Bai1-3/+0
new_valid_dev() always returns 1, so the !new_valid_dev() check is not needed. Remove it. Signed-off-by: Yaowei Bai <bywxiaobai@163.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Chris Mason <clm@fb.com> Cc: Josef Bacik <jbacik@fb.com> Acked-by: David Sterba <dsterba@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-09Btrfs: fix race leading to BUG_ON when running delalloc for nodatacowFilipe Manana1-2/+8
If we are using the NO_HOLES feature, we have a tiny time window when running delalloc for a nodatacow inode where we can race with a concurrent link or xattr add operation leading to a BUG_ON. This happens because at run_delalloc_nocow() we end up casting a leaf item of type BTRFS_INODE_[REF|EXTREF]_KEY or of type BTRFS_XATTR_ITEM_KEY to a file extent item (struct btrfs_file_extent_item) and then analyse its extent type field, which won't match any of the expected extent types (values BTRFS_FILE_EXTENT_[REG|PREALLOC|INLINE]) and therefore trigger an explicit BUG_ON(1). The following sequence diagram shows how the race happens when running a no-cow dellaloc range [4K, 8K[ for inode 257 and we have the following neighbour leafs: Leaf X (has N items) Leaf Y [ ... (257 INODE_ITEM 0) (257 INODE_REF 256) ] [ (257 EXTENT_DATA 8192), ... ] slot N - 2 slot N - 1 slot 0 (Note the implicit hole for inode 257 regarding the [0, 8K[ range) CPU 1 CPU 2 run_dealloc_nocow() btrfs_lookup_file_extent() --> searches for a key with value (257 EXTENT_DATA 4096) in the fs/subvol tree --> returns us a path with path->nodes[0] == leaf X and path->slots[0] == N because path->slots[0] is >= btrfs_header_nritems(leaf X), it calls btrfs_next_leaf() btrfs_next_leaf() --> releases the path hard link added to our inode, with key (257 INODE_REF 500) added to the end of leaf X, so leaf X now has N + 1 keys --> searches for the key (257 INODE_REF 256), because it was the last key in leaf X before it released the path, with path->keep_locks set to 1 --> ends up at leaf X again and it verifies that the key (257 INODE_REF 256) is no longer the last key in the leaf, so it returns with path->nodes[0] == leaf X and path->slots[0] == N, pointing to the new item with key (257 INODE_REF 500) the loop iteration of run_dealloc_nocow() does not break out the loop and continues because the key referenced in the path at path->nodes[0] and path->slots[0] is for inode 257, its type is < BTRFS_EXTENT_DATA_KEY and its offset (500) is less then our delalloc range's end (8192) the item pointed by the path, an inode reference item, is (incorrectly) interpreted as a file extent item and we get an invalid extent type, leading to the BUG_ON(1): if (extent_type == BTRFS_FILE_EXTENT_REG || extent_type == BTRFS_FILE_EXTENT_PREALLOC) { (...) } else if (extent_type == BTRFS_FILE_EXTENT_INLINE) { (...) } else { BUG_ON(1) } The same can happen if a xattr is added concurrently and ends up having a key with an offset smaller then the delalloc's range end. So fix this by skipping keys with a type smaller than BTRFS_EXTENT_DATA_KEY. Cc: stable@vger.kernel.org Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-11-05Btrfs: fix extent accounting for partial direct IO writesFilipe Manana1-15/+37
When doing a write using direct IO we can end up not doing the whole write operation using the direct IO path, in that case we fallback to a buffered write to do the remaining IO. This happens for example if the range we are writing to contains a compressed extent. When we do a partial write and fallback to buffered IO, due to the existence of a compressed extent for example, we end up not adjusting the outstanding extents counter of our inode which ends up getting decremented twice, once by the DIO ordered extent for the partial write and once again by btrfs_direct_IO(), resulting in an arithmetic underflow at extent-tree.c:drop_outstanding_extent(). For example if we have: extents [ prealloc extent ] [ compressed extent ] offsets A B C D E and at the moment our inode's outstanding extents counter is 0, if we do a direct IO write against the range [B, D[ (which has a length smaller than 128Mb), we end up bumping our inode's outstanding extents counter to 1, we create a DIO ordered extent for the range [B, C[ and then fallback to a buffered write for the range [C, D[. The direct IO handler (inode.c:btrfs_direct_IO()) decrements the outstanding extents counter by 1, leaving it with a value of 0, through a call to btrfs_delalloc_release_space() and then shortly after the DIO ordered extent finishes and calls btrfs_delalloc_release_metadata() which ends up to attempt to decrement the inode's outstanding extents counter by 1, resulting in an assertion failure at drop_outstanding_extent() because the operation would result in an arithmetic underflow (0 - 1). This produces the following trace: [125471.336838] BTRFS: assertion failed: BTRFS_I(inode)->outstanding_extents >= num_extents, file: fs/btrfs/extent-tree.c, line: 5526 [125471.338844] ------------[ cut here ]------------ [125471.340745] kernel BUG at fs/btrfs/ctree.h:4173! [125471.340745] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC [125471.340745] Modules linked in: btrfs f2fs xfs libcrc32c dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse parport_pc acpi_cpufreq psmouse i2c_piix4 parport pcspkr serio_raw microcode processor evdev i2c_core button ext4 crc16 jbd2 mbcache sd_mod sg sr_mod cdrom ata_generic virtio_scsi ata_piix virtio_pci virtio_ring floppy libata virtio e1000 scsi_mod [last unloaded: btrfs] [125471.340745] CPU: 10 PID: 23649 Comm: kworker/u32:1 Tainted: G W 4.3.0-rc5-btrfs-next-17+ #1 [125471.340745] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014 [125471.340745] Workqueue: btrfs-endio-write btrfs_endio_write_helper [btrfs] [125471.340745] task: ffff8804244fcf80 ti: ffff88040a118000 task.ti: ffff88040a118000 [125471.340745] RIP: 0010:[<ffffffffa0550da1>] [<ffffffffa0550da1>] assfail.constprop.46+0x1e/0x20 [btrfs] [125471.340745] RSP: 0018:ffff88040a11bc78 EFLAGS: 00010296 [125471.340745] RAX: 0000000000000075 RBX: 0000000000005000 RCX: 0000000000000000 [125471.340745] RDX: ffffffff81098f93 RSI: ffffffff8147c619 RDI: 00000000ffffffff [125471.340745] RBP: ffff88040a11bc78 R08: 0000000000000001 R09: 0000000000000000 [125471.340745] R10: ffff88040a11bc08 R11: ffffffff81651000 R12: ffff8803efb4a000 [125471.340745] R13: ffff8803efb4a000 R14: 0000000000000000 R15: ffff8802f8e33c88 [125471.340745] FS: 0000000000000000(0000) GS:ffff88043dd40000(0000) knlGS:0000000000000000 [125471.340745] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [125471.340745] CR2: 00007fae7ca86095 CR3: 0000000001a0b000 CR4: 00000000000006e0 [125471.340745] Stack: [125471.340745] ffff88040a11bc88 ffffffffa04ca0cd ffff88040a11bcc8 ffffffffa04ceeb1 [125471.340745] ffff8802f8e33940 ffff8802c93eadb0 ffff8802f8e0bf50 ffff8803efb4a000 [125471.340745] 0000000000000000 ffff8802f8e33c88 ffff88040a11bd38 ffffffffa04eccfa [125471.340745] Call Trace: [125471.340745] [<ffffffffa04ca0cd>] drop_outstanding_extent+0x3d/0x6d [btrfs] [125471.340745] [<ffffffffa04ceeb1>] btrfs_delalloc_release_metadata+0x51/0xdd [btrfs] [125471.340745] [<ffffffffa04eccfa>] btrfs_finish_ordered_io+0x420/0x4eb [btrfs] [125471.340745] [<ffffffffa04ecdda>] finish_ordered_fn+0x15/0x17 [btrfs] [125471.340745] [<ffffffffa050e6e8>] normal_work_helper+0x14c/0x32a [btrfs] [125471.340745] [<ffffffffa050e9c8>] btrfs_endio_write_helper+0x12/0x14 [btrfs] [125471.340745] [<ffffffff81063b23>] process_one_work+0x24a/0x4ac [125471.340745] [<ffffffff81064285>] worker_thread+0x206/0x2c2 [125471.340745] [<ffffffff8106407f>] ? rescuer_thread+0x2cb/0x2cb [125471.340745] [<ffffffff8106407f>] ? rescuer_thread+0x2cb/0x2cb [125471.340745] [<ffffffff8106904d>] kthread+0xef/0xf7 [125471.340745] [<ffffffff81068f5e>] ? kthread_parkme+0x24/0x24 [125471.340745] [<ffffffff8147d10f>] ret_from_fork+0x3f/0x70 [125471.340745] [<ffffffff81068f5e>] ? kthread_parkme+0x24/0x24 [125471.340745] Code: a5 55 a0 48 89 e5 e8 42 50 bc e0 0f 0b 55 89 f1 48 c7 c2 f0 a8 55 a0 48 89 fe 31 c0 48 c7 c7 14 aa 55 a0 48 89 e5 e8 22 50 bc e0 <0f> 0b 0f 1f 44 00 00 55 31 c9 ba 18 00 00 00 48 89 e5 41 56 41 [125471.340745] RIP [<ffffffffa0550da1>] assfail.constprop.46+0x1e/0x20 [btrfs] [125471.340745] RSP <ffff88040a11bc78> [125471.539620] ---[ end trace 144259f7838b4aa4 ]--- So fix this by ensuring we adjust the outstanding extents counter when we do the fallback just like we do for the case where the whole write can be done through the direct IO path. We were also adjusting the outstanding extents counter by a constant value of 1, which is incorrect because we were ignorning that we account extents in BTRFS_MAX_EXTENT_SIZE units, o fix that as well. The following test case for fstests reproduces this issue: seq=`basename $0` seqres=$RESULT_DIR/$seq echo "QA output created by $seq" tmp=/tmp/$$ status=1 # failure is the default! trap "_cleanup; exit \$status" 0 1 2 3 15 _cleanup() { rm -f $tmp.* } # get standard environment, filters and checks . ./common/rc . ./common/filter # real QA test starts here _need_to_be_root _supported_fs btrfs _supported_os Linux _require_scratch _require_xfs_io_command "falloc" rm -f $seqres.full _scratch_mkfs >>$seqres.full 2>&1 _scratch_mount "-o compress" # Create a compressed extent covering the range [700K, 800K[. $XFS_IO_PROG -f -s -c "pwrite -S 0xaa -b 100K 700K 100K" \ $SCRATCH_MNT/foo | _filter_xfs_io # Create prealloc extent covering the range [600K, 700K[. $XFS_IO_PROG -c "falloc 600K 100K" $SCRATCH_MNT/foo # Write 80K of data to the range [640K, 720K[ using direct IO. This # range covers both the prealloc extent and the compressed extent. # Because there's a compressed extent in the range we are writing to, # the DIO write code path ends up only writing the first 60k of data, # which goes to the prealloc extent, and then falls back to buffered IO # for writing the remaining 20K of data - because that remaining data # maps to a file range containing a compressed extent. # When falling back to buffered IO, we used to trigger an assertion when # releasing reserved space due to bad accounting of the inode's # outstanding extents counter, which was set to 1 but we ended up # decrementing it by 1 twice, once through the ordered extent for the # 60K of data we wrote using direct IO, and once through the main direct # IO handler (inode.cbtrfs_direct_IO()) because the direct IO write # wrote less than 80K of data (60K). $XFS_IO_PROG -d -c "pwrite -S 0xbb -b 80K 640K 80K" \ $SCRATCH_MNT/foo | _filter_xfs_io # Now similar test as above but for very large write operations. This # triggers special cases for an inode's outstanding extents accounting, # as internally btrfs logically splits extents into 128Mb units. $XFS_IO_PROG -f -s \ -c "pwrite -S 0xaa -b 128M 258M 128M" \ -c "falloc 0 258M" \ $SCRATCH_MNT/bar | _filter_xfs_io $XFS_IO_PROG -d -c "pwrite -S 0xbb -b 256M 3M 256M" $SCRATCH_MNT/bar \ | _filter_xfs_io # Now verify the file contents are correct and that they are the same # even after unmounting and mounting the fs again (or evicting the page # cache). # # For file foo, all bytes in the range [0, 640K[ must have a value of # 0x00, all bytes in the range [640K, 720K[ must have a value of 0xbb # and all bytes in the range [720K, 800K[ must have a value of 0xaa. # # For file bar, all bytes in the range [0, 3M[ must havea value of 0x00, # all bytes in the range [3M, 259M[ must have a value of 0xbb and all # bytes in the range [259M, 386M[ must have a value of 0xaa. # echo "File digests before remounting the file system:" md5sum $SCRATCH_MNT/foo | _filter_scratch md5sum $SCRATCH_MNT/bar | _filter_scratch _scratch_remount echo "File digests after remounting the file system:" md5sum $SCRATCH_MNT/foo | _filter_scratch md5sum $SCRATCH_MNT/bar | _filter_scratch status=0 exit Fixes: e1cbbfa5f5aa ("Btrfs: fix outstanding_extents accounting in DIO") Fixes: 3e05bde8c3c2 ("Btrfs: only adjust outstanding_extents when we do a short write") Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-10-27btrfs: qgroup: Fix a race in delayed_ref which leads to abort transQu Wenruo1-8/+4
Between btrfs_allocerved_file_extent() and btrfs_add_delayed_qgroup_reserve(), there is a window that delayed_refs are run and delayed ref head maybe freed before btrfs_add_delayed_qgroup_reserve(). This will cause btrfs_dad_delayed_qgroup_reserve() to return -ENOENT, and cause transaction to be aborted. This patch will record qgroup reserve space info into delayed_ref_head at btrfs_add_delayed_ref(), to eliminate the race window. Reported-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-10-25Btrfs: fix regression running delayed references when using qgroupsFilipe Manana1-2/+2
In the kernel 4.2 merge window we had a big changes to the implementation of delayed references and qgroups which made the no_quota field of delayed references not used anymore. More specifically the no_quota field is not used anymore as of: commit 0ed4792af0e8 ("btrfs: qgroup: Switch to new extent-oriented qgroup mechanism.") Leaving the no_quota field actually prevents delayed references from getting merged, which in turn cause the following BUG_ON(), at fs/btrfs/extent-tree.c, to be hit when qgroups are enabled: static int run_delayed_tree_ref(...) { (...) BUG_ON(node->ref_mod != 1); (...) } This happens on a scenario like the following: 1) Ref1 bytenr X, action = BTRFS_ADD_DELAYED_REF, no_quota = 1, added. 2) Ref2 bytenr X, action = BTRFS_DROP_DELAYED_REF, no_quota = 0, added. It's not merged with Ref1 because Ref1->no_quota != Ref2->no_quota. 3) Ref3 bytenr X, action = BTRFS_ADD_DELAYED_REF, no_quota = 1, added. It's not merged with the reference at the tail of the list of refs for bytenr X because the reference at the tail, Ref2 is incompatible due to Ref2->no_quota != Ref3->no_quota. 4) Ref4 bytenr X, action = BTRFS_DROP_DELAYED_REF, no_quota = 0, added. It's not merged with the reference at the tail of the list of refs for bytenr X because the reference at the tail, Ref3 is incompatible due to Ref3->no_quota != Ref4->no_quota. 5) We run delayed references, trigger merging of delayed references, through __btrfs_run_delayed_refs() -> btrfs_merge_delayed_refs(). 6) Ref1 and Ref3 are merged as Ref1->no_quota = Ref3->no_quota and all other conditions are satisfied too. So Ref1 gets a ref_mod value of 2. 7) Ref2 and Ref4 are merged as Ref2->no_quota = Ref4->no_quota and all other conditions are satisfied too. So Ref2 gets a ref_mod value of 2. 8) Ref1 and Ref2 aren't merged, because they have different values for their no_quota field. 9) Delayed reference Ref1 is picked for running (select_delayed_ref() always prefers references with an action == BTRFS_ADD_DELAYED_REF). So run_delayed_tree_ref() is called for Ref1 which triggers the BUG_ON because Ref1->red_mod != 1 (equals 2). So fix this by removing the no_quota field, as it's not used anymore as of commit 0ed4792af0e8 ("btrfs: qgroup: Switch to new extent-oriented qgroup mechanism."). The use of no_quota was also buggy in at least two places: 1) At delayed-refs.c:btrfs_add_delayed_tree_ref() - we were setting no_quota to 0 instead of 1 when the following condition was true: is_fstree(ref_root) || !fs_info->quota_enabled 2) At extent-tree.c:__btrfs_inc_extent_ref() - we were attempting to reset a node's no_quota when the condition "!is_fstree(root_objectid) || !root->fs_info->quota_enabled" was true but we did it only in an unused local stack variable, that is, we never reset the no_quota value in the node itself. This fixes the remainder of problems several people have been having when running delayed references, mostly while a balance is running in parallel, on a 4.2+ kernel. Very special thanks to Stéphane Lesimple for helping debugging this issue and testing this fix on his multi terabyte filesystem (which took more than one day to balance alone, plus fsck, etc). Also, this fixes deadlock issue when using the clone ioctl with qgroups enabled, as reported by Elias Probst in the mailing list. The deadlock happens because after calling btrfs_insert_empty_item we have our path holding a write lock on a leaf of the fs/subvol tree and then before releasing the path we called check_ref() which did backref walking, when qgroups are enabled, and tried to read lock the same leaf. The trace for this case is the following: INFO: task systemd-nspawn:6095 blocked for more than 120 seconds. (...) Call Trace: [<ffffffff86999201>] schedule+0x74/0x83 [<ffffffff863ef64c>] btrfs_tree_read_lock+0xc0/0xea [<ffffffff86137ed7>] ? wait_woken+0x74/0x74 [<ffffffff8639f0a7>] btrfs_search_old_slot+0x51a/0x810 [<ffffffff863a129b>] btrfs_next_old_leaf+0xdf/0x3ce [<ffffffff86413a00>] ? ulist_add_merge+0x1b/0x127 [<ffffffff86411688>] __resolve_indirect_refs+0x62a/0x667 [<ffffffff863ef546>] ? btrfs_clear_lock_blocking_rw+0x78/0xbe [<ffffffff864122d3>] find_parent_nodes+0xaf3/0xfc6 [<ffffffff86412838>] __btrfs_find_all_roots+0x92/0xf0 [<ffffffff864128f2>] btrfs_find_all_roots+0x45/0x65 [<ffffffff8639a75b>] ? btrfs_get_tree_mod_seq+0x2b/0x88 [<ffffffff863e852e>] check_ref+0x64/0xc4 [<ffffffff863e9e01>] btrfs_clone+0x66e/0xb5d [<ffffffff863ea77f>] btrfs_ioctl_clone+0x48f/0x5bb [<ffffffff86048a68>] ? native_sched_clock+0x28/0x77 [<ffffffff863ed9b0>] btrfs_ioctl+0xabc/0x25cb (...) The problem goes away by eleminating check_ref(), which no longer is needed as its purpose was to get a value for the no_quota field of a delayed reference (this patch removes the no_quota field as mentioned earlier). Reported-by: Stéphane Lesimple <stephane_btrfs@lesimple.fr> Tested-by: Stéphane Lesimple <stephane_btrfs@lesimple.fr> Reported-by: Elias Probst <mail@eliasprobst.eu> Reported-by: Peter Becker <floyd.net@gmail.com> Reported-by: Malte Schröder <malte@tnxip.de> Reported-by: Derek Dongray <derek@valedon.co.uk> Reported-by: Erkki Seppala <flux-btrfs@inside.org> Cc: stable@vger.kernel.org # 4.2+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
2015-10-22Merge branch 'allocator-fixes' into for-linus-4.4Chris Mason1-0/+9
Signed-off-by: Chris Mason <clm@fb.com>
2015-10-22Btrfs: fix prealloc under heavy fragmentation conditionsJosef Bacik1-0/+9
If we are heavily fragmented we will continually try to prealloc the largest extent size we can every time we call btrfs_reserve_extent. This can be very expensive when we are heavily fragmented, burning lots of CPU cycles and loops through the allocator. So instead notice when we get a smaller chunk from the allocator than what we specified and use this as the new maximum size we try to allocate. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-10-22btrfs: qgroup: Check if qgroup reserved space leakedQu Wenruo1-0/+1
Add check at btrfs_destroy_inode() time to detect qgroup reserved space leak. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-10-22btrfs: qgroup: Avoid calling btrfs_free_reserved_data_space in clear_bit_hookQu Wenruo1-2/+2
In clear_bit_hook, qgroup reserved data is already handled quite well, either released by finish_ordered_io or invalidatepage. So calling btrfs_qgroup_free_data() here is completely meaningless, and since btrfs_qgroup_free_data() will lock io_tree, so it can't be called with io_tree lock hold. This patch will add a new function btrfs_free_reserved_data_space_noquota() for clear_bit_hook() to cease the lockdep warning. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-10-22btrfs: Add handler for invalidate pageQu Wenruo1-0/+24
For btrfs_invalidatepage() and its variant evict_inode_truncate_page(), there will be pages don't reach disk. In that case, their reserved space won't be release nor freed by finish_ordered_io() nor delayed_ref handler. So we must free their qgroup reserved space, or we will leaking reserved space again. So this will patch will call btrfs_qgroup_free_data() for invalidatepage() and its variant evict_inode_truncate_page(). And due to the nature of new btrfs_qgroup_reserve/free_data() reserved space will only be reserved or freed once, so for pages which are already flushed to disk, their reserved space will be released and freed by delayed_ref handler. Double free won't be a problem. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-10-22btrfs: qgroup: Add handler for NOCOW and inlineQu Wenruo1-0/+15
For NOCOW and inline case, there will be no delayed_ref created for them, so we should free their reserved data space at proper time(finish_ordered_io for NOCOW and cow_file_inline for inline). Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-10-22btrfs: qgroup: Cleanup old inaccurate facilitiesQu Wenruo1-17/+17
Cleanup the old facilities which use old btrfs_qgroup_reserve() function call, replace them with the newer version, and remove the "__" prefix in them. Also, make btrfs_qgroup_reserve/free() functions private, as they are now only used inside qgroup codes. Now, the whole btrfs qgroup is swithed to use the new reserve facilities. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-10-22btrfs: extent-tree: Switch to new delalloc space reserve and releaseQu Wenruo1-15/+23
Use new __btrfs_delalloc_reserve_space() and __btrfs_delalloc_release_space() to reserve and release space for delalloc. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-10-22btrfs: delayed_ref: release and free qgroup reserved at proper timingQu Wenruo1-0/+10
Qgroup reserved space needs to be released from inode dirty map and get freed at different timing: 1) Release when the metadata is written into tree After corresponding metadata is written into tree, any newer write will be COWed(don't include NOCOW case yet). So we must release its range from inode dirty range map, or we will forget to reserve needed range, causing accounting exceeding the limit. 2) Free reserved bytes when delayed ref is run When delayed refs are run, qgroup accounting will follow soon and turn the reserved bytes into rfer/excl numbers. As run_delayed_refs and qgroup accounting are all done at commit_transaction() time, we are safe to free reserved space in run_delayed_ref time(). With these timing to release/free reserved space, we should be able to resolve the long existing qgroup reserve space leak problem. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-10-22Merge branch 'integration-4.4' of ↵Chris Mason1-14/+68
git://git.kernel.org/pub/scm/linux/kernel/git/fdmanana/linux into for-linus-4.4
2015-10-22Merge branch 'cleanups/for-4.4' of ↵Chris Mason1-3/+2
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.4
2015-10-21btrfs: remove unnecessary list_delByongho Lee1-1/+0
We can safely iterate whole list items, without using list_del macro. So remove the list_del call. Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Byongho Lee <bhlee.kernel@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-21Btrfs: btrfs_submit_bio_hook: Use btrfs_wq_endio_type values instead of ↵Chandan Rajendra1-2/+2
integer constants btrfs_submit_bio_hook() uses integer constants instead of values from "enum btrfs_wq_endio_type". Fix this. Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-16Btrfs: fix truncation of compressed and inlined extentsFilipe Manana1-14/+68
When truncating a file to a smaller size which consists of an inline extent that is compressed, we did not discard (or made unusable) the data between the new file size and the old file size, wasting metadata space and allowing for the truncated data to be leaked and the data corruption/loss mentioned below. We were also not correctly decrementing the number of bytes used by the inode, we were setting it to zero, giving a wrong report for callers of the stat(2) syscall. The fsck tool also reported an error about a mismatch between the nbytes of the file versus the real space used by the file. Now because we weren't discarding the truncated region of the file, it was possible for a caller of the clone ioctl to actually read the data that was truncated, allowing for a security breach without requiring root access to the system, using only standard filesystem operations. The scenario is the following: 1) User A creates a file which consists of an inline and compressed extent with a size of 2000 bytes - the file is not accessible to any other users (no read, write or execution permission for anyone else); 2) The user truncates the file to a size of 1000 bytes; 3) User A makes the file world readable; 4) User B creates a file consisting of an inline extent of 2000 bytes; 5) User B issues a clone operation from user A's file into its own file (using a length argument of 0, clone the whole range); 6) User B now gets to see the 1000 bytes that user A truncated from its file before it made its file world readbale. User B also lost the bytes in the range [1000, 2000[ bytes from its own file, but that might be ok if his/her intention was reading stale data from user A that was never supposed to be public. Note that this contrasts with the case where we truncate a file from 2000 bytes to 1000 bytes and then truncate it back from 1000 to 2000 bytes. In this case reading any byte from the range [1000, 2000[ will return a value of 0x00, instead of the original data. This problem exists since the clone ioctl was added and happens both with and without my recent data loss and file corruption fixes for the clone ioctl (patch "Btrfs: fix file corruption and data loss after cloning inline extents"). So fix this by truncating the compressed inline extents as we do for the non-compressed case, which involves decompressing, if the data isn't already in the page cache, compressing the truncated version of the extent, writing the compressed content into the inline extent and then truncate it. The following test case for fstests reproduces the problem. In order for the test to pass both this fix and my previous fix for the clone ioctl that forbids cloning a smaller inline extent into a larger one, which is titled "Btrfs: fix file corruption and data loss after cloning inline extents", are needed. Without that other fix the test fails in a different way that does not leak the truncated data, instead part of destination file gets replaced with zeroes (because the destination file has a larger inline extent than the source). seq=`basename $0` seqres=$RESULT_DIR/$seq echo "QA output created by $seq" tmp=/tmp/$$ status=1 # failure is the default! trap "_cleanup; exit \$status" 0 1 2 3 15 _cleanup() { rm -f $tmp.* } # get standard environment, filters and checks . ./common/rc . ./common/filter # real QA test starts here _need_to_be_root _supported_fs btrfs _supported_os Linux _require_scratch _require_cloner rm -f $seqres.full _scratch_mkfs >>$seqres.full 2>&1 _scratch_mount "-o compress" # Create our test files. File foo is going to be the source of a clone operation # and consists of a single inline extent with an uncompressed size of 512 bytes, # while file bar consists of a single inline extent with an uncompressed size of # 256 bytes. For our test's purpose, it's important that file bar has an inline # extent with a size smaller than foo's inline extent. $XFS_IO_PROG -f -c "pwrite -S 0xa1 0 128" \ -c "pwrite -S 0x2a 128 384" \ $SCRATCH_MNT/foo | _filter_xfs_io $XFS_IO_PROG -f -c "pwrite -S 0xbb 0 256" $SCRATCH_MNT/bar | _filter_xfs_io # Now durably persist all metadata and data. We do this to make sure that we get # on disk an inline extent with a size of 512 bytes for file foo. sync # Now truncate our file foo to a smaller size. Because it consists of a # compressed and inline extent, btrfs did not shrink the inline extent to the # new size (if the extent was not compressed, btrfs would shrink it to 128 # bytes), it only updates the inode's i_size to 128 bytes. $XFS_IO_PROG -c "truncate 128" $SCRATCH_MNT/foo # Now clone foo's inline extent into bar. # This clone operation should fail with errno EOPNOTSUPP because the source # file consists only of an inline extent and the file's size is smaller than # the inline extent of the destination (128 bytes < 256 bytes). However the # clone ioctl was not prepared to deal with a file that has a size smaller # than the size of its inline extent (something that happens only for compressed # inline extents), resulting in copying the full inline extent from the source # file into the destination file. # # Note that btrfs' clone operation for inline extents consists of removing the # inline extent from the destination inode and copy the inline extent from the # source inode into the destination inode, meaning that if the destination # inode's inline extent is larger (N bytes) than the source inode's inline # extent (M bytes), some bytes (N - M bytes) will be lost from the destination # file. Btrfs could copy the source inline extent's data into the destination's # inline extent so that we would not lose any data, but that's currently not # done due to the complexity that would be needed to deal with such cases # (specially when one or both extents are compressed), returning EOPNOTSUPP, as # it's normally not a very common case to clone very small files (only case # where we get inline extents) and copying inline extents does not save any # space (unlike for normal, non-inlined extents). $CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/foo $SCRATCH_MNT/bar # Now because the above clone operation used to succeed, and due to foo's inline # extent not being shinked by the truncate operation, our file bar got the whole # inline extent copied from foo, making us lose the last 128 bytes from bar # which got replaced by the bytes in range [128, 256[ from foo before foo was # truncated - in other words, data loss from bar and being able to read old and # stale data from foo that should not be possible to read anymore through normal # filesystem operations. Contrast with the case where we truncate a file from a # size N to a smaller size M, truncate it back to size N and then read the range # [M, N[, we should always get the value 0x00 for all the bytes in that range. # We expected the clone operation to fail with errno EOPNOTSUPP and therefore # not modify our file's bar data/metadata. So its content should be 256 bytes # long with all bytes having the value 0xbb. # # Without the btrfs bug fix, the clone operation succeeded and resulted in # leaking truncated data from foo, the bytes that belonged to its range # [128, 256[, and losing data from bar in that same range. So reading the # file gave us the following content: # # 0000000 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 # * # 0000200 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a # * # 0000400 echo "File bar's content after the clone operation:" od -t x1 $SCRATCH_MNT/bar # Also because the foo's inline extent was not shrunk by the truncate # operation, btrfs' fsck, which is run by the fstests framework everytime a # test completes, failed reporting the following error: # # root 5 inode 257 errors 400, nbytes wrong status=0 exit Cc: stable@vger.kernel.org Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-10-13Merge branch 'fix/waitqueue-barriers' of ↵Chris Mason1-0/+3
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.4
2015-10-10btrfs: comment the rest of implicit barriers before waitqueue_activeDavid Sterba1-0/+3
There are atomic operations that imply the barrier for waitqueue_active mixed in an if-condition. Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-08btrfs: switch message printers to ratelimited variantsDavid Sterba1-5/+2
Signed-off-by: David Sterba <dsterba@suse.com>
2015-09-25Merge branch 'for-linus-4.3' of ↵Linus Torvalds1-22/+23
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "This is an assorted set I've been queuing up: Jeff Mahoney tracked down a tricky one where we ended up starting IO on the wrong mapping for special files in btrfs_evict_inode. A few people reported this one on the list. Filipe found (and provided a test for) a difficult bug in reading compressed extents, and Josef fixed up some quota record keeping with snapshot deletion. Chandan killed off an accounting bug during DIO that lead to WARN_ONs as we freed inodes" * 'for-linus-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: keep dropped roots in cache until transaction commit Btrfs: Direct I/O: Fix space accounting btrfs: skip waiting on ordered range for special files Btrfs: fix read corruption of compressed and shared extents Btrfs: remove unnecessary locking of cleaner_mutex to avoid deadlock Btrfs: don't initialize a space info as full to prevent ENOSPC
2015-09-21Btrfs: Direct I/O: Fix space accountingchandan1-21/+21
The following call trace is seen when generic/095 test is executed, WARNING: CPU: 3 PID: 2769 at /home/chandan/code/repos/linux/fs/btrfs/inode.c:8967 btrfs_destroy_inode+0x284/0x2a0() Modules linked in: CPU: 3 PID: 2769 Comm: umount Not tainted 4.2.0-rc5+ #31 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20150306_163512-brownie 04/01/2014 ffffffff81c08150 ffff8802ec9cbce8 ffffffff81984058 ffff8802ffd8feb0 0000000000000000 ffff8802ec9cbd28 ffffffff81050385 ffff8802ec9cbd38 ffff8802d12f8588 ffff8802d12f8588 ffff8802f15ab000 ffff8800bb96c0b0 Call Trace: [<ffffffff81984058>] dump_stack+0x45/0x57 [<ffffffff81050385>] warn_slowpath_common+0x85/0xc0 [<ffffffff81050465>] warn_slowpath_null+0x15/0x20 [<ffffffff81340294>] btrfs_destroy_inode+0x284/0x2a0 [<ffffffff8117ce07>] destroy_inode+0x37/0x60 [<ffffffff8117cf39>] evict+0x109/0x170 [<ffffffff8117cfd5>] dispose_list+0x35/0x50 [<ffffffff8117dd3a>] evict_inodes+0xaa/0x100 [<ffffffff81165667>] generic_shutdown_super+0x47/0xf0 [<ffffffff81165951>] kill_anon_super+0x11/0x20 [<ffffffff81302093>] btrfs_kill_super+0x13/0x110 [<ffffffff81165c99>] deactivate_locked_super+0x39/0x70 [<ffffffff811660cf>] deactivate_super+0x5f/0x70 [<ffffffff81180e1e>] cleanup_mnt+0x3e/0x90 [<ffffffff81180ebd>] __cleanup_mnt+0xd/0x10 [<ffffffff81069c06>] task_work_run+0x96/0xb0 [<ffffffff81003a3d>] do_notify_resume+0x3d/0x50 [<ffffffff8198cbc2>] int_signal+0x12/0x17 This means that the inode had non-zero "outstanding extents" during eviction. This occurs because, during direct I/O a task which successfully used up its reserved data space would set BTRFS_INODE_DIO_READY bit and does not clear the bit after finishing the DIO write. A future DIO write could actually fail and the unused reserve space won't be freed because of the previously set BTRFS_INODE_DIO_READY bit. Clearing the BTRFS_INODE_DIO_READY bit in btrfs_direct_IO() caused the following issue, |-----------------------------------+-------------------------------------| | Task A | Task B | |-----------------------------------+-------------------------------------| | Start direct i/o write on inode X.| | | reserve space | | | Allocate ordered extent | | | release reserved space | | | Set BTRFS_INODE_DIO_READY bit. | | | | splice() | | | Transfer data from pipe buffer to | | | destination file. | | | - kmap(pipe buffer page) | | | - Start direct i/o write on | | | inode X. | | | - reserve space | | | - dio_refill_pages() | | | - sdio->blocks_available == 0 | | | - Since a kernel address is | | | being passed instead of a | | | user space address, | | | iov_iter_get_pages() returns | | | -EFAULT. | | | - Since BTRFS_INODE_DIO_READY is | | | set, we don't release reserved | | | space. | | | - Clear BTRFS_INODE_DIO_READY bit.| | -EIOCBQUEUED is returned. | | |-----------------------------------+-------------------------------------| Hence this commit introduces "struct btrfs_dio_data" to track the usage of reserved data space. The remaining unused "reserve space" can now be freed reliably. Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-09-15btrfs: skip waiting on ordered range for special filesJeff Mahoney1-1/+2
In btrfs_evict_inode, we properly truncate the page cache for evicted inodes but then we call btrfs_wait_ordered_range for every inode as well. It's the right thing to do for regular files but results in incorrect behavior for device inodes for block devices. filemap_fdatawrite_range gets called with inode->i_mapping which gets resolved to the block device inode before getting passed to wbc_attach_fdatawrite_inode and ultimately to inode_to_bdi. What happens next depends on whether there's an open file handle associated with the inode. If there is, we write to the block device, which is unexpected behavior. If there isn't, we through normally and inode->i_data is used. We can also end up racing against open/close which can result in crashes when i_mapping points to a block device inode that has been closed. Since there can't be any page cache associated with special file inodes, it's safe to skip the btrfs_wait_ordered_range call entirely and avoid the problem. Cc: <stable@vger.kernel.org> Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=100911 Tested-by: Christoph Biedl <linux-kernel.bfrz@manchmal.in-ulm.de> Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: Filipe Manana <fdmanana@suse.com>
2015-09-11Merge branch 'for-linus-4.3' of ↵Linus Torvalds1-2/+1
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs cleanups and fixes from Chris Mason: "These are small cleanups, and also some fixes for our async worker thread initialization. I was having some trouble testing these, but it ended up being a combination of changing around my test servers and a shiny new schedule while atomic from the new start/finish_plug in writeback_sb_inodes(). That one only hits on btrfs raid5/6 or MD raid10, and if I wasn't changing a bunch of things in my test setup at once it would have been really clear. Fix for writeback_sb_inodes() on the way as well" * 'for-linus-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: cleanup: remove unnecessary check before btrfs_free_path is called btrfs: async_thread: Fix workqueue 'max_active' value when initializing btrfs: Add raid56 support for updating num_tolerated_disk_barrier_failures in btrfs_balance btrfs: Cleanup for btrfs_calc_num_tolerated_disk_barrier_failures btrfs: Remove noused chunk_tree and chunk_objectid from scrub_enumerate_chunks and scrub_chunk btrfs: Update out-of-date "skip parity stripe" comment
2015-09-06Merge branch 'for-linus-4.3' of ↵Linus Torvalds1-1/+34
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs updates from Chris Mason: "This has Jeff Mahoney's long standing trim patch that fixes corners where trims were missing. Omar has some raid5/6 fixes, especially for using scrub and device replace when devices are missing. Zhao Lie continues cleaning and fixing things, this series fixes some really hard to hit corners in xfstests. I had to pull it last merge window due to some deadlocks, but those are now resolved. I added support for Tejun's new blkio controllers. It seems to work well for single devices, we'll expand to multi-device as well" * 'for-linus-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (47 commits) btrfs: fix compile when block cgroups are not enabled Btrfs: fix file read corruption after extent cloning and fsync Btrfs: check if previous transaction aborted to avoid fs corruption btrfs: use __GFP_NOFAIL in alloc_btrfs_bio btrfs: Prevent from early transaction abort btrfs: Remove unused arguments in tree-log.c btrfs: Remove useless condition in start_log_trans() Btrfs: add support for blkio controllers Btrfs: remove unused mutex from struct 'btrfs_fs_info' Btrfs: fix parity scrub of RAID 5/6 with missing device Btrfs: fix device replace of a missing RAID 5/6 device Btrfs: add RAID 5/6 BTRFS_RBIO_REBUILD_MISSING operation Btrfs: count devices correctly in readahead during RAID 5/6 replace Btrfs: remove misleading handling of missing device scrub btrfs: fix clone / extent-same deadlocks Btrfs: fix defrag to merge tail file extent Btrfs: fix warning in backref walking btrfs: Add WARN_ON() for double lock in btrfs_tree_lock() btrfs: Remove root argument in extent_data_ref_count() btrfs: Fix wrong comment of btrfs_alloc_tree_block() ...
2015-09-02Merge branch 'for-4.3/core' of git://git.kernel.dk/linux-blockLinus Torvalds1-26/+27
Pull core block updates from Jens Axboe: "This first core part of the block IO changes contains: - Cleanup of the bio IO error signaling from Christoph. We used to rely on the uptodate bit and passing around of an error, now we store the error in the bio itself. - Improvement of the above from myself, by shrinking the bio size down again to fit in two cachelines on x86-64. - Revert of the max_hw_sectors cap removal from a revision again, from Jeff Moyer. This caused performance regressions in various tests. Reinstate the limit, bump it to a more reasonable size instead. - Make /sys/block/<dev>/queue/discard_max_bytes writeable, by me. Most devices have huge trim limits, which can cause nasty latencies when deleting files. Enable the admin to configure the size down. We will look into having a more sane default instead of UINT_MAX sectors. - Improvement of the SGP gaps logic from Keith Busch. - Enable the block core to handle arbitrarily sized bios, which enables a nice simplification of bio_add_page() (which is an IO hot path). From Kent. - Improvements to the partition io stats accounting, making it faster. From Ming Lei. - Also from Ming Lei, a basic fixup for overflow of the sysfs pending file in blk-mq, as well as a fix for a blk-mq timeout race condition. - Ming Lin has been carrying Kents above mentioned patches forward for a while, and testing them. Ming also did a few fixes around that. - Sasha Levin found and fixed a use-after-free problem introduced by the bio->bi_error changes from Christoph. - Small blk cgroup cleanup from Viresh Kumar" * 'for-4.3/core' of git://git.kernel.dk/linux-block: (26 commits) blk: Fix bio_io_vec index when checking bvec gaps block: Replace SG_GAPS with new queue limits mask block: bump BLK_DEF_MAX_SECTORS to 2560 Revert "block: remove artifical max_hw_sectors cap" blk-mq: fix race between timeout and freeing request blk-mq: fix buffer overflow when reading sysfs file of 'pending' Documentation: update notes in biovecs about arbitrarily sized bios block: remove bio_get_nr_vecs() fs: use helper bio_add_page() instead of open coding on bi_io_vec block: kill merge_bvec_fn() completely md/raid5: get rid of bio_fits_rdev() md/raid5: split bio for chunk_aligned_read block: remove split code in blkdev_issue_{discard,write_same} btrfs: remove bio splitting and merge_bvec_fn() calls bcache: remove driver private bio splitting code block: simplify bio_add_page() block: make generic_make_request handle arbitrarily sized bios blk-cgroup: Drop unlikely before IS_ERR(_OR_NULL) block: don't access bio->bi_error after bio_put() block: shrink struct bio down to 2 cache lines again ...
2015-08-31Btrfs: cleanup: remove unnecessary check before btrfs_free_path is calledTsutomu Itoh1-2/+1
We need not check path before btrfs_free_path() is called because path is checked in btrfs_free_path(). Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-08-13block: remove bio_get_nr_vecs()Kent Overstreet1-2/+1
We can always fill up the bio now, no need to estimate the possible size based on queue parameters. Acked-by: Steven Whitehouse <swhiteho@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> [hch: rebased and wrote a changelog] Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lin <ming.l@ssi.samsung.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-08-09Btrfs: add support for blkio controllersChris Mason1-1/+5
This attaches accounting information to bios as we submit them so the new blkio controllers can throttle on btrfs filesystems. Not much is required, we're just associating bios with blkcgs during clone, calling wbc_init_bio()/wbc_account_io() during writepages submission, and attaching the bios to the current context during direct IO. Finally if we are splitting bios during btrfs_map_bio, this attaches accounting information to the split. The end result is able to throttle nicely on single disk filesystems. A little more work is required for multi-device filesystems. Signed-off-by: Chris Mason <clm@fb.com>