summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)AuthorFilesLines
2019-06-25cifs: fix GlobalMid_Lock bug in cifs_reconnectRonnie Sahlberg1-0/+2
commit 61cabc7b0a5cf0d3c532cfa96594c801743fe7f6 upstream. We can not hold the GlobalMid_Lock spinlock during the dfs processing in cifs_reconnect since it invokes things that may sleep and thus trigger : BUG: sleeping function called from invalid context at kernel/locking/rwsem.c:23 Thus we need to drop the spinlock during this code block. RHBZ: 1716743 Cc: stable@vger.kernel.org Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Acked-by: Pavel Shilovsky <pshilov@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-25cifs: add spinlock for the openFileList to cifsInodeInfoRonnie Sahlberg3-2/+12
commit 487317c99477d00f22370625d53be3239febabbe upstream. We can not depend on the tcon->open_file_lock here since in multiuser mode we may have the same file/inode open via multiple different tcons. The current code is race prone and will crash if one user deletes a file at the same time a different user opens/create the file. To avoid this we need to have a spinlock attached to the inode and not the tcon. RHBZ: 1580165 CC: Stable <stable@vger.kernel.org> Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-22ocfs2: fix error path kobject memory leakTobin C. Harding1-0/+1
[ Upstream commit b9fba67b3806e21b98bd5a98dc3921a8e9b42d61 ] If a call to kobject_init_and_add() fails we should call kobject_put() otherwise we leak memory. Add call to kobject_put() in the error path of call to kobject_init_and_add(). Please note, this has the side effect that the release method is called if kobject_init_and_add() fails. Link: http://lkml.kernel.org/r/20190513033458.2824-1-tobin@kernel.org Signed-off-by: Tobin C. Harding <tobin@kernel.org> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-06-22dfs_cache: fix a wrong use of kfree in flush_cache_ent()Gen Zhang1-2/+2
[ Upstream commit 50fbc13dc12666f3604dc2555a47fc8c4e29162b ] In flush_cache_ent(), 'ce->ce_path' is allocated by kstrdup_const(). It should be freed by kfree_const(), rather than kfree(). Signed-off-by: Gen Zhang <blackgod016574@gmail.com> Reviewed-by: Paulo Alcantara <palcantara@suse.de> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-06-22configfs: Fix use-after-free when accessing sd->s_dentrySahitya Tummala1-8/+6
[ Upstream commit f6122ed2a4f9c9c1c073ddf6308d1b2ac10e0781 ] In the vfs_statx() context, during path lookup, the dentry gets added to sd->s_dentry via configfs_attach_attr(). In the end, vfs_statx() kills the dentry by calling path_put(), which invokes configfs_d_iput(). Ideally, this dentry must be removed from sd->s_dentry but it doesn't if the sd->s_count >= 3. As a result, sd->s_dentry is holding reference to a stale dentry pointer whose memory is already freed up. This results in use-after-free issue, when this stale sd->s_dentry is accessed later in configfs_readdir() path. This issue can be easily reproduced, by running the LTP test case - sh fs_racer_file_list.sh /config (https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/fs/racer/fs_racer_file_list.sh) Fixes: 76ae281f6307 ('configfs: fix race between dentry put and lookup') Signed-off-by: Sahitya Tummala <stummala@codeaurora.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-06-22io_uring: Fix __io_uring_register() false successPavel Begunkov1-1/+1
[ Upstream commit a278682dad37fd2f8d2f30d8e84e376a856ab472 ] If io_copy_iov() fails, it will break the loop and report success, albeit partially completed operation. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-06-19f2fs: fix to avoid accessing xattr across the boundaryRandall Huang2-9/+29
[ Upstream commit 2777e654371dd4207a3a7f4fb5fa39550053a080 ] When we traverse xattr entries via __find_xattr(), if the raw filesystem content is faked or any hardware failure occurs, out-of-bound error can be detected by KASAN. Fix the issue by introducing boundary check. [ 38.402878] c7 1827 BUG: KASAN: slab-out-of-bounds in f2fs_getxattr+0x518/0x68c [ 38.402891] c7 1827 Read of size 4 at addr ffffffc0b6fb35dc by task [ 38.402935] c7 1827 Call trace: [ 38.402952] c7 1827 [<ffffff900809003c>] dump_backtrace+0x0/0x6bc [ 38.402966] c7 1827 [<ffffff9008090030>] show_stack+0x20/0x2c [ 38.402981] c7 1827 [<ffffff900871ab10>] dump_stack+0xfc/0x140 [ 38.402995] c7 1827 [<ffffff9008325c40>] print_address_description+0x80/0x2d8 [ 38.403009] c7 1827 [<ffffff900832629c>] kasan_report_error+0x198/0x1fc [ 38.403022] c7 1827 [<ffffff9008326104>] kasan_report_error+0x0/0x1fc [ 38.403037] c7 1827 [<ffffff9008325000>] __asan_load4+0x1b0/0x1b8 [ 38.403051] c7 1827 [<ffffff90085fcc44>] f2fs_getxattr+0x518/0x68c [ 38.403066] c7 1827 [<ffffff90085fc508>] f2fs_xattr_generic_get+0xb0/0xd0 [ 38.403080] c7 1827 [<ffffff9008395708>] __vfs_getxattr+0x1f4/0x1fc [ 38.403096] c7 1827 [<ffffff9008621bd0>] inode_doinit_with_dentry+0x360/0x938 [ 38.403109] c7 1827 [<ffffff900862d6cc>] selinux_d_instantiate+0x2c/0x38 [ 38.403123] c7 1827 [<ffffff900861b018>] security_d_instantiate+0x68/0x98 [ 38.403136] c7 1827 [<ffffff9008377db8>] d_splice_alias+0x58/0x348 [ 38.403149] c7 1827 [<ffffff900858d16c>] f2fs_lookup+0x608/0x774 [ 38.403163] c7 1827 [<ffffff900835eacc>] lookup_slow+0x1e0/0x2cc [ 38.403177] c7 1827 [<ffffff9008367fe0>] walk_component+0x160/0x520 [ 38.403190] c7 1827 [<ffffff9008369ef4>] path_lookupat+0x110/0x2b4 [ 38.403203] c7 1827 [<ffffff900835dd38>] filename_lookup+0x1d8/0x3a8 [ 38.403216] c7 1827 [<ffffff900835eeb0>] user_path_at_empty+0x54/0x68 [ 38.403229] c7 1827 [<ffffff9008395f44>] SyS_getxattr+0xb4/0x18c [ 38.403241] c7 1827 [<ffffff9008084200>] el0_svc_naked+0x34/0x38 Signed-off-by: Randall Huang <huangrandall@google.com> [Jaegeuk Kim: Fix wrong ending boundary] Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-06-19fs/ocfs2: fix race in ocfs2_dentry_attach_lock()Wengang Wang1-0/+12
commit be99ca2716972a712cde46092c54dee5e6192bf8 upstream. ocfs2_dentry_attach_lock() can be executed in parallel threads against the same dentry. Make that race safe. The race is like this: thread A thread B (A1) enter ocfs2_dentry_attach_lock, seeing dentry->d_fsdata is NULL, and no alias found by ocfs2_find_local_alias, so kmalloc a new ocfs2_dentry_lock structure to local variable "dl", dl1 ..... (B1) enter ocfs2_dentry_attach_lock, seeing dentry->d_fsdata is NULL, and no alias found by ocfs2_find_local_alias so kmalloc a new ocfs2_dentry_lock structure to local variable "dl", dl2. ...... (A2) set dentry->d_fsdata with dl1, call ocfs2_dentry_lock() and increase dl1->dl_lockres.l_ro_holders to 1 on success. ...... (B2) set dentry->d_fsdata with dl2 call ocfs2_dentry_lock() and increase dl2->dl_lockres.l_ro_holders to 1 on success. ...... (A3) call ocfs2_dentry_unlock() and decrease dl2->dl_lockres.l_ro_holders to 0 on success. .... (B3) call ocfs2_dentry_unlock(), decreasing dl2->dl_lockres.l_ro_holders, but see it's zero now, panic Link: http://lkml.kernel.org/r/20190529174636.22364-1-wen.gang.wang@oracle.com Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com> Reported-by: Daniel Sobe <daniel.sobe@nxp.com> Tested-by: Daniel Sobe <daniel.sobe@nxp.com> Reviewed-by: Changwei Ge <gechangwei@live.cn> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-19io_uring: fix memory leak of UNIX domain socket inodeEric Biggers1-1/+3
commit 355e8d26f719c207aa2e00e6f3cfab3acf21769b upstream. Opening and closing an io_uring instance leaks a UNIX domain socket inode. This is because the ->file of the io_uring instance's internal UNIX domain socket is set to point to the io_uring file, but then sock_release() sees the non-NULL ->file and assumes the inode reference is held by the file so doesn't call iput(). That's not the case here, since the reference is still meant to be held by the socket; the actual inode of the io_uring file is different. Fix this leak by NULL-ing out ->file before releasing the socket. Reported-by: syzbot+111cb28d9f583693aefa@syzkaller.appspotmail.com Fixes: 2b188cc1bb85 ("Add io_uring IO interface") Cc: <stable@vger.kernel.org> # v5.1+ Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-15io_uring: fix failure to verify SQ_AFF cpuJens Axboe1-2/+3
commit 44a9bd18a0f06bba19d155aeaa11e2edce898293 upstream. The test case we have is rightfully failing with the current kernel: io_uring_setup(1, 0x7ffe2cafebe0), flags: IORING_SETUP_SQPOLL|IORING_SETUP_SQ_AFF, resv: 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000, sq_thread_cpu: 4 expected -1, got 3 This is in a vm, and CPU3 is the last valid one, hence asking for 4 should fail the setup with -EINVAL, not succeed. The problem is that we're using array_index_nospec() with nr_cpu_ids as the index, hence we wrap and end up using CPU0 instead of CPU4. This makes the setup succeed where it should be failing. We don't need to use array_index_nospec() as we're not indexing any array with this. Instead just compare with nr_cpu_ids directly. This is fine as we're checking with cpu_online() afterwards. Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-15ovl: support stacked SEEK_HOLE/SEEK_DATAAmir Goldstein1-4/+40
commit 9e46b840c7053b5f7a245e98cd239b60d189a96c upstream. Overlay file f_pos is the master copy that is preserved through copy up and modified on read/write, but only real fs knows how to SEEK_HOLE/SEEK_DATA and real fs may impose limitations that are more strict than ->s_maxbytes for specific files, so we use the real file to perform seeks. We do not call real fs for SEEK_CUR:0 query and for SEEK_SET:0 requests. Fixes: d1d04ef8572b ("ovl: stack file ops") Reported-by: Eddie Horng <eddiehorng.tw@gmail.com> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-15ovl: check the capability before cred overriddenJiufei Xue1-18/+61
commit 98487de318a6f33312471ae1e2afa16fbf8361fe upstream. We found that it return success when we set IMMUTABLE_FL flag to a file in docker even though the docker didn't have the capability CAP_LINUX_IMMUTABLE. The commit d1d04ef8572b ("ovl: stack file ops") and dab5ca8fd9dd ("ovl: add lsattr/chattr support") implemented chattr operations on a regular overlay file. ovl_real_ioctl() overridden the current process's subjective credentials with ofs->creator_cred which have the capability CAP_LINUX_IMMUTABLE so that it will return success in vfs_ioctl()->cap_capable(). Fix this by checking the capability before cred overridden. And here we only care about APPEND_FL and IMMUTABLE_FL, so get these information from inode. [SzM: move check and call to underlying fs inside inode locked region to prevent two such calls from racing with each other] Signed-off-by: Jiufei Xue <jiufei.xue@linux.alibaba.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Cc: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-15f2fs: fix potential recursive call when enabling data_flushChao Yu2-5/+4
[ Upstream commit 186857c5a14aee85cace2ae7a36c6e43b9d3c7a5 ] As Hagbard Celine reported: Hi, this is a long standing bug that I've hit before on older kernels, but I was not able to get the syslog saved because of the nature of the bug. This time I had booted form a pen-drive, and was able to save the log to it's efi-partition. What i did to trigger it was to create a partition and format it f2fs, then mount it with options: "rw,relatime,lazytime,background_gc=on,disable_ext_identify,discard,heap,user_xattr,inline_xattr,acl,inline_data,inline_dentry,flush_merge,data_flush,extent_cache,mode=adaptive,active_logs=6,whint_mode=fs-based,alloc_mode=default,fsync_mode=strict". Then I unpacked a big .tar.xz to the partition (I used a gentoo-stage3-tarball as I was in process of installing Gentoo). Same options just without data_flush gives no problems. Mar 20 20:54:01 usbgentoo kernel: FAT-fs (nvme0n1p4): Volume was not properly unmounted. Some data may be corrupt. Please run fsck. Mar 20 21:05:23 usbgentoo kernel: kworker/dying (1588) used greatest stack depth: 12064 bytes left Mar 20 21:06:40 usbgentoo kernel: BUG: stack guard page was hit at 00000000a4b0733c (stack is 0000000056016422..0000000096e7463f) Mar 20 21:06:40 usbgentoo kernel: kernel stack overflow ...... Mar 20 21:06:40 usbgentoo kernel: Call Trace: Mar 20 21:06:40 usbgentoo kernel: read_node_page+0x71/0xf0 Mar 20 21:06:40 usbgentoo kernel: ? xas_load+0x8/0x50 Mar 20 21:06:40 usbgentoo kernel: __get_node_page+0x73/0x2a0 Mar 20 21:06:40 usbgentoo kernel: f2fs_get_dnode_of_data+0x34e/0x580 Mar 20 21:06:40 usbgentoo kernel: f2fs_write_inline_data+0x5e/0x2a0 Mar 20 21:06:40 usbgentoo kernel: __write_data_page+0x421/0x690 Mar 20 21:06:40 usbgentoo kernel: f2fs_write_cache_pages+0x1cf/0x460 Mar 20 21:06:40 usbgentoo kernel: f2fs_write_data_pages+0x2b3/0x2e0 Mar 20 21:06:40 usbgentoo kernel: ? f2fs_inode_chksum_verify+0x1d/0xc0 Mar 20 21:06:40 usbgentoo kernel: ? read_node_page+0x71/0xf0 Mar 20 21:06:40 usbgentoo kernel: do_writepages+0x3c/0xd0 Mar 20 21:06:40 usbgentoo kernel: __filemap_fdatawrite_range+0x7c/0xb0 Mar 20 21:06:40 usbgentoo kernel: f2fs_sync_dirty_inodes+0xf2/0x200 Mar 20 21:06:40 usbgentoo kernel: f2fs_balance_fs_bg+0x2a3/0x2c0 Mar 20 21:06:40 usbgentoo kernel: ? f2fs_inode_dirtied+0x21/0xc0 Mar 20 21:06:40 usbgentoo kernel: f2fs_balance_fs+0xd6/0x2b0 Mar 20 21:06:40 usbgentoo kernel: __write_data_page+0x4fb/0x690 ...... Mar 20 21:06:40 usbgentoo kernel: __writeback_single_inode+0x2a1/0x340 Mar 20 21:06:40 usbgentoo kernel: ? soft_cursor+0x1b4/0x220 Mar 20 21:06:40 usbgentoo kernel: writeback_sb_inodes+0x1d5/0x3e0 Mar 20 21:06:40 usbgentoo kernel: __writeback_inodes_wb+0x58/0xa0 Mar 20 21:06:40 usbgentoo kernel: wb_writeback+0x250/0x2e0 Mar 20 21:06:40 usbgentoo kernel: ? 0xffffffff8c000000 Mar 20 21:06:40 usbgentoo kernel: ? cpumask_next+0x16/0x20 Mar 20 21:06:40 usbgentoo kernel: wb_workfn+0x2f6/0x3b0 Mar 20 21:06:40 usbgentoo kernel: ? __switch_to_asm+0x40/0x70 Mar 20 21:06:40 usbgentoo kernel: process_one_work+0x1f5/0x3f0 Mar 20 21:06:40 usbgentoo kernel: worker_thread+0x28/0x3c0 Mar 20 21:06:40 usbgentoo kernel: ? rescuer_thread+0x330/0x330 Mar 20 21:06:40 usbgentoo kernel: kthread+0x10e/0x130 Mar 20 21:06:40 usbgentoo kernel: ? kthread_create_on_node+0x60/0x60 Mar 20 21:06:40 usbgentoo kernel: ret_from_fork+0x35/0x40 The root cause is that we run into an infinite recursive calling in between f2fs_balance_fs_bg and writepage() as described below: - f2fs_write_data_pages --- A - __write_data_page - f2fs_balance_fs - f2fs_balance_fs_bg --- B - f2fs_sync_dirty_inodes - filemap_fdatawrite - f2fs_write_data_pages --- A ... - f2fs_balance_fs_bg --- B ... In order to fix this issue, let's detect such condition in __write_data_page() and just skip calling f2fs_balance_fs() recursively. Reported-by: Hagbard Celine <hagbardcelin@gmail.com> Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-06-15nfsd: avoid uninitialized variable warningArnd Bergmann1-0/+4
[ Upstream commit 0ab88ca4bcf18ba21058d8f19220f60afe0d34d8 ] clang warns that 'contextlen' may be accessed without an initialization: fs/nfsd/nfs4xdr.c:2911:9: error: variable 'contextlen' is uninitialized when used here [-Werror,-Wuninitialized] contextlen); ^~~~~~~~~~ fs/nfsd/nfs4xdr.c:2424:16: note: initialize the variable 'contextlen' to silence this warning int contextlen; ^ = 0 Presumably this cannot happen, as FATTR4_WORD2_SECURITY_LABEL is set if CONFIG_NFSD_V4_SECURITY_LABEL is enabled. Adding another #ifdef like the other two in this function avoids the warning. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-06-15nfsd: allow fh_want_write to be called twiceJ. Bruce Fields1-1/+4
[ Upstream commit 0b8f62625dc309651d0efcb6a6247c933acd8b45 ] A fuzzer recently triggered lockdep warnings about potential sb_writers deadlocks caused by fh_want_write(). Looks like we aren't careful to pair each fh_want_write() with an fh_drop_write(). It's not normally a problem since fh_put() will call fh_drop_write() for us. And was OK for NFSv3 where we'd do one operation that might call fh_want_write(), and then put the filehandle. But an NFSv4 protocol fuzzer can do weird things like call unlink twice in a compound, and then we get into trouble. I'm a little worried about this approach of just leaving everything to fh_put(). But I think there are probably a lot of fh_want_write()/fh_drop_write() imbalances so for now I think we need it to be more forgiving. Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-06-15fuse: retrieve: cap requested size to negotiated max_writeKirill Smelkov1-1/+1
[ Upstream commit 7640682e67b33cab8628729afec8ca92b851394f ] FUSE filesystem server and kernel client negotiate during initialization phase, what should be the maximum write size the client will ever issue. Correspondingly the filesystem server then queues sys_read calls to read requests with buffer capacity large enough to carry request header + that max_write bytes. A filesystem server is free to set its max_write in anywhere in the range between [1*page, fc->max_pages*page]. In particular go-fuse[2] sets max_write by default as 64K, wheres default fc->max_pages corresponds to 128K. Libfuse also allows users to configure max_write, but by default presets it to possible maximum. If max_write is < fc->max_pages*page, and in NOTIFY_RETRIEVE handler we allow to retrieve more than max_write bytes, corresponding prepared NOTIFY_REPLY will be thrown away by fuse_dev_do_read, because the filesystem server, in full correspondence with server/client contract, will be only queuing sys_read with ~max_write buffer capacity, and fuse_dev_do_read throws away requests that cannot fit into server request buffer. In turn the filesystem server could get stuck waiting indefinitely for NOTIFY_REPLY since NOTIFY_RETRIEVE handler returned OK which is understood by clients as that NOTIFY_REPLY was queued and will be sent back. Cap requested size to negotiate max_write to avoid the problem. This aligns with the way NOTIFY_RETRIEVE handler works, which already unconditionally caps requested retrieve size to fuse_conn->max_pages. This way it should not hurt NOTIFY_RETRIEVE semantic if we return less data than was originally requested. Please see [1] for context where the problem of stuck filesystem was hit for real, how the situation was traced and for more involving patch that did not make it into the tree. [1] https://marc.info/?l=linux-fsdevel&m=155057023600853&w=2 [2] https://github.com/hanwen/go-fuse Signed-off-by: Kirill Smelkov <kirr@nexedi.com> Cc: Han-Wen Nienhuys <hanwen@google.com> Cc: Jakob Unterwurzacher <jakobunt@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-06-15ovl: do not generate duplicate fsnotify events for "fake" pathAmir Goldstein1-3/+4
[ Upstream commit d989903058a83e8536cc7aadf9256a47d5c173fe ] Overlayfs "fake" path is used for stacked file operations on underlying files. Operations on files with "fake" path must not generate fsnotify events with path data, because those events have already been generated at overlayfs layer and because the reported event->fd for fanotify marks on underlying inode/filesystem will have the wrong path (the overlayfs path). Link: https://lore.kernel.org/linux-fsdevel/20190423065024.12695-1-jencce.kernel@gmail.com/ Reported-by: Murphy Zhou <jencce.kernel@gmail.com> Fixes: d1d04ef8572b ("ovl: stack file ops") Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-06-15configfs: fix possible use-after-free in configfs_register_groupYueHaibing1-5/+12
[ Upstream commit 35399f87e271f7cf3048eab00a421a6519ac8441 ] In configfs_register_group(), if create_default_group() failed, we forget to unlink the group. It will left a invalid item in the parent list, which may trigger the use-after-free issue seen below: BUG: KASAN: use-after-free in __list_add_valid+0xd4/0xe0 lib/list_debug.c:26 Read of size 8 at addr ffff8881ef61ae20 by task syz-executor.0/5996 CPU: 1 PID: 5996 Comm: syz-executor.0 Tainted: G C 5.0.0+ #5 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0xa9/0x10e lib/dump_stack.c:113 print_address_description+0x65/0x270 mm/kasan/report.c:187 kasan_report+0x149/0x18d mm/kasan/report.c:317 __list_add_valid+0xd4/0xe0 lib/list_debug.c:26 __list_add include/linux/list.h:60 [inline] list_add_tail include/linux/list.h:93 [inline] link_obj+0xb0/0x190 fs/configfs/dir.c:759 link_group+0x1c/0x130 fs/configfs/dir.c:784 configfs_register_group+0x56/0x1e0 fs/configfs/dir.c:1751 configfs_register_default_group+0x72/0xc0 fs/configfs/dir.c:1834 ? 0xffffffffc1be0000 iio_sw_trigger_init+0x23/0x1000 [industrialio_sw_trigger] do_one_initcall+0xbc/0x47d init/main.c:887 do_init_module+0x1b5/0x547 kernel/module.c:3456 load_module+0x6405/0x8c10 kernel/module.c:3804 __do_sys_finit_module+0x162/0x190 kernel/module.c:3898 do_syscall_64+0x9f/0x450 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x462e99 Code: f7 d8 64 89 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 bc ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007f494ecbcc58 EFLAGS: 00000246 ORIG_RAX: 0000000000000139 RAX: ffffffffffffffda RBX: 000000000073bf00 RCX: 0000000000462e99 RDX: 0000000000000000 RSI: 0000000020000180 RDI: 0000000000000003 RBP: 00007f494ecbcc70 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 00007f494ecbd6bc R13: 00000000004bcefa R14: 00000000006f6fb0 R15: 0000000000000004 Allocated by task 5987: set_track mm/kasan/common.c:87 [inline] __kasan_kmalloc.constprop.3+0xa0/0xd0 mm/kasan/common.c:497 kmalloc include/linux/slab.h:545 [inline] kzalloc include/linux/slab.h:740 [inline] configfs_register_default_group+0x4c/0xc0 fs/configfs/dir.c:1829 0xffffffffc1bd0023 do_one_initcall+0xbc/0x47d init/main.c:887 do_init_module+0x1b5/0x547 kernel/module.c:3456 load_module+0x6405/0x8c10 kernel/module.c:3804 __do_sys_finit_module+0x162/0x190 kernel/module.c:3898 do_syscall_64+0x9f/0x450 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe Freed by task 5987: set_track mm/kasan/common.c:87 [inline] __kasan_slab_free+0x130/0x180 mm/kasan/common.c:459 slab_free_hook mm/slub.c:1429 [inline] slab_free_freelist_hook mm/slub.c:1456 [inline] slab_free mm/slub.c:3003 [inline] kfree+0xe1/0x270 mm/slub.c:3955 configfs_register_default_group+0x9a/0xc0 fs/configfs/dir.c:1836 0xffffffffc1bd0023 do_one_initcall+0xbc/0x47d init/main.c:887 do_init_module+0x1b5/0x547 kernel/module.c:3456 load_module+0x6405/0x8c10 kernel/module.c:3804 __do_sys_finit_module+0x162/0x190 kernel/module.c:3898 do_syscall_64+0x9f/0x450 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe The buggy address belongs to the object at ffff8881ef61ae00 which belongs to the cache kmalloc-192 of size 192 The buggy address is located 32 bytes inside of 192-byte region [ffff8881ef61ae00, ffff8881ef61aec0) The buggy address belongs to the page: page:ffffea0007bd8680 count:1 mapcount:0 mapping:ffff8881f6c03000 index:0xffff8881ef61a700 flags: 0x2fffc0000000200(slab) raw: 02fffc0000000200 ffffea0007ca4740 0000000500000005 ffff8881f6c03000 raw: ffff8881ef61a700 000000008010000c 00000001ffffffff 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff8881ef61ad00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ffff8881ef61ad80: 00 00 00 00 00 00 00 00 fc fc fc fc fc fc fc fc >ffff8881ef61ae00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ^ ffff8881ef61ae80: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc ffff8881ef61af00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb Fixes: 5cf6a51e6062 ("configfs: allow dynamic group creation") Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-06-15f2fs: fix to do checksum even if inode page is uptodateChao Yu2-5/+6
[ Upstream commit b42b179bda9ff11075a6fc2bac4d9e400513679a ] As Jungyeon reported in bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=203221 - Overview When mounting the attached crafted image and running program, this error is reported. The image is intentionally fuzzed from a normal f2fs image for testing and I enabled option CONFIG_F2FS_CHECK_FS on. - Reproduces cc poc_07.c mkdir test mount -t f2fs tmp.img test cp a.out test cd test sudo ./a.out - Messages kernel BUG at fs/f2fs/node.c:1279! RIP: 0010:read_node_page+0xcf/0xf0 Call Trace: __get_node_page+0x6b/0x2f0 f2fs_iget+0x8f/0xdf0 f2fs_lookup+0x136/0x320 __lookup_slow+0x92/0x140 lookup_slow+0x30/0x50 walk_component+0x1c1/0x350 path_lookupat+0x62/0x200 filename_lookup+0xb3/0x1a0 do_fchmodat+0x3e/0xa0 __x64_sys_chmod+0x12/0x20 do_syscall_64+0x43/0xf0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 On below paths, we can have opportunity to readahead inode page - gc_node_segment -> f2fs_ra_node_page - gc_data_segment -> f2fs_ra_node_page - f2fs_fill_dentries -> f2fs_ra_node_page Unlike synchronized read, on readahead path, we can set page uptodate before verifying page's checksum, then read_node_page() will trigger kernel panic once it encounters a uptodated page w/ incorrect checksum. So considering readahead scenario, we have to do checksum each time when loading inode page even if it is uptodated. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-06-15f2fs: fix to retrieve inline xattr spaceChao Yu1-0/+17
[ Upstream commit 45a746881576977f85504c21a75547f10c5c0a8e ] With below mkfs and mount option, generic/339 of fstest will report that scratch image becomes corrupted. MKFS_OPTIONS -- -O extra_attr -O project_quota -O inode_checksum -O flexible_inline_xattr -O inode_crtime -f /dev/zram1 MOUNT_OPTIONS -- -o acl,user_xattr -o discard,noinline_xattr /dev/zram1 /mnt/scratch_f2fs [ASSERT] (f2fs_check_dirent_position:1315) --> Wrong position of dirent pino:1970, name: (...) level:8, dir_level:0, pgofs:951, correct range:[900, 901] In old kernel, inline data and directory always reserved 200 bytes in inode layout, even if inline_xattr is disabled, then new kernel tries to retrieve that space for non-inline xattr inode, but for inline dentry, its layout size should be fixed, so we just keep that reserved space. But the problem here is that, after inline dentry conversion, inline dentry layout no longer exists, if we still reserve inline xattr space, after dents updates, there will be a hole in inline xattr space, which can break hierarchy hash directory structure. This patch fixes this issue by retrieving inline xattr space after inline dentry conversion. Fixes: 6afc662e68b5 ("f2fs: support flexible inline xattr size") Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-06-15f2fs: fix to avoid deadloop in foreground GCChao Yu1-0/+1
[ Upstream commit 793ab1c8a792f8bccd7ae4c5be02bd275410b3af ] As Jungyeon reported in bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=203211 - Overview When mounting the attached crafted image and making a new file, I got this error and the error messages keep repeating. The image is intentionally fuzzed from a normal f2fs image for testing and I run with option CONFIG_F2FS_CHECK_FS on. - Reproduces mkdir test mount -t f2fs tmp.img test cd test touch t - Messages [ 58.820451] F2FS-fs (sdb): Inconsistent segment (1) type [1, 0] in SSA and SIT [ 58.821485] F2FS-fs (sdb): Inconsistent segment (1) type [1, 0] in SSA and SIT [ 58.822530] F2FS-fs (sdb): Inconsistent segment (1) type [1, 0] in SSA and SIT [ 58.823571] F2FS-fs (sdb): Inconsistent segment (1) type [1, 0] in SSA and SIT [ 58.824616] F2FS-fs (sdb): Inconsistent segment (1) type [1, 0] in SSA and SIT [ 58.825640] F2FS-fs (sdb): Inconsistent segment (1) type [1, 0] in SSA and SIT [ 58.826663] F2FS-fs (sdb): Inconsistent segment (1) type [1, 0] in SSA and SIT [ 58.827698] F2FS-fs (sdb): Inconsistent segment (1) type [1, 0] in SSA and SIT [ 58.828719] F2FS-fs (sdb): Inconsistent segment (1) type [1, 0] in SSA and SIT [ 58.829759] F2FS-fs (sdb): Inconsistent segment (1) type [1, 0] in SSA and SIT [ 58.830783] F2FS-fs (sdb): Inconsistent segment (1) type [1, 0] in SSA and SIT [ 58.831828] F2FS-fs (sdb): Inconsistent segment (1) type [1, 0] in SSA and SIT [ 58.832869] F2FS-fs (sdb): Inconsistent segment (1) type [1, 0] in SSA and SIT [ 58.833888] F2FS-fs (sdb): Inconsistent segment (1) type [1, 0] in SSA and SIT [ 58.834945] F2FS-fs (sdb): Inconsistent segment (1) type [1, 0] in SSA and SIT [ 58.835996] F2FS-fs (sdb): Inconsistent segment (1) type [1, 0] in SSA and SIT [ 58.837028] F2FS-fs (sdb): Inconsistent segment (1) type [1, 0] in SSA and SIT [ 58.838051] F2FS-fs (sdb): Inconsistent segment (1) type [1, 0] in SSA and SIT [ 58.839072] F2FS-fs (sdb): Inconsistent segment (1) type [1, 0] in SSA and SIT [ 58.840100] F2FS-fs (sdb): Inconsistent segment (1) type [1, 0] in SSA and SIT [ 58.841147] F2FS-fs (sdb): Inconsistent segment (1) type [1, 0] in SSA and SIT [ 58.842186] F2FS-fs (sdb): Inconsistent segment (1) type [1, 0] in SSA and SIT [ 58.843214] F2FS-fs (sdb): Inconsistent segment (1) type [1, 0] in SSA and SIT [ 58.844267] F2FS-fs (sdb): Inconsistent segment (1) type [1, 0] in SSA and SIT [ 58.845282] F2FS-fs (sdb): Inconsistent segment (1) type [1, 0] in SSA and SIT [ 58.846305] F2FS-fs (sdb): Inconsistent segment (1) type [1, 0] in SSA and SIT [ 58.847341] F2FS-fs (sdb): Inconsistent segment (1) type [1, 0] in SSA and SIT ... (repeating) During GC, if segment type stored in SSA and SIT is inconsistent, we just skip migrating current segment directly, since we need to know the exact type to decide the migration function we use. So in foreground GC, we will easily run into a infinite loop as we may select the same victim segment which has inconsistent type due to greedy policy. In order to end up this, we choose to shutdown filesystem. For backgrond GC, we need to do that as well, so that we can avoid latter potential infinite looped foreground GC. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-06-15f2fs: fix to do sanity check on valid block count of segmentChao Yu1-2/+1
[ Upstream commit e95bcdb2fefa129f37bd9035af1d234ca92ee4ef ] As Jungyeon reported in bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=203233 - Overview When mounting the attached crafted image and running program, following errors are reported. Additionally, it hangs on sync after running program. The image is intentionally fuzzed from a normal f2fs image for testing. Compile options for F2FS are as follows. CONFIG_F2FS_FS=y CONFIG_F2FS_STAT_FS=y CONFIG_F2FS_FS_XATTR=y CONFIG_F2FS_FS_POSIX_ACL=y CONFIG_F2FS_CHECK_FS=y - Reproduces cc poc_13.c mkdir test mount -t f2fs tmp.img test cp a.out test cd test sudo ./a.out sync - Kernel messages F2FS-fs (sdb): Bitmap was wrongly set, blk:4608 kernel BUG at fs/f2fs/segment.c:2102! RIP: 0010:update_sit_entry+0x394/0x410 Call Trace: f2fs_allocate_data_block+0x16f/0x660 do_write_page+0x62/0x170 f2fs_do_write_node_page+0x33/0xa0 __write_node_page+0x270/0x4e0 f2fs_sync_node_pages+0x5df/0x670 f2fs_write_checkpoint+0x372/0x1400 f2fs_sync_fs+0xa3/0x130 f2fs_do_sync_file+0x1a6/0x810 do_fsync+0x33/0x60 __x64_sys_fsync+0xb/0x10 do_syscall_64+0x43/0xf0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 sit.vblocks and sum valid block count in sit.valid_map may be inconsistent, segment w/ zero vblocks will be treated as free segment, while allocating in free segment, we may allocate a free block, if its bitmap is valid previously, it can cause kernel crash due to bitmap verification failure. Anyway, to avoid further serious metadata inconsistence and corruption, it is necessary and worth to detect SIT inconsistence. So let's enable check_block_count() to verify vblocks and valid_map all the time rather than do it only CONFIG_F2FS_CHECK_FS is enabled. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-06-15f2fs: fix to avoid panic in dec_valid_node_count()Chao Yu1-3/+11
[ Upstream commit ea6d7e72fea49402aa445345aade7a26b81732e3 ] As Jungyeon reported in bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=203213 - Overview When mounting the attached crafted image and running program, I got this error. Additionally, it hangs on sync after running the this script. The image is intentionally fuzzed from a normal f2fs image for testing and I enabled option CONFIG_F2FS_CHECK_FS on. - Reproduces mkdir test mount -t f2fs tmp.img test cp a.out test cd test sudo ./a.out sync kernel BUG at fs/f2fs/f2fs.h:2012! RIP: 0010:truncate_node+0x2c9/0x2e0 Call Trace: f2fs_truncate_xattr_node+0xa1/0x130 f2fs_remove_inode_page+0x82/0x2d0 f2fs_evict_inode+0x2a3/0x3a0 evict+0xba/0x180 __dentry_kill+0xbe/0x160 dentry_kill+0x46/0x180 dput+0xbb/0x100 do_renameat2+0x3c9/0x550 __x64_sys_rename+0x17/0x20 do_syscall_64+0x43/0xf0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 The reason is dec_valid_node_count() will trigger kernel panic due to inconsistent count in between inode.i_blocks and actual block. To avoid panic, let's just print debug message and set SBI_NEED_FSCK to give a hint to fsck for latter repairing. Signed-off-by: Chao Yu <yuchao0@huawei.com> [Jaegeuk Kim: fix build warning and add unlikely] Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-06-15f2fs: fix to use inline space only if inline_xattr is enableChao Yu1-1/+3
[ Upstream commit 622927f3b8809206f6da54a6a7ed4df1a7770fce ] With below mkfs and mount option: MKFS_OPTIONS -- -O extra_attr -O project_quota -O inode_checksum -O flexible_inline_xattr -O inode_crtime -f MOUNT_OPTIONS -- -o noinline_xattr We may miss xattr data with below testcase: - mkdir dir - setfattr -n "user.name" -v 0 dir - for ((i = 0; i < 190; i++)) do touch dir/$i; done - umount - mount - getfattr -n "user.name" dir user.name: No such attribute The root cause is that we persist xattr data into reserved inline xattr space, even if inline_xattr is not enable in inline directory inode, after inline dentry conversion, reserved space no longer exists, so that xattr data missed. Let's use inline xattr space only if inline_xattr flag is set on inode to fix this iusse. Fixes: 6afc662e68b5 ("f2fs: support flexible inline xattr size") Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-06-15f2fs: fix to avoid panic in dec_valid_block_count()Chao Yu1-2/+10
[ Upstream commit 5e159cd349bf3a31fb7e35c23a93308eb30f4f71 ] As Jungyeon reported in bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=203209 - Overview When mounting the attached crafted image and running program, I got this error. Additionally, it hangs on sync after the this script. The image is intentionally fuzzed from a normal f2fs image for testing and I enabled option CONFIG_F2FS_CHECK_FS on. - Reproduces cc poc_01.c ./run.sh f2fs sync kernel BUG at fs/f2fs/f2fs.h:1788! RIP: 0010:f2fs_truncate_data_blocks_range+0x342/0x350 Call Trace: f2fs_truncate_blocks+0x36d/0x3c0 f2fs_truncate+0x88/0x110 f2fs_setattr+0x3e1/0x460 notify_change+0x2da/0x400 do_truncate+0x6d/0xb0 do_sys_ftruncate+0xf1/0x160 do_syscall_64+0x43/0xf0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 The reason is dec_valid_block_count() will trigger kernel panic due to inconsistent count in between inode.i_blocks and actual block. To avoid panic, let's just print debug message and set SBI_NEED_FSCK to give a hint to fsck for latter repairing. Signed-off-by: Chao Yu <yuchao0@huawei.com> [Jaegeuk Kim: fix build warning and add unlikely] Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-06-15f2fs: fix to clear dirty inode in error path of f2fs_iget()Chao Yu1-0/+1
[ Upstream commit 546d22f070d64a7b96f57c93333772085d3a5e6d ] As Jungyeon reported in bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=203217 - Overview When mounting the attached crafted image and running program, I got this error. Additionally, it hangs on sync after running the program. The image is intentionally fuzzed from a normal f2fs image for testing and I enabled option CONFIG_F2FS_CHECK_FS on. - Reproduces cc poc_test_05.c mkdir test mount -t f2fs tmp.img test sudo ./a.out sync - Messages kernel BUG at fs/f2fs/inode.c:707! RIP: 0010:f2fs_evict_inode+0x33f/0x3a0 Call Trace: evict+0xba/0x180 f2fs_iget+0x598/0xdf0 f2fs_lookup+0x136/0x320 __lookup_slow+0x92/0x140 lookup_slow+0x30/0x50 walk_component+0x1c1/0x350 path_lookupat+0x62/0x200 filename_lookup+0xb3/0x1a0 do_readlinkat+0x56/0x110 __x64_sys_readlink+0x16/0x20 do_syscall_64+0x43/0xf0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 During inode loading, __recover_inline_status() can recovery inode status and set inode dirty, once we failed in following process, it will fail the check in f2fs_evict_inode, result in trigger BUG_ON(). Let's clear dirty inode in error path of f2fs_iget() to avoid panic. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-06-15f2fs: fix to do sanity check on free nidChao Yu1-0/+3
[ Upstream commit 626bcf2b7ce87211dba565f2bfa7842ba5be5c1b ] As Jungyeon reported in bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=203225 - Overview When mounting the attached crafted image and unmounting it, following errors are reported. Additionally, it hangs on sync after unmounting. The image is intentionally fuzzed from a normal f2fs image for testing. Compile options for F2FS are as follows. CONFIG_F2FS_FS=y CONFIG_F2FS_STAT_FS=y CONFIG_F2FS_FS_XATTR=y CONFIG_F2FS_FS_POSIX_ACL=y CONFIG_F2FS_CHECK_FS=y - Reproduces mkdir test mount -t f2fs tmp.img test touch test/t umount test sync - Messages kernel BUG at fs/f2fs/node.c:3073! RIP: 0010:f2fs_destroy_node_manager+0x2f0/0x300 Call Trace: f2fs_put_super+0xf4/0x270 generic_shutdown_super+0x62/0x110 kill_block_super+0x1c/0x50 kill_f2fs_super+0xad/0xd0 deactivate_locked_super+0x35/0x60 cleanup_mnt+0x36/0x70 task_work_run+0x75/0x90 exit_to_usermode_loop+0x93/0xa0 do_syscall_64+0xba/0xf0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0010:f2fs_destroy_node_manager+0x2f0/0x300 NAT table is corrupted, so reserved meta/node inode ids were added into free list incorrectly, during file creation, since reserved id has cached in inode hash, so it fails the creation and preallocated nid can not be released later, result in kernel panic. To fix this issue, let's do nid boundary check during free nid loading. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-06-15f2fs: fix to avoid panic in f2fs_remove_inode_page()Chao Yu1-2/+8
[ Upstream commit 8b6810f8acfe429fde7c7dad4714692cc5f75651 ] As Jungyeon reported in bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=203219 - Overview When mounting the attached crafted image and running program, I got this error. Additionally, it hangs on sync after running the program. The image is intentionally fuzzed from a normal f2fs image for testing and I enabled option CONFIG_F2FS_CHECK_FS on. - Reproduces cc poc_06.c mkdir test mount -t f2fs tmp.img test cp a.out test cd test sudo ./a.out sync - Messages kernel BUG at fs/f2fs/node.c:1183! RIP: 0010:f2fs_remove_inode_page+0x294/0x2d0 Call Trace: f2fs_evict_inode+0x2a3/0x3a0 evict+0xba/0x180 __dentry_kill+0xbe/0x160 dentry_kill+0x46/0x180 dput+0xbb/0x100 do_renameat2+0x3c9/0x550 __x64_sys_rename+0x17/0x20 do_syscall_64+0x43/0xf0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 The reason is f2fs_remove_inode_page() will trigger kernel panic due to inconsistent i_blocks value of inode. To avoid panic, let's just print debug message and set SBI_NEED_FSCK to give a hint to fsck for latter repairing of potential image corruption. Signed-off-by: Chao Yu <yuchao0@huawei.com> [Jaegeuk Kim: fix build warning and add unlikely] Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-06-15f2fs: fix error path of recoveryChao Yu1-4/+11
[ Upstream commit 988385795c7f46b231982d54750587f204bd558b ] There are some places in where we missed to unlock page or unlock page incorrectly, fix them. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-06-15f2fs: fix to avoid panic in f2fs_inplace_write_data()Chao Yu1-2/+7
[ Upstream commit 05573d6ccf702df549a7bdeabef31e4753df1a90 ] As Jungyeon reported in bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=203239 - Overview When mounting the attached crafted image and running program, following errors are reported. Additionally, it hangs on sync after running program. The image is intentionally fuzzed from a normal f2fs image for testing. Compile options for F2FS are as follows. CONFIG_F2FS_FS=y CONFIG_F2FS_STAT_FS=y CONFIG_F2FS_FS_XATTR=y CONFIG_F2FS_FS_POSIX_ACL=y CONFIG_F2FS_CHECK_FS=y - Reproduces cc poc_15.c ./run.sh f2fs sync - Kernel messages ------------[ cut here ]------------ kernel BUG at fs/f2fs/segment.c:3162! RIP: 0010:f2fs_inplace_write_data+0x12d/0x160 Call Trace: f2fs_do_write_data_page+0x3c1/0x820 __write_data_page+0x156/0x720 f2fs_write_cache_pages+0x20d/0x460 f2fs_write_data_pages+0x1b4/0x300 do_writepages+0x15/0x60 __filemap_fdatawrite_range+0x7c/0xb0 file_write_and_wait_range+0x2c/0x80 f2fs_do_sync_file+0x102/0x810 do_fsync+0x33/0x60 __x64_sys_fsync+0xb/0x10 do_syscall_64+0x43/0xf0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 The reason is f2fs_inplace_write_data() will trigger kernel panic due to data block locates in node type segment. To avoid panic, let's just return error code and set SBI_NEED_FSCK to give a hint to fsck for latter repairing. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-06-15f2fs: fix to avoid panic in do_recover_data()Chao Yu1-1/+9
[ Upstream commit 22d61e286e2d9097dae36f75ed48801056b77cac ] As Jungyeon reported in bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=203227 - Overview When mounting the attached crafted image, following errors are reported. Additionally, it hangs on sync after trying to mount it. The image is intentionally fuzzed from a normal f2fs image for testing. Compile options for F2FS are as follows. CONFIG_F2FS_FS=y CONFIG_F2FS_STAT_FS=y CONFIG_F2FS_FS_XATTR=y CONFIG_F2FS_FS_POSIX_ACL=y CONFIG_F2FS_CHECK_FS=y - Reproduces mkdir test mount -t f2fs tmp.img test sync - Messages kernel BUG at fs/f2fs/recovery.c:549! RIP: 0010:recover_data+0x167a/0x1780 Call Trace: f2fs_recover_fsync_data+0x613/0x710 f2fs_fill_super+0x1043/0x1aa0 mount_bdev+0x16d/0x1a0 mount_fs+0x4a/0x170 vfs_kern_mount+0x5d/0x100 do_mount+0x200/0xcf0 ksys_mount+0x79/0xc0 __x64_sys_mount+0x1c/0x20 do_syscall_64+0x43/0xf0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 During recovery, if ofs_of_node is inconsistent in between recovered node page and original checkpointed node page, let's just fail recovery instead of making kernel panic. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-06-15mm: page_mkclean vs MADV_DONTNEED raceAneesh Kumar K.V1-1/+1
[ Upstream commit 024eee0e83f0df52317be607ca521e0fc572aa07 ] MADV_DONTNEED is handled with mmap_sem taken in read mode. We call page_mkclean without holding mmap_sem. MADV_DONTNEED implies that pages in the region are unmapped and subsequent access to the pages in that range is handled as a new page fault. This implies that if we don't have parallel access to the region when MADV_DONTNEED is run we expect those range to be unallocated. w.r.t page_mkclean() we need to make sure that we don't break the MADV_DONTNEED semantics. MADV_DONTNEED check for pmd_none without holding pmd_lock. This implies we skip the pmd if we temporarily mark pmd none. Avoid doing that while marking the page clean. Keep the sequence same for dax too even though we don't support MADV_DONTNEED for dax mapping The bug was noticed by code review and I didn't observe any failures w.r.t test run. This is similar to commit 58ceeb6bec86d9140f9d91d71a710e963523d063 Author: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Date: Thu Apr 13 14:56:26 2017 -0700 thp: fix MADV_DONTNEED vs. MADV_FREE race commit ced108037c2aa542b3ed8b7afd1576064ad1362a Author: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Date: Thu Apr 13 14:56:20 2017 -0700 thp: fix MADV_DONTNEED vs. numa balancing race Link: http://lkml.kernel.org/r/20190321040610.14226-1-aneesh.kumar@linux.ibm.com Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc:"Kirill A . Shutemov" <kirill@shutemov.name> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-06-15fs/fat/file.c: issue flush after the writeback of FATHou Tao1-3/+8
[ Upstream commit bd8309de0d60838eef6fb575b0c4c7e95841cf73 ] fsync() needs to make sure the data & meta-data of file are persistent after the return of fsync(), even when a power-failure occurs later. In the case of fat-fs, the FAT belongs to the meta-data of file, so we need to issue a flush after the writeback of FAT instead before. Also bail out early when any stage of fsync fails. Link: http://lkml.kernel.org/r/20190409030158.136316-1-houtao1@huawei.com Signed-off-by: Hou Tao <houtao1@huawei.com> Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-06-11pstore/ram: Run without kernel crash dump regionKees Cook2-14/+25
commit 8880fa32c557600f5f624084152668ed3c2ea51e upstream. The ram pstore backend has always had the crash dumper frontend enabled unconditionally. However, it was possible to effectively disable it by setting a record_size=0. All the machinery would run (storing dumps to the temporary crash buffer), but 0 bytes would ultimately get stored due to there being no przs allocated for dumps. Commit 89d328f637b9 ("pstore/ram: Correctly calculate usable PRZ bytes"), however, assumed that there would always be at least one allocated dprz for calculating the size of the temporary crash buffer. This was, of course, not the case when record_size=0, and would lead to a NULL deref trying to find the dprz buffer size: BUG: unable to handle kernel NULL pointer dereference at (null) ... IP: ramoops_probe+0x285/0x37e (fs/pstore/ram.c:808) cxt->pstore.bufsize = cxt->dprzs[0]->buffer_size; Instead, we need to only enable the frontends based on the success of the prz initialization and only take the needed actions when those zones are available. (This also fixes a possible error in detecting if the ftrace frontend should be enabled.) Reported-and-tested-by: Yaro Slav <yaro330@gmail.com> Fixes: 89d328f637b9 ("pstore/ram: Correctly calculate usable PRZ bytes") Cc: stable@vger.kernel.org Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-11pstore: Set tfm to NULL on free_buf_for_compressionPi-Hsun Shih1-1/+3
commit a9fb94a99bb515d8720ba8440ce3aba84aec80f8 upstream. Set tfm to NULL on free_buf_for_compression() after crypto_free_comp(). This avoid a use-after-free when allocate_buf_for_compression() and free_buf_for_compression() are called twice. Although free_buf_for_compression() freed the tfm, allocate_buf_for_compression() won't reinitialize the tfm since the tfm pointer is not NULL. Fixes: 95047b0519c1 ("pstore: Refactor compression initialization") Signed-off-by: Pi-Hsun Shih <pihsun@chromium.org> Cc: stable@vger.kernel.org Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-11fuse: fix copy_file_range() in the writeback caseMiklos Szeredi1-0/+12
commit a2bc92362941006830afa3dfad6caec1f99acbf5 upstream. Prior to sending COPY_FILE_RANGE to userspace filesystem, we must flush all dirty pages in both the source and destination files. This patch adds the missing flush of the source file. Tested on libfuse-3.5.0 with: libfuse/example/passthrough_ll /mnt/fuse/ -o writeback libfuse/test/test_syscalls /mnt/fuse/tmp/test Fixes: 88bc7d5097a1 ("fuse: add support for copy_file_range()") Cc: <stable@vger.kernel.org> # v4.20 Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-11fuse: fallocate: fix return with locked inodeMiklos Szeredi1-1/+1
commit 35d6fcbb7c3e296a52136347346a698a35af3fda upstream. Do the proper cleanup in case the size check fails. Tested with xfstests:generic/228 Reported-by: kbuild test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Fixes: 0cbade024ba5 ("fuse: honor RLIMIT_FSIZE in fuse_file_fallocate") Cc: Liu Bo <bo.liu@linux.alibaba.com> Cc: <stable@vger.kernel.org> # v3.5 Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-11NFSv4.1: Fix bug only first CB_NOTIFY_LOCK is handledYihao Wu1-3/+5
commit ba851a39c9703f09684a541885ed176f8fb7c868 upstream. When a waiter is waked by CB_NOTIFY_LOCK, it will retry nfs4_proc_setlk(). The waiter may fail to nfs4_proc_setlk() and sleep again. However, the waiter is already removed from clp->cl_lock_waitq when handling CB_NOTIFY_LOCK in nfs4_wake_lock_waiter(). So any subsequent CB_NOTIFY_LOCK won't wake this waiter anymore. We should put the waiter back to clp->cl_lock_waitq before retrying. Cc: stable@vger.kernel.org #4.9+ Signed-off-by: Yihao Wu <wuyihao@linux.alibaba.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-11NFSv4.1: Again fix a race where CB_NOTIFY_LOCK fails to wake a waiterYihao Wu1-17/+7
commit 52b042ab9948cc367b61f9ca9c18603aa7813c3a upstream. Commit b7dbcc0e433f "NFSv4.1: Fix a race where CB_NOTIFY_LOCK fails to wake a waiter" found this bug. However it didn't fix it. This commit replaces schedule_timeout() with wait_woken() and default_wake_function() with woken_wake_function() in function nfs4_retry_setlk() and nfs4_wake_lock_waiter(). wait_woken() uses memory barriers in its implementation to avoid potential race condition when putting a process into sleeping state and then waking it up. Fixes: a1d617d8f134 ("nfs: allow blocking locks to be awoken by lock callbacks") Cc: stable@vger.kernel.org #4.9+ Signed-off-by: Yihao Wu <wuyihao@linux.alibaba.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-09Revert "lockd: Show pid of lockd for remote locks"Benjamin Coddington2-4/+4
commit 141731d15d6eb2fd9aaefbf9b935ce86ae243074 upstream. This reverts most of commit b8eee0e90f97 ("lockd: Show pid of lockd for remote locks"), which caused remote locks to not be differentiated between remote processes for NLM. We retain the fixup for setting the client's fl_pid to a negative value. Fixes: b8eee0e90f97 ("lockd: Show pid of lockd for remote locks") Cc: stable@vger.kernel.org Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Reviewed-by: XueWei Zhang <xueweiz@google.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-09CIFS: cifs_read_allocate_pages: don't iterate through whole page array on ENOMEMRoberto Bergantinos Corpas1-1/+3
commit 31fad7d41e73731f05b8053d17078638cf850fa6 upstream. In cifs_read_allocate_pages, in case of ENOMEM, we go through whole rdata->pages array but we have failed the allocation before nr_pages, therefore we may end up calling put_page with NULL pointer, causing oops Signed-off-by: Roberto Bergantinos Corpas <rbergant@redhat.com> Acked-by: Pavel Shilovsky <pshilov@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com> CC: Stable <stable@vger.kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-09cifs: fix memory leak of pneg_inbuf on -EOPNOTSUPP ioctl caseColin Ian King1-1/+2
commit 210782038b54ec8e9059a3c12d6f6ae173efa3a9 upstream. Currently in the case where SMB2_ioctl returns the -EOPNOTSUPP error there is a memory leak of pneg_inbuf. Fix this by returning via the out_free_inbuf exit path that will perform the relevant kfree. Addresses-Coverity: ("Resource leak") Fixes: 969ae8e8d4ee ("cifs: Accept validate negotiate if server return NT_STATUS_NOT_SUPPORTED") CC: Stable <stable@vger.kernel.org> # v5.1+ Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-09btrfs: reloc: Also queue orphan reloc tree for cleanup to avoid BUG_ON()Qu Wenruo1-8/+19
commit 30d40577e322b670551ad7e2faa9570b6e23eb2b upstream. [BUG] When a fs has orphan reloc tree along with unfinished balance: ... item 16 key (TREE_RELOC ROOT_ITEM FS_TREE) itemoff 12090 itemsize 439 generation 12 root_dirid 256 bytenr 300400640 level 1 refs 0 <<< lastsnap 8 byte_limit 0 bytes_used 1359872 flags 0x0(none) uuid 7c48d938-33a3-4aae-ab19-6e5c9d406e46 item 17 key (BALANCE TEMPORARY_ITEM 0) itemoff 11642 itemsize 448 temporary item objectid BALANCE offset 0 balance status flags 14 Then at mount time, we can hit the following kernel BUG_ON(): BTRFS info (device dm-3): relocating block group 298844160 flags metadata|dup ------------[ cut here ]------------ kernel BUG at fs/btrfs/relocation.c:1413! invalid opcode: 0000 [#1] PREEMPT SMP NOPTI CPU: 1 PID: 897 Comm: btrfs-balance Tainted: G O 5.2.0-rc1-custom #15 RIP: 0010:create_reloc_root+0x1eb/0x200 [btrfs] Call Trace: btrfs_init_reloc_root+0x96/0xb0 [btrfs] record_root_in_trans+0xb2/0xe0 [btrfs] btrfs_record_root_in_trans+0x55/0x70 [btrfs] select_reloc_root+0x7e/0x230 [btrfs] do_relocation+0xc4/0x620 [btrfs] relocate_tree_blocks+0x592/0x6a0 [btrfs] relocate_block_group+0x47b/0x5d0 [btrfs] btrfs_relocate_block_group+0x183/0x2f0 [btrfs] btrfs_relocate_chunk+0x4e/0xe0 [btrfs] btrfs_balance+0x864/0xfa0 [btrfs] balance_kthread+0x3b/0x50 [btrfs] kthread+0x123/0x140 ret_from_fork+0x27/0x50 [CAUSE] In btrfs, reloc trees are used to record swapped tree blocks during balance. Reloc tree either get merged (replace old tree blocks of its parent subvolume) in next transaction if its ref is 1 (fresh). Or is already merged and will be cleaned up if its ref is 0 (orphan). After commit d2311e698578 ("btrfs: relocation: Delay reloc tree deletion after merge_reloc_roots"), reloc tree cleanup is delayed until one block group is balanced. Since fresh reloc roots are recorded during merge, as long as there is no power loss, those orphan reloc roots converted from fresh ones are handled without problem. However when power loss happens, orphan reloc roots can be recorded on-disk, thus at next mount time, we will have orphan reloc roots from on-disk data directly, and ignored by clean_dirty_subvols() routine. Then when background balance starts to balance another block group, and needs to create new reloc root for the same root, btrfs_insert_item() returns -EEXIST, and trigger that BUG_ON(). [FIX] For orphan reloc roots, also queue them to rc->dirty_subvol_roots, so all reloc roots no matter orphan or not, can be cleaned up properly and avoid above BUG_ON(). And to cooperate with above change, clean_dirty_subvols() will check if the queued root is a reloc root or a subvol root. For a subvol root, do the old work, and for a orphan reloc root, clean it up. Fixes: d2311e698578 ("btrfs: relocation: Delay reloc tree deletion after merge_reloc_roots") CC: stable@vger.kernel.org # 5.1 Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-09Btrfs: incremental send, fix file corruption when no-holes feature is enabledFilipe Manana1-0/+6
commit 6b1f72e5b82a5c2a4da4d1ebb8cc01913ddbea21 upstream. When using the no-holes feature, if we have a file with prealloc extents with a start offset beyond the file's eof, doing an incremental send can cause corruption of the file due to incorrect hole detection. Such case requires that the prealloc extent(s) exist in both the parent and send snapshots, and that a hole is punched into the file that covers all its extents that do not cross the eof boundary. Example reproducer: $ mkfs.btrfs -f -O no-holes /dev/sdb $ mount /dev/sdb /mnt/sdb $ xfs_io -f -c "pwrite -S 0xab 0 500K" /mnt/sdb/foobar $ xfs_io -c "falloc -k 1200K 800K" /mnt/sdb/foobar $ btrfs subvolume snapshot -r /mnt/sdb /mnt/sdb/base $ btrfs send -f /tmp/base.snap /mnt/sdb/base $ xfs_io -c "fpunch 0 500K" /mnt/sdb/foobar $ btrfs subvolume snapshot -r /mnt/sdb /mnt/sdb/incr $ btrfs send -p /mnt/sdb/base -f /tmp/incr.snap /mnt/sdb/incr $ md5sum /mnt/sdb/incr/foobar 816df6f64deba63b029ca19d880ee10a /mnt/sdb/incr/foobar $ mkfs.btrfs -f /dev/sdc $ mount /dev/sdc /mnt/sdc $ btrfs receive -f /tmp/base.snap /mnt/sdc $ btrfs receive -f /tmp/incr.snap /mnt/sdc $ md5sum /mnt/sdc/incr/foobar cf2ef71f4a9e90c2f6013ba3b2257ed2 /mnt/sdc/incr/foobar --> Different checksum, because the prealloc extent beyond the file's eof confused the hole detection code and it assumed a hole starting at offset 0 and ending at the offset of the prealloc extent (1200Kb) instead of ending at the offset 500Kb (the file's size). Fix this by ensuring we never cross the file's size when issuing the write operations for a hole. Fixes: 16e7549f045d33 ("Btrfs: incompatible format change to remove hole extents") CC: stable@vger.kernel.org # 3.14+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-09btrfs: qgroup: Check bg while resuming relocation to avoid NULL pointer ↵Qu Wenruo1-1/+7
dereference commit 57949d033a09c57d77be218b5bec07af6878ab32 upstream. [BUG] When mounting a fs with reloc tree and has qgroup enabled, it can cause NULL pointer dereference at mount time: BUG: kernel NULL pointer dereference, address: 00000000000000a8 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: 0000 [#1] PREEMPT SMP NOPTI RIP: 0010:btrfs_qgroup_add_swapped_blocks+0x186/0x300 [btrfs] Call Trace: replace_path.isra.23+0x685/0x900 [btrfs] merge_reloc_root+0x26e/0x5f0 [btrfs] merge_reloc_roots+0x10a/0x1a0 [btrfs] btrfs_recover_relocation+0x3cd/0x420 [btrfs] open_ctree+0x1bc8/0x1ed0 [btrfs] btrfs_mount_root+0x544/0x680 [btrfs] legacy_get_tree+0x34/0x60 vfs_get_tree+0x2d/0xf0 fc_mount+0x12/0x40 vfs_kern_mount.part.12+0x61/0xa0 vfs_kern_mount+0x13/0x20 btrfs_mount+0x16f/0x860 [btrfs] legacy_get_tree+0x34/0x60 vfs_get_tree+0x2d/0xf0 do_mount+0x81f/0xac0 ksys_mount+0xbf/0xe0 __x64_sys_mount+0x25/0x30 do_syscall_64+0x65/0x240 entry_SYSCALL_64_after_hwframe+0x49/0xbe [CAUSE] In btrfs_recover_relocation(), we don't have enough info to determine which block group we're relocating, but only to merge existing reloc trees. Thus in btrfs_recover_relocation(), rc->block_group is NULL. btrfs_qgroup_add_swapped_blocks() hasn't taken this into consideration, and causes a NULL pointer dereference. The bug is introduced by commit 3d0174f78e72 ("btrfs: qgroup: Only trace data extents in leaves if we're relocating data block group"), and later qgroup refactoring still keeps this optimization. [FIX] Thankfully in the context of btrfs_recover_relocation(), there is no other progress can modify tree blocks, thus those swapped tree blocks pair will never affect qgroup numbers, no matter whatever we set for block->trace_leaf. So we only need to check if @bg is NULL before accessing @bg->flags. Reported-by: Juan Erbes <jerbes@gmail.com> Link: https://bugzilla.opensuse.org/show_bug.cgi?id=1134806 Fixes: 3d0174f78e72 ("btrfs: qgroup: Only trace data extents in leaves if we're relocating data block group") CC: stable@vger.kernel.org # 4.20+ Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-09btrfs: correct zstd workspace manager lock to use spin_lock_bh()Dennis Zhou1-10/+10
commit fee13fe96529523a709d1fff487f14a5e0d56d34 upstream. The btrfs zstd workspace manager uses a background timer to reclaim not recently used workspaces. I used spin_lock() from this context which should have been caught with lockdep, but was not. This deadlock was reported in bugzilla. The fix is to switch the zstd wsm lock to use spin_lock_bh() from the softirq context. This happened quite relibably on ppc64, unlike on other architectures. [ 313.402874] ================================ [ 313.402875] WARNING: inconsistent lock state [ 313.402879] 5.1.0-rc7 #1 Not tainted [ 313.402880] -------------------------------- [ 313.402882] inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage. [ 313.402885] swapper/5/0 [HC0[0]:SC1[1]:HE1:SE0] takes: [ 313.402888] 0000000080d1120c (&(&wsm.lock)->rlock){+.?.}, at: .zstd_reclaim_timer_fn+0x40/0x230 [ 313.402895] {SOFTIRQ-ON-W} state was registered at: [ 313.402899] .lock_acquire+0xd0/0x240 [ 313.402903] ._raw_spin_lock+0x34/0x60 [ 313.402906] .zstd_get_workspace+0xd0/0x360 [ 313.402908] .end_compressed_bio_read+0x3b8/0x540 [ 313.402911] .bio_endio+0x174/0x2c0 [ 313.402914] .end_workqueue_fn+0x4c/0x70 [ 313.402917] .normal_work_helper+0x138/0x7e0 [ 313.402920] .process_one_work+0x324/0x790 [ 313.402922] .worker_thread+0x68/0x570 [ 313.402925] .kthread+0x19c/0x1b0 [ 313.402928] .ret_from_kernel_thread+0x58/0x78 [ 313.402930] irq event stamp: 2629216 [ 313.402933] hardirqs last enabled at (2629216): [<c0000000009da738>] ._raw_spin_unlock_irq+0x38/0x60 [ 313.402936] hardirqs last disabled at (2629215): [<c0000000009da4c4>] ._raw_spin_lock_irq+0x24/0x70 [ 313.402939] softirqs last enabled at (2629212): [<c0000000000af9fc>] .irq_enter+0x8c/0xd0 [ 313.402942] softirqs last disabled at (2629213): [<c0000000000afb58>] .irq_exit+0x118/0x170 [ 313.402944] other info that might help us debug this: [ 313.402945] Possible unsafe locking scenario: [ 313.402947] CPU0 [ 313.402948] ---- [ 313.402949] lock(&(&wsm.lock)->rlock); [ 313.402951] <Interrupt> [ 313.402952] lock(&(&wsm.lock)->rlock); [ 313.402954] *** DEADLOCK *** [ 313.402957] 1 lock held by swapper/5/0: [ 313.402958] #0: 000000004b612042 ((&wsm.timer)){+.-.}, at: .call_timer_fn+0x0/0x3c0 [ 313.402963] stack backtrace: [ 313.402967] CPU: 5 PID: 0 Comm: swapper/5 Not tainted 5.1.0-rc7 #1 [ 313.402968] Call Trace: [ 313.402972] [c0000007fa262e70] [c0000000009b3294] .dump_stack+0xe0/0x15c (unreliable) [ 313.402975] [c0000007fa262f10] [c000000000125548] .print_usage_bug+0x348/0x390 [ 313.402978] [c0000007fa262fd0] [c000000000125cb4] .mark_lock+0x724/0x930 [ 313.402981] [c0000007fa263080] [c000000000126c20] .__lock_acquire+0xc90/0x16a0 [ 313.402984] [c0000007fa2631b0] [c000000000128040] .lock_acquire+0xd0/0x240 [ 313.402987] [c0000007fa263280] [c0000000009da2b4] ._raw_spin_lock+0x34/0x60 [ 313.402990] [c0000007fa263300] [c00000000054b0b0] .zstd_reclaim_timer_fn+0x40/0x230 [ 313.402993] [c0000007fa2633d0] [c000000000158b38] .call_timer_fn+0xc8/0x3c0 [ 313.402996] [c0000007fa2634a0] [c000000000158f74] .expire_timers+0x144/0x260 [ 313.402999] [c0000007fa263550] [c000000000159178] .run_timer_softirq+0xe8/0x230 [ 313.403002] [c0000007fa263680] [c0000000009db288] .__do_softirq+0x188/0x5d4 [ 313.403004] [c0000007fa263790] [c0000000000afb58] .irq_exit+0x118/0x170 [ 313.403008] [c0000007fa263800] [c000000000028d88] .timer_interrupt+0x158/0x430 [ 313.403012] [c0000007fa2638b0] [c0000000000091d4] decrementer_common+0x134/0x140 [ 313.403017] --- interrupt: 901 at replay_interrupt_return+0x0/0x4 LR = .arch_local_irq_restore.part.0+0x68/0x80 [ 313.403020] [c0000007fa263bb0] [c00000000001a3ac] .arch_local_irq_restore.part.0+0x2c/0x80 (unreliable) [ 313.403024] [c0000007fa263c30] [c0000000007bbbcc] .cpuidle_enter_state+0xec/0x670 [ 313.403027] [c0000007fa263d00] [c0000000000f5130] .call_cpuidle+0x40/0x90 [ 313.403031] [c0000007fa263d70] [c0000000000f554c] .do_idle+0x2dc/0x3a0 [ 313.403034] [c0000007fa263e30] [c0000000000f59ac] .cpu_startup_entry+0x2c/0x30 [ 313.403037] [c0000007fa263ea0] [c000000000045674] .start_secondary+0x644/0x650 [ 313.403041] [c0000007fa263f90] [c00000000000ad5c] start_secondary_prolog+0x10/0x14 Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=203517 Fixes: 3f93aef535c8 ("btrfs: add zstd compression level support") CC: stable@vger.kernel.org # 5.1+ Signed-off-by: Dennis Zhou <dennis@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-09Btrfs: fix fsync not persisting changed attributes of a directoryFilipe Manana1-12/+0
commit 60d9f50308e5df19bc18c2fefab0eba4a843900a upstream. While logging an inode we follow its ancestors and for each one we mark it as logged in the current transaction, even if we have not logged it. As a consequence if we change an attribute of an ancestor, such as the UID or GID for example, and then explicitly fsync it, we end up not logging the inode at all despite returning success to user space, which results in the attribute being lost if a power failure happens after the fsync. Sample reproducer: $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /mnt $ mkdir /mnt/dir $ chown 6007:6007 /mnt/dir $ sync $ chown 9003:9003 /mnt/dir $ touch /mnt/dir/file $ xfs_io -c fsync /mnt/dir/file # fsync our directory after fsync'ing the new file, should persist the # new values for the uid and gid. $ xfs_io -c fsync /mnt/dir <power failure> $ mount /dev/sdb /mnt $ stat -c %u:%g /mnt/dir 6007:6007 --> should be 9003:9003, the uid and gid were not persisted, despite the explicit fsync on the directory prior to the power failure Fix this by not updating the logged_trans field of ancestor inodes when logging an inode, since we have not logged them. Let only future calls to btrfs_log_inode() to mark inodes as logged. This could be triggered by my recent fsync fuzz tester for fstests, for which an fstests patch exists titled "fstests: generic, fsync fuzz tester with fsstress". Fixes: 12fcfd22fe5b ("Btrfs: tree logging unlink/rename fixes") CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-09Btrfs: fix race updating log root item during fsyncFilipe Manana1-2/+6
commit 06989c799f04810f6876900d4760c0edda369cf7 upstream. When syncing the log, the final phase of a fsync operation, we need to either create a log root's item or update the existing item in the log tree of log roots, and that depends on the current value of the log root's log_transid - if it's 1 we need to create the log root item, otherwise it must exist already and we update it. Since there is no synchronization between updating the log_transid and checking it for deciding whether the log root's item needs to be created or updated, we end up with a tiny race window that results in attempts to update the item to fail because the item was not yet created: CPU 1 CPU 2 btrfs_sync_log() lock root->log_mutex set log root's log_transid to 1 unlock root->log_mutex btrfs_sync_log() lock root->log_mutex sets log root's log_transid to 2 unlock root->log_mutex update_log_root() sees log root's log_transid with a value of 2 calls btrfs_update_root(), which fails with -EUCLEAN and causes transaction abort Until recently the race lead to a BUG_ON at btrfs_update_root(), but after the recent commit 7ac1e464c4d47 ("btrfs: Don't panic when we can't find a root key") we just abort the current transaction. A sample trace of the BUG_ON() on a SLE12 kernel: ------------[ cut here ]------------ kernel BUG at ../fs/btrfs/root-tree.c:157! Oops: Exception in kernel mode, sig: 5 [#1] SMP NR_CPUS=2048 NUMA pSeries (...) Supported: Yes, External CPU: 78 PID: 76303 Comm: rtas_errd Tainted: G X 4.4.156-94.57-default #1 task: c00000ffa906d010 ti: c00000ff42b08000 task.ti: c00000ff42b08000 NIP: d000000036ae5cdc LR: d000000036ae5cd8 CTR: 0000000000000000 REGS: c00000ff42b0b860 TRAP: 0700 Tainted: G X (4.4.156-94.57-default) MSR: 8000000002029033 <SF,VEC,EE,ME,IR,DR,RI,LE> CR: 22444484 XER: 20000000 CFAR: d000000036aba66c SOFTE: 1 GPR00: d000000036ae5cd8 c00000ff42b0bae0 d000000036bda220 0000000000000054 GPR04: 0000000000000001 0000000000000000 c00007ffff8d37c8 0000000000000000 GPR08: c000000000e19c00 0000000000000000 0000000000000000 3736343438312079 GPR12: 3930373337303434 c000000007a3a800 00000000007fffff 0000000000000023 GPR16: c00000ffa9d26028 c00000ffa9d261f8 0000000000000010 c00000ffa9d2ab28 GPR20: c00000ff42b0bc48 0000000000000001 c00000ff9f0d9888 0000000000000001 GPR24: c00000ffa9d26000 c00000ffa9d261e8 c00000ffa9d2a800 c00000ff9f0d9888 GPR28: c00000ffa9d26028 c00000ffa9d2aa98 0000000000000001 c00000ffa98f5b20 NIP [d000000036ae5cdc] btrfs_update_root+0x25c/0x4e0 [btrfs] LR [d000000036ae5cd8] btrfs_update_root+0x258/0x4e0 [btrfs] Call Trace: [c00000ff42b0bae0] [d000000036ae5cd8] btrfs_update_root+0x258/0x4e0 [btrfs] (unreliable) [c00000ff42b0bba0] [d000000036b53610] btrfs_sync_log+0x2d0/0xc60 [btrfs] [c00000ff42b0bce0] [d000000036b1785c] btrfs_sync_file+0x44c/0x4e0 [btrfs] [c00000ff42b0bd80] [c00000000032e300] vfs_fsync_range+0x70/0x120 [c00000ff42b0bdd0] [c00000000032e44c] do_fsync+0x5c/0xb0 [c00000ff42b0be10] [c00000000032e8dc] SyS_fdatasync+0x2c/0x40 [c00000ff42b0be30] [c000000000009488] system_call+0x3c/0x100 Instruction dump: 7f43d378 4bffebb9 60000000 88d90008 3d220000 e8b90000 3b390009 e87a01f0 e8898e08 e8f90000 4bfd48e5 60000000 <0fe00000> e95b0060 39200004 394a0ea0 ---[ end trace 8f2dc8f919cabab8 ]--- So fix this by doing the check of log_transid and updating or creating the log root's item while holding the root's log_mutex. Fixes: 7237f1833601d ("Btrfs: fix tree logs parallel sync") CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-09Btrfs: fix wrong ctime and mtime of a directory after log replayFilipe Manana1-2/+12
commit 5338e43abbab13791144d37fd8846847062351c6 upstream. When replaying a log that contains a new file or directory name that needs to be added to its parent directory, we end up updating the mtime and the ctime of the parent directory to the current time after we have set their values to the correct ones (set at fsync time), efectivelly losing them. Sample reproducer: $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /mnt $ mkdir /mnt/dir $ touch /mnt/dir/file # fsync of the directory is optional, not needed $ xfs_io -c fsync /mnt/dir $ xfs_io -c fsync /mnt/dir/file $ stat -c %Y /mnt/dir 1557856079 <power failure> $ sleep 3 $ mount /dev/sdb /mnt $ stat -c %Y /mnt/dir 1557856082 --> should have been 1557856079, the mtime is updated to the current time when replaying the log Fix this by not updating the mtime and ctime to the current time at btrfs_add_link() when we are replaying a log tree. This could be triggered by my recent fsync fuzz tester for fstests, for which an fstests patch exists titled "fstests: generic, fsync fuzz tester with fsstress". Fixes: e02119d5a7b43 ("Btrfs: Add a write ahead tree log to optimize synchronous operations") CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-31NFS: Fix a double unlock from nfs_match,get_clientBenjamin Coddington1-1/+1
[ Upstream commit c260121a97a3e4df6536edbc2f26e166eff370ce ] Now that nfs_match_client drops the nfs_client_lock, we should be careful to always return it in the same condition: locked. Fixes: 950a578c6128 ("NFS: make nfs_match_client killable") Reported-by: syzbot+228a82b263b5da91883d@syzkaller.appspotmail.com Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Sasha Levin <sashal@kernel.org>