summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)AuthorFilesLines
2021-11-09coda: avoid hidden code duplication in renameJan Harkes1-5/+2
We were actually fixing up the directory mtime in both branches after the negative dentry test, it was just that one branch was only flagging the directory inodes to refresh their attributes while the other branch used the optional optimization to set mtime to the current time and not go back to the Coda client. Link: https://lkml.kernel.org/r/20210908140308.18491-6-jaharkes@cs.cmu.edu Signed-off-by: Jan Harkes <jaharkes@cs.cmu.edu> Cc: Alex Shi <alex.shi@linux.alibaba.com> Cc: Jing Yangyang <jing.yangyang@zte.com.cn> Cc: Xin Tan <tanxin.ctf@gmail.com> Cc: Xiyu Yang <xiyuyang19@fudan.edu.cn> Cc: Zeal Robot <zealci@zte.com.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-09coda: avoid flagging NULL inodesJan Harkes1-0/+3
Somehow we hit a negative dentry in coda_rename even after checking with d_really_is_positive. Maybe something raced and turned the new_dentry negative while we were fixing up directory link counts. Link: https://lkml.kernel.org/r/20210908140308.18491-5-jaharkes@cs.cmu.edu Signed-off-by: Jan Harkes <jaharkes@cs.cmu.edu> Cc: Alex Shi <alex.shi@linux.alibaba.com> Cc: Jing Yangyang <jing.yangyang@zte.com.cn> Cc: Xin Tan <tanxin.ctf@gmail.com> Cc: Xiyu Yang <xiyuyang19@fudan.edu.cn> Cc: Zeal Robot <zealci@zte.com.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-09coda: remove err which no one careAlex Shi1-2/+1
No one care 'err' in func coda_release, so better remove it. Link: https://lkml.kernel.org/r/20210908140308.18491-4-jaharkes@cs.cmu.edu Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com> Signed-off-by: Jan Harkes <jaharkes@cs.cmu.edu> Cc: Jing Yangyang <jing.yangyang@zte.com.cn> Cc: Xin Tan <tanxin.ctf@gmail.com> Cc: Xiyu Yang <xiyuyang19@fudan.edu.cn> Cc: Zeal Robot <zealci@zte.com.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-09coda: check for async upcall request using local stateJan Harkes1-1/+2
Originally flagged by Smatch because the code implicitly assumed outSize is not NULL for non-async upcalls because of a flag that was (not) set in req->uc_flags. However req->uc_flags field is in shared state and although the current code will not allow it to be changed before the async request check the code is more robust when it tests against the local outSize variable. Link: https://lkml.kernel.org/r/20210908140308.18491-3-jaharkes@cs.cmu.edu Signed-off-by: Jan Harkes <jaharkes@cs.cmu.edu> Cc: Alex Shi <alex.shi@linux.alibaba.com> Cc: Jing Yangyang <jing.yangyang@zte.com.cn> Cc: Xin Tan <tanxin.ctf@gmail.com> Cc: Xiyu Yang <xiyuyang19@fudan.edu.cn> Cc: Zeal Robot <zealci@zte.com.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-09coda: avoid NULL pointer dereference from a bad inodeJan Harkes1-4/+9
Patch series "Coda updates for -next". The following patch series contains some fixes for the Coda kernel module I've had sitting around and were tested extensively in a development version of the Coda kernel module that lives outside of the main kernel. This patch (of 9): Avoid accessing coda_inode_info from a dentry with a bad inode. Link: https://lkml.kernel.org/r/20210908140308.18491-1-jaharkes@cs.cmu.edu Link: https://lkml.kernel.org/r/20210908140308.18491-2-jaharkes@cs.cmu.edu Signed-off-by: Jan Harkes <jaharkes@cs.cmu.edu> Cc: Alex Shi <alex.shi@linux.alibaba.com> Cc: Jing Yangyang <jing.yangyang@zte.com.cn> Cc: Xin Tan <tanxin.ctf@gmail.com> Cc: Xiyu Yang <xiyuyang19@fudan.edu.cn> Cc: Zeal Robot <zealci@zte.com.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-09ramfs: fix mount source show for ramfsyangerkun1-4/+7
ramfs_parse_param does not parse key "source", and will convert -ENOPARAM to 0. This will skip vfs_parse_fs_param_source in vfs_parse_fs_param, which lead always "none" mount source for ramfs. Fix it by parsing "source" in ramfs_parse_param like cgroup1_parse_param does. Link: https://lkml.kernel.org/r/20210924091756.1906118-1-yangerkun@huawei.com Signed-off-by: yangerkun <yangerkun@huawei.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-09ELF: simplify STACK_ALLOC macroAlexey Dobriyan1-1/+1
"A -= B; A" is equivalent to "A -= B". Link: https://lkml.kernel.org/r/YVmcP256fRMqCwgK@localhost.localdomain Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-09binfmt_elf: reintroduce using MAP_FIXED_NOREPLACEKees Cook1-9/+22
Commit b212921b13bd ("elf: don't use MAP_FIXED_NOREPLACE for elf executable mappings") reverted back to using MAP_FIXED to map ELF LOAD segments because it was found that the segments in some binaries overlap and can cause MAP_FIXED_NOREPLACE to fail. The original intent of MAP_FIXED_NOREPLACE in the ELF loader was to prevent the silent clobbering of an existing mapping (e.g. stack) by the ELF image, which could lead to exploitable conditions. Quoting commit 4ed28639519c ("fs, elf: drop MAP_FIXED usage from elf_map"), which originally introduced the use of MAP_FIXED_NOREPLACE in the loader: Both load_elf_interp and load_elf_binary rely on elf_map to map segments [to a specific] address and they use MAP_FIXED to enforce that. This is however [a] dangerous thing prone to silent data corruption which can be even exploitable. ... Let's take CVE-2017-1000253 as an example ... we could end up mapping [the executable] over the existing stack ... The [stack layout] issue has been fixed since then ... So we should be safe and any [similar] attack should be impractical. On the other hand this is just too subtle [an] assumption ... it can break quite easily and [be] hard to spot. ... Address this [weakness] by changing MAP_FIXED to the newly added MAP_FIXED_NOREPLACE. This will mean that mmap will fail if there is an existing mapping clashing with the requested one [instead of silently] clobbering it. Then processing ET_DYN binaries the loader already calculates a total size for the image when the first segment is mapped, maps the entire image, and then unmaps the remainder before the remaining segments are then individually mapped. To avoid the earlier problems (legitimate overlapping LOAD segments specified in the ELF), apply the same logic to ET_EXEC binaries as well. For both ET_EXEC and ET_DYN+INTERP use MAP_FIXED_NOREPLACE for the initial total size mapping and then use MAP_FIXED to build the final (possibly legitimately overlapping) mappings. For ET_DYN w/out INTERP, continue to map at a system-selected address in the mmap region. Link: https://lkml.kernel.org/r/20210916215947.3993776-1-keescook@chromium.org Link: https://lore.kernel.org/lkml/1595869887-23307-2-git-send-email-anthony.yznaga@oracle.com Co-developed-by: Anthony Yznaga <anthony.yznaga@oracle.com> Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com> Signed-off-by: Kees Cook <keescook@chromium.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Michal Hocko <mhocko@suse.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Chen Jingwen <chenjingwen6@huawei.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Andrei Vagin <avagin@openvz.org> Cc: Khalid Aziz <khalid.aziz@oracle.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-09proc: allow pid_revalidate() during LOOKUP_RCUStephen Brennan1-8/+10
Problem Description: When running running ~128 parallel instances of TZ=/etc/localtime ps -fe >/dev/null on a 128CPU machine, the %sys utilization reaches 97%, and perf shows the following code path as being responsible for heavy contention on the d_lockref spinlock: walk_component() lookup_fast() d_revalidate() pid_revalidate() // returns -ECHILD unlazy_child() lockref_get_not_dead(&nd->path.dentry->d_lockref) <-- contention The reason is that pid_revalidate() is triggering a drop from RCU to ref path walk mode. All concurrent path lookups thus try to grab a reference to the dentry for /proc/, before re-executing pid_revalidate() and then stepping into the /proc/$pid directory. Thus there is huge spinlock contention. This patch allows pid_revalidate() to execute in RCU mode, meaning that the path lookup can successfully enter the /proc/$pid directory while still in RCU mode. Later on, the path lookup may still drop into ref mode, but the contention will be much reduced at this point. By applying this patch, %sys utilization falls to around 85% under the same workload, and the number of ps processes executed per unit time increases by 3x-4x. Although this particular workload is a bit contrived, we have seen some large collections of eager monitoring scripts which produced similarly high %sys time due to contention in the /proc directory. As a result this patch, Al noted that several procfs methods which were only called in ref-walk mode could now be called from RCU mode. To ensure that this patch is safe, I audited all the inode get_link and permission() implementations, as well as dentry d_revalidate() implementations, in fs/proc. The purpose here is to ensure that they either are safe to call in RCU (i.e. don't sleep) or correctly bail out of RCU mode if they don't support it. My analysis shows that all at-risk procfs methods are safe to call under RCU, and thus this patch is safe. Procfs RCU-walk Analysis: This analysis is up-to-date with 5.15-rc3. When called under RCU mode, these functions have arguments as follows: * get_link() receives a NULL dentry pointer when called in RCU mode. * permission() receives MAY_NOT_BLOCK in the mode parameter when called from RCU. * d_revalidate() receives LOOKUP_RCU in flags. For the following functions, either they are trivially RCU safe, or they explicitly bail at the beginning of the function when they run: proc_ns_get_link (bails out) proc_get_link (RCU safe) proc_pid_get_link (bails out) map_files_d_revalidate (bails out) map_misc_d_revalidate (bails out) proc_net_d_revalidate (RCU safe) proc_sys_revalidate (bails out, also not under /proc/$pid) tid_fd_revalidate (bails out) proc_sys_permission (not under /proc/$pid) The remainder of the functions require a bit more detail: * proc_fd_permission: RCU safe. All of the body of this function is under rcu_read_lock(), except generic_permission() which declares itself RCU safe in its documentation string. * proc_self_get_link uses GFP_ATOMIC in the RCU case, so it is RCU aware and otherwise looks safe. The same is true of proc_thread_self_get_link. * proc_map_files_get_link: calls ns_capable, which calls capable(), and thus calls into the audit code (see note #1 below). The remainder is just a call to the trivially safe proc_pid_get_link(). * proc_pid_permission: calls ptrace_may_access(), which appears RCU safe, although it does call into the "security_ptrace_access_check()" hook, which looks safe under smack and selinux. Just the audit code is of concern. Also uses get_task_struct() and put_task_struct(), see note #2 below. * proc_tid_comm_permission: Appears safe, though calls put_task_struct (see note #2 below). Note #1: Most of the concern of RCU safety has centered around the audit code. However, since b17ec22fb339 ("selinux: slow_avc_audit has become non-blocking"), it's safe to call this code under RCU. So all of the above are safe by my estimation. Note #2: get_task_struct() and put_task_struct(): The majority of get_task_struct() is under RCU read lock, and in any case it is a simple increment. But put_task_struct() is complex, given that it could at some point free the task struct, and this process has many steps which I couldn't manually verify. However, several other places call put_task_struct() under RCU, so it appears safe to use here too (see kernel/hung_task.c:165 or rcu/tree-stall.h:296) Patch description: pid_revalidate() drops from RCU into REF lookup mode. When many threads are resolving paths within /proc in parallel, this can result in heavy spinlock contention on d_lockref as each thread tries to grab a reference to the /proc dentry (and drop it shortly thereafter). Investigation indicates that it is not necessary to drop RCU in pid_revalidate(), as no RCU data is modified and the function never sleeps. So, remove the LOOKUP_RCU check. Link: https://lkml.kernel.org/r/20211004175629.292270-2-stephen.s.brennan@oracle.com Signed-off-by: Stephen Brennan <stephen.s.brennan@oracle.com> Cc: Konrad Wilk <konrad.wilk@oracle.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Matthew Wilcox <willy@infradead.org> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-09proc/vmcore: convert oldmem_pfn_is_ram callback to more generic vmcore callbacksDavid Hildenbrand1-30/+69
Let's support multiple registered callbacks, making sure that registering vmcore callbacks cannot fail. Make the callback return a bool instead of an int, handling how to deal with errors internally. Drop unused HAVE_OLDMEM_PFN_IS_RAM. We soon want to make use of this infrastructure from other drivers: virtio-mem, registering one callback for each virtio-mem device, to prevent reading unplugged virtio-mem memory. Handle it via a generic vmcore_cb structure, prepared for future extensions: for example, once we support virtio-mem on s390x where the vmcore is completely constructed in the second kernel, we want to detect and add plugged virtio-mem memory ranges to the vmcore in order for them to get dumped properly. Handle corner cases that are unexpected and shouldn't happen in sane setups: registering a callback after the vmcore has already been opened (warn only) and unregistering a callback after the vmcore has already been opened (warn and essentially read only zeroes from that point on). Link: https://lkml.kernel.org/r/20211005121430.30136-6-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Baoquan He <bhe@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Dave Young <dyoung@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-09proc/vmcore: let pfn_is_ram() return a boolDavid Hildenbrand1-4/+4
The callback should deal with errors internally, it doesn't make sense to expose these via pfn_is_ram(). We'll rework the callbacks next. Right now we consider errors as if "it's RAM"; no functional change. Link: https://lkml.kernel.org/r/20211005121430.30136-5-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Baoquan He <bhe@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Dave Young <dyoung@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-09procfs: do not list TID 0 in /proc/<pid>/taskFlorian Weimer1-0/+3
If a task exits concurrently, task_pid_nr_ns may return 0. [akpm@linux-foundation.org: coding style tweaks] [adobriyan@gmail.com: test that /proc/*/task doesn't contain "0"] Link: https://lkml.kernel.org/r/YV88AnVzHxPafQ9o@localhost.localdomain Link: https://lkml.kernel.org/r/8735pn5dx7.fsf@oldenburg.str.redhat.com Signed-off-by: Florian Weimer <fweimer@redhat.com> Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Reviewed-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: Kees Cook <keescook@chromium.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-09mm,hugetlb: remove mlock ulimit for SHM_HUGETLBzhangyiru1-15/+8
Commit 21a3c273f88c ("mm, hugetlb: add thread name and pid to SHM_HUGETLB mlock rlimit warning") marked this as deprecated in 2012, but it is not deleted yet. Mike says he still sees that message in log files on occasion, so maybe we should preserve this warning. Also remove hugetlbfs related user_shm_unlock in ipc/shm.c and remove the user_shm_unlock after out. Link: https://lkml.kernel.org/r/20211103105857.25041-1-zhangyiru3@huawei.com Signed-off-by: zhangyiru <zhangyiru3@huawei.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Hugh Dickins <hughd@google.com> Cc: Liu Zixian <liuzixian4@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: wuxu.wu <wuxu.wu@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-09vfs: keep inodes with page cache off the inode shrinker LRUJohannes Weiner2-20/+27
Historically (pre-2.5), the inode shrinker used to reclaim only empty inodes and skip over those that still contained page cache. This caused problems on highmem hosts: struct inode could put fill lowmem zones before the cache was getting reclaimed in the highmem zones. To address this, the inode shrinker started to strip page cache to facilitate reclaiming lowmem. However, this comes with its own set of problems: the shrinkers may drop actively used page cache just because the inodes are not currently open or dirty - think working with a large git tree. It further doesn't respect cgroup memory protection settings and can cause priority inversions between containers. Nowadays, the page cache also holds non-resident info for evicted cache pages in order to detect refaults. We've come to rely heavily on this data inside reclaim for protecting the cache workingset and driving swap behavior. We also use it to quantify and report workload health through psi. The latter in turn is used for fleet health monitoring, as well as driving automated memory sizing of workloads and containers, proactive reclaim and memory offloading schemes. The consequences of dropping page cache prematurely is that we're seeing subtle and not-so-subtle failures in all of the above-mentioned scenarios, with the workload generally entering unexpected thrashing states while losing the ability to reliably detect it. To fix this on non-highmem systems at least, going back to rotating inodes on the LRU isn't feasible. We've tried (commit a76cf1a474d7 ("mm: don't reclaim inodes with many attached pages")) and failed (commit 69056ee6a8a3 ("Revert "mm: don't reclaim inodes with many attached pages"")). The issue is mostly that shrinker pools attract pressure based on their size, and when objects get skipped the shrinkers remember this as deferred reclaim work. This accumulates excessive pressure on the remaining inodes, and we can quickly eat into heavily used ones, or dirty ones that require IO to reclaim, when there potentially is plenty of cold, clean cache around still. Instead, this patch keeps populated inodes off the inode LRU in the first place - just like an open file or dirty state would. An otherwise clean and unused inode then gets queued when the last cache entry disappears. This solves the problem without reintroducing the reclaim issues, and generally is a bit more scalable than having to wade through potentially hundreds of thousands of busy inodes. Locking is a bit tricky because the locks protecting the inode state (i_lock) and the inode LRU (lru_list.lock) don't nest inside the irq-safe page cache lock (i_pages.xa_lock). Page cache deletions are serialized through i_lock, taken before the i_pages lock, to make sure depopulated inodes are queued reliably. Additions may race with deletions, but we'll check again in the shrinker. If additions race with the shrinker itself, we're protected by the i_lock: if find_inode() or iput() win, the shrinker will bail on the elevated i_count or I_REFERENCED; if the shrinker wins and goes ahead with the inode, it will set I_FREEING and inhibit further igets(), which will cause the other side to create a new instance of the inode instead. Link: https://lkml.kernel.org/r/20210614211904.14420-4-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Roman Gushchin <guro@fb.com> Cc: Tejun Heo <tj@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-06mm, thp: fix incorrect unmap behavior for private pagesRongwei Wang1-1/+10
When truncating pagecache on file THP, the private pages of a process should not be unmapped mapping. This incorrect behavior on a dynamic shared libraries which will cause related processes to happen core dump. A simple test for a DSO (Prerequisite is the DSO mapped in file THP): int main(int argc, char *argv[]) { int fd; fd = open(argv[1], O_WRONLY); if (fd < 0) { perror("open"); } close(fd); return 0; } The test only to open a target DSO, and do nothing. But this operation will lead one or more process to happen core dump. This patch mainly to fix this bug. Link: https://lkml.kernel.org/r/20211025092134.18562-3-rongwei.wang@linux.alibaba.com Fixes: eb6ecbed0aa2 ("mm, thp: relax the VM_DENYWRITE constraint on file-backed THPs") Signed-off-by: Rongwei Wang <rongwei.wang@linux.alibaba.com> Tested-by: Xu Yu <xuyu@linux.alibaba.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Song Liu <song@kernel.org> Cc: William Kucharski <william.kucharski@oracle.com> Cc: Hugh Dickins <hughd@google.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Collin Fijalkovich <cfijalkovich@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-06mm, thp: lock filemap when truncating page cacheRongwei Wang1-1/+4
Patch series "fix two bugs for file THP". This patch (of 2): Transparent huge page has supported read-only non-shmem files. The file- backed THP is collapsed by khugepaged and truncated when written (for shared libraries). However, there is a race when multiple writers truncate the same page cache concurrently. In that case, subpage(s) of file THP can be revealed by find_get_entry in truncate_inode_pages_range, which will trigger PageTail BUG_ON in truncate_inode_page, as follows: page:000000009e420ff2 refcount:1 mapcount:0 mapping:0000000000000000 index:0x7ff pfn:0x50c3ff head:0000000075ff816d order:9 compound_mapcount:0 compound_pincount:0 flags: 0x37fffe0000010815(locked|uptodate|lru|arch_1|head) raw: 37fffe0000000000 fffffe0013108001 dead000000000122 dead000000000400 raw: 0000000000000001 0000000000000000 00000000ffffffff 0000000000000000 head: 37fffe0000010815 fffffe001066bd48 ffff000404183c20 0000000000000000 head: 0000000000000600 0000000000000000 00000001ffffffff ffff000c0345a000 page dumped because: VM_BUG_ON_PAGE(PageTail(page)) ------------[ cut here ]------------ kernel BUG at mm/truncate.c:213! Internal error: Oops - BUG: 0 [#1] SMP Modules linked in: xfs(E) libcrc32c(E) rfkill(E) ... CPU: 14 PID: 11394 Comm: check_madvise_d Kdump: ... Hardware name: ECS, BIOS 0.0.0 02/06/2015 pstate: 60400005 (nZCv daif +PAN -UAO -TCO BTYPE=--) Call trace: truncate_inode_page+0x64/0x70 truncate_inode_pages_range+0x550/0x7e4 truncate_pagecache+0x58/0x80 do_dentry_open+0x1e4/0x3c0 vfs_open+0x38/0x44 do_open+0x1f0/0x310 path_openat+0x114/0x1dc do_filp_open+0x84/0x134 do_sys_openat2+0xbc/0x164 __arm64_sys_openat+0x74/0xc0 el0_svc_common.constprop.0+0x88/0x220 do_el0_svc+0x30/0xa0 el0_svc+0x20/0x30 el0_sync_handler+0x1a4/0x1b0 el0_sync+0x180/0x1c0 Code: aa0103e0 900061e1 910ec021 9400d300 (d4210000) This patch mainly to lock filemap when one enter truncate_pagecache(), avoiding truncating the same page cache concurrently. Link: https://lkml.kernel.org/r/20211025092134.18562-1-rongwei.wang@linux.alibaba.com Link: https://lkml.kernel.org/r/20211025092134.18562-2-rongwei.wang@linux.alibaba.com Fixes: eb6ecbed0aa2 ("mm, thp: relax the VM_DENYWRITE constraint on file-backed THPs") Signed-off-by: Xu Yu <xuyu@linux.alibaba.com> Signed-off-by: Rongwei Wang <rongwei.wang@linux.alibaba.com> Suggested-by: Matthew Wilcox (Oracle) <willy@infradead.org> Tested-by: Song Liu <song@kernel.org> Cc: Collin Fijalkovich <cfijalkovich@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: William Kucharski <william.kucharski@oracle.com> Cc: Yang Shi <shy828301@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-06fs: explicitly unregister per-superblock BDIsChristoph Hellwig1-0/+3
Add a new SB_I_ flag to mark superblocks that have an ephemeral bdi associated with them, and unregister it when the superblock is shut down. Link: https://lkml.kernel.org/r/20211021124441.668816-4-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Miquel Raynal <miquel.raynal@bootlin.com> Cc: Richard Weinberger <richard@nod.at> Cc: Vignesh Raghavendra <vigneshr@ti.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-06mm/smaps: simplify shmem handling of pte holesPeter Xu1-10/+12
Firstly, check_shmem_swap variable is actually not necessary, because it's always set with pte_hole hook; checking each would work. Meanwhile, the check within smaps_pte_entry is not easy to follow. E.g., pte_none() check is not needed as "!pte_present && !is_swap_pte" is the same. Since at it, use the pte_hole() helper rather than dup the page cache lookup. Still keep the CONFIG_SHMEM part so the code can be optimized to nop for !SHMEM. There will be a very slight functional change in smaps_pte_entry(), that for !SHMEM we'll return early for pte_none (before checking page==NULL), but that's even nicer. Link: https://lkml.kernel.org/r/20210917164756.8586-4-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-06mm/smaps: fix shmem pte hole swap calculationPeter Xu1-2/+4
Patch series "mm/smaps: Fixes and optimizations on shmem swap handling". This patch (of 3): The shmem swap calculation on the privately writable mappings are using wrong parameters as spotted by Vlastimil. Fix them. This was introduced in commit 48131e03ca4e ("mm, proc: reduce cost of /proc/pid/smaps for unpopulated shmem mappings"), when shmem_swap_usage was reworked to shmem_partial_swap_usage. Test program: void main(void) { char *buffer, *p; int i, fd; fd = memfd_create("test", 0); assert(fd > 0); /* isize==2M*3, fill in pages, swap them out */ ftruncate(fd, SIZE_2M * 3); buffer = mmap(NULL, SIZE_2M * 3, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); assert(buffer); for (i = 0, p = buffer; i < SIZE_2M * 3 / 4096; i++) { *p = 1; p += 4096; } madvise(buffer, SIZE_2M * 3, MADV_PAGEOUT); munmap(buffer, SIZE_2M * 3); /* * Remap with private+writtable mappings on partial of the inode (<= 2M*3), * while the size must also be >= 2M*2 to make sure there's a none pmd so * smaps_pte_hole will be triggered. */ buffer = mmap(NULL, SIZE_2M * 2, PROT_READ | PROT_WRITE, MAP_PRIVATE, fd, 0); printf("pid=%d, buffer=%p\n", getpid(), buffer); /* Check /proc/$PID/smap_rollup, should see 4MB swap */ sleep(1000000); } Before the patch, smaps_rollup shows <4MB swap and the number will be random depending on the alignment of the buffer of mmap() allocated. After this patch, it'll show 4MB. Link: https://lkml.kernel.org/r/20210917164756.8586-1-peterx@redhat.com Link: https://lkml.kernel.org/r/20210917164756.8586-2-peterx@redhat.com Fixes: 48131e03ca4e ("mm, proc: reduce cost of /proc/pid/smaps for unpopulated shmem mappings") Signed-off-by: Peter Xu <peterx@redhat.com> Reported-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-06d_path: fix Kernel doc validator complainingJia He1-5/+3
Kernel doc validator complains: Function parameter or member 'p' not described in 'prepend_name' Excess function parameter 'buffer' description in 'prepend_name' Link: https://lkml.kernel.org/r/20211011005614.26189-1-justin.he@arm.com Fixes: ad08ae586586 ("d_path: introduce struct prepend_buffer") Signed-off-by: Jia He <justin.he@arm.com> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Acked-by: Randy Dunlap <rdunlap@infradead.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-06fs/posix_acl.c: avoid -Wempty-body warningArnd Bergmann1-2/+1
The fallthrough comment for an ignored cmpxchg() return value produces a harmless warning with 'make W=1': fs/posix_acl.c: In function 'get_acl': fs/posix_acl.c:127:36: error: suggest braces around empty body in an 'if' statement [-Werror=empty-body] 127 | /* fall through */ ; | ^ Simplify it as a step towards a clean W=1 build. As all architectures define cmpxchg() as a statement expression these days, it is no longer necessary to evaluate its return code, and the if() can just be droped. Link: https://lkml.kernel.org/r/20210927102410.1863853-1-arnd@kernel.org Link: https://lore.kernel.org/all/20210322132103.qiun2rjilnlgztxe@wittgenstein/ Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Christian Brauner <christian.brauner@ubuntu.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: James Morris <jamorris@linux.microsoft.com> Cc: Serge Hallyn <serge@hallyn.com> Cc: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-06ocfs2: do not zero pages beyond i_sizeJan Kara1-7/+12
ocfs2_zero_range_for_truncate() can try to zero pages beyond current inode size despite the fact that underlying blocks should be already zeroed out and writeback will skip writing such pages anyway. Avoid the pointless work. Link: https://lkml.kernel.org/r/20211025151332.11301-2-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Jun Piao <piaojun@huawei.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Mark Fasheh <mark@fasheh.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-06ocfs2: fix data corruption on truncateJan Kara1-2/+6
Patch series "ocfs2: Truncate data corruption fix". As further testing has shown, commit 5314454ea3f ("ocfs2: fix data corruption after conversion from inline format") didn't fix all the data corruption issues the customer started observing after 6dbf7bb55598 ("fs: Don't invalidate page buffers in block_write_full_page()") This time I have tracked them down to two bugs in ocfs2 truncation code. One bug (truncating page cache before clearing tail cluster and setting i_size) could cause data corruption even before 6dbf7bb55598, but before that commit it needed a race with page fault, after 6dbf7bb55598 it started to be pretty deterministic. Another bug (zeroing pages beyond old i_size) used to be harmless inefficiency before commit 6dbf7bb55598. But after commit 6dbf7bb55598 in combination with the first bug it resulted in deterministic data corruption. Although fixing only the first problem is needed to stop data corruption, I've fixed both issues to make the code more robust. This patch (of 2): ocfs2_truncate_file() did unmap invalidate page cache pages before zeroing partial tail cluster and setting i_size. Thus some pages could be left (and likely have left if the cluster zeroing happened) in the page cache beyond i_size after truncate finished letting user possibly see stale data once the file was extended again. Also the tail cluster zeroing was not guaranteed to finish before truncate finished causing possible stale data exposure. The problem started to be particularly easy to hit after commit 6dbf7bb55598 "fs: Don't invalidate page buffers in block_write_full_page()" stopped invalidation of pages beyond i_size from page writeback path. Fix these problems by unmapping and invalidating pages in the page cache after the i_size is reduced and tail cluster is zeroed out. Link: https://lkml.kernel.org/r/20211025150008.29002-1-jack@suse.cz Link: https://lkml.kernel.org/r/20211025151332.11301-1-jack@suse.cz Fixes: ccd979bdbce9 ("[PATCH] OCFS2: The Second Oracle Cluster Filesystem") Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-06ocfs2/dlm: remove redundant assignment of variable retColin Ian King1-1/+0
The variable ret is being assigned a value that is never read, it is updated later on with a different value. The assignment is redundant and can be removed. Addresses-Coverity: ("Unused value") Link: https://lkml.kernel.org/r/20211007233452.30815-1-colin.king@canonical.com Signed-off-by: Colin Ian King <colin.king@canonical.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-06ocfs2: cleanup journal init and shutdownValentin Vidic4-46/+27
Allocate and free struct ocfs2_journal in ocfs2_journal_init and ocfs2_journal_shutdown. Init and release of system inodes references the journal so reorder calls to make sure they work correctly. Link: https://lkml.kernel.org/r/20211009145006.3478-1-vvidic@valentin-vidic.from.hr Signed-off-by: Valentin Vidic <vvidic@valentin-vidic.from.hr> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-06ocfs2: fix handle refcount leak in two exception handling pathsChenyuan Mi1-0/+2
The reference counting issue happens in two exception handling paths of ocfs2_replay_truncate_records(). When executing these two exception handling paths, the function forgets to decrease the refcount of handle increased by ocfs2_start_trans(), causing a refcount leak. Fix this issue by using ocfs2_commit_trans() to decrease the refcount of handle in two handling paths. Link: https://lkml.kernel.org/r/20210908102055.10168-1-cymi20@fudan.edu.cn Signed-off-by: Chenyuan Mi <cymi20@fudan.edu.cn> Signed-off-by: Xiyu Yang <xiyuyang19@fudan.edu.cn> Signed-off-by: Xin Tan <tanxin.ctf@gmail.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Wengang Wang <wen.gang.wang@oracle.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-10-29Merge tag 'for-5.15-rc7-tag' of ↵Linus Torvalds5-33/+72
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "Last minute fixes for crash on 32bit architectures when compression is in use. It's a regression introduced in 5.15-rc and I'd really like not let this into the final release, fixes via stable trees would add unnecessary delay. The problem is on 32bit architectures with highmem enabled, the pages for compression may need to be kmapped, while the patches removed that as we don't use GFP_HIGHMEM allocations anymore. The pages that don't come from local allocation still may be from highmem. Despite being on 32bit there's enough such ARM machines in use so it's not a marginal issue. I did full reverts of the patches one by one instead of a huge one. There's one exception for the "lzo" revert as there was an intermediate patch touching the same code to make it compatible with subpage. I can't revert that one too, so the revert in lzo.c is manual. Qu Wenruo has worked on that with me and verified the changes" * tag 'for-5.15-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: Revert "btrfs: compression: drop kmap/kunmap from lzo" Revert "btrfs: compression: drop kmap/kunmap from zlib" Revert "btrfs: compression: drop kmap/kunmap from zstd" Revert "btrfs: compression: drop kmap/kunmap from generic helpers"
2021-10-29Revert "btrfs: compression: drop kmap/kunmap from lzo"David Sterba1-11/+25
This reverts commit 8c945d32e60427cbc0859cf7045bbe6196bb03d8. The kmaps in compression code are still needed and cause crashes on 32bit machines (ARM, x86). Reproducible eg. by running fstest btrfs/004 with enabled LZO or ZSTD compression. The revert does not apply cleanly due to changes in a6e66e6f8c1b ("btrfs: rework lzo_decompress_bio() to make it subpage compatible") that reworked the page iteration so the revert is done to be equivalent to the original code. Link: https://lore.kernel.org/all/CAJCQCtT+OuemovPO7GZk8Y8=qtOObr0XTDp8jh4OHD6y84AFxw@mail.gmail.com/ Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=214839 Tested-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-29Revert "btrfs: compression: drop kmap/kunmap from zlib"David Sterba1-11/+25
This reverts commit 696ab562e6df9fbafd6052d8ce4aafcb2ed16069. The kmaps in compression code are still needed and cause crashes on 32bit machines (ARM, x86). Reproducible eg. by running fstest btrfs/004 with enabled LZO or ZSTD compression. Link: https://lore.kernel.org/all/CAJCQCtT+OuemovPO7GZk8Y8=qtOObr0XTDp8jh4OHD6y84AFxw@mail.gmail.com/ Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=214839 Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-29Revert "btrfs: compression: drop kmap/kunmap from zstd"David Sterba1-9/+18
This reverts commit bbaf9715f3f5b5ff0de71da91fcc34ee9c198ed8. The kmaps in compression code are still needed and cause crashes on 32bit machines (ARM, x86). Reproducible eg. by running fstest btrfs/004 with enabled LZO or ZSTD compression. Example stacktrace with ZSTD on a 32bit ARM machine: Unable to handle kernel NULL pointer dereference at virtual address 00000000 pgd = c4159ed3 [00000000] *pgd=00000000 Internal error: Oops: 5 [#1] PREEMPT SMP ARM Modules linked in: CPU: 0 PID: 210 Comm: kworker/u2:3 Not tainted 5.14.0-rc79+ #12 Hardware name: Allwinner sun4i/sun5i Families Workqueue: btrfs-delalloc btrfs_work_helper PC is at mmiocpy+0x48/0x330 LR is at ZSTD_compressStream_generic+0x15c/0x28c (mmiocpy) from [<c0629648>] (ZSTD_compressStream_generic+0x15c/0x28c) (ZSTD_compressStream_generic) from [<c06297dc>] (ZSTD_compressStream+0x64/0xa0) (ZSTD_compressStream) from [<c049444c>] (zstd_compress_pages+0x170/0x488) (zstd_compress_pages) from [<c0496798>] (btrfs_compress_pages+0x124/0x12c) (btrfs_compress_pages) from [<c043c068>] (compress_file_range+0x3c0/0x834) (compress_file_range) from [<c043c4ec>] (async_cow_start+0x10/0x28) (async_cow_start) from [<c0475c3c>] (btrfs_work_helper+0x100/0x230) (btrfs_work_helper) from [<c014ef68>] (process_one_work+0x1b4/0x418) (process_one_work) from [<c014f210>] (worker_thread+0x44/0x524) (worker_thread) from [<c0156aa4>] (kthread+0x180/0x1b0) (kthread) from [<c0100150>] Link: https://lore.kernel.org/all/CAJCQCtT+OuemovPO7GZk8Y8=qtOObr0XTDp8jh4OHD6y84AFxw@mail.gmail.com/ Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=214839 Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-29ocfs2: fix race between searching chunks and release journal_head from ↵Gautham Ananthakrishna1-9/+13
buffer_head Encountered a race between ocfs2_test_bg_bit_allocatable() and jbd2_journal_put_journal_head() resulting in the below vmcore. PID: 106879 TASK: ffff880244ba9c00 CPU: 2 COMMAND: "loop3" Call trace: panic oops_end no_context __bad_area_nosemaphore bad_area_nosemaphore __do_page_fault do_page_fault page_fault [exception RIP: ocfs2_block_group_find_clear_bits+316] ocfs2_block_group_find_clear_bits [ocfs2] ocfs2_cluster_group_search [ocfs2] ocfs2_search_chain [ocfs2] ocfs2_claim_suballoc_bits [ocfs2] __ocfs2_claim_clusters [ocfs2] ocfs2_claim_clusters [ocfs2] ocfs2_local_alloc_slide_window [ocfs2] ocfs2_reserve_local_alloc_bits [ocfs2] ocfs2_reserve_clusters_with_limit [ocfs2] ocfs2_reserve_clusters [ocfs2] ocfs2_lock_refcount_allocators [ocfs2] ocfs2_make_clusters_writable [ocfs2] ocfs2_replace_cow [ocfs2] ocfs2_refcount_cow [ocfs2] ocfs2_file_write_iter [ocfs2] lo_rw_aio loop_queue_work kthread_worker_fn kthread ret_from_fork When ocfs2_test_bg_bit_allocatable() called bh2jh(bg_bh), the bg_bh->b_private NULL as jbd2_journal_put_journal_head() raced and released the jounal head from the buffer head. Needed to take bit lock for the bit 'BH_JournalHead' to fix this race. Link: https://lkml.kernel.org/r/1634820718-6043-1-git-send-email-gautham.ananthakrishna@oracle.com Signed-off-by: Gautham Ananthakrishna <gautham.ananthakrishna@oracle.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: <rajesh.sivaramasubramaniom@oracle.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-10-27Revert "btrfs: compression: drop kmap/kunmap from generic helpers"David Sterba2-2/+4
This reverts commit 4c2bf276b56d8d27ddbafcdf056ef3fc60ae50b0. The kmaps in compression code are still needed and cause crashes on 32bit machines (ARM, x86). Reproducible eg. by running fstest btrfs/004 with enabled LZO or ZSTD compression. Link: https://lore.kernel.org/all/CAJCQCtT+OuemovPO7GZk8Y8=qtOObr0XTDp8jh4OHD6y84AFxw@mail.gmail.com/ Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=214839 Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-24Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds1-1/+1
Pull autofs fix from Al Viro: "Fix for a braino of mine (in getting rid of open-coded dentry_path_raw() in autofs a couple of cycles ago). Mea culpa... Obvious -stable fodder" * 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: autofs: fix wait name hash calculation in autofs_wait()
2021-10-24Merge tag '5.15-rc6-ksmbd-fixes' of git://git.samba.org/ksmbdLinus Torvalds14-153/+306
Pull ksmbd fixes from Steve French: "Ten fixes for the ksmbd kernel server, for improved security and additional buffer overflow checks: - a security improvement to session establishment to reduce the possibility of dictionary attacks - fix to ensure that maximum i/o size negotiated in the protocol is not less than 64K and not more than 8MB to better match expected behavior - fix for crediting (flow control) important to properly verify that sufficient credits are available for the requested operation - seven additional buffer overflow, buffer validation checks" * tag '5.15-rc6-ksmbd-fixes' of git://git.samba.org/ksmbd: ksmbd: add buffer validation in session setup ksmbd: throttle session setup failures to avoid dictionary attacks ksmbd: validate OutputBufferLength of QUERY_DIR, QUERY_INFO, IOCTL requests ksmbd: validate credit charge after validating SMB2 PDU body size ksmbd: add buffer validation for smb direct ksmbd: limit read/write/trans buffer size not to exceed 8MB ksmbd: validate compound response buffer ksmbd: fix potencial 32bit overflow from data area check in smb2_write ksmbd: improve credits management ksmbd: add validation in smb2_ioctl
2021-10-23Merge tag 'io_uring-5.15-2021-10-22' of git://git.kernel.dk/linux-blockLinus Torvalds2-14/+47
Pull io_uring fixes from Jens Axboe: "Two fixes for the max workers limit API that was introduced this series: one fix for an issue with that code, and one fixing a linked timeout regression in this series" * tag 'io_uring-5.15-2021-10-22' of git://git.kernel.dk/linux-block: io_uring: apply worker limits to previous users io_uring: fix ltimeout unprep io_uring: apply max_workers limit to all future users io-wq: max_worker fixes
2021-10-22Merge tag 'fuse-fixes-5.15-rc7' of ↵Linus Torvalds3-56/+46
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse Pull fuse fixes from Miklos Szeredi: "Syzbot discovered a race in case of reusing the fuse sb (introduced in this cycle). Fix it by doing the s_fs_info initialization at the proper place" * tag 'fuse-fixes-5.15-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: fuse: clean up error exits in fuse_fill_super() fuse: always initialize sb->s_fs_info fuse: clean up fuse_mount destruction fuse: get rid of fuse_put_super() fuse: check s_root when destroying sb
2021-10-21io_uring: apply worker limits to previous usersPavel Begunkov1-0/+18
Another change to the API io-wq worker limitation API added in 5.15, apply the limit to all prior users that already registered a tctx. It may be confusing as it's now, in particular the change covers the following 2 cases: TASK1 | TASK2 _________________________________________________ ring = create() | | limit_iowq_workers() *not limited* | TASK1 | TASK2 _________________________________________________ ring = create() | | issue_requests() limit_iowq_workers() | | *not limited* A note on locking, it's safe to traverse ->tctx_list as we hold ->uring_lock, but do that after dropping sqd->lock to avoid possible problems. It's also safe to access tctx->io_wq there because tasks kill it only after removing themselves from tctx_list, see io_uring_cancel_generic() -> io_uring_clean_tctx() Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/d6e09ecc3545e4dc56e43c906ee3d71b7ae21bed.1634818641.git.asml.silence@gmail.com Reviewed-by: Hao Xu <haoxu@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-21fuse: clean up error exits in fuse_fill_super()Miklos Szeredi1-6/+2
Instead of "goto err", return error directly, since there's no error cleanup to do now. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-10-21fuse: always initialize sb->s_fs_infoMiklos Szeredi1-25/+25
Syzkaller reports a null pointer dereference in fuse_test_super() that is caused by sb->s_fs_info being NULL. This is due to the fact that fuse_fill_super() is initializing s_fs_info, which is too late, it's already on the fs_supers list. The initialization needs to be done in sget_fc() with the sb_lock held. Move allocation of fuse_mount and fuse_conn from fuse_fill_super() into fuse_get_tree(). After this ->kill_sb() will always be called with non-NULL ->s_fs_info, hence fuse_mount_destroy() can drop the test for non-NULL "fm". Reported-by: syzbot+74a15f02ccb51f398601@syzkaller.appspotmail.com Fixes: 5d5b74aa9c76 ("fuse: allow sharing existing sb") Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-10-21fuse: clean up fuse_mount destructionMiklos Szeredi2-19/+7
1. call fuse_mount_destroy() for open coded variants 2. before deactivate_locked_super() don't need fuse_mount destruction since that will now be done (if ->s_fs_info is not cleared) 3. rearrange fuse_mount setup in fuse_get_tree_submount() so that the regular pattern can be used Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-10-21fuse: get rid of fuse_put_super()Miklos Szeredi3-9/+15
The ->put_super callback is called from generic_shutdown_super() in case of a fully initialized sb. This is called from kill_***_super(), which is called from ->kill_sb instances. Fuse uses ->put_super to destroy the fs specific fuse_mount and drop the reference to the fuse_conn, while it does the same on each error case during sb setup. This patch moves the destruction from fuse_put_super() to fuse_mount_destroy(), called at the end of all ->kill_sb instances. A follup patch will clean up the error paths. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-10-21fuse: check s_root when destroying sbMiklos Szeredi2-2/+2
Checking "fm" works because currently sb->s_fs_info is cleared on error paths; however, sb->s_root is what generic_shutdown_super() checks to determine whether the sb was fully initialized or not. This change will allow cleanup of sb setup error paths. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-10-21autofs: fix wait name hash calculation in autofs_wait()Ian Kent1-1/+1
There's a mistake in commit 2be7828c9fefc ("get rid of autofs_getpath()") that affects kernels from v5.13.0, basically missed because of me not fully testing the change for Al. The problem is that the hash calculation for the wait name qstr hasn't been updated to account for the change to use dentry_path_raw(). This prevents the correct matching an existing wait resulting in multiple notifications being sent to the daemon for the same mount which must not occur. The problem wasn't discovered earlier because it only occurs when multiple processes trigger a request for the same mount concurrently so it only shows up in more aggressive testing. Fixes: 2be7828c9fefc ("get rid of autofs_getpath()") Cc: stable@vger.kernel.org Signed-off-by: Ian Kent <raven@themaw.net> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2021-10-20Merge tag 'ceph-for-5.15-rc7' of git://github.com/ceph/ceph-clientLinus Torvalds6-30/+22
Pull ceph fixes from Ilya Dryomov: "Two important filesystem fixes, marked for stable. The blocklisted superblocks issue was particularly annoying because for unexperienced users it essentially exacted a reboot to establish a new functional mount in that scenario" * tag 'ceph-for-5.15-rc7' of git://github.com/ceph/ceph-client: ceph: fix handling of "meta" errors ceph: skip existing superblocks that are blocklisted or shut down when mounting
2021-10-20io_uring: fix ltimeout unprepPavel Begunkov1-6/+1
io_unprep_linked_timeout() is broken, first it needs to return back REQ_F_ARM_LTIMEOUT, so the linked timeout is enqueued and disarmed. But now we refcounted it, and linked timeouts may get not executed at all, leaking a request. Just kill the unprep optimisation. Fixes: 906c6caaf586 ("io_uring: optimise io_prep_linked_timeout()") Reported-by: Beld Zhang <beldzhang@gmail.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/51b8e2bfc4bea8ee625cf2ba62b2a350cc9be031.1634719585.git.asml.silence@gmail.com Link: https://github.com/axboe/liburing/issues/460 Reported-by: Beld Zhang <beldzhang@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-20io_uring: apply max_workers limit to all future usersPavel Begunkov1-6/+23
Currently, IORING_REGISTER_IOWQ_MAX_WORKERS applies only to the task that issued it, it's unexpected for users. If one task creates a ring, limits workers and then passes it to another task the limit won't be applied to the other task. Another pitfall is that a task should either create a ring or submit at least one request for IORING_REGISTER_IOWQ_MAX_WORKERS to work at all, furher complicating the picture. Change the API, save the limits and apply to all future users. Note, it should be done first before giving away the ring or submitting new requests otherwise the result is not guaranteed. Fixes: 2e480058ddc2 ("io-wq: provide a way to limit max number of workers") Link: https://github.com/axboe/liburing/issues/460 Reported-by: Beld Zhang <beldzhang@gmail.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/51d0bae97180e08ab722c0d5c93e7439cfb6f697.1634683237.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-20ksmbd: add buffer validation in session setupMarios Makassikis2-27/+40
Make sure the security buffer's length/offset are valid with regards to the packet length. Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Marios Makassikis <mmakassikis@freebox.fr> Signed-off-by: Steve French <stfrench@microsoft.com>
2021-10-20ksmbd: throttle session setup failures to avoid dictionary attacksNamjae Jeon6-6/+31
To avoid dictionary attacks (repeated session setups rapidly sent) to connect to server, ksmbd make a delay of a 5 seconds on session setup failure to make it harder to send enough random connection requests to break into a server if a user insert the wrong password 10 times in a row. Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2021-10-20ksmbd: validate OutputBufferLength of QUERY_DIR, QUERY_INFO, IOCTL requestsHyunchul Lee1-16/+52
Validate OutputBufferLength of QUERY_DIR, QUERY_INFO, IOCTL requests and check the free size of response buffer for these requests. Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Hyunchul Lee <hyc.lee@gmail.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2021-10-20io-wq: max_worker fixesPavel Begunkov1-2/+5
First, fix nr_workers checks against max_workers, with max_worker registration, it may pretty easily happen that nr_workers > max_workers. Also, synchronise writing to acct->max_worker with wqe->lock. It's not an actual problem, but as we don't care about io_wqe_create_worker(), it's better than WRITE_ONCE()/READ_ONCE(). Fixes: 2e480058ddc2 ("io-wq: provide a way to limit max number of workers") Reported-by: Beld Zhang <beldzhang@gmail.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/11f90e6b49410b7d1a88f5d04fb8d95bb86b8cf3.1634671835.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>