summaryrefslogtreecommitdiff
path: root/include
AgeCommit message (Collapse)AuthorFilesLines
2023-06-26Merge tag 'for-6.5/splice-2023-06-23' of git://git.kernel.dk/linuxLinus Torvalds3-19/+6
Pull splice updates from Jens Axboe: "This kills off ITER_PIPE to avoid a race between truncate, iov_iter_revert() on the pipe and an as-yet incomplete DMA to a bio with unpinned/unref'ed pages from an O_DIRECT splice read. This causes memory corruption. Instead, we either use (a) filemap_splice_read(), which invokes the buffered file reading code and splices from the pagecache into the pipe; (b) copy_splice_read(), which bulk-allocates a buffer, reads into it and then pushes the filled pages into the pipe; or (c) handle it in filesystem-specific code. Summary: - Rename direct_splice_read() to copy_splice_read() - Simplify the calculations for the number of pages to be reclaimed in copy_splice_read() - Turn do_splice_to() into a helper, vfs_splice_read(), so that it can be used by overlayfs and coda to perform the checks on the lower fs - Make vfs_splice_read() jump to copy_splice_read() to handle direct-I/O and DAX - Provide shmem with its own splice_read to handle non-existent pages in the pagecache. We don't want a ->read_folio() as we don't want to populate holes, but filemap_get_pages() requires it - Provide overlayfs with its own splice_read to call down to a lower layer as overlayfs doesn't provide ->read_folio() - Provide coda with its own splice_read to call down to a lower layer as coda doesn't provide ->read_folio() - Direct ->splice_read to copy_splice_read() in tty, procfs, kernfs and random files as they just copy to the output buffer and don't splice pages - Provide wrappers for afs, ceph, ecryptfs, ext4, f2fs, nfs, ntfs3, ocfs2, orangefs, xfs and zonefs to do locking and/or revalidation - Make cifs use filemap_splice_read() - Replace pointers to generic_file_splice_read() with pointers to filemap_splice_read() as DIO and DAX are handled in the caller; filesystems can still provide their own alternate ->splice_read() op - Remove generic_file_splice_read() - Remove ITER_PIPE and its paraphernalia as generic_file_splice_read was the only user" * tag 'for-6.5/splice-2023-06-23' of git://git.kernel.dk/linux: (31 commits) splice: kdoc for filemap_splice_read() and copy_splice_read() iov_iter: Kill ITER_PIPE splice: Remove generic_file_splice_read() splice: Use filemap_splice_read() instead of generic_file_splice_read() cifs: Use filemap_splice_read() trace: Convert trace/seq to use copy_splice_read() zonefs: Provide a splice-read wrapper xfs: Provide a splice-read wrapper orangefs: Provide a splice-read wrapper ocfs2: Provide a splice-read wrapper ntfs3: Provide a splice-read wrapper nfs: Provide a splice-read wrapper f2fs: Provide a splice-read wrapper ext4: Provide a splice-read wrapper ecryptfs: Provide a splice-read wrapper ceph: Provide a splice-read wrapper afs: Provide a splice-read wrapper 9p: Add splice_read wrapper net: Make sock_splice_read() use copy_splice_read() by default tty, proc, kernfs, random: Use copy_splice_read() ...
2023-06-26Merge tag 'for-6.5-tag' of ↵Linus Torvalds1-4/+35
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs updates from David Sterba: "Mainly core changes, refactoring and optimizations. Performance is improved in some areas, overall there may be a cumulative improvement due to refactoring that removed lookups in the IO path or simplified IO submission tracking. Core: - submit IO synchronously for fast checksums (crc32c and xxhash), remove high priority worker kthread - read extent buffer in one go, simplify IO tracking, bio submission and locking - remove additional tracking of redirtied extent buffers, originally added for zoned mode but actually not needed - track ordered extent pointer in bio to avoid rbtree lookups during IO - scrub, use recovered data stripes as cache to avoid unnecessary read - in zoned mode, optimize logical to physical mappings of extents - remove PageError handling, not set by VFS nor writeback - cleanups, refactoring, better structure packing - lots of error handling improvements - more assertions, lockdep annotations - print assertion failure with the exact line where it happens - tracepoint updates - more debugging prints Performance: - speedup in fsync(), better tracking of inode logged status can avoid transaction commit - IO path structures track logical offsets in data structures and does not need to look it up User visible changes: - don't commit transaction for every created subvolume, this can reduce time when many subvolumes are created in a batch - print affected files when relocation fails - trigger orphan file cleanup during START_SYNC ioctl Notable fixes: - fix crash when disabling quota and relocation - fix crashes when removing roots from drity list - fix transacion abort during relocation when converting from newer profiles not covered by fallback - in zoned mode, stop reclaiming block groups if filesystem becomes read-only - fix rare race condition in tree mod log rewind that can miss some btree node slots - with enabled fsverity, drop up-to-date page bit in case the verification fails" * tag 'for-6.5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (194 commits) btrfs: fix race between quota disable and relocation btrfs: add comment to struct btrfs_fs_info::dirty_cowonly_roots btrfs: fix race when deleting free space root from the dirty cow roots list btrfs: fix race when deleting quota root from the dirty cow roots list btrfs: tracepoints: also show actual number of the outstanding extents btrfs: update i_version in update_dev_time btrfs: make btrfs_compressed_bioset static btrfs: add handling for RAID1C23/DUP to btrfs_reduce_alloc_profile btrfs: scrub: remove btrfs_fs_info::scrub_wr_completion_workers btrfs: scrub: remove scrub_ctx::csum_list member btrfs: do not BUG_ON after failure to migrate space during truncation btrfs: do not BUG_ON on failure to get dir index for new snapshot btrfs: send: do not BUG_ON() on unexpected symlink data extent btrfs: do not BUG_ON() when dropping inode items from log root btrfs: replace BUG_ON() at split_item() with proper error handling btrfs: do not BUG_ON() on tree mod log failures at btrfs_del_ptr() btrfs: do not BUG_ON() on tree mod log failures at insert_ptr() btrfs: do not BUG_ON() on tree mod log failure at insert_new_root() btrfs: do not BUG_ON() on tree mod log failures at push_nodes_for_insert() btrfs: abort transaction at update_ref_for_cow() when ref count is zero ...
2023-06-26Merge tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fsverity/linuxLinus Torvalds1-5/+9
Pull fsverity updates from Eric Biggers: "Several updates for fs/verity/: - Do all hashing with the shash API instead of with the ahash API. This simplifies the code and reduces API overhead. It should also make things slightly easier for XFS's upcoming support for fsverity. It does drop fsverity's support for off-CPU hash accelerators, but that support was incomplete and not known to be used - Update and export fsverity_get_digest() so that it's ready for overlayfs's upcoming support for fsverity checking of lowerdata - Improve the documentation for builtin signature support - Fix a bug in the large folio support" * tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fsverity/linux: fsverity: improve documentation for builtin signature support fsverity: rework fsverity_get_digest() again fsverity: simplify error handling in verify_data_block() fsverity: don't use bio_first_page_all() in fsverity_verify_bio() fsverity: constify fsverity_hash_alg fsverity: use shash API instead of ahash API
2023-06-26Merge tag 'nfsd-6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linuxLinus Torvalds5-17/+59
Pull nfsd updates from Chuck Lever: - Clean-ups in the READ path in anticipation of MSG_SPLICE_PAGES - Better NUMA awareness when allocating pages and other objects - A number of minor clean-ups to XDR encoding - Elimination of a race when accepting a TCP socket - Numerous observability enhancements * tag 'nfsd-6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: (46 commits) nfsd: remove redundant assignments to variable len svcrdma: Fix stale comment NFSD: Distinguish per-net namespace initialization nfsd: move init of percpu reply_cache_stats counters back to nfsd_init_net SUNRPC: Address RCU warning in net/sunrpc/svc.c SUNRPC: Use sysfs_emit in place of strlcpy/sprintf SUNRPC: Remove transport class dprintk call sites SUNRPC: Fix comments for transport class registration svcrdma: Remove an unused argument from __svc_rdma_put_rw_ctxt() svcrdma: trace cc_release calls svcrdma: Convert "might sleep" comment into a code annotation NFSD: Add an nfsd4_encode_nfstime4() helper SUNRPC: Move initialization of rq_stime SUNRPC: Optimize page release in svc_rdma_sendto() svcrdma: Prevent page release when nothing was received svcrdma: Revert 2a1e4f21d841 ("svcrdma: Normalize Send page handling") SUNRPC: Revert 579900670ac7 ("svcrdma: Remove unused sc_pages field") SUNRPC: Revert cc93ce9529a6 ("svcrdma: Retain the page backing rq_res.head[0].iov_base") NFSD: add encoding of op_recall flag for write delegation NFSD: Add "official" reviewers for this subsystem ...
2023-06-26Merge tag 'v6.5/vfs.mount' of ↵Linus Torvalds1-1/+2
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs mount updates from Christian Brauner: "This contains the work to extend move_mount() to allow adding a mount beneath the topmost mount of a mount stack. There are two LWN articles about this. One covers the original patch series in [1]. The other in [2] summarizes the session and roughly the discussion between Al and me at LSFMM. The second article also goes into some good questions from attendees. Since all details are found in the relevant commit with a technical dive into semantics and locking at the end I'm only adding the motivation and core functionality for this from commit message and leave out the invasive details. The code is also heavily commented and annotated as well which was explicitly requested. TL;DR: > mount -t ext4 /dev/sda /mnt | └─/mnt /dev/sda ext4 > mount --beneath -t xfs /dev/sdb /mnt | └─/mnt /dev/sdb xfs └─/mnt /dev/sda ext4 > umount /mnt | └─/mnt /dev/sdb xfs The longer motivation is that various distributions are adding or are in the process of adding support for system extensions and in the future configuration extensions through various tools. A more detailed explanation on system and configuration extensions can be found on the manpage which is listed below at [3]. System extension images may – dynamically at runtime — extend the /usr/ and /opt/ directory hierarchies with additional files. This is particularly useful on immutable system images where a /usr/ and/or /opt/ hierarchy residing on a read-only file system shall be extended temporarily at runtime without making any persistent modifications. When one or more system extension images are activated, their /usr/ and /opt/ hierarchies are combined via overlayfs with the same hierarchies of the host OS, and the host /usr/ and /opt/ overmounted with it ("merging"). When they are deactivated, the mount point is disassembled — again revealing the unmodified original host version of the hierarchy ("unmerging"). Merging thus makes the extension's resources suddenly appear below the /usr/ and /opt/ hierarchies as if they were included in the base OS image itself. Unmerging makes them disappear again, leaving in place only the files that were shipped with the base OS image itself. System configuration images are similar but operate on directories containing system or service configuration. On nearly all modern distributions mount propagation plays a crucial role and the rootfs of the OS is a shared mount in a peer group (usually with peer group id 1): TARGET SOURCE FSTYPE PROPAGATION MNT_ID PARENT_ID / / ext4 shared:1 29 1 On such systems all services and containers run in a separate mount namespace and are pivot_root()ed into their rootfs. A separate mount namespace is almost always used as it is the minimal isolation mechanism services have. But usually they are even much more isolated up to the point where they almost become indistinguishable from containers. Mount propagation again plays a crucial role here. The rootfs of all these services is a slave mount to the peer group of the host rootfs. This is done so the service will receive mount propagation events from the host when certain files or directories are updated. In addition, the rootfs of each service, container, and sandbox is also a shared mount in its separate peer group: TARGET SOURCE FSTYPE PROPAGATION MNT_ID PARENT_ID / / ext4 shared:24 master:1 71 47 For people not too familiar with mount propagation, the master:1 means that this is a slave mount to peer group 1. Which as one can see is the host rootfs as indicated by shared:1 above. The shared:24 indicates that the service rootfs is a shared mount in a separate peer group with peer group id 24. A service may run other services. Such nested services will also have a rootfs mount that is a slave to the peer group of the outer service rootfs mount. For containers things are just slighly different. A container's rootfs isn't a slave to the service's or host rootfs' peer group. The rootfs mount of a container is simply a shared mount in its own peer group: TARGET SOURCE FSTYPE PROPAGATION MNT_ID PARENT_ID /home/ubuntu/debian-tree / ext4 shared:99 61 60 So whereas services are isolated OS components a container is treated like a separate world and mount propagation into it is restricted to a single well known mount that is a slave to the peer group of the shared mount /run on the host: TARGET SOURCE FSTYPE PROPAGATION MNT_ID PARENT_ID /propagate/debian-tree /run/host/incoming tmpfs master:5 71 68 Here, the master:5 indicates that this mount is a slave to the peer group with peer group id 5. This allows to propagate mounts into the container and served as a workaround for not being able to insert mounts into mount namespaces directly. But the new mount api does support inserting mounts directly. For the interested reader the blogpost in [4] might be worth reading where I explain the old and the new approach to inserting mounts into mount namespaces. Containers of course, can themselves be run as services. They often run full systems themselves which means they again run services and containers with the exact same propagation settings explained above. The whole system is designed so that it can be easily updated, including all services in various fine-grained ways without having to enter every single service's mount namespace which would be prohibitively expensive. The mount propagation layout has been carefully chosen so it is possible to propagate updates for system extensions and configurations from the host into all services. The simplest model to update the whole system is to mount on top of /usr, /opt, or /etc on the host. The new mount on /usr, /opt, or /etc will then propagate into every service. This works cleanly the first time. However, when the system is updated multiple times it becomes necessary to unmount the first update on /opt, /usr, /etc and then propagate the new update. But this means, there's an interval where the old base system is accessible. This has to be avoided to protect against downgrade attacks. The vfs already exposes a mechanism to userspace whereby mounts can be mounted beneath an existing mount. Such mounts are internally referred to as "tucked". The patch series exposes the ability to mount beneath a top mount through the new MOVE_MOUNT_BENEATH flag for the move_mount() system call. This allows userspace to seamlessly upgrade mounts. After this series the only thing that will have changed is that mounting beneath an existing mount can be done explicitly instead of just implicitly. The crux is that the proposed mechanism already exists and that it is so powerful as to cover cases where mounts are supposed to be updated with new versions. Crucially, it offers an important flexibility. Namely that updates to a system may either be forced or can be delayed and the umount of the top mount be left to a service if it is a cooperative one" Link: https://lwn.net/Articles/927491 [1] Link: https://lwn.net/Articles/934094 [2] Link: https://man7.org/linux/man-pages/man8/systemd-sysext.8.html [3] Link: https://brauner.io/2023/02/28/mounting-into-mount-namespaces.html [4] Link: https://github.com/flatcar/sysext-bakery Link: https://fedoraproject.org/wiki/Changes/Unified_Kernel_Support_Phase_1 Link: https://fedoraproject.org/wiki/Changes/Unified_Kernel_Support_Phase_2 Link: https://github.com/systemd/systemd/pull/26013 * tag 'v6.5/vfs.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: fs: allow to mount beneath top mount fs: use a for loop when locking a mount fs: properly document __lookup_mnt() fs: add path_mounted()
2023-06-26Merge tag 'v6.5/vfs.file' of ↵Linus Torvalds2-9/+37
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs file handling updates from Christian Brauner: "This contains Amir's work to fix a long-standing problem where an unprivileged overlayfs mount can be used to avoid fanotify permission events that were requested for an inode or superblock on the underlying filesystem. Some background about files opened in overlayfs. If a file is opened in overlayfs @file->f_path will refer to a "fake" path. What this means is that while @file->f_inode will refer to inode of the underlying layer, @file->f_path refers to an overlayfs {dentry,vfsmount} pair. The reasons for doing this are out of scope here but it is the reason why the vfs has been providing the open_with_fake_path() helper for overlayfs for very long time now. So nothing new here. This is for sure not very elegant and everyone including the overlayfs maintainers agree. Improving this significantly would involve more fragile and potentially rather invasive changes. In various codepaths access to the path of the underlying filesystem is needed for such hybrid file. The best example is fsnotify where this becomes security relevant. Passing the overlayfs @file->f_path->dentry will cause fsnotify to skip generating fsnotify events registered on the underlying inode or superblock. To fix this we extend the vfs provided open_with_fake_path() concept for overlayfs to create a backing file container that holds the real path and to expose a helper that can be used by relevant callers to get access to the path of the underlying filesystem through the new file_real_path() helper. This pattern is similar to what we do in d_real() and d_real_inode(). The first beneficiary is fsnotify and fixes the security sensitive problem mentioned above. There's a couple of nice cleanups included as well. Over time, the old open_with_fake_path() helper added specifically for overlayfs a long time ago started to get used in other places such as cachefiles. Even though cachefiles have nothing to do with hybrid files. The only reason cachefiles used that concept was that files opened with open_with_fake_path() aren't charged against the caller's open file limit by raising FMODE_NOACCOUNT. It's just mere coincidence that both overlayfs and cachefiles need to ensure to not overcharge the caller for their internal open calls. So this work disentangles FMODE_NOACCOUNT use cases and backing file use-cases by adding the FMODE_BACKING flag which indicates that the file can be used to retrieve the backing file of another filesystem. (Fyi, Jens will be sending you a really nice cleanup from Christoph that gets rid of 3 FMODE_* flags otherwise this would be the last fmode_t bit we'd be using.) So now overlayfs becomes the sole user of the renamed open_with_fake_path() helper which is now named backing_file_open(). For internal kernel users such as cachefiles that are only interested in FMODE_NOACCOUNT but not in FMODE_BACKING we add a new kernel_file_open() helper which opens a file without being charged against the caller's open file limit. All new helpers are properly documented and clearly annotated to mention their special uses. We also rename vfs_tmpfile_open() to kernel_tmpfile_open() to clearly distinguish it from vfs_tmpfile() and align it the other kernel_*() internal helpers" * tag 'v6.5/vfs.file' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: ovl: enable fsnotify events on underlying real files fs: use backing_file container for internal files with "fake" f_path fs: move kmem_cache_zalloc() into alloc_empty_file*() helpers fs: use a helper for opening kernel internal files fs: rename {vfs,kernel}_tmpfile_open()
2023-06-26Merge tag 'v6.5/vfs.misc' of ↵Linus Torvalds7-20/+30
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull misc vfs updates from Christian Brauner: "Miscellaneous features, cleanups, and fixes for vfs and individual fs Features: - Use mode 0600 for file created by cachefilesd so it can be run by unprivileged users. This aligns them with directories which are already created with mode 0700 by cachefilesd - Reorder a few members in struct file to prevent some false sharing scenarios - Indicate that an eventfd is used a semaphore in the eventfd's fdinfo procfs file - Add a missing uapi header for eventfd exposing relevant uapi defines - Let the VFS protect transitions of a superblock from read-only to read-write in addition to the protection it already provides for transitions from read-write to read-only. Protecting read-only to read-write transitions allows filesystems such as ext4 to perform internal writes, keeping writers away until the transition is completed Cleanups: - Arnd removed the architecture specific arch_report_meminfo() prototypes and added a generic one into procfs.h. Note, we got a report about a warning in amdpgpu codepaths that suggested this was bisectable to this change but we concluded it was a false positive - Remove unused parameters from split_fs_names() - Rename put_and_unmap_page() to unmap_and_put_page() to let the name reflect the order of the cleanup operation that has to unmap before the actual put - Unexport buffer_check_dirty_writeback() as it is not used outside of block device aops - Stop allocating aio rings from highmem - Protecting read-{only,write} transitions in the VFS used open-coded barriers in various places. Replace them with proper little helpers and document both the helpers and all barrier interactions involved when transitioning between read-{only,write} states - Use flexible array members in old readdir codepaths Fixes: - Use the correct type __poll_t for epoll and eventfd - Replace all deprecated strlcpy() invocations, whose return value isn't checked with an equivalent strscpy() call - Fix some kernel-doc warnings in fs/open.c - Reduce the stack usage in jffs2's xattr codepaths finally getting rid of this: fs/jffs2/xattr.c:887:1: error: the frame size of 1088 bytes is larger than 1024 bytes [-Werror=frame-larger-than=] royally annoying compilation warning - Use __FMODE_NONOTIFY instead of FMODE_NONOTIFY where an int and not fmode_t is required to avoid fmode_t to integer degradation warnings - Create coredumps with O_WRONLY instead of O_RDWR. There's a long explanation in that commit how O_RDWR is actually a bug which we found out with the help of Linus and git archeology - Fix "no previous prototype" warnings in the pipe codepaths - Add overflow calculations for remap_verify_area() as a signed addition overflow could be triggered in xfstests - Fix a null pointer dereference in sysv - Use an unsigned variable for length calculations in jfs avoiding compilation warnings with gcc 13 - Fix a dangling pipe pointer in the watch queue codepath - The legacy mount option parser provided as a fallback by the VFS for filesystems not yet converted to the new mount api did prefix the generated mount option string with a leading ',' causing issues for some filesystems - Fix a repeated word in a comment in fs.h - autofs: Update the ctime when mtime is updated as mandated by POSIX" * tag 'v6.5/vfs.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (27 commits) readdir: Replace one-element arrays with flexible-array members fs: Provide helpers for manipulating sb->s_readonly_remount fs: Protect reconfiguration of sb read-write from racing writes eventfd: add a uapi header for eventfd userspace APIs autofs: set ctime as well when mtime changes on a dir eventfd: show the EFD_SEMAPHORE flag in fdinfo fs/aio: Stop allocating aio rings from HIGHMEM fs: Fix comment typo fs: unexport buffer_check_dirty_writeback fs: avoid empty option when generating legacy mount string watch_queue: prevent dangling pipe pointer fs.h: Optimize file struct to prevent false sharing highmem: Rename put_and_unmap_page() to unmap_and_put_page() cachefiles: Allow the cache to be non-root init: remove unused names parameter in split_fs_names() jfs: Use unsigned variable for length calculations fs/sysv: Null check to prevent null-ptr-deref bug fs: use UB-safe check for signed addition overflow in remap_verify_area procfs: consolidate arch_report_meminfo declaration fs: pipe: reveal missing function protoypes ...
2023-06-25Merge tag 'perf_urgent_for_v6.4' of ↵Linus Torvalds1-3/+3
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Borislav Petkov: - Drop the __weak attribute from a function prototype as it otherwise leads to the function getting replaced by a dummy stub - Fix the umask value setup of the frontend event as former is different on two Intel cores * tag 'perf_urgent_for_v6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86/intel: Fix the FRONTEND encoding on GNR and MTL perf/core: Drop __weak attribute from arch_perf_update_userpage() prototype
2023-06-25Merge tag 'objtool_urgent_for_v6.4' of ↵Linus Torvalds1-0/+3
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull objtool fix from Borislav Petkov: - Add a ORC format hash to vmlinux and modules in order for other tools which use it, to detect changes to it and adapt accordingly * tag 'objtool_urgent_for_v6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/unwind/orc: Add ELF section with ORC version identifier
2023-06-24Merge tag 'gpio-fixes-for-v6.4' of ↵Linus Torvalds1-0/+8
git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux Pull gpio fixes from Bartosz Golaszewski: - fix IRQ initialization in gpiochip_irqchip_add_domain() - add a missing return value check for platform_get_irq() in gpio-sifive - don't free irq_domains which GPIOLIB does not manage * tag 'gpio-fixes-for-v6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux: gpiolib: Fix irq_domain resource tracking for gpiochip_irqchip_add_domain() gpio: sifive: add missing check for platform_get_irq gpiolib: Fix GPIO chip IRQ initialization restriction
2023-06-23workqueue: clean up WORK_* constant types, clarify maskingLinus Torvalds1-7/+8
Dave Airlie reports that gcc-13.1.1 has started complaining about some of the workqueue code in 32-bit arm builds: kernel/workqueue.c: In function ‘get_work_pwq’: kernel/workqueue.c:713:24: error: cast to pointer from integer of different size [-Werror=int-to-pointer-cast] 713 | return (void *)(data & WORK_STRUCT_WQ_DATA_MASK); | ^ [ ... a couple of other cases ... ] and while it's not immediately clear exactly why gcc started complaining about it now, I suspect it's some C23-induced enum type handlign fixup in gcc-13 is the cause. Whatever the reason for starting to complain, the code and data types are indeed disgusting enough that the complaint is warranted. The wq code ends up creating various "helper constants" (like that WORK_STRUCT_WQ_DATA_MASK) using an enum type, which is all kinds of confused. The mask needs to be 'unsigned long', not some unspecified enum type. To make matters worse, the actual "mask and cast to a pointer" is repeated a couple of times, and the cast isn't even always done to the right pointer, but - as the error case above - to a 'void *' with then the compiler finishing the job. That's now how we roll in the kernel. So create the masks using the proper types rather than some ambiguous enumeration, and use a nice helper that actually does the type conversion in one well-defined place. Incidentally, this magically makes clang generate better code. That, admittedly, is really just a sign of clang having been seriously confused before, and cleaning up the typing unconfuses the compiler too. Reported-by: Dave Airlie <airlied@gmail.com> Link: https://lore.kernel.org/lkml/CAPM=9twNnV4zMCvrPkw3H-ajZOH-01JVh_kDrxdPYQErz8ZTdA@mail.gmail.com/ Cc: Arnd Bergmann <arnd@arndb.de> Cc: Tejun Heo <tj@kernel.org> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-06-23Merge tag 'net-6.4-rc8' of ↵Linus Torvalds3-2/+38
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Paolo Abeni: "Including fixes from ipsec, bpf, mptcp and netfilter. Current release - regressions: - netfilter: add NFT_TRANS_PREPARE_ERROR to deal with bound set/chain - eth: mlx5e: - fix scheduling of IPsec ASO query while in atomic - free IRQ rmap and notifier on kernel shutdown Current release - new code bugs: - phy: manual remove LEDs to ensure correct ordering Previous releases - regressions: - mptcp: fix possible divide by zero in recvmsg() - dsa: revert "net: phy: dp83867: perform soft reset and retain established link" Previous releases - always broken: - sched: netem: acquire qdisc lock in netem_change() - bpf: - fix verifier id tracking of scalars on spill - fix NULL dereference on exceptions - accept function names that contain dots - netfilter: disallow element updates of bound anonymous sets - mptcp: ensure listener is unhashed before updating the sk status - xfrm: - add missed call to delete offloaded policies - fix inbound ipv4/udp/esp packets to UDPv6 dualstack sockets - selftests: fixes for FIPS mode - dsa: mt7530: fix multiple CPU ports, BPDU and LLDP handling - eth: sfc: use budget for TX completions Misc: - wifi: iwlwifi: add support for SO-F device with PCI id 0x7AF0" * tag 'net-6.4-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (74 commits) revert "net: align SO_RCVMARK required privileges with SO_MARK" net: wwan: iosm: Convert single instance struct member to flexible array sch_netem: acquire qdisc lock in netem_change() selftests: forwarding: Fix race condition in mirror installation wifi: mac80211: report all unusable beacon frames mptcp: ensure listener is unhashed before updating the sk status mptcp: drop legacy code around RX EOF mptcp: consolidate fallback and non fallback state machine mptcp: fix possible list corruption on passive MPJ mptcp: fix possible divide by zero in recvmsg() mptcp: handle correctly disconnect() failures bpf: Force kprobe multi expected_attach_type for kprobe_multi link bpf/btf: Accept function names that contain dots Revert "net: phy: dp83867: perform soft reset and retain established link" net: mdio: fix the wrong parameters netfilter: nf_tables: Fix for deleting base chains with payload netfilter: nfnetlink_osf: fix module autoload netfilter: nf_tables: drop module reference after updating chain netfilter: nf_tables: disallow timeout for anonymous sets netfilter: nf_tables: disallow updates of anonymous sets ...
2023-06-22Merge tag 'nf-23-06-21' of ↵Paolo Abeni1-2/+29
git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf Pablo Neira Ayuso says: ==================== Netfilter/IPVS fixes for net This is v3, including a crash fix for patch 01/14. The following patchset contains Netfilter/IPVS fixes for net: 1) Fix UDP segmentation with IPVS tunneled traffic, from Terin Stock. 2) Fix chain binding transaction logic, add a bound flag to rule transactions. Remove incorrect logic in nft_data_hold() and nft_data_release(). 3) Add a NFT_TRANS_PREPARE_ERROR deactivate state to deal with releasing the set/chain as a follow up to 1240eb93f061 ("netfilter: nf_tables: incorrect error path handling with NFT_MSG_NEWRULE") 4) Drop map element references from preparation phase instead of set destroy path, otherwise bogus EBUSY with transactions such as: flush chain ip x y delete chain ip x w where chain ip x y contains jump/goto from set elements. 5) Pipapo set type does not regard generation mask from the walk iteration. 6) Fix reference count underflow in set element reference to stateful object. 7) Several patches to tighten the nf_tables API: - disallow set element updates of bound anonymous set - disallow unbound anonymous set/chain at the end of transaction. - disallow updates of anonymous set. - disallow timeout configuration for anonymous sets. 8) Fix module reference leak in chain updates. 9) Fix nfnetlink_osf module autoload. 10) Fix deletion of basechain when NFTA_CHAIN_HOOK is specified as in iptables-nft. This Netfilter batch is larger than usual at this stage, I am aware we are fairly late in the -rc cycle, if you prefer to route them through net-next, please let me know. netfilter pull request 23-06-21 * tag 'nf-23-06-21' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf: netfilter: nf_tables: Fix for deleting base chains with payload netfilter: nfnetlink_osf: fix module autoload netfilter: nf_tables: drop module reference after updating chain netfilter: nf_tables: disallow timeout for anonymous sets netfilter: nf_tables: disallow updates of anonymous sets netfilter: nf_tables: reject unbound chain set before commit phase netfilter: nf_tables: reject unbound anonymous set before commit phase netfilter: nf_tables: disallow element updates of bound anonymous sets netfilter: nf_tables: fix underflow in object reference counter netfilter: nft_set_pipapo: .walk does not deal with generations netfilter: nf_tables: drop map element references from preparation phase netfilter: nf_tables: add NFT_TRANS_PREPARE_ERROR to deal with bound set/chain netfilter: nf_tables: fix chain binding transaction logic ipvs: align inner_mac_header for encapsulation ==================== Link: https://lore.kernel.org/r/20230621100731.68068-1-pablo@netfilter.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-06-21Merge tag 'regulator-fix-v6.4-rc7' of ↵Linus Torvalds1-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator Pull regulator fix from Mark Brown: "One simple fix for v6.4, some incorrectly specified bitfield masks in the PCA9450 driver" * tag 'regulator-fix-v6.4-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator: regulator: pca9450: Fix LDO3OUT and LDO4OUT MASK
2023-06-21Merge tag 'mm-hotfixes-stable-2023-06-20-12-31' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull hotfixes from Andrew Morton: "19 hotfixes. 8 of these are cc:stable. This includes a wholesale reversion of the post-6.4 series 'make slab shrink lockless'. After input from Dave Chinner it has been decided that we should go a different way [1]" Link: https://lkml.kernel.org/r/ZH6K0McWBeCjaf16@dread.disaster.area [1] * tag 'mm-hotfixes-stable-2023-06-20-12-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: selftests/mm: fix cross compilation with LLVM mailmap: add entries for Ben Dooks nilfs2: prevent general protection fault in nilfs_clear_dirty_page() Revert "mm: vmscan: make global slab shrink lockless" Revert "mm: vmscan: make memcg slab shrink lockless" Revert "mm: vmscan: add shrinker_srcu_generation" Revert "mm: shrinkers: make count and scan in shrinker debugfs lockless" Revert "mm: vmscan: hold write lock to reparent shrinker nr_deferred" Revert "mm: vmscan: remove shrinker_rwsem from synchronize_shrinkers()" Revert "mm: shrinkers: convert shrinker_rwsem to mutex" nilfs2: fix buffer corruption due to concurrent device reads scripts/gdb: fix SB_* constants parsing scripts: fix the gfp flags header path in gfp-translate udmabuf: revert 'Add support for mapping hugepages (v4)' mm/khugepaged: fix iteration in collapse_file memfd: check for non-NULL file_seals in memfd_create() syscall mm/vmalloc: do not output a spurious warning when huge vmalloc() fails mm/mprotect: fix do_mprotect_pkey() limit check writeback: fix dereferencing NULL mapping->host on writeback_page_template
2023-06-21Merge tag 'acpi-6.4-rc8' of ↵Linus Torvalds1-0/+1
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull ACPI fix from Rafael Wysocki: "Fix a kernel crash during early resume from ACPI S3 that has been present since the 5.15 cycle when might_sleep() was added to down_timeout(), which in some configurations of the kernel caused an implicit preemption point to trigger at a wrong time" * tag 'acpi-6.4-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: ACPI: sleep: Avoid breaking S3 wakeup due to might_sleep()
2023-06-20netfilter: nf_tables: reject unbound anonymous set before commit phasePablo Neira Ayuso1-0/+3
Add a new list to track set transaction and to check for unbound anonymous sets before entering the commit phase. Bail out at the end of the transaction handling if an anonymous set remains unbound. Fixes: 96518518cc41 ("netfilter: add nftables") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-06-20netfilter: nf_tables: drop map element references from preparation phasePablo Neira Ayuso1-1/+4
set .destroy callback releases the references to other objects in maps. This is very late and it results in spurious EBUSY errors. Drop refcount from the preparation phase instead, update set backend not to drop reference counter from set .destroy path. Exceptions: NFT_TRANS_PREPARE_ERROR does not require to drop the reference counter because the transaction abort path releases the map references for each element since the set is unbound. The abort path also deals with releasing reference counter for new elements added to unbound sets. Fixes: 591054469b3e ("netfilter: nf_tables: revisit chain/object refcounting from elements") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-06-20netfilter: nf_tables: add NFT_TRANS_PREPARE_ERROR to deal with bound set/chainPablo Neira Ayuso1-0/+2
Add a new state to deal with rule expressions deactivation from the newrule error path, otherwise the anonymous set remains in the list in inactive state for the next generation. Mark the set/chain transaction as unbound so the abort path releases this object, set it as inactive in the next generation so it is not reachable anymore from this transaction and reference counter is dropped. Fixes: 1240eb93f061 ("netfilter: nf_tables: incorrect error path handling with NFT_MSG_NEWRULE") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-06-20netfilter: nf_tables: fix chain binding transaction logicPablo Neira Ayuso1-1/+20
Add bound flag to rule and chain transactions as in 6a0a8d10a366 ("netfilter: nf_tables: use-after-free in failing rule with bound set") to skip them in case that the chain is already bound from the abort path. This patch fixes an imbalance in the chain use refcnt that triggers a WARN_ON on the table and chain destroy path. This patch also disallows nested chain bindings, which is not supported from userspace. The logic to deal with chain binding in nft_data_hold() and nft_data_release() is not correct. The NFT_TRANS_PREPARE state needs a special handling in case a chain is bound but next expressions in the same rule fail to initialize as described by 1240eb93f061 ("netfilter: nf_tables: incorrect error path handling with NFT_MSG_NEWRULE"). The chain is left bound if rule construction fails, so the objects stored in this chain (and the chain itself) are released by the transaction records from the abort path, follow up patch ("netfilter: nf_tables: add NFT_TRANS_PREPARE_ERROR to deal with bound set/chain") completes this error handling. When deleting an existing rule, chain bound flag is set off so the rule expression .destroy path releases the objects. Fixes: d0e2c7de92c7 ("netfilter: nf_tables: add NFT_CHAIN_BINDING") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-06-20Merge tag 'ipsec-2023-06-20' of ↵David S. Miller1-0/+1
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec ipsec-2023-06-20
2023-06-20fs: Provide helpers for manipulating sb->s_readonly_remountJan Kara1-1/+1
Provide helpers to set and clear sb->s_readonly_remount including appropriate memory barriers. Also use this opportunity to document what the barriers pair with and why they are needed. Suggested-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Dave Chinner <dchinner@redhat.com> Message-Id: <20230620112832.5158-1-jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-06-20net: dsa: introduce preferred_default_local_cpu_port and use on MT7530Vladimir Oltean1-0/+8
Since the introduction of the OF bindings, DSA has always had a policy that in case multiple CPU ports are present in the device tree, the numerically smallest one is always chosen. The MT7530 switch family, except the switch on the MT7988 SoC, has 2 CPU ports, 5 and 6, where port 6 is preferable on the MT7531BE switch because it has higher bandwidth. The MT7530 driver developers had 3 options: - to modify DSA when the MT7531 switch support was introduced, such as to prefer the better port - to declare both CPU ports in device trees as CPU ports, and live with the sub-optimal performance resulting from not preferring the better port - to declare just port 6 in the device tree as a CPU port Of course they chose the path of least resistance (3rd option), kicking the can down the road. The hardware description in the device tree is supposed to be stable - developers are not supposed to adopt the strategy of piecemeal hardware description, where the device tree is updated in lockstep with the features that the kernel currently supports. Now, as a result of the fact that they did that, any attempts to modify the device tree and describe both CPU ports as CPU ports would make DSA change its default selection from port 6 to 5, effectively resulting in a performance degradation visible to users with the MT7531BE switch as can be seen below. Without preferring port 6: [ ID][Role] Interval Transfer Bitrate Retr [ 5][TX-C] 0.00-20.00 sec 374 MBytes 157 Mbits/sec 734 sender [ 5][TX-C] 0.00-20.00 sec 373 MBytes 156 Mbits/sec receiver [ 7][RX-C] 0.00-20.00 sec 1.81 GBytes 778 Mbits/sec 0 sender [ 7][RX-C] 0.00-20.00 sec 1.81 GBytes 777 Mbits/sec receiver With preferring port 6: [ ID][Role] Interval Transfer Bitrate Retr [ 5][TX-C] 0.00-20.00 sec 1.99 GBytes 856 Mbits/sec 273 sender [ 5][TX-C] 0.00-20.00 sec 1.99 GBytes 855 Mbits/sec receiver [ 7][RX-C] 0.00-20.00 sec 1.72 GBytes 737 Mbits/sec 15 sender [ 7][RX-C] 0.00-20.00 sec 1.71 GBytes 736 Mbits/sec receiver Using one port for WAN and the other ports for LAN is a very popular use case which is what this test emulates. As such, this change proposes that we retroactively modify stable kernels (which don't support the modification of the CPU port assignments, so as to let user space fix the problem and restore the throughput) to keep the mt7530 driver preferring port 6 even with device trees where the hardware is more fully described. Fixes: c288575f7810 ("net: dsa: mt7530: Add the support of MT7531 switch") Signed-off-by: Vladimir Oltean <olteanv@gmail.com> Signed-off-by: Arınç ÜNAL <arinc.unal@arinc9.com> Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-06-20Merge tag 'hyperv-fixes-signed-20230619' of ↵Linus Torvalds1-0/+1
git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux Pull hyperv fixes from Wei Liu: - Fix races in Hyper-V PCI controller (Dexuan Cui) - Fix handling of hyperv_pcpu_input_arg (Michael Kelley) - Fix vmbus_wait_for_unload to scan present CPUs (Michael Kelley) - Call hv_synic_free in the failure path of hv_synic_alloc (Dexuan Cui) - Add noop for real mode handlers for virtual trust level code (Saurabh Sengar) * tag 'hyperv-fixes-signed-20230619' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux: PCI: hv: Add a per-bus mutex state_lock Revert "PCI: hv: Fix a timing issue which causes kdump to fail occasionally" PCI: hv: Remove the useless hv_pcichild_state from struct hv_pci_dev PCI: hv: Fix a race condition in hv_irq_unmask() that can cause panic PCI: hv: Fix a race condition bug in hv_pci_query_relations() arm64/hyperv: Use CPUHP_AP_HYPERV_ONLINE state to fix CPU online sequencing x86/hyperv: Fix hyperv_pcpu_input_arg handling when CPUs go online/offline Drivers: hv: vmbus: Fix vmbus_wait_for_unload() to scan present CPUs Drivers: hv: vmbus: Call hv_synic_free() if hv_synic_alloc() fails x86/hyperv/vtl: Add noop for realmode pointers
2023-06-19writeback: fix dereferencing NULL mapping->host on writeback_page_templateRafael Aquini1-1/+1
When commit 19343b5bdd16 ("mm/page-writeback: introduce tracepoint for wait_on_page_writeback()") repurposed the writeback_dirty_page trace event as a template to create its new wait_on_page_writeback trace event, it ended up opening a window to NULL pointer dereference crashes due to the (infrequent) occurrence of a race where an access to a page in the swap-cache happens concurrently with the moment this page is being written to disk and the tracepoint is enabled: BUG: kernel NULL pointer dereference, address: 0000000000000040 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 800000010ec0a067 P4D 800000010ec0a067 PUD 102353067 PMD 0 Oops: 0000 [#1] PREEMPT SMP PTI CPU: 1 PID: 1320 Comm: shmem-worker Kdump: loaded Not tainted 6.4.0-rc5+ #13 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS edk2-20230301gitf80f052277c8-1.fc37 03/01/2023 RIP: 0010:trace_event_raw_event_writeback_folio_template+0x76/0xf0 Code: 4d 85 e4 74 5c 49 8b 3c 24 e8 06 98 ee ff 48 89 c7 e8 9e 8b ee ff ba 20 00 00 00 48 89 ef 48 89 c6 e8 fe d4 1a 00 49 8b 04 24 <48> 8b 40 40 48 89 43 28 49 8b 45 20 48 89 e7 48 89 43 30 e8 a2 4d RSP: 0000:ffffaad580b6fb60 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff90e38035c01c RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff90e38035c044 RBP: ffff90e38035c024 R08: 0000000000000002 R09: 0000000000000006 R10: ffff90e38035c02e R11: 0000000000000020 R12: ffff90e380bac000 R13: ffffe3a7456d9200 R14: 0000000000001b81 R15: ffffe3a7456d9200 FS: 00007f2e4e8a15c0(0000) GS:ffff90e3fbc80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000040 CR3: 00000001150c6003 CR4: 0000000000170ee0 Call Trace: <TASK> ? __die+0x20/0x70 ? page_fault_oops+0x76/0x170 ? kernelmode_fixup_or_oops+0x84/0x110 ? exc_page_fault+0x65/0x150 ? asm_exc_page_fault+0x22/0x30 ? trace_event_raw_event_writeback_folio_template+0x76/0xf0 folio_wait_writeback+0x6b/0x80 shmem_swapin_folio+0x24a/0x500 ? filemap_get_entry+0xe3/0x140 shmem_get_folio_gfp+0x36e/0x7c0 ? find_busiest_group+0x43/0x1a0 shmem_fault+0x76/0x2a0 ? __update_load_avg_cfs_rq+0x281/0x2f0 __do_fault+0x33/0x130 do_read_fault+0x118/0x160 do_pte_missing+0x1ed/0x2a0 __handle_mm_fault+0x566/0x630 handle_mm_fault+0x91/0x210 do_user_addr_fault+0x22c/0x740 exc_page_fault+0x65/0x150 asm_exc_page_fault+0x22/0x30 This problem arises from the fact that the repurposed writeback_dirty_page trace event code was written assuming that every pointer to mapping (struct address_space) would come from a file-mapped page-cache object, thus mapping->host would always be populated, and that was a valid case before commit 19343b5bdd16. The swap-cache address space (swapper_spaces), however, doesn't populate its ->host (struct inode) pointer, thus leading to the crashes in the corner-case aforementioned. commit 19343b5bdd16 ended up breaking the assignment of __entry->name and __entry->ino for the wait_on_page_writeback tracepoint -- both dependent on mapping->host carrying a pointer to a valid inode. The assignment of __entry->name was fixed by commit 68f23b89067f ("memcg: fix a crash in wb_workfn when a device disappears"), and this commit fixes the remaining case, for __entry->ino. Link: https://lkml.kernel.org/r/20230606233613.1290819-1-aquini@redhat.com Fixes: 19343b5bdd16 ("mm/page-writeback: introduce tracepoint for wait_on_page_writeback()") Signed-off-by: Rafael Aquini <aquini@redhat.com> Reviewed-by: Yafang Shao <laoar.shao@gmail.com> Cc: Aristeu Rozanski <aris@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-19btrfs: tracepoints: also show actual number of the outstanding extentsNaohiro Aota1-4/+6
The btrfs_inode_mod_outstanding_extents trace event only shows the modified number to the number of outstanding extents. It would be helpful if we can see the resulting extent number as well. Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19ovl: enable fsnotify events on underlying real filesAmir Goldstein1-1/+3
Overlayfs creates the real underlying files with fake f_path, whose f_inode is on the underlying fs and f_path on overlayfs. Those real files were open with FMODE_NONOTIFY, because fsnotify code was not prapared to handle fsnotify hooks on files with fake path correctly and fanotify would report unexpected event->fd with fake overlayfs path, when the underlying fs was being watched. Teach fsnotify to handle events on the real files, and do not set real files to FMODE_NONOTIFY to allow operations on real file (e.g. open, access, modify, close) to generate async and permission events. Because fsnotify does not have notifications on address space operations, we do not need to worry about ->vm_file not reporting events to a watched overlayfs when users are accessing a mapped overlayfs file. Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Message-Id: <20230615112229.2143178-6-amir73il@gmail.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-06-19fs: use backing_file container for internal files with "fake" f_pathAmir Goldstein1-5/+28
Overlayfs uses open_with_fake_path() to allocate internal kernel files, with a "fake" path - whose f_path is not on the same fs as f_inode. Allocate a container struct backing_file for those internal files, that is used to hold the "fake" ovl path along with the real path. backing_file_real_path() can be used to access the stored real path. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Message-Id: <20230615112229.2143178-5-amir73il@gmail.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-06-19fs: use a helper for opening kernel internal filesAmir Goldstein1-0/+2
cachefiles uses kernel_open_tmpfile() to open kernel internal tmpfile without accounting for nr_files. cachefiles uses open_with_fake_path() for the same reason without the need for a fake path. Fork open_with_fake_path() to kernel_file_open() which only does the noaccount part and use it in cachefiles. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Message-Id: <20230615112229.2143178-3-amir73il@gmail.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-06-19fs: rename {vfs,kernel}_tmpfile_open()Amir Goldstein1-3/+4
Overlayfs and cachefiles use vfs_open_tmpfile() to open a tmpfile without accounting for nr_files. Rename this helper to kernel_tmpfile_open() to better reflect this helper is used for kernel internal users. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Message-Id: <20230615112229.2143178-2-amir73il@gmail.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-06-19regulator: pca9450: Fix LDO3OUT and LDO4OUT MASKTeresa Remmet1-2/+2
L3_OUT and L4_OUT Bit fields range from Bit 0:4 and thus the mask should be 0x1F instead of 0x0F. Fixes: 0935ff5f1f0a ("regulator: pca9450: add pca9450 pmic driver") Signed-off-by: Teresa Remmet <t.remmet@phytec.de> Reviewed-by: Frieder Schrempf <frieder.schrempf@kontron.de> Link: https://lore.kernel.org/r/20230614125240.3946519-1-t.remmet@phytec.de Signed-off-by: Mark Brown <broonie@kernel.org>
2023-06-19gpiolib: Fix irq_domain resource tracking for gpiochip_irqchip_add_domain()Michael Walle1-0/+8
Up until commit 6a45b0e2589f ("gpiolib: Introduce gpiochip_irqchip_add_domain()") all irq_domains were allocated by gpiolib itself and thus gpiolib also takes care of freeing it. With gpiochip_irqchip_add_domain() a user of gpiolib can associate an irq_domain with the gpio_chip. This irq_domain is not managed by gpiolib and therefore must not be freed by gpiolib. Fixes: 6a45b0e2589f ("gpiolib: Introduce gpiochip_irqchip_add_domain()") Reported-by: Jiawen Wu <jiawenwu@trustnetic.com> Signed-off-by: Michael Walle <mwalle@kernel.org> Reviewed-by: Linus Walleij <linus.walleij@linaro.org> Reviewed-by: Andy Shevchenko <andy@kernel.org> Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
2023-06-19btrfs: add a btrfs_finish_ordered_extent helperChristoph Hellwig1-0/+29
Add a helper to complete an ordered_extent without first doing a lookup. The tracepoint cannot use the ordered_extent class as we also want to print the range. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-18Merge tag 'ata-6.4-rc7' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/libata Pull ata fix from Damien Le Moal: - Avoid deadlocks on resume from sleep by delaying scsi rescan until the scsi device is also fully resumed. * tag 'ata-6.4-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/libata: ata: libata-scsi: Avoid deadlock on rescan after device resume
2023-06-18ata: libata-scsi: Avoid deadlock on rescan after device resumeDamien Le Moal1-1/+1
When an ATA port is resumed from sleep, the port is reset and a power management request issued to libata EH to reset the port and rescanning the device(s) attached to the port. Device rescanning is done by scheduling an ata_scsi_dev_rescan() work, which will execute scsi_rescan_device(). However, scsi_rescan_device() takes the generic device lock, which is also taken by dpm_resume() when the SCSI device is resumed as well. If a device rescan execution starts before the completion of the SCSI device resume, the rcu locking used to refresh the cached VPD pages of the device, combined with the generic device locking from scsi_rescan_device() and from dpm_resume() can cause a deadlock. Avoid this situation by changing struct ata_port scsi_rescan_task to be a delayed work instead of a simple work_struct. ata_scsi_dev_rescan() is modified to check if the SCSI device associated with the ATA device that must be rescanned is not suspended. If the SCSI device is still suspended, ata_scsi_dev_rescan() returns early and reschedule itself for execution after an arbitrary delay of 5ms. Reported-by: Kai-Heng Feng <kai.heng.feng@canonical.com> Reported-by: Joe Breuer <linux-kernel@jmbreuer.net> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217530 Fixes: a19a93e4c6a9 ("scsi: core: pm: Rely on the device driver core for async power management") Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Tested-by: Kai-Heng Feng <kai.heng.feng@canonical.com> Tested-by: Joe Breuer <linux-kernel@jmbreuer.net>
2023-06-18x86/hyperv: Fix hyperv_pcpu_input_arg handling when CPUs go online/offlineMichael Kelley1-0/+1
These commits a494aef23dfc ("PCI: hv: Replace retarget_msi_interrupt_params with hyperv_pcpu_input_arg") 2c6ba4216844 ("PCI: hv: Enable PCI pass-thru devices in Confidential VMs") update the Hyper-V virtual PCI driver to use the hyperv_pcpu_input_arg because that memory will be correctly marked as decrypted or encrypted for all VM types (CoCo or normal). But problems ensue when CPUs in the VM go online or offline after virtual PCI devices have been configured. When a CPU is brought online, the hyperv_pcpu_input_arg for that CPU is initialized by hv_cpu_init() running under state CPUHP_AP_ONLINE_DYN. But this state occurs after state CPUHP_AP_IRQ_AFFINITY_ONLINE, which may call the virtual PCI driver and fault trying to use the as yet uninitialized hyperv_pcpu_input_arg. A similar problem occurs in a CoCo VM if the MMIO read and write hypercalls are used from state CPUHP_AP_IRQ_AFFINITY_ONLINE. When a CPU is taken offline, IRQs may be reassigned in state CPUHP_TEARDOWN_CPU. Again, the virtual PCI driver may fault trying to use the hyperv_pcpu_input_arg that has already been freed by a higher state. Fix the onlining problem by adding state CPUHP_AP_HYPERV_ONLINE immediately after CPUHP_AP_ONLINE_IDLE (similar to CPUHP_AP_KVM_ONLINE) and before CPUHP_AP_IRQ_AFFINITY_ONLINE. Use this new state for Hyper-V initialization so that hyperv_pcpu_input_arg is allocated early enough. Fix the offlining problem by not freeing hyperv_pcpu_input_arg when a CPU goes offline. Retain the allocated memory, and reuse it if the CPU comes back online later. Signed-off-by: Michael Kelley <mikelley@microsoft.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Acked-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Dexuan Cui <decui@microsoft.com> Link: https://lore.kernel.org/r/1684862062-51576-1-git-send-email-mikelley@microsoft.com Signed-off-by: Wei Liu <wei.liu@kernel.org>
2023-06-17svcrdma: trace cc_release callsChuck Lever1-0/+8
This event brackets the svcrdma_post_* trace points. If this trace event is enabled but does not appear as expected, that indicates a chunk_ctxt leak. Reviewed-by: Jeff Layton <jlayton@kernel.org> Acked-by: Tom Talpey <tom@talpey.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-06-16Merge tag 'urgent-rcu.2023.06.11a' of ↵Linus Torvalds1-0/+10
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu Pull RCU fix from Paul McKenney: "This fixes a spinlock-initialization regression in SRCU that causes the SRCU notifier to fail. The fix simply adds the initialization, but introduces a #ifdef because there is no spinlock to initialize for the Tiny SRCU used in !SMP builds. Yes, it would be nice to abstract this somehow in order to hide it in SRCU, but I still don't see a good way of doing this" * tag 'urgent-rcu.2023.06.11a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu: notifier: Initialize new struct srcu_usage field
2023-06-16x86/unwind/orc: Add ELF section with ORC version identifierOmar Sandoval1-0/+3
Commits ffb1b4a41016 ("x86/unwind/orc: Add 'signal' field to ORC metadata") and fb799447ae29 ("x86,objtool: Split UNWIND_HINT_EMPTY in two") changed the ORC format. Although ORC is internal to the kernel, it's the only way for external tools to get reliable kernel stack traces on x86-64. In particular, the drgn debugger [1] uses ORC for stack unwinding, and these format changes broke it [2]. As the drgn maintainer, I don't care how often or how much the kernel changes the ORC format as long as I have a way to detect the change. It suffices to store a version identifier in the vmlinux and kernel module ELF files (to use when parsing ORC sections from ELF), and in kernel memory (to use when parsing ORC from a core dump+symbol table). Rather than hard-coding a version number that needs to be manually bumped, Peterz suggested hashing the definitions from orc_types.h. If there is a format change that isn't caught by this, the hashing script can be updated. This patch adds an .orc_header allocated ELF section containing the 20-byte hash to vmlinux and kernel modules, along with the corresponding __start_orc_header and __stop_orc_header symbols in vmlinux. 1: https://github.com/osandov/drgn 2: https://github.com/osandov/drgn/issues/303 Fixes: ffb1b4a41016 ("x86/unwind/orc: Add 'signal' field to ORC metadata") Fixes: fb799447ae29 ("x86,objtool: Split UNWIND_HINT_EMPTY in two") Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Josh Poimboeuf <jpoimboe@kernel.org> Link: https://lkml.kernel.org/r/aef9c8dc43915b886a8c48509a12ec1b006ca1ca.1686690801.git.osandov@osandov.com
2023-06-16perf/core: Drop __weak attribute from arch_perf_update_userpage() prototypeMarc Zyngier1-3/+3
Reiji reports that the arm64 implementation of arch_perf_update_userpage() is now ignored and replaced by the dummy stub in core code. This seems to happen since the PMUv3 driver was moved to driver/perf. As it turns out, dropping the __weak attribute from the *prototype* of the function solves the problem. You're right, this doesn't seem to make much sense. And yet... It appears that both symbols get flagged as weak, and that the first one to appear in the link order wins: $ nm drivers/perf/arm_pmuv3.o|grep arch_perf_update_userpage 0000000000001db0 W arch_perf_update_userpage Dropping the attribute from the prototype restores the expected behaviour, and arm64 is able to enjoy arch_perf_update_userpage() again. Fixes: 7755cec63ade ("arm64: perf: Move PMUv3 driver to drivers/perf") Fixes: f1ec3a517b43 ("kernel/events: Add a missing prototype for arch_perf_update_userpage()") Reported-by: Reiji Watanabe <reijiw@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Mark Rutland <mark.rutland@arm.com> Tested-by: Reiji Watanabe <reijiw@google.com> Link: https://lkml.kernel.org/r/20230616114831.3186980-1-maz@kernel.org
2023-06-16Merge tag 'net-6.4-rc7' of ↵Linus Torvalds4-3/+13
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from wireless, and netfilter. Selftests excluded - we have 58 patches and diff of +442/-199, which isn't really small but perhaps with the exception of the WiFi locking change it's old(ish) bugs. We have no known problems with v6.4. The selftest changes are rather large as MPTCP folks try to apply Greg's guidance that selftest from torvalds/linux should be able to run against stable kernels. Last thing I should call out is the DCCP/UDP-lite deprecation notices. We are fairly sure those are dead, but if we're wrong reverting them back in won't be fun. Current release - regressions: - wifi: - cfg80211: fix double lock bug in reg_wdev_chan_valid() - iwlwifi: mvm: spin_lock_bh() to fix lockdep regression Current release - new code bugs: - handshake: remove fput() that causes use-after-free Previous releases - regressions: - sched: cls_u32: fix reference counter leak leading to overflow - sched: cls_api: fix lockup on flushing explicitly created chain Previous releases - always broken: - nf_tables: integrate pipapo into commit protocol - nf_tables: incorrect error path handling with NFT_MSG_NEWRULE, fix dangling pointer on failure - ping6: fix send to link-local addresses with VRF - sched: act_pedit: parse L3 header for L4 offset, the skb may not have the offset saved - sched: act_ct: fix promotion of offloaded unreplied tuple - sched: refuse to destroy an ingress and clsact Qdiscs if there are lockless change operations in flight - wifi: mac80211: fix handful of bugs in multi-link operation - ipvlan: fix bound dev checking for IPv6 l3s mode - eth: enetc: correct the indexes of highest and 2nd highest TCs - eth: ice: fix XDP memory leak when NIC is brought up and down Misc: - add deprecation notices for UDP-lite and DCCP - selftests: mptcp: skip tests not supported by old kernels - sctp: handle invalid error codes without calling BUG()" * tag 'net-6.4-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (91 commits) dccp: Print deprecation notice. udplite: Print deprecation notice. octeon_ep: Add missing check for ioremap selftests/ptp: Fix timestamp printf format for PTP_SYS_OFFSET net: ethernet: stmicro: stmmac: fix possible memory leak in __stmmac_open net: tipc: resize nlattr array to correct size sfc: fix XDP queues mode with legacy IRQ net: macsec: fix double free of percpu stats net: lapbether: only support ethernet devices MAINTAINERS: add reviewers for SMC Sockets s390/ism: Fix trying to free already-freed IRQ by repeated ism_dev_exit() net: dsa: felix: fix taprio guard band overflow at 10Mbps with jumbo frames net/sched: cls_api: Fix lockup on flushing explicitly created chain ice: Fix ice module unload net/handshake: remove fput() that causes use-after-free selftests: forwarding: hw_stats_l3: Set addrgenmode in a separate step net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting net/sched: Refactor qdisc_graft() for ingress and clsact Qdiscs net/sched: act_ct: Fix promotion of offloaded unreplied tuple wifi: iwlwifi: mvm: spin_lock_bh() to fix lockdep regression ...
2023-06-16Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdmaLinus Torvalds2-23/+12
Pull rdma fixes from Jason Gunthorpe: "This is an unusually large bunch of bug fixes for the later rc cycle, rxe and mlx5 both dumped a lot of things at once. rxe continues to fix itself, and mlx5 is fixing a bunch of "queue counters" related bugs. There is one highly notable bug fix regarding the qkey. This small security check was missed in the original 2005 implementation and it allows some significant issues. Summary: - Two rtrs bug fixes for error unwind bugs - Several rxe bug fixes: * Incorrect Rx packet validation * Using memory without a refcount * Syzkaller found use before initialization * Regression fix for missing locking with the tasklet conversion from this merge window - Have bnxt report the correct link properties to userspace, this was a regression in v6.3 - Several mlx5 bug fixes: * Kernel crash triggerable by userspace for the RAW ethernet profile * Defend against steering refcounting issues created by userspace * Incorrect change of QP port affinity parameters in some LAG configurations - Fix mlx5 Q counters: * Do not over allocate Q counters to allow userspace to use the full port capacity * Kernel crash triggered by eswitch due to mis-use of Q counters * Incorrect mlx5_device for Q counters in some LAG configurations - Properly implement the IBA spec restricting privileged qkeys to root - Always an error when reading from a disassociated device's event queue - isert bug fixes: * Avoid a deadlock with the CM handler and CM ID destruction * Correct list corruption due to incorrect locking * Fix a use after free around connection tear down" * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: RDMA/rxe: Fix rxe_cq_post IB/isert: Fix incorrect release of isert connection IB/isert: Fix possible list corruption in CMA handler IB/isert: Fix dead lock in ib_isert RDMA/mlx5: Fix affinity assignment IB/uverbs: Fix to consider event queue closing also upon non-blocking mode RDMA/uverbs: Restrict usage of privileged QKEYs RDMA/cma: Always set static rate to 0 for RoCE RDMA/mlx5: Fix Q-counters query in LAG mode RDMA/mlx5: Remove vport Q-counters dependency on normal Q-counters RDMA/mlx5: Fix Q-counters per vport allocation RDMA/mlx5: Create an indirect flow table for steering anchor RDMA/mlx5: Initiate dropless RQ for RAW Ethernet functions RDMA/rxe: Fix the use-before-initialization error of resp_pkts RDMA/bnxt_re: Fix reporting active_{speed,width} attributes RDMA/rxe: Fix ref count error in check_rkey() RDMA/rxe: Fix packet length checks RDMA/rtrs: Fix rxe_dealloc_pd warning RDMA/rtrs: Fix the last iu->buf leak in err path
2023-06-16Merge tag 'media/v6.4-6' of ↵Linus Torvalds1-5/+1
git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media Pull media fixes from Mauro Carvalho Chehab: "A fix for dvb-core to avoid a race condition during DVB board registration" * tag 'media/v6.4-6' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media: Revert "media: dvb-core: Fix use-after-free on race condition at dvb_frontend"
2023-06-15ACPI: sleep: Avoid breaking S3 wakeup due to might_sleep()Rafael J. Wysocki1-0/+1
The addition of might_sleep() to down_timeout() caused the latter to enable interrupts unconditionally in some cases, which in turn broke the ACPI S3 wakeup path in acpi_suspend_enter(), where down_timeout() is called by acpi_disable_all_gpes() via acpi_ut_acquire_mutex(). Namely, if CONFIG_DEBUG_ATOMIC_SLEEP is set, might_sleep() causes might_resched() to be used and if CONFIG_PREEMPT_VOLUNTARY is set, this triggers __cond_resched() which may call preempt_schedule_common(), so __schedule() gets invoked and it ends up with enabled interrupts (in the prev == next case). Now, enabling interrupts early in the S3 wakeup path causes the kernel to crash. Address this by modifying acpi_suspend_enter() to disable GPEs without attempting to acquire the sleeping lock which is not needed in that code path anyway. Fixes: 99409b935c9a ("locking/semaphore: Add might_sleep() to down_*() family") Reported-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: 5.15+ <stable@vger.kernel.org> # 5.15+
2023-06-15eventfd: add a uapi header for eventfd userspace APIsWen Yang2-5/+12
Create a uapi header include/uapi/linux/eventfd.h, move the associated flags to the uapi header, and include it from linux/eventfd.h. Suggested-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Wen Yang <wenyang.linux@foxmail.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Jens Axboe <axboe@kernel.dk> Cc: Christian Brauner <brauner@kernel.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Dylan Yudaken <dylany@fb.com> Cc: David Woodhouse <dwmw@amazon.co.uk> Cc: Matthew Wilcox <willy@infradead.org> Cc: Eric Biggers <ebiggers@google.com> Cc: linux-fsdevel@vger.kernel.org Cc: linux-kernel@vger.kernel.org Message-Id: <tencent_2B6A999A23E86E522D5D9859D54FFCF9AA05@qq.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-06-15Revert "media: dvb-core: Fix use-after-free on race condition at dvb_frontend"Mauro Carvalho Chehab1-5/+1
As reported by Thomas Voegtle <tv@lio96.de>, sometimes a DVB card does not initialize properly booting Linux 6.4-rc4. This is not always, maybe in 3 out of 4 attempts. After double-checking, the root cause seems to be related to the UAF fix, which is causing a race issue: [ 26.332149] tda10071 7-0005: found a 'NXP TDA10071' in cold state, will try to load a firmware [ 26.340779] tda10071 7-0005: downloading firmware from file 'dvb-fe-tda10071.fw' [ 989.277402] INFO: task vdr:743 blocked for more than 491 seconds. [ 989.283504] Not tainted 6.4.0-rc5-i5 #249 [ 989.288036] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 989.295860] task:vdr state:D stack:0 pid:743 ppid:711 flags:0x00004002 [ 989.295865] Call Trace: [ 989.295867] <TASK> [ 989.295869] __schedule+0x2ea/0x12d0 [ 989.295877] ? asm_sysvec_apic_timer_interrupt+0x16/0x20 [ 989.295881] schedule+0x57/0xc0 [ 989.295884] schedule_preempt_disabled+0xc/0x20 [ 989.295887] __mutex_lock.isra.16+0x237/0x480 [ 989.295891] ? dvb_get_property.isra.10+0x1bc/0xa50 [ 989.295898] ? dvb_frontend_stop+0x36/0x180 [ 989.338777] dvb_frontend_stop+0x36/0x180 [ 989.338781] dvb_frontend_open+0x2f1/0x470 [ 989.338784] dvb_device_open+0x81/0xf0 [ 989.338804] ? exact_lock+0x20/0x20 [ 989.338808] chrdev_open+0x7f/0x1c0 [ 989.338811] ? generic_permission+0x1a2/0x230 [ 989.338813] ? link_path_walk.part.63+0x340/0x380 [ 989.338815] ? exact_lock+0x20/0x20 [ 989.338817] do_dentry_open+0x18e/0x450 [ 989.374030] path_openat+0xca5/0xe00 [ 989.374031] ? terminate_walk+0xec/0x100 [ 989.374034] ? path_lookupat+0x93/0x140 [ 989.374036] do_filp_open+0xc0/0x140 [ 989.374038] ? __call_rcu_common.constprop.91+0x92/0x240 [ 989.374041] ? __check_object_size+0x147/0x260 [ 989.374043] ? __check_object_size+0x147/0x260 [ 989.374045] ? alloc_fd+0xbb/0x180 [ 989.374048] ? do_sys_openat2+0x243/0x310 [ 989.374050] do_sys_openat2+0x243/0x310 [ 989.374052] do_sys_open+0x52/0x80 [ 989.374055] do_syscall_64+0x5b/0x80 [ 989.421335] ? __task_pid_nr_ns+0x92/0xa0 [ 989.421337] ? syscall_exit_to_user_mode+0x20/0x40 [ 989.421339] ? do_syscall_64+0x67/0x80 [ 989.421341] ? syscall_exit_to_user_mode+0x20/0x40 [ 989.421343] ? do_syscall_64+0x67/0x80 [ 989.421345] entry_SYSCALL_64_after_hwframe+0x63/0xcd [ 989.421348] RIP: 0033:0x7fe895d067e3 [ 989.421349] RSP: 002b:00007fff933c2ba0 EFLAGS: 00000293 ORIG_RAX: 0000000000000101 [ 989.421351] RAX: ffffffffffffffda RBX: 00007fff933c2c10 RCX: 00007fe895d067e3 [ 989.421352] RDX: 0000000000000802 RSI: 00005594acdce160 RDI: 00000000ffffff9c [ 989.421353] RBP: 0000000000000802 R08: 0000000000000000 R09: 0000000000000000 [ 989.421353] R10: 0000000000000000 R11: 0000000000000293 R12: 0000000000000001 [ 989.421354] R13: 00007fff933c2ca0 R14: 00000000ffffffff R15: 00007fff933c2c90 [ 989.421355] </TASK> This reverts commit 6769a0b7ee0c3b31e1b22c3fadff2bfb642de23f. Fixes: 6769a0b7ee0c ("media: dvb-core: Fix use-after-free on race condition at dvb_frontend") Link: https://lore.kernel.org/all/da5382ad-09d6-20ac-0d53-611594b30861@lio96.de/ Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
2023-06-14fsverity: rework fsverity_get_digest() againEric Biggers1-5/+9
Address several issues with the calling convention and documentation of fsverity_get_digest(): - Make it provide the hash algorithm as either a FS_VERITY_HASH_ALG_* value or HASH_ALGO_* value, at the caller's choice, rather than only a HASH_ALGO_* value as it did before. This allows callers to work with the fsverity native algorithm numbers if they want to. HASH_ALGO_* is what IMA uses, but other users (e.g. overlayfs) should use FS_VERITY_HASH_ALG_* to match fsverity-utils and the fsverity UAPI. - Make it return the digest size so that it doesn't need to be looked up separately. Use the return value for this, since 0 works nicely for the "file doesn't have fsverity enabled" case. This also makes it clear that no other errors are possible. - Rename the 'digest' parameter to 'raw_digest' and clearly document that it is only useful in combination with the algorithm ID. This hopefully clears up a point of confusion. - Export it to modules, since overlayfs will need it for checking the fsverity digests of lowerdata files (https://lore.kernel.org/r/dd294a44e8f401e6b5140029d8355f88748cd8fd.1686565330.git.alexl@redhat.com). Acked-by: Mimi Zohar <zohar@linux.ibm.com> # for the IMA piece Link: https://lore.kernel.org/r/20230612190047.59755-1-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com>
2023-06-14net/sched: qdisc_destroy() old ingress and clsact Qdiscs before graftingPeilin Ye1-0/+8
mini_Qdisc_pair::p_miniq is a double pointer to mini_Qdisc, initialized in ingress_init() to point to net_device::miniq_ingress. ingress Qdiscs access this per-net_device pointer in mini_qdisc_pair_swap(). Similar for clsact Qdiscs and miniq_egress. Unfortunately, after introducing RTNL-unlocked RTM_{NEW,DEL,GET}TFILTER requests (thanks Hillf Danton for the hint), when replacing ingress or clsact Qdiscs, for example, the old Qdisc ("@old") could access the same miniq_{in,e}gress pointer(s) concurrently with the new Qdisc ("@new"), causing race conditions [1] including a use-after-free bug in mini_qdisc_pair_swap() reported by syzbot: BUG: KASAN: slab-use-after-free in mini_qdisc_pair_swap+0x1c2/0x1f0 net/sched/sch_generic.c:1573 Write of size 8 at addr ffff888045b31308 by task syz-executor690/14901 ... Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0xd9/0x150 lib/dump_stack.c:106 print_address_description.constprop.0+0x2c/0x3c0 mm/kasan/report.c:319 print_report mm/kasan/report.c:430 [inline] kasan_report+0x11c/0x130 mm/kasan/report.c:536 mini_qdisc_pair_swap+0x1c2/0x1f0 net/sched/sch_generic.c:1573 tcf_chain_head_change_item net/sched/cls_api.c:495 [inline] tcf_chain0_head_change.isra.0+0xb9/0x120 net/sched/cls_api.c:509 tcf_chain_tp_insert net/sched/cls_api.c:1826 [inline] tcf_chain_tp_insert_unique net/sched/cls_api.c:1875 [inline] tc_new_tfilter+0x1de6/0x2290 net/sched/cls_api.c:2266 ... @old and @new should not affect each other. In other words, @old should never modify miniq_{in,e}gress after @new, and @new should not update @old's RCU state. Fixing without changing sch_api.c turned out to be difficult (please refer to Closes: for discussions). Instead, make sure @new's first call always happen after @old's last call (in {ingress,clsact}_destroy()) has finished: In qdisc_graft(), return -EBUSY if @old has any ongoing filter requests, and call qdisc_destroy() for @old before grafting @new. Introduce qdisc_refcount_dec_if_one() as the counterpart of qdisc_refcount_inc_nz() used for filter requests. Introduce a non-static version of qdisc_destroy() that does a TCQ_F_BUILTIN check, just like qdisc_put() etc. Depends on patch "net/sched: Refactor qdisc_graft() for ingress and clsact Qdiscs". [1] To illustrate, the syzkaller reproducer adds ingress Qdiscs under TC_H_ROOT (no longer possible after commit c7cfbd115001 ("net/sched: sch_ingress: Only create under TC_H_INGRESS")) on eth0 that has 8 transmission queues: Thread 1 creates ingress Qdisc A (containing mini Qdisc a1 and a2), then adds a flower filter X to A. Thread 2 creates another ingress Qdisc B (containing mini Qdisc b1 and b2) to replace A, then adds a flower filter Y to B. Thread 1 A's refcnt Thread 2 RTM_NEWQDISC (A, RTNL-locked) qdisc_create(A) 1 qdisc_graft(A) 9 RTM_NEWTFILTER (X, RTNL-unlocked) __tcf_qdisc_find(A) 10 tcf_chain0_head_change(A) mini_qdisc_pair_swap(A) (1st) | | RTM_NEWQDISC (B, RTNL-locked) RCU sync 2 qdisc_graft(B) | 1 notify_and_destroy(A) | tcf_block_release(A) 0 RTM_NEWTFILTER (Y, RTNL-unlocked) qdisc_destroy(A) tcf_chain0_head_change(B) tcf_chain0_head_change_cb_del(A) mini_qdisc_pair_swap(B) (2nd) mini_qdisc_pair_swap(A) (3rd) | ... ... Here, B calls mini_qdisc_pair_swap(), pointing eth0->miniq_ingress to its mini Qdisc, b1. Then, A calls mini_qdisc_pair_swap() again during ingress_destroy(), setting eth0->miniq_ingress to NULL, so ingress packets on eth0 will not find filter Y in sch_handle_ingress(). This is just one of the possible consequences of concurrently accessing miniq_{in,e}gress pointers. Fixes: 7a096d579e8e ("net: sched: ingress: set 'unlocked' flag for Qdisc ops") Fixes: 87f373921c4e ("net: sched: ingress: set 'unlocked' flag for clsact Qdisc ops") Reported-by: syzbot+b53a9c0d1ea4ad62da8b@syzkaller.appspotmail.com Closes: https://lore.kernel.org/r/0000000000006cf87705f79acf1a@google.com/ Cc: Hillf Danton <hdanton@sina.com> Cc: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: Peilin Ye <peilin.ye@bytedance.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-06-14net/sched: act_ct: Fix promotion of offloaded unreplied tuplePaul Blakey1-1/+1
Currently UNREPLIED and UNASSURED connections are added to the nf flow table. This causes the following connection packets to be processed by the flow table which then skips conntrack_in(), and thus such the connections will remain UNREPLIED and UNASSURED even if reply traffic is then seen. Even still, the unoffloaded reply packets are the ones triggering hardware update from new to established state, and if there aren't any to triger an update and/or previous update was missed, hardware can get out of sync with sw and still mark packets as new. Fix the above by: 1) Not skipping conntrack_in() for UNASSURED packets, but still refresh for hardware, as before the cited patch. 2) Try and force a refresh by reply-direction packets that update the hardware rules from new to established state. 3) Remove any bidirectional flows that didn't failed to update in hardware for re-insertion as bidrectional once any new packet arrives. Fixes: 6a9bad0069cf ("net/sched: act_ct: offload UDP NEW connections") Co-developed-by: Vlad Buslov <vladbu@nvidia.com> Signed-off-by: Vlad Buslov <vladbu@nvidia.com> Signed-off-by: Paul Blakey <paulb@nvidia.com> Reviewed-by: Florian Westphal <fw@strlen.de> Link: https://lore.kernel.org/r/1686313379-117663-1-git-send-email-paulb@nvidia.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-06-12svcrdma: Revert 2a1e4f21d841 ("svcrdma: Normalize Send page handling")Chuck Lever1-1/+0
Get rid of the completion wait in svc_rdma_sendto(), and release pages in the send completion handler again. A subsequent patch will handle releasing those pages more efficiently. Reverted by hand: patch -R would not apply cleanly. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>