summaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2024-05-02seq_file: Optimize seq_puts()Christophe JAILLET2-3/+14
Most of seq_puts() usages are done with a string literal. In such cases, the length of the string car be computed at compile time in order to save a strlen() call at run-time. seq_putc() or seq_write() can then be used instead. This saves a few cycles. To have an estimation of how often this optimization triggers: $ git grep seq_puts.*\" | wc -l 3436 $ git grep seq_puts.*\".\" | wc -l 84 Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Link: https://lore.kernel.org/r/a8589bffe4830dafcb9111e22acf06603fea7132.1713781332.git.christophe.jaillet@wanadoo.fr Signed-off-by: Christian Brauner <brauner@kernel.org> The output for seq_putc() generation has also be checked and works.
2024-05-02proc: Move fdinfo PTRACE_MODE_READ check into the inode .permission operationTyler Hicks (Microsoft)1-22/+20
The following commits loosened the permissions of /proc/<PID>/fdinfo/ directory, as well as the files within it, from 0500 to 0555 while also introducing a PTRACE_MODE_READ check between the current task and <PID>'s task: - commit 7bc3fa0172a4 ("procfs: allow reading fdinfo with PTRACE_MODE_READ") - commit 1927e498aee1 ("procfs: prevent unprivileged processes accessing fdinfo dir") Before those changes, inode based system calls like inotify_add_watch(2) would fail when the current task didn't have sufficient read permissions: [...] lstat("/proc/1/task/1/fdinfo", {st_mode=S_IFDIR|0500, st_size=0, ...}) = 0 inotify_add_watch(64, "/proc/1/task/1/fdinfo", IN_MODIFY|IN_ATTRIB|IN_MOVED_FROM|IN_MOVED_TO|IN_CREATE|IN_DELETE| IN_ONLYDIR|IN_DONT_FOLLOW|IN_EXCL_UNLINK) = -1 EACCES (Permission denied) [...] This matches the documented behavior in the inotify_add_watch(2) man page: ERRORS EACCES Read access to the given file is not permitted. After those changes, inotify_add_watch(2) started succeeding despite the current task not having PTRACE_MODE_READ privileges on the target task: [...] lstat("/proc/1/task/1/fdinfo", {st_mode=S_IFDIR|0555, st_size=0, ...}) = 0 inotify_add_watch(64, "/proc/1/task/1/fdinfo", IN_MODIFY|IN_ATTRIB|IN_MOVED_FROM|IN_MOVED_TO|IN_CREATE|IN_DELETE| IN_ONLYDIR|IN_DONT_FOLLOW|IN_EXCL_UNLINK) = 1757 openat(AT_FDCWD, "/proc/1/task/1/fdinfo", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = -1 EACCES (Permission denied) [...] This change in behavior broke .NET prior to v7. See the github link below for the v7 commit that inadvertently/quietly (?) fixed .NET after the kernel changes mentioned above. Return to the old behavior by moving the PTRACE_MODE_READ check out of the file .open operation and into the inode .permission operation: [...] lstat("/proc/1/task/1/fdinfo", {st_mode=S_IFDIR|0555, st_size=0, ...}) = 0 inotify_add_watch(64, "/proc/1/task/1/fdinfo", IN_MODIFY|IN_ATTRIB|IN_MOVED_FROM|IN_MOVED_TO|IN_CREATE|IN_DELETE| IN_ONLYDIR|IN_DONT_FOLLOW|IN_EXCL_UNLINK) = -1 EACCES (Permission denied) [...] Reported-by: Kevin Parsons (Microsoft) <parsonskev@gmail.com> Link: https://github.com/dotnet/runtime/commit/89e5469ac591b82d38510fe7de98346cce74ad4f Link: https://stackoverflow.com/questions/75379065/start-self-contained-net6-build-exe-as-service-on-raspbian-system-unauthorizeda Fixes: 7bc3fa0172a4 ("procfs: allow reading fdinfo with PTRACE_MODE_READ") Cc: stable@vger.kernel.org Cc: Christian Brauner <brauner@kernel.org> Cc: Christian König <christian.koenig@amd.com> Cc: Jann Horn <jannh@google.com> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Hardik Garg <hargar@linux.microsoft.com> Cc: Allen Pais <apais@linux.microsoft.com> Signed-off-by: Tyler Hicks (Microsoft) <code@tyhicks.com> Link: https://lore.kernel.org/r/20240501005646.745089-1-code@tyhicks.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-04-26fs: Create anon_inode_getfile_fmode()Dawid Osuchowski2-0/+38
Creates an anon_inode_getfile_fmode() function that works similarly to anon_inode_getfile() with the addition of being able to set the fmode member. Signed-off-by: Dawid Osuchowski <linux@osuchow.ski> Link: https://lore.kernel.org/r/20240426075854.4723-1-linux@osuchow.ski Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-04-24xfs: don't call xfs_file_open from xfs_dir_openChristoph Hellwig1-1/+3
Directories do not support direct I/O and thus no non-blocking direct I/O either. Open code the shutdown check and call to generic_file_open instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240423124608.537794-4-hch@lst.de Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-04-24xfs: drop fop_flags for directoriesChristoph Hellwig1-2/+0
Directories have non of the capabilities, so drop the flags. Note that the current state is harmless as no one actually checks for the flags either. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240423124608.537794-3-hch@lst.de Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-04-24xfs: fix overly long line in the file_operationsChristoph Hellwig1-4/+4
Re-wrap the newly added fop_flags fields to not go over 80 characters. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240423124608.537794-2-hch@lst.de Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-04-17Merge patch series 'Fix shmem_rename2 directory offset calculation' of ↵Christian Brauner3-8/+52
https://lore.kernel.org/r/20240415152057.4605-1-cel@kernel.org Pull shmem_rename2() offset fixes from Chuck Lever: The existing code in shmem_rename2() allocates a fresh directory offset value when renaming over an existing destination entry. User space does not expect this behavior. In particular, applications that rename while walking a directory can loop indefinitely because they never reach the end of the directory. * 'Fix shmem_rename2 directory offset calculation' of https://lore.kernel.org/r/20240415152057.4605-1-cel@kernel.org: (3 commits) shmem: Fix shmem_rename2() libfs: Add simple_offset_rename() API libfs: Fix simple_offset_rename_exchange() fs/libfs.c | 55 +++++++++++++++++++++++++++++++++++++++++----- include/linux/fs.h | 2 ++ mm/shmem.c | 3 +-- 3 files changed, 52 insertions(+), 8 deletions(-) Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-04-17shmem: Fix shmem_rename2()Chuck Lever1-0/+9
When renaming onto an existing directory entry, user space expects the replacement entry to have the same directory offset as the original one. Link: https://gitlab.alpinelinux.org/alpine/aports/-/issues/15966 Fixes: a2e459555c5f ("shmem: stable directory offsets") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Link: https://lore.kernel.org/r/20240415152057.4605-4-cel@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-04-17libfs: Add simple_offset_rename() APIChuck Lever3-2/+24
I'm about to fix a tmpfs rename bug that requires the use of internal simple_offset helpers that are not available in mm/shmem.c Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Link: https://lore.kernel.org/r/20240415152057.4605-3-cel@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-04-17libfs: Fix simple_offset_rename_exchange()Chuck Lever1-6/+19
User space expects the replacement (old) directory entry to have the same directory offset after the rename. Suggested-by: Christian Brauner <brauner@kernel.org> Fixes: a2e459555c5f ("shmem: stable directory offsets") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Link: https://lore.kernel.org/r/20240415152057.4605-2-cel@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-04-17jffs2: prevent xattr node from overflowing the eraseblockIlya Denisyev1-0/+3
Add a check to make sure that the requested xattr node size is no larger than the eraseblock minus the cleanmarker. Unlike the usual inode nodes, the xattr nodes aren't split into parts and spread across multiple eraseblocks, which means that a xattr node must not occupy more than one eraseblock. If the requested xattr value is too large, the xattr node can spill onto the next eraseblock, overwriting the nodes and causing errors such as: jffs2: argh. node added in wrong place at 0x0000b050(2) jffs2: nextblock 0x0000a000, expected at 0000b00c jffs2: error: (823) do_verify_xattr_datum: node CRC failed at 0x01e050, read=0xfc892c93, calc=0x000000 jffs2: notice: (823) jffs2_get_inode_nodes: Node header CRC failed at 0x01e00c. {848f,2fc4,0fef511f,59a3d171} jffs2: Node at 0x0000000c with length 0x00001044 would run over the end of the erase block jffs2: Perhaps the file system was created with the wrong erase size? jffs2: jffs2_scan_eraseblock(): Magic bitmask 0x1985 not found at 0x00000010: 0x1044 instead This breaks the filesystem and can lead to KASAN crashes such as: BUG: KASAN: slab-out-of-bounds in jffs2_sum_add_kvec+0x125e/0x15d0 Read of size 4 at addr ffff88802c31e914 by task repro/830 CPU: 0 PID: 830 Comm: repro Not tainted 6.9.0-rc3+ #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0xc6/0x120 print_report+0xc4/0x620 ? __virt_addr_valid+0x308/0x5b0 kasan_report+0xc1/0xf0 ? jffs2_sum_add_kvec+0x125e/0x15d0 ? jffs2_sum_add_kvec+0x125e/0x15d0 jffs2_sum_add_kvec+0x125e/0x15d0 jffs2_flash_direct_writev+0xa8/0xd0 jffs2_flash_writev+0x9c9/0xef0 ? __x64_sys_setxattr+0xc4/0x160 ? do_syscall_64+0x69/0x140 ? entry_SYSCALL_64_after_hwframe+0x76/0x7e [...] Found by Linux Verification Center (linuxtesting.org) with Syzkaller. Fixes: aa98d7cf59b5 ("[JFFS2][XATTR] XATTR support on JFFS2 (version. 5)") Signed-off-by: Ilya Denisyev <dev@elkcl.ru> Link: https://lore.kernel.org/r/20240412155357.237803-1-dev@elkcl.ru Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-04-15vfs, swap: compile out IS_SWAPFILE() on swapless configsAlexey Dobriyan1-0/+6
No swap support -- no swapfiles possible. Signed-off-by: Alexey Dobriyan (Yandex) <adobriyan@gmail.com> Link: https://lore.kernel.org/r/2391c7f5-0f83-4188-ae56-4ec7ccbf2576@p183 Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-04-13vfs: relax linkat() AT_EMPTY_PATH - aka flink() - requirementsLinus Torvalds2-6/+14
"The definition of insanity is doing the same thing over and over again and expecting different results” We've tried to do this before, most recently with commit bb2314b47996 ("fs: Allow unprivileged linkat(..., AT_EMPTY_PATH) aka flink") about a decade ago. But the effort goes back even further than that, eg this thread back from 1998 that is so old that we don't even have it archived in lore: https://lkml.org/lkml/1998/3/10/108 which also points out some of the reasons why it's dangerous. Or, how about then in 2003: https://lkml.org/lkml/2003/4/6/112 where we went through some of the same arguments, just wirh different people involved. In particular, having access to a file descriptor does not necessarily mean that you have access to the path that was used for lookup, and there may be very good reasons why you absolutely must not have access to a path to said file. For example, if we were passed a file descriptor from the outside into some limited environment (think chroot, but also user namespaces etc) a 'flink()' system call could now make that file visible inside a context where it's not supposed to be visible. In the process the user may also be able to re-open it with permissions that the original file descriptor did not have (eg a read-only file descriptor may be associated with an underlying file that is writable). Another variation on this is if somebody else (typically root) opens a file in a directory that is not accessible to others, and passes the file descriptor on as a read-only file. Again, the access to the file descriptor does not imply that you should have access to a path to the file in the filesystem. So while we have tried this several times in the past, it never works. The last time we did this, that commit bb2314b47996 quickly got reverted again in commit f0cc6ffb8ce8 (Revert "fs: Allow unprivileged linkat(..., AT_EMPTY_PATH) aka flink"), with a note saying "We may re-do this once the whole discussion about the interface is done". Well, the discussion is long done, and didn't come to any resolution. There's no question that 'flink()' would be a useful operation, but it's a dangerous one. However, it does turn out that since 2008 (commit d76b0d9b2d87: "CRED: Use creds in file structs") we have had a fairly straightforward way to check whether the file descriptor was opened by the same credentials as the credentials of the flink(). That allows the most common patterns that people want to use, which tend to be to either open the source carefully (ie using the openat2() RESOLVE_xyz flags, and/or checking ownership with fstat() before linking), or to use O_TMPFILE and fill in the file contents before it's exposed to the world with linkat(). But it also means that if the file descriptor was opened by somebody else, or we've gone through a credentials change since, the operation no longer works (unless we have CAP_DAC_READ_SEARCH capabilities in the opener's user namespace, as before). Note that the credential equality check is done by using pointer equality, which means that it's not enough that you have effectively the same user - they have to be literally identical, since our credentials are using copy-on-write semantics. So you can't change your credentials to something else and try to change it back to the same ones between the open() and the linkat(). This is not meant to be some kind of generic permission check, this is literally meant as a "the open and link calls are 'atomic' wrt user credentials" check. It also means that you can't just move things between namespaces, because the credentials aren't just a list of uid's and gid's: they includes the pointer to the user_ns that the capabilities are relative to. So let's try this one more time and see if maybe this approach ends up being workable after all. Cc: Andrew Lutomirski <luto@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Peter Anvin <hpa@zytor.com> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20240411001012.12513-1-torvalds@linux-foundation.org [brauner: relax capability check to opener of the file] Link: https://lore.kernel.org/all/20231113-undenkbar-gediegen-efde5f1c34bc@brauner Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-04-11fs/direct-io: remove redundant assignment to variable retvalColin Ian King1-1/+0
The variable retval is being assigned a value that is not being read, it is being re-assigned later on in the function. The assignment is redundant and can be removed. Cleans up clang scan build warning: fs/direct-io.c:1220:2: warning: Value stored to 'retval' is never read [deadcode.DeadStores] Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Link: https://lore.kernel.org/r/20240410162221.292485-1-colin.i.king@gmail.com Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-04-09fs/dcache: Re-use value stored to dentry->d_flags instead of re-readinglinke li1-1/+1
Currently, the __d_clear_type_and_inode() writes the value flags to dentry->d_flags, then immediately re-reads it in order to use it in a if statement. This re-read is useless because no other update to dentry->d_flags can occur at this point. This commit therefore re-use flags in the if statement instead of re-reading dentry->d_flags. Signed-off-by: linke li <lilinke99@qq.com> Link: https://lore.kernel.org/r/tencent_5E187BD0A61BA28605E85405F15228254D0A@qq.com Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-04-09fs: Add FOP_HUGE_PAGESMatthew Wilcox (Oracle)5-20/+10
Instead of checking for specific file_operations, add a bit to file_operations which denotes a file that only contain hugetlb pages. This lets us make hugetlbfs_file_operations static, and removes is_file_shm_hugepages() completely. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://lore.kernel.org/r/20240407201122.3783877-1-willy@infradead.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-04-09orangefs: cleanup uses of strncpyJustin Stitt3-32/+15
strncpy() is deprecated for use on NUL-terminated destination strings [1] and as such we should prefer more robust and less ambiguous string interfaces. There is some care taken to ensure these destination buffers are NUL-terminated by bounding the strncpy()'s by ORANGEFS_NAME_MAX - 1 or ORANGEFS_MAX_SERVER_ADDR_LEN - 1. Instead, we can use the new 2-argument version of strscpy() to guarantee NUL-termination on the destination buffers while simplifying the code. Based on usage with printf-likes, we can see these buffers are expected to be NUL-terminated: | gossip_debug(GOSSIP_NAME_DEBUG, | "%s: doing lookup on %s under %pU,%d\n", | __func__, | new_op->upcall.req.lookup.d_name, | &new_op->upcall.req.lookup.parent_refn.khandle, | new_op->upcall.req.lookup.parent_refn.fs_id); ... | gossip_debug(GOSSIP_SUPER_DEBUG, | "Attempting ORANGEFS Remount via host %s\n", | new_op->upcall.req.fs_mount.orangefs_config_server); NUL-padding isn't required for any of these destination buffers as they've all been zero-allocated with op_alloc() or kzalloc(). Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings [1] Link: https://manpages.debian.org/testing/linux-manual-4.8/strscpy.9.en.html [2] Link: https://github.com/KSPP/linux/issues/90 Cc: linux-hardening@vger.kernel.org Signed-off-by: Justin Stitt <justinstitt@google.com> Link: https://lore.kernel.org/r/20240322-strncpy-fs-orangefs-dcache-c-v1-1-15d12debbf38@google.com Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-04-09nilfs2: fix out-of-range warningArnd Bergmann1-1/+1
clang-14 points out that v_size is always smaller than a 64KB page size if that is configured by the CPU architecture: fs/nilfs2/ioctl.c:63:19: error: result of comparison of constant 65536 with expression of type '__u16' (aka 'unsigned short') is always false [-Werror,-Wtautological-constant-out-of-range-compare] if (argv->v_size > PAGE_SIZE) ~~~~~~~~~~~~ ^ ~~~~~~~~~ This is ok, so just shut up that warning with a cast. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Link: https://lore.kernel.org/r/20240328143051.1069575-7-arnd@kernel.org Fixes: 3358b4aaa84f ("nilfs2: fix problems of memory allocation in ioctl") Acked-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Reviewed-by: Justin Stitt <justinstitt@google.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-04-07fs: use bit shifts for FMODE_* flagsChristian Brauner1-26/+31
Make it easier to see what bits are still available. Link: https://lore.kernel.org/r/20240406061604.GA538574@ZenIV Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-04-07fs: claw back a few FMODE_* bitsChristian Brauner12-28/+37
There's a bunch of flags that are purely based on what the file operations support while also never being conditionally set or unset. IOW, they're not subject to change for individual files. Imho, such flags don't need to live in f_mode they might as well live in the fops structs itself. And the fops struct already has that lonely mmap_supported_flags member. We might as well turn that into a generic fop_flags member and move a few flags from FMODE_* space into FOP_* space. That gets us four FMODE_* bits back and the ability for new static flags that are about file ops to not have to live in FMODE_* space but in their own FOP_* space. It's not the most beautiful thing ever but it gets the job done. Yes, there'll be an additional pointer chase but hopefully that won't matter for these flags. I suspect there's a few more we can move into there and that we can also redirect a bunch of new flag suggestions that follow this pattern into the fop_flags field instead of f_mode. Link: https://lore.kernel.org/r/20240328-gewendet-spargel-aa60a030ef74@brauner Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-04-05fs: Annotate struct file_handle with __counted_by() and use struct_size()Gustavo A. R. Silva2-4/+4
Prepare for the coming implementation by GCC and Clang of the __counted_by attribute. Flexible array members annotated with __counted_by can have their accesses bounds-checked at run-time via CONFIG_UBSAN_BOUNDS (for array indexing) and CONFIG_FORTIFY_SOURCE (for strcpy/memcpy-family functions). While there, use struct_size() helper, instead of the open-coded version. [brauner@kernel.org: contains a fix by Edward for an OOB access] Reported-by: syzbot+4139435cb1b34cf759c2@syzkaller.appspotmail.com Signed-off-by: Edward Adam Davis <eadavis@qq.com> Link: https://lore.kernel.org/r/tencent_A7845DD769577306D813742365E976E3A205@qq.com Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Link: https://lore.kernel.org/r/ZgImCXTdGDTeBvSS@neat Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-04-05Merge patch series 'fs: aio: more folio conversion' of ↵Christian Brauner1-44/+47
https://lore.kernel.org/r/20240321131640.948634-1-wangkefeng.wang@huawei.com Pull aio folio conversion from Kefeng Wang: Convert to use folio throughout aio. * series 'fs: aio: more folio conversion' of https://lore.kernel.org/r/20240321131640.948634-1-wangkefeng.wang@huawei.com: (3 commits) fs: aio: convert to ring_folios and internal_folios fs: aio: use a folio in aio_free_ring() fs: aio: use a folio in aio_setup_ring() Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-04-05fs: aio: convert to ring_folios and internal_foliosKefeng Wang1-31/+31
Since aio use folios in most functions, convert ring/internal_pages to ring/internal_folios, let's directly use folio instead of page throughout aio to remove hidden calls to compound_head(), eg, flush_dcache_page(). Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Link: https://lore.kernel.org/r/20240321131640.948634-4-wangkefeng.wang@huawei.com Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-04-05ecryptfs: Fix buffer size for tag 66 packetBrian Kubisiak1-1/+3
The 'TAG 66 Packet Format' description is missing the cipher code and checksum fields that are packed into the message packet. As a result, the buffer allocated for the packet is 3 bytes too small and write_tag_66_packet() will write up to 3 bytes past the end of the buffer. Fix this by increasing the size of the allocation so the whole packet will always fit in the buffer. This fixes the below kasan slab-out-of-bounds bug: BUG: KASAN: slab-out-of-bounds in ecryptfs_generate_key_packet_set+0x7d6/0xde0 Write of size 1 at addr ffff88800afbb2a5 by task touch/181 CPU: 0 PID: 181 Comm: touch Not tainted 6.6.13-gnu #1 4c9534092be820851bb687b82d1f92a426598dc6 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2/GNU Guix 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x4c/0x70 print_report+0xc5/0x610 ? ecryptfs_generate_key_packet_set+0x7d6/0xde0 ? kasan_complete_mode_report_info+0x44/0x210 ? ecryptfs_generate_key_packet_set+0x7d6/0xde0 kasan_report+0xc2/0x110 ? ecryptfs_generate_key_packet_set+0x7d6/0xde0 __asan_store1+0x62/0x80 ecryptfs_generate_key_packet_set+0x7d6/0xde0 ? __pfx_ecryptfs_generate_key_packet_set+0x10/0x10 ? __alloc_pages+0x2e2/0x540 ? __pfx_ovl_open+0x10/0x10 [overlay 30837f11141636a8e1793533a02e6e2e885dad1d] ? dentry_open+0x8f/0xd0 ecryptfs_write_metadata+0x30a/0x550 ? __pfx_ecryptfs_write_metadata+0x10/0x10 ? ecryptfs_get_lower_file+0x6b/0x190 ecryptfs_initialize_file+0x77/0x150 ecryptfs_create+0x1c2/0x2f0 path_openat+0x17cf/0x1ba0 ? __pfx_path_openat+0x10/0x10 do_filp_open+0x15e/0x290 ? __pfx_do_filp_open+0x10/0x10 ? __kasan_check_write+0x18/0x30 ? _raw_spin_lock+0x86/0xf0 ? __pfx__raw_spin_lock+0x10/0x10 ? __kasan_check_write+0x18/0x30 ? alloc_fd+0xf4/0x330 do_sys_openat2+0x122/0x160 ? __pfx_do_sys_openat2+0x10/0x10 __x64_sys_openat+0xef/0x170 ? __pfx___x64_sys_openat+0x10/0x10 do_syscall_64+0x60/0xd0 entry_SYSCALL_64_after_hwframe+0x6e/0xd8 RIP: 0033:0x7f00a703fd67 Code: 25 00 00 41 00 3d 00 00 41 00 74 37 64 8b 04 25 18 00 00 00 85 c0 75 5b 44 89 e2 48 89 ee bf 9c ff ff ff b8 01 01 00 00 0f 05 <48> 3d 00 f0 ff ff 0f 87 85 00 00 00 48 83 c4 68 5d 41 5c c3 0f 1f RSP: 002b:00007ffc088e30b0 EFLAGS: 00000246 ORIG_RAX: 0000000000000101 RAX: ffffffffffffffda RBX: 00007ffc088e3368 RCX: 00007f00a703fd67 RDX: 0000000000000941 RSI: 00007ffc088e48d7 RDI: 00000000ffffff9c RBP: 00007ffc088e48d7 R08: 0000000000000001 R09: 0000000000000000 R10: 00000000000001b6 R11: 0000000000000246 R12: 0000000000000941 R13: 0000000000000000 R14: 00007ffc088e48d7 R15: 00007f00a7180040 </TASK> Allocated by task 181: kasan_save_stack+0x2f/0x60 kasan_set_track+0x29/0x40 kasan_save_alloc_info+0x25/0x40 __kasan_kmalloc+0xc5/0xd0 __kmalloc+0x66/0x160 ecryptfs_generate_key_packet_set+0x6d2/0xde0 ecryptfs_write_metadata+0x30a/0x550 ecryptfs_initialize_file+0x77/0x150 ecryptfs_create+0x1c2/0x2f0 path_openat+0x17cf/0x1ba0 do_filp_open+0x15e/0x290 do_sys_openat2+0x122/0x160 __x64_sys_openat+0xef/0x170 do_syscall_64+0x60/0xd0 entry_SYSCALL_64_after_hwframe+0x6e/0xd8 Fixes: dddfa461fc89 ("[PATCH] eCryptfs: Public key; packet management") Signed-off-by: Brian Kubisiak <brian@kubisiak.com> Link: https://lore.kernel.org/r/5j2q56p6qkhezva6b2yuqfrsurmvrrqtxxzrnp3wqu7xrz22i7@hoecdztoplbl Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-04-05fs: aio: use a folio in aio_free_ring()Kefeng Wang1-6/+7
Use a folio throughout aio_free_ring() to remove calls to compound_head(), also move pr_debug after folio check to remove unnecessary print. Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Link: https://lore.kernel.org/r/20240321131640.948634-3-wangkefeng.wang@huawei.com Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-04-05Merge series 'Fixes and cleanups to fs-writeback' of ↵Christian Brauner1-24/+33
https://lore.kernel.org/r/20240228091958.288260-1-shikemeng@huaweicloud.com Pull writeback fixes and cleanups from Kemeng Shi: This contains several fixes and cleanups for the writeback code. * series 'Fixes and cleanups to fs-writeback' of https://lore.kernel.org/r/20240228091958.288260-1-shikemeng@huaweicloud.com: (6 commits) fs/writeback: remove unnecessary return in writeback_inodes_sb fs/writeback: correct comment of __wakeup_flusher_threads_bdi fs/writeback: only calculate dirtied_before when b_io is empty fs/writeback: remove unused parameter wb of finish_writeback_work fs/writeback: bail out if there is no more inodes for IO and queued once fs/writeback: avoid to writeback non-expired inode in kupdate writeback Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-04-05fs: aio: use a folio in aio_setup_ring()Kefeng Wang1-9/+11
Use a folio throughout aio_setup_ring() to remove calls to compound_head(), also use folio_end_read() to simultaneously mark the folio uptodate and unlock it. Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Link: https://lore.kernel.org/r/20240321131640.948634-2-wangkefeng.wang@huawei.com Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-04-05fs/writeback: remove unnecessary return in writeback_inodes_sbKemeng Shi1-1/+1
writeback_inodes_sb doesn't have return value, just remove unnecessary return in it. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Link: https://lore.kernel.org/r/20240228091958.288260-7-shikemeng@huaweicloud.com Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-04-05fs/writeback: correct comment of __wakeup_flusher_threads_bdiKemeng Shi1-2/+1
Commit e8e8a0c6c9bfc ("writeback: move nr_pages == 0 logic to one location") removed parameter nr_pages of __wakeup_flusher_threads_bdi and we try to writeback all dirty pages in __wakeup_flusher_threads_bdi now. Just correct stale comment. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Link: https://lore.kernel.org/r/20240228091958.288260-6-shikemeng@huaweicloud.com Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-04-05fs/writeback: only calculate dirtied_before when b_io is emptyKemeng Shi1-12/+13
The dirtied_before is only used when b_io is not empty, so only calculate when b_io is not empty. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Link: https://lore.kernel.org/r/20240228091958.288260-5-shikemeng@huaweicloud.com Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-04-05fs/writeback: remove unused parameter wb of finish_writeback_workKemeng Shi1-4/+3
Remove unused parameter wb of finish_writeback_work. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Link: https://lore.kernel.org/r/20240228091958.288260-4-shikemeng@huaweicloud.com Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-04-05fs/writeback: bail out if there is no more inodes for IO and queued onceKemeng Shi1-2/+5
For case there is no more inodes for IO in io list from last wb_writeback, We may bail out early even there is inode in dirty list should be written back. Only bail out when we queued once to avoid missing dirtied inode. This is from code reading... Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Link: https://lore.kernel.org/r/20240228091958.288260-3-shikemeng@huaweicloud.com Reviewed-by: Jan Kara <jack@suse.cz> [brauner@kernel.org: fold in memory corruption fix from Jan in [1]] Link: https://lore.kernel.org/r/20240405132346.bid7gibby3lxxhez@quack3 [1] Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-03-26fs: Add kernel-doc comments to proc_create_net_data_write()Yang Li1-0/+1
This commit adds kernel-doc style comments with complete parameter descriptions for the function proc_create_net_data_write. Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Link: https://lore.kernel.org/r/20240315073805.77463-1-yang.lee@linux.alibaba.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-03-26fs/writeback: avoid to writeback non-expired inode in kupdate writebackKemeng Shi1-3/+10
In kupdate writeback, only expired inode (have been dirty for longer than dirty_expire_interval) is supposed to be written back. However, kupdate writeback will writeback non-expired inode left in b_io or b_more_io from last wb_writeback. As a result, writeback will keep being triggered unexpected when we keep dirtying pages even dirty memory is under threshold and inode is not expired. To be more specific: Assume dirty background threshold is > 1G and dirty_expire_centisecs is > 60s. When we running fio -size=1G -invalidate=0 -ioengine=libaio --time_based -runtime=60... (keep dirtying), the writeback will keep being triggered as following: wb_workfn wb_do_writeback wb_check_background_flush /* * Wb dirty background threshold starts at 0 if device was idle and * grows up when bandwidth of wb is updated. So a background * writeback is triggered. */ wb_over_bg_thresh /* * Dirtied inode will be written back and added to b_more_io list * after slice used up (because we keep dirtying the inode). */ wb_writeback Writeback is triggered per dirty_writeback_centisecs as following: wb_workfn wb_do_writeback wb_check_old_data_flush /* * Write back inode left in b_io and b_more_io from last wb_writeback * even the inode is non-expired and it will be added to b_more_io * again as slice will be used up (because we keep dirtying the * inode) */ wb_writeback Fix this by moving non-expired inode to dirty list instead of more io list for kupdate writeback in requeue_inode. Test as following: /* make it more easier to observe the issue */ echo 300000 > /proc/sys/vm/dirty_expire_centisecs echo 100 > /proc/sys/vm/dirty_writeback_centisecs /* create a idle device */ mkfs.ext4 -F /dev/vdb mount /dev/vdb /bdi1/ /* run buffer write with fio */ fio -name test -filename=/bdi1/file -size=800M -ioengine=libaio -bs=4K \ -iodepth=1 -rw=write -direct=0 --time_based -runtime=60 -invalidate=0 Fio result before fix (run three tests): 1360MB/s 1329MB/s 1455MB/s Fio result after fix (run three tests): 1737MB/s 1729MB/s 1789MB/s Writeback for non-expired inode is gone as expeted. Observe this with trace writeback_start and writeback_written as following: echo 1 > /sys/kernel/debug/tracing/events/writeback/writeback_start/enab echo 1 > /sys/kernel/debug/tracing/events/writeback/writeback_written/enable cat /sys/kernel/tracing/trace_pipe Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Link: https://lore.kernel.org/r/20240228091958.288260-2-shikemeng@huaweicloud.com Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-03-26fs_parser: move fsparam_string_empty() helper into headerLuis Henriques (SUSE)3-8/+4
Since both ext4 and overlayfs define the same macro to specify string parameters that may allow empty values, define it in an header file so that this helper can be shared. Signed-off-by: Luis Henriques (SUSE) <luis.henriques@linux.dev> Link: https://lore.kernel.org/r/20240312104757.27333-1-luis.henriques@linux.dev Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-03-26statx: stx_subvolKent Overstreet5-1/+11
Add a new statx field for (sub)volume identifiers, as implemented by btrfs and bcachefs. This includes bcachefs support; we'll definitely want btrfs support as well. Link: https://lore.kernel.org/linux-fsdevel/2uvhm6gweyl7iyyp2xpfryvcu2g3padagaeqcbiavjyiis6prl@yjm725bizncq/ Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Miklos Szeredi <mszeredi@redhat.com> Cc: Christian Brauner <brauner@kernel.org> Cc: David Howells <dhowells@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Link: https://lore.kernel.org/r/20240308022914.196982-1-kent.overstreet@linux.dev Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-03-25Linux 6.9-rc1Linus Torvalds1-2/+2
2024-03-24Merge tag 'efi-fixes-for-v6.9-2' of ↵Linus Torvalds4-2/+14
git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi Pull EFI fixes from Ard Biesheuvel: - Fix logic that is supposed to prevent placement of the kernel image below LOAD_PHYSICAL_ADDR - Use the firmware stack in the EFI stub when running in mixed mode - Clear BSS only once when using mixed mode - Check efi.get_variable() function pointer for NULL before trying to call it * tag 'efi-fixes-for-v6.9-2' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi: efi: fix panic in kdump kernel x86/efistub: Don't clear BSS twice in mixed mode x86/efistub: Call mixed mode boot services on the firmware's stack efi/libstub: fix efi_random_alloc() to allocate memory at alloc_min or higher address
2024-03-24Merge tag 'x86-urgent-2024-03-24' of ↵Linus Torvalds15-89/+80
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Thomas Gleixner: - Ensure that the encryption mask at boot is properly propagated on 5-level page tables, otherwise the PGD entry is incorrectly set to non-encrypted, which causes system crashes during boot. - Undo the deferred 5-level page table setup as it cannot work with memory encryption enabled. - Prevent inconsistent XFD state on CPU hotplug, where the MSR is reset to the default value but the cached variable is not, so subsequent comparisons might yield the wrong result and as a consequence the result prevents updating the MSR. - Register the local APIC address only once in the MPPARSE enumeration to prevent triggering the related WARN_ONs() in the APIC and topology code. - Handle the case where no APIC is found gracefully by registering a fake APIC in the topology code. That makes all related topology functions work correctly and does not affect the actual APIC driver code at all. - Don't evaluate logical IDs during early boot as the local APIC IDs are not yet enumerated and the invoked function returns an error code. Nothing requires the logical IDs before the final CPUID enumeration takes place, which happens after the enumeration. - Cure the fallout of the per CPU rework on UP which misplaced the copying of boot_cpu_data to per CPU data so that the final update to boot_cpu_data got lost which caused inconsistent state and boot crashes. - Use copy_from_kernel_nofault() in the kprobes setup as there is no guarantee that the address can be safely accessed. - Reorder struct members in struct saved_context to work around another kmemleak false positive - Remove the buggy code which tries to update the E820 kexec table for setup_data as that is never passed to the kexec kernel. - Update the resource control documentation to use the proper units. - Fix a Kconfig warning observed with tinyconfig * tag 'x86-urgent-2024-03-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/boot/64: Move 5-level paging global variable assignments back x86/boot/64: Apply encryption mask to 5-level pagetable update x86/cpu: Add model number for another Intel Arrow Lake mobile processor x86/fpu: Keep xfd_state in sync with MSR_IA32_XFD Documentation/x86: Document that resctrl bandwidth control units are MiB x86/mpparse: Register APIC address only once x86/topology: Handle the !APIC case gracefully x86/topology: Don't evaluate logical IDs during early boot x86/cpu: Ensure that CPU info updates are propagated on UP kprobes/x86: Use copy_from_kernel_nofault() to read from unsafe address x86/pm: Work around false positive kmemleak report in msr_build_context() x86/kexec: Do not update E820 kexec table for setup_data x86/config: Fix warning for 'make ARCH=x86_64 tinyconfig'
2024-03-24Merge tag 'sched-urgent-2024-03-24' of ↵Linus Torvalds1-0/+3
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler doc clarification from Thomas Gleixner: "A single update for the documentation of the base_slice_ns tunable to clarify that any value which is less than the tick slice has no effect because the scheduler tick is not guaranteed to happen within the set time slice" * tag 'sched-urgent-2024-03-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/doc: Update documentation for base_slice_ns and CONFIG_HZ relation
2024-03-24Merge tag 'dma-mapping-6.9-2024-03-24' of ↵Linus Torvalds2-12/+42
git://git.infradead.org/users/hch/dma-mapping Pull dma-mapping fixes from Christoph Hellwig: "This has a set of swiotlb alignment fixes for sometimes very long standing bugs from Will. We've been discussion them for a while and they should be solid now" * tag 'dma-mapping-6.9-2024-03-24' of git://git.infradead.org/users/hch/dma-mapping: swiotlb: Reinstate page-alignment for mappings >= PAGE_SIZE iommu/dma: Force swiotlb_max_mapping_size on an untrusted device swiotlb: Fix alignment checks when both allocation and DMA masks are present swiotlb: Honour dma_alloc_coherent() alignment in swiotlb_alloc() swiotlb: Enforce page alignment in swiotlb_alloc() swiotlb: Fix double-allocation of slots due to broken alignment handling
2024-03-24efi: fix panic in kdump kernelOleksandr Tymoshenko1-0/+2
Check if get_next_variable() is actually valid pointer before calling it. In kdump kernel this method is set to NULL that causes panic during the kexec-ed kernel boot. Tested with QEMU and OVMF firmware. Fixes: bad267f9e18f ("efi: verify that variable services are supported") Signed-off-by: Oleksandr Tymoshenko <ovt@google.com> Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
2024-03-24x86/efistub: Don't clear BSS twice in mixed modeArd Biesheuvel1-1/+2
Clearing BSS should only be done once, at the very beginning. efi_pe_entry() is the entrypoint from the firmware, which may not clear BSS and so it is done explicitly. However, efi_pe_entry() is also used as an entrypoint by the mixed mode startup code, in which case BSS will already have been cleared, and doing it again at this point will corrupt global variables holding the firmware's GDT/IDT and segment selectors. So make the memset() conditional on whether the EFI stub is running in native mode. Fixes: b3810c5a2cc4a666 ("x86/efistub: Clear decompressor BSS in native EFI entrypoint") Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
2024-03-24x86/efistub: Call mixed mode boot services on the firmware's stackArd Biesheuvel1-0/+9
Normally, the EFI stub calls into the EFI boot services using the stack that was live when the stub was entered. According to the UEFI spec, this stack needs to be at least 128k in size - this might seem large but all asynchronous processing and event handling in EFI runs from the same stack and so quite a lot of space may be used in practice. In mixed mode, the situation is a bit different: the bootloader calls the 32-bit EFI stub entry point, which calls the decompressor's 32-bit entry point, where the boot stack is set up, using a fixed allocation of 16k. This stack is still in use when the EFI stub is started in 64-bit mode, and so all calls back into the EFI firmware will be using the decompressor's limited boot stack. Due to the placement of the boot stack right after the boot heap, any stack overruns have gone unnoticed. However, commit 5c4feadb0011983b ("x86/decompressor: Move global symbol references to C code") moved the definition of the boot heap into C code, and now the boot stack is placed right at the base of BSS, where any overruns will corrupt the end of the .data section. While it would be possible to work around this by increasing the size of the boot stack, doing so would affect all x86 systems, and mixed mode systems are a tiny (and shrinking) fraction of the x86 installed base. So instead, record the firmware stack pointer value when entering from the 32-bit firmware, and switch to this stack every time a EFI boot service call is made. Cc: <stable@kernel.org> # v6.1+ Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
2024-03-24x86/boot/64: Move 5-level paging global variable assignments backTom Lendacky1-9/+7
Commit 63bed9660420 ("x86/startup_64: Defer assignment of 5-level paging global variables") moved assignment of 5-level global variables to later in the boot in order to avoid having to use RIP relative addressing in order to set them. However, when running with 5-level paging and SME active (mem_encrypt=on), the variables are needed as part of the page table setup needed to encrypt the kernel (using pgd_none(), p4d_offset(), etc.). Since the variables haven't been set, the page table manipulation is done as if 4-level paging is active, causing the system to crash on boot. While only a subset of the assignments that were moved need to be set early, move all of the assignments back into check_la57_support() so that these assignments aren't spread between two locations. Instead of just reverting the fix, this uses the new RIP_REL_REF() macro when assigning the variables. Fixes: 63bed9660420 ("x86/startup_64: Defer assignment of 5-level paging global variables") Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Ard Biesheuvel <ardb@kernel.org> Link: https://lore.kernel.org/r/2ca419f4d0de719926fd82353f6751f717590a86.1711122067.git.thomas.lendacky@amd.com
2024-03-24x86/boot/64: Apply encryption mask to 5-level pagetable updateTom Lendacky1-1/+1
When running with 5-level page tables, the kernel mapping PGD entry is updated to point to the P4D table. The assignment uses _PAGE_TABLE_NOENC, which, when SME is active (mem_encrypt=on), results in a page table entry without the encryption mask set, causing the system to crash on boot. Change the assignment to use _PAGE_TABLE instead of _PAGE_TABLE_NOENC so that the encryption mask is set for the PGD entry. Fixes: 533568e06b15 ("x86/boot/64: Use RIP_REL_REF() to access early_top_pgt[]") Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Ard Biesheuvel <ardb@kernel.org> Link: https://lore.kernel.org/r/8f20345cda7dbba2cf748b286e1bc00816fe649a.1711122067.git.thomas.lendacky@amd.com
2024-03-24x86/cpu: Add model number for another Intel Arrow Lake mobile processorTony Luck1-0/+1
This one is the regular laptop CPU. Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20240322161725.195614-1-tony.luck@intel.com
2024-03-24x86/fpu: Keep xfd_state in sync with MSR_IA32_XFDAdamos Ttofari2-6/+13
Commit 672365477ae8 ("x86/fpu: Update XFD state where required") and commit 8bf26758ca96 ("x86/fpu: Add XFD state to fpstate") introduced a per CPU variable xfd_state to keep the MSR_IA32_XFD value cached, in order to avoid unnecessary writes to the MSR. On CPU hotplug MSR_IA32_XFD is reset to the init_fpstate.xfd, which wipes out any stale state. But the per CPU cached xfd value is not reset, which brings them out of sync. As a consequence a subsequent xfd_update_state() might fail to update the MSR which in turn can result in XRSTOR raising a #NM in kernel space, which crashes the kernel. To fix this, introduce xfd_set_state() to write xfd_state together with MSR_IA32_XFD, and use it in all places that set MSR_IA32_XFD. Fixes: 672365477ae8 ("x86/fpu: Update XFD state where required") Signed-off-by: Adamos Ttofari <attofari@amazon.de> Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240322230439.456571-1-chang.seok.bae@intel.com Closes: https://lore.kernel.org/lkml/20230511152818.13839-1-attofari@amazon.de
2024-03-24Documentation/x86: Document that resctrl bandwidth control units are MiBTony Luck1-4/+4
The memory bandwidth software controller uses 2^20 units rather than 10^6. See mbm_bw_count() which computes bandwidth using the "SZ_1M" Linux define for 0x00100000. Update the documentation to use MiB when describing this feature. It's too late to fix the mount option "mba_MBps" as that is now an established user interface. Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20240322182016.196544-1-tony.luck@intel.com
2024-03-24Merge tag 'timers-urgent-2024-03-23' of ↵Linus Torvalds2-3/+20
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fixes from Thomas Gleixner: "Two regression fixes for the timer and timer migration code: - Prevent endless timer requeuing which is caused by two CPUs racing out of idle. This happens when the last CPU goes idle and therefore has to ensure to expire the pending global timers and some other CPU come out of idle at the same time and the other CPU wins the race and expires the global queue. This causes the last CPU to chase ghost timers forever and reprogramming it's clockevent device endlessly. Cure this by re-evaluating the wakeup time unconditionally. - The split into local (pinned) and global timers in the timer wheel caused a regression for NOHZ full as it broke the idle tracking of global timers. On NOHZ full this prevents an self IPI being sent which in turn causes the timer to be not programmed and not being expired on time. Restore the idle tracking for the global timer base so that the self IPI condition for NOHZ full is working correctly again" * tag 'timers-urgent-2024-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: timers: Fix removed self-IPI on global timer's enqueue in nohz_full timers/migration: Fix endless timer requeue after idle interrupts