summaryrefslogtreecommitdiff
path: root/include/uapi/asm-generic/unistd.h
AgeCommit message (Collapse)AuthorFilesLines
2024-06-25syscalls: fix compat_sys_io_pgetevents_time64 usageArnd Bergmann1-1/+1
Using sys_io_pgetevents() as the entry point for compat mode tasks works almost correctly, but misses the sign extension for the min_nr and nr arguments. This was addressed on parisc by switching to compat_sys_io_pgetevents_time64() in commit 6431e92fc827 ("parisc: io_pgetevents_time64() needs compat syscall in 32-bit compat mode"), as well as by using more sophisticated system call wrappers on x86 and s390. However, arm64, mips, powerpc, sparc and riscv still have the same bug. Change all of them over to use compat_sys_io_pgetevents_time64() like parisc already does. This was clearly the intention when the function was originally added, but it got hooked up incorrectly in the tables. Cc: stable@vger.kernel.org Fixes: 48166e6ea47d ("y2038: add 64-bit time_t syscalls to all 32-bit architectures") Acked-by: Heiko Carstens <hca@linux.ibm.com> # s390 Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2024-05-24mseal: wire up mseal syscallJeff Xu1-1/+4
Patch series "Introduce mseal", v10. This patchset proposes a new mseal() syscall for the Linux kernel. In a nutshell, mseal() protects the VMAs of a given virtual memory range against modifications, such as changes to their permission bits. Modern CPUs support memory permissions, such as the read/write (RW) and no-execute (NX) bits. Linux has supported NX since the release of kernel version 2.6.8 in August 2004 [1]. The memory permission feature improves the security stance on memory corruption bugs, as an attacker cannot simply write to arbitrary memory and point the code to it. The memory must be marked with the X bit, or else an exception will occur. Internally, the kernel maintains the memory permissions in a data structure called VMA (vm_area_struct). mseal() additionally protects the VMA itself against modifications of the selected seal type. Memory sealing is useful to mitigate memory corruption issues where a corrupted pointer is passed to a memory management system. For example, such an attacker primitive can break control-flow integrity guarantees since read-only memory that is supposed to be trusted can become writable or .text pages can get remapped. Memory sealing can automatically be applied by the runtime loader to seal .text and .rodata pages and applications can additionally seal security critical data at runtime. A similar feature already exists in the XNU kernel with the VM_FLAGS_PERMANENT [3] flag and on OpenBSD with the mimmutable syscall [4]. Also, Chrome wants to adopt this feature for their CFI work [2] and this patchset has been designed to be compatible with the Chrome use case. Two system calls are involved in sealing the map: mmap() and mseal(). The new mseal() is an syscall on 64 bit CPU, and with following signature: int mseal(void addr, size_t len, unsigned long flags) addr/len: memory range. flags: reserved. mseal() blocks following operations for the given memory range. 1> Unmapping, moving to another location, and shrinking the size, via munmap() and mremap(), can leave an empty space, therefore can be replaced with a VMA with a new set of attributes. 2> Moving or expanding a different VMA into the current location, via mremap(). 3> Modifying a VMA via mmap(MAP_FIXED). 4> Size expansion, via mremap(), does not appear to pose any specific risks to sealed VMAs. It is included anyway because the use case is unclear. In any case, users can rely on merging to expand a sealed VMA. 5> mprotect() and pkey_mprotect(). 6> Some destructive madvice() behaviors (e.g. MADV_DONTNEED) for anonymous memory, when users don't have write permission to the memory. Those behaviors can alter region contents by discarding pages, effectively a memset(0) for anonymous memory. The idea that inspired this patch comes from Stephen Röttger’s work in V8 CFI [5]. Chrome browser in ChromeOS will be the first user of this API. Indeed, the Chrome browser has very specific requirements for sealing, which are distinct from those of most applications. For example, in the case of libc, sealing is only applied to read-only (RO) or read-execute (RX) memory segments (such as .text and .RELRO) to prevent them from becoming writable, the lifetime of those mappings are tied to the lifetime of the process. Chrome wants to seal two large address space reservations that are managed by different allocators. The memory is mapped RW- and RWX respectively but write access to it is restricted using pkeys (or in the future ARM permission overlay extensions). The lifetime of those mappings are not tied to the lifetime of the process, therefore, while the memory is sealed, the allocators still need to free or discard the unused memory. For example, with madvise(DONTNEED). However, always allowing madvise(DONTNEED) on this range poses a security risk. For example if a jump instruction crosses a page boundary and the second page gets discarded, it will overwrite the target bytes with zeros and change the control flow. Checking write-permission before the discard operation allows us to control when the operation is valid. In this case, the madvise will only succeed if the executing thread has PKEY write permissions and PKRU changes are protected in software by control-flow integrity. Although the initial version of this patch series is targeting the Chrome browser as its first user, it became evident during upstream discussions that we would also want to ensure that the patch set eventually is a complete solution for memory sealing and compatible with other use cases. The specific scenario currently in mind is glibc's use case of loading and sealing ELF executables. To this end, Stephen is working on a change to glibc to add sealing support to the dynamic linker, which will seal all non-writable segments at startup. Once this work is completed, all applications will be able to automatically benefit from these new protections. In closing, I would like to formally acknowledge the valuable contributions received during the RFC process, which were instrumental in shaping this patch: Jann Horn: raising awareness and providing valuable insights on the destructive madvise operations. Liam R. Howlett: perf optimization. Linus Torvalds: assisting in defining system call signature and scope. Theo de Raadt: sharing the experiences and insight gained from implementing mimmutable() in OpenBSD. MM perf benchmarks ================== This patch adds a loop in the mprotect/munmap/madvise(DONTNEED) to check the VMAs’ sealing flag, so that no partial update can be made, when any segment within the given memory range is sealed. To measure the performance impact of this loop, two tests are developed. [8] The first is measuring the time taken for a particular system call, by using clock_gettime(CLOCK_MONOTONIC). The second is using PERF_COUNT_HW_REF_CPU_CYCLES (exclude user space). Both tests have similar results. The tests have roughly below sequence: for (i = 0; i < 1000, i++) create 1000 mappings (1 page per VMA) start the sampling for (j = 0; j < 1000, j++) mprotect one mapping stop and save the sample delete 1000 mappings calculates all samples. Below tests are performed on Intel(R) Pentium(R) Gold 7505 @ 2.00GHz, 4G memory, Chromebook. Based on the latest upstream code: The first test (measuring time) syscall__ vmas t t_mseal delta_ns per_vma % munmap__ 1 909 944 35 35 104% munmap__ 2 1398 1502 104 52 107% munmap__ 4 2444 2594 149 37 106% munmap__ 8 4029 4323 293 37 107% munmap__ 16 6647 6935 288 18 104% munmap__ 32 11811 12398 587 18 105% mprotect 1 439 465 26 26 106% mprotect 2 1659 1745 86 43 105% mprotect 4 3747 3889 142 36 104% mprotect 8 6755 6969 215 27 103% mprotect 16 13748 14144 396 25 103% mprotect 32 27827 28969 1142 36 104% madvise_ 1 240 262 22 22 109% madvise_ 2 366 442 76 38 121% madvise_ 4 623 751 128 32 121% madvise_ 8 1110 1324 215 27 119% madvise_ 16 2127 2451 324 20 115% madvise_ 32 4109 4642 534 17 113% The second test (measuring cpu cycle) syscall__ vmas cpu cmseal delta_cpu per_vma % munmap__ 1 1790 1890 100 100 106% munmap__ 2 2819 3033 214 107 108% munmap__ 4 4959 5271 312 78 106% munmap__ 8 8262 8745 483 60 106% munmap__ 16 13099 14116 1017 64 108% munmap__ 32 23221 24785 1565 49 107% mprotect 1 906 967 62 62 107% mprotect 2 3019 3203 184 92 106% mprotect 4 6149 6569 420 105 107% mprotect 8 9978 10524 545 68 105% mprotect 16 20448 21427 979 61 105% mprotect 32 40972 42935 1963 61 105% madvise_ 1 434 497 63 63 115% madvise_ 2 752 899 147 74 120% madvise_ 4 1313 1513 200 50 115% madvise_ 8 2271 2627 356 44 116% madvise_ 16 4312 4883 571 36 113% madvise_ 32 8376 9319 943 29 111% Based on the result, for 6.8 kernel, sealing check adds 20-40 nano seconds, or around 50-100 CPU cycles, per VMA. In addition, I applied the sealing to 5.10 kernel: The first test (measuring time) syscall__ vmas t tmseal delta_ns per_vma % munmap__ 1 357 390 33 33 109% munmap__ 2 442 463 21 11 105% munmap__ 4 614 634 20 5 103% munmap__ 8 1017 1137 120 15 112% munmap__ 16 1889 2153 263 16 114% munmap__ 32 4109 4088 -21 -1 99% mprotect 1 235 227 -7 -7 97% mprotect 2 495 464 -30 -15 94% mprotect 4 741 764 24 6 103% mprotect 8 1434 1437 2 0 100% mprotect 16 2958 2991 33 2 101% mprotect 32 6431 6608 177 6 103% madvise_ 1 191 208 16 16 109% madvise_ 2 300 324 24 12 108% madvise_ 4 450 473 23 6 105% madvise_ 8 753 806 53 7 107% madvise_ 16 1467 1592 125 8 108% madvise_ 32 2795 3405 610 19 122% The second test (measuring cpu cycle) syscall__ nbr_vma cpu cmseal delta_cpu per_vma % munmap__ 1 684 715 31 31 105% munmap__ 2 861 898 38 19 104% munmap__ 4 1183 1235 51 13 104% munmap__ 8 1999 2045 46 6 102% munmap__ 16 3839 3816 -23 -1 99% munmap__ 32 7672 7887 216 7 103% mprotect 1 397 443 46 46 112% mprotect 2 738 788 50 25 107% mprotect 4 1221 1256 35 9 103% mprotect 8 2356 2429 72 9 103% mprotect 16 4961 4935 -26 -2 99% mprotect 32 9882 10172 291 9 103% madvise_ 1 351 380 29 29 108% madvise_ 2 565 615 49 25 109% madvise_ 4 872 933 61 15 107% madvise_ 8 1508 1640 132 16 109% madvise_ 16 3078 3323 245 15 108% madvise_ 32 5893 6704 811 25 114% For 5.10 kernel, sealing check adds 0-15 ns in time, or 10-30 CPU cycles, there is even decrease in some cases. It might be interesting to compare 5.10 and 6.8 kernel The first test (measuring time) syscall__ vmas t_5_10 t_6_8 delta_ns per_vma % munmap__ 1 357 909 552 552 254% munmap__ 2 442 1398 956 478 316% munmap__ 4 614 2444 1830 458 398% munmap__ 8 1017 4029 3012 377 396% munmap__ 16 1889 6647 4758 297 352% munmap__ 32 4109 11811 7702 241 287% mprotect 1 235 439 204 204 187% mprotect 2 495 1659 1164 582 335% mprotect 4 741 3747 3006 752 506% mprotect 8 1434 6755 5320 665 471% mprotect 16 2958 13748 10790 674 465% mprotect 32 6431 27827 21397 669 433% madvise_ 1 191 240 49 49 125% madvise_ 2 300 366 67 33 122% madvise_ 4 450 623 173 43 138% madvise_ 8 753 1110 357 45 147% madvise_ 16 1467 2127 660 41 145% madvise_ 32 2795 4109 1314 41 147% The second test (measuring cpu cycle) syscall__ vmas cpu_5_10 c_6_8 delta_cpu per_vma % munmap__ 1 684 1790 1106 1106 262% munmap__ 2 861 2819 1958 979 327% munmap__ 4 1183 4959 3776 944 419% munmap__ 8 1999 8262 6263 783 413% munmap__ 16 3839 13099 9260 579 341% munmap__ 32 7672 23221 15549 486 303% mprotect 1 397 906 509 509 228% mprotect 2 738 3019 2281 1140 409% mprotect 4 1221 6149 4929 1232 504% mprotect 8 2356 9978 7622 953 423% mprotect 16 4961 20448 15487 968 412% mprotect 32 9882 40972 31091 972 415% madvise_ 1 351 434 82 82 123% madvise_ 2 565 752 186 93 133% madvise_ 4 872 1313 442 110 151% madvise_ 8 1508 2271 763 95 151% madvise_ 16 3078 4312 1234 77 140% madvise_ 32 5893 8376 2483 78 142% From 5.10 to 6.8 munmap: added 250-550 ns in time, or 500-1100 in cpu cycle, per vma. mprotect: added 200-750 ns in time, or 500-1200 in cpu cycle, per vma. madvise: added 33-50 ns in time, or 70-110 in cpu cycle, per vma. In comparison to mseal, which adds 20-40 ns or 50-100 CPU cycles, the increase from 5.10 to 6.8 is significantly larger, approximately ten times greater for munmap and mprotect. When I discuss the mm performance with Brian Makin, an engineer who worked on performance, it was brought to my attention that such performance benchmarks, which measuring millions of mm syscall in a tight loop, may not accurately reflect real-world scenarios, such as that of a database service. Also this is tested using a single HW and ChromeOS, the data from another HW or distribution might be different. It might be best to take this data with a grain of salt. This patch (of 5): Wire up mseal syscall for all architectures. Link: https://lkml.kernel.org/r/20240415163527.626541-1-jeffxu@chromium.org Link: https://lkml.kernel.org/r/20240415163527.626541-2-jeffxu@chromium.org Signed-off-by: Jeff Xu <jeffxu@chromium.org> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guenter Roeck <groeck@chromium.org> Cc: Jann Horn <jannh@google.com> [Bug #2] Cc: Jeff Xu <jeffxu@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Jorge Lucangeli Obes <jorgelo@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muhammad Usama Anjum <usama.anjum@collabora.com> Cc: Pedro Falcato <pedro.falcato@gmail.com> Cc: Stephen Röttger <sroettger@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Amer Al Shanawany <amer.shanawany@gmail.com> Cc: Javier Carrasco <javier.carrasco.cruz@gmail.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-01-09Merge tag 'lsm-pr-20240105' of ↵Linus Torvalds1-1/+8
git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm Pull security module updates from Paul Moore: - Add three new syscalls: lsm_list_modules(), lsm_get_self_attr(), and lsm_set_self_attr(). The first syscall simply lists the LSMs enabled, while the second and third get and set the current process' LSM attributes. Yes, these syscalls may provide similar functionality to what can be found under /proc or /sys, but they were designed to support multiple, simultaneaous (stacked) LSMs from the start as opposed to the current /proc based solutions which were created at a time when only one LSM was allowed to be active at a given time. We have spent considerable time discussing ways to extend the existing /proc interfaces to support multiple, simultaneaous LSMs and even our best ideas have been far too ugly to support as a kernel API; after +20 years in the kernel, I felt the LSM layer had established itself enough to justify a handful of syscalls. Support amongst the individual LSM developers has been nearly unanimous, with a single objection coming from Tetsuo (TOMOYO) as he is worried that the LSM_ID_XXX token concept will make it more difficult for out-of-tree LSMs to survive. Several members of the LSM community have demonstrated the ability for out-of-tree LSMs to continue to exist by picking high/unused LSM_ID values as well as pointing out that many kernel APIs rely on integer identifiers, e.g. syscalls (!), but unfortunately Tetsuo's objections remain. My personal opinion is that while I have no interest in penalizing out-of-tree LSMs, I'm not going to penalize in-tree development to support out-of-tree development, and I view this as a necessary step forward to support the push for expanded LSM stacking and reduce our reliance on /proc and /sys which has occassionally been problematic for some container users. Finally, we have included the linux-api folks on (all?) recent revisions of the patchset and addressed all of their concerns. - Add a new security_file_ioctl_compat() LSM hook to handle the 32-bit ioctls on 64-bit systems problem. This patch includes support for all of the existing LSMs which provide ioctl hooks, although it turns out only SELinux actually cares about the individual ioctls. It is worth noting that while Casey (Smack) and Tetsuo (TOMOYO) did not give explicit ACKs to this patch, they did both indicate they are okay with the changes. - Fix a potential memory leak in the CALIPSO code when IPv6 is disabled at boot. While it's good that we are fixing this, I doubt this is something users are seeing in the wild as you need to both disable IPv6 and then attempt to configure IPv6 labeled networking via NetLabel/CALIPSO; that just doesn't make much sense. Normally this would go through netdev, but Jakub asked me to take this patch and of all the trees I maintain, the LSM tree seemed like the best fit. - Update the LSM MAINTAINERS entry with additional information about our process docs, patchwork, bug reporting, etc. I also noticed that the Lockdown LSM is missing a dedicated MAINTAINERS entry so I've added that to the pull request. I've been working with one of the major Lockdown authors/contributors to see if they are willing to step up and assume a Lockdown maintainer role; hopefully that will happen soon, but in the meantime I'll continue to look after it. - Add a handful of mailmap entries for Serge Hallyn and myself. * tag 'lsm-pr-20240105' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm: (27 commits) lsm: new security_file_ioctl_compat() hook lsm: Add a __counted_by() annotation to lsm_ctx.ctx calipso: fix memory leak in netlbl_calipso_add_pass() selftests: remove the LSM_ID_IMA check in lsm/lsm_list_modules_test MAINTAINERS: add an entry for the lockdown LSM MAINTAINERS: update the LSM entry mailmap: add entries for Serge Hallyn's dead accounts mailmap: update/replace my old email addresses lsm: mark the lsm_id variables are marked as static lsm: convert security_setselfattr() to use memdup_user() lsm: align based on pointer length in lsm_fill_user_ctx() lsm: consolidate buffer size handling into lsm_fill_user_ctx() lsm: correct error codes in security_getselfattr() lsm: cleanup the size counters in security_getselfattr() lsm: don't yet account for IMA in LSM_CONFIG_COUNT calculation lsm: drop LSM_ID_IMA LSM: selftests for Linux Security Module syscalls SELinux: Add selfattr hooks AppArmor: Add selfattr hooks Smack: implement setselfattr and getselfattr hooks ...
2023-12-14wire up syscalls for statmount/listmountMiklos Szeredi1-1/+7
Wire up all archs. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Link: https://lore.kernel.org/r/20231025140205.3586473-7-mszeredi@redhat.com Reviewed-by: Ian Kent <raven@themaw.net> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-11-13LSM: wireup Linux Security Module syscallsCasey Schaufler1-1/+8
Wireup lsm_get_self_attr, lsm_set_self_attr and lsm_list_modules system calls. Signed-off-by: Casey Schaufler <casey@schaufler-ca.com> Reviewed-by: Kees Cook <keescook@chromium.org> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Acked-by: Arnd Bergmann <arnd@arndb.de> Cc: linux-api@vger.kernel.org Reviewed-by: Mickaël Salaün <mic@digikod.net> [PM: forward ported beyond v6.6 due merge window changes] Signed-off-by: Paul Moore <paul@paul-moore.com>
2023-11-02Merge tag 'asm-generic-6.7' of ↵Linus Torvalds1-3/+3
git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic Pull ia64 removal and asm-generic updates from Arnd Bergmann: - The ia64 architecture gets its well-earned retirement as planned, now that there is one last (mostly) working release that will be maintained as an LTS kernel. - The architecture specific system call tables are updated for the added map_shadow_stack() syscall and to remove references to the long-gone sys_lookup_dcookie() syscall. * tag 'asm-generic-6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic: hexagon: Remove unusable symbols from the ptrace.h uapi asm-generic: Fix spelling of architecture arch: Reserve map_shadow_stack() syscall number for all architectures syscalls: Cleanup references to sys_lookup_dcookie() Documentation: Drop or replace remaining mentions of IA64 lib/raid6: Drop IA64 support Documentation: Drop IA64 from feature descriptions kernel: Drop IA64 support from sig_fault handlers arch: Remove Itanium (IA-64) architecture
2023-10-06arch: Reserve map_shadow_stack() syscall number for all architecturesSohil Mehta1-1/+4
commit c35559f94ebc ("x86/shstk: Introduce map_shadow_stack syscall") recently added support for map_shadow_stack() but it is limited to x86 only for now. There is a possibility that other architectures (namely, arm64 and RISC-V), that are implementing equivalent support for shadow stacks, might need to add support for it. Independent of that, reserving arch-specific syscall numbers in the syscall tables of all architectures is good practice and would help avoid future conflicts. map_shadow_stack() is marked as a conditional syscall in sys_ni.c. Adding it to the syscall tables of other architectures is harmless and would return ENOSYS when exercised. Note, map_shadow_stack() was assigned #453 during the merge process since #452 was taken by fchmodat2(). For Powerpc, map it to sys_ni_syscall() as is the norm for Powerpc syscall tables. For Alpha, map_shadow_stack() takes up #563 as Alpha still diverges from the common syscall numbering system in the other architectures. Link: https://lore.kernel.org/lkml/20230515212255.GA562920@debug.ba.rivosinc.com/ Link: https://lore.kernel.org/lkml/b402b80b-a7c6-4ef0-b977-c0f5f582b78a@sirena.org.uk/ Signed-off-by: Sohil Mehta <sohil.mehta@intel.com> Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc) Acked-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2023-10-03syscalls: Cleanup references to sys_lookup_dcookie()Sohil Mehta1-1/+1
commit 'be65de6b03aa ("fs: Remove dcookies support")' removed the syscall definition for lookup_dcookie. However, syscall tables still point to the old sys_lookup_dcookie() definition. Update syscall tables of all architectures to directly point to sys_ni_syscall() instead. Signed-off-by: Sohil Mehta <sohil.mehta@intel.com> Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Acked-by: Namhyung Kim <namhyung@kernel.org> # for perf Acked-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2023-09-21futex: Add sys_futex_requeue()peterz@infradead.org1-1/+3
Finish off the 'simple' futex2 syscall group by adding sys_futex_requeue(). Unlike sys_futex_{wait,wake}() its arguments are too numerous to fit into a regular syscall. As such, use struct futex_waitv to pass the 'source' and 'destination' futexes to the syscall. This syscall implements what was previously known as FUTEX_CMP_REQUEUE and uses {val, uaddr, flags} for source and {uaddr, flags} for destination. This design explicitly allows requeueing between different types of futex by having a different flags word per uaddr. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Link: https://lore.kernel.org/r/20230921105248.511860556@noisy.programming.kicks-ass.net
2023-09-21futex: Add sys_futex_wait()peterz@infradead.org1-1/+3
To complement sys_futex_waitv()/wake(), add sys_futex_wait(). This syscall implements what was previously known as FUTEX_WAIT_BITSET except it uses 'unsigned long' for the value and bitmask arguments, takes timespec and clockid_t arguments for the absolute timeout and uses FUTEX2 flags. The 'unsigned long' allows FUTEX2_SIZE_U64 on 64bit platforms. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Link: https://lore.kernel.org/r/20230921105248.164324363@noisy.programming.kicks-ass.net
2023-09-21futex: Add sys_futex_wake()peterz@infradead.org1-1/+3
To complement sys_futex_waitv() add sys_futex_wake(). This syscall implements what was previously known as FUTEX_WAKE_BITSET except it uses 'unsigned long' for the bitmask and takes FUTEX2 flags. The 'unsigned long' allows FUTEX2_SIZE_U64 on 64bit platforms. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Link: https://lore.kernel.org/r/20230921105247.936205525@noisy.programming.kicks-ass.net
2023-07-27arch: Register fchmodat2, usually as syscall 452Palmer Dabbelt1-1/+4
This registers the new fchmodat2 syscall in most places as nuber 452, with alpha being the exception where it's 562. I found all these sites by grepping for fspick, which I assume has found me everything. Signed-off-by: Palmer Dabbelt <palmer@sifive.com> Signed-off-by: Alexey Gladkov <legion@kernel.org> Acked-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Message-Id: <a677d521f048e4ca439e7080a5328f21eb8e960e.1689092120.git.legion@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-07-06Merge tag 'asm-generic-6.5' of ↵Linus Torvalds1-99/+30
git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic Pull asm-generic updates from Arnd Bergmann: "These are cleanups for architecture specific header files: - the comments in include/linux/syscalls.h have gone out of sync and are really pointless, so these get removed - The asm/bitsperlong.h header no longer needs to be architecture specific on modern compilers, so use a generic version for newer architectures that use new enough userspace compilers - A cleanup for virt_to_pfn/virt_to_bus to have proper type checking, forcing the use of pointers" * tag 'asm-generic-6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic: syscalls: Remove file path comments from headers tools arch: Remove uapi bitsperlong.h of hexagon and microblaze asm-generic: Unify uapi bitsperlong.h for arm64, riscv and loongarch m68k/mm: Make pfn accessors static inlines arm64: memory: Make virt_to_pfn() a static inline ARM: mm: Make virt_to_pfn() a static inline asm-generic/page.h: Make pfn accessors static inlines xen/netback: Pass (void *) to virt_to_page() netfs: Pass a pointer to virt_to_page() cifs: Pass a pointer to virt_to_page() in cifsglob cifs: Pass a pointer to virt_to_page() riscv: mm: init: Pass a pointer to virt_to_page() ARC: init: Pass a pointer to virt_to_pfn() in init m68k: Pass a pointer to virt_to_pfn() virt_to_page() fs/proc/kcore.c: Pass a pointer to virt_addr_valid()
2023-06-22syscalls: Remove file path comments from headersSohil Mehta1-99/+30
Source file locations for syscall definitions can change over a period of time. File paths in comments get stale and are hard to maintain long term. Also, their usefulness is questionable since it would be easier to locate a syscall definition using the SYSCALL_DEFINEx() macro. Remove all source file path comments from the syscall headers. Also, equalize the uneven line spacing (some of which is introduced due to the deletions). Signed-off-by: Sohil Mehta <sohil.mehta@intel.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2023-06-10cachestat: implement cachestat syscallNhat Pham1-1/+4
There is currently no good way to query the page cache state of large file sets and directory trees. There is mincore(), but it scales poorly: the kernel writes out a lot of bitmap data that userspace has to aggregate, when the user really doesn not care about per-page information in that case. The user also needs to mmap and unmap each file as it goes along, which can be quite slow as well. Some use cases where this information could come in handy: * Allowing database to decide whether to perform an index scan or direct table queries based on the in-memory cache state of the index. * Visibility into the writeback algorithm, for performance issues diagnostic. * Workload-aware writeback pacing: estimating IO fulfilled by page cache (and IO to be done) within a range of a file, allowing for more frequent syncing when and where there is IO capacity, and batching when there is not. * Computing memory usage of large files/directory trees, analogous to the du tool for disk usage. More information about these use cases could be found in the following thread: https://lore.kernel.org/lkml/20230315170934.GA97793@cmpxchg.org/ This patch implements a new syscall that queries cache state of a file and summarizes the number of cached pages, number of dirty pages, number of pages marked for writeback, number of (recently) evicted pages, etc. in a given range. Currently, the syscall is only wired in for x86 architecture. NAME cachestat - query the page cache statistics of a file. SYNOPSIS #include <sys/mman.h> struct cachestat_range { __u64 off; __u64 len; }; struct cachestat { __u64 nr_cache; __u64 nr_dirty; __u64 nr_writeback; __u64 nr_evicted; __u64 nr_recently_evicted; }; int cachestat(unsigned int fd, struct cachestat_range *cstat_range, struct cachestat *cstat, unsigned int flags); DESCRIPTION cachestat() queries the number of cached pages, number of dirty pages, number of pages marked for writeback, number of evicted pages, number of recently evicted pages, in the bytes range given by `off` and `len`. An evicted page is a page that is previously in the page cache but has been evicted since. A page is recently evicted if its last eviction was recent enough that its reentry to the cache would indicate that it is actively being used by the system, and that there is memory pressure on the system. These values are returned in a cachestat struct, whose address is given by the `cstat` argument. The `off` and `len` arguments must be non-negative integers. If `len` > 0, the queried range is [`off`, `off` + `len`]. If `len` == 0, we will query in the range from `off` to the end of the file. The `flags` argument is unused for now, but is included for future extensibility. User should pass 0 (i.e no flag specified). Currently, hugetlbfs is not supported. Because the status of a page can change after cachestat() checks it but before it returns to the application, the returned values may contain stale information. RETURN VALUE On success, cachestat returns 0. On error, -1 is returned, and errno is set to indicate the error. ERRORS EFAULT cstat or cstat_args points to an invalid address. EINVAL invalid flags. EBADF invalid file descriptor. EOPNOTSUPP file descriptor is of a hugetlbfs file [nphamcs@gmail.com: replace rounddown logic with the existing helper] Link: https://lkml.kernel.org/r/20230504022044.3675469-1-nphamcs@gmail.com Link: https://lkml.kernel.org/r/20230503013608.2431726-3-nphamcs@gmail.com Signed-off-by: Nhat Pham <nphamcs@gmail.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Brian Foster <bfoster@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-04-26syscalls: compat: Fix the missing part for __SYSCALL_COMPATGuo Ren1-2/+2
Make "uapi asm unistd.h" could be used for architectures' COMPAT mode. The __SYSCALL_COMPAT is first used in riscv. Signed-off-by: Guo Ren <guoren@linux.alibaba.com> Signed-off-by: Guo Ren <guoren@kernel.org> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Tested-by: Heiko Stuebner <heiko@sntech.de> Link: https://lore.kernel.org/r/20220405071314.3225832-8-guoren@kernel.org Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2022-01-15mm/mempolicy: wire up syscall set_mempolicy_home_nodeAneesh Kumar K.V1-1/+4
Link: https://lkml.kernel.org/r/20211202123810.267175-4-aneesh.kumar@linux.ibm.com Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Ben Widawsky <ben.widawsky@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Feng Tang <feng.tang@intel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Andi Kleen <ak@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Huang Ying <ying.huang@intel.com> Cc: <linux-api@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-10-07futex: Implement sys_futex_waitv()André Almeida1-1/+4
Add support to wait on multiple futexes. This is the interface implemented by this syscall: futex_waitv(struct futex_waitv *waiters, unsigned int nr_futexes, unsigned int flags, struct timespec *timeout, clockid_t clockid) struct futex_waitv { __u64 val; __u64 uaddr; __u32 flags; __u32 __reserved; }; Given an array of struct futex_waitv, wait on each uaddr. The thread wakes if a futex_wake() is performed at any uaddr. The syscall returns immediately if any waiter has *uaddr != val. *timeout is an optional absolute timeout value for the operation. This syscall supports only 64bit sized timeout structs. The flags argument of the syscall should be empty, but it can be used for future extensions. Flags for shared futexes, sizes, etc. should be used on the individual flags of each waiter. __reserved is used for explicit padding and should be 0, but it might be used for future extensions. If the userspace uses 32-bit pointers, it should make sure to explicitly cast it when assigning to waitv::uaddr. Returns the array index of one of the woken futexes. There’s no given information of how many were woken, or any particular attribute of it (if it’s the first woken, if it is of the smaller index...). Signed-off-by: André Almeida <andrealmeid@collabora.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20210923171111.300673-17-andrealmeid@collabora.com
2021-09-09compat: remove some compat entry pointsArnd Bergmann1-5/+5
These are all handled correctly when calling the native system call entry point, so remove the special cases. Link: https://lkml.kernel.org/r/20210727144859.4150043-6-arnd@kernel.org Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Feng Tang <feng.tang@intel.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Mackerras <paulus@samba.org> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03mm: wire up syscall process_mreleaseSuren Baghdasaryan1-1/+3
Split off from prev patch in the series that implements the syscall. Link: https://lkml.kernel.org/r/20210809185259.405936-2-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Christian Brauner <christian.brauner@ubuntu.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Florian Weimer <fweimer@redhat.com> Cc: Jan Engelhardt <jengelh@inai.de> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Rik van Riel <riel@surriel.com> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tim Murray <timmurray@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-08arch, mm: wire up memfd_secret system call where relevantMike Rapoport1-1/+6
Wire up memfd_secret system call on architectures that define ARCH_HAS_SET_DIRECT_MAP, namely arm64, risc-v and x86. Link: https://lkml.kernel.org/r/20210518072034.31572-7-rppt@kernel.org Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Acked-by: Palmer Dabbelt <palmerdabbelt@google.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Christopher Lameter <cl@linux.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Elena Reshetova <elena.reshetova@intel.com> Cc: Hagen Paul Pfeifer <hagen@jauu.net> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Bottomley <jejb@linux.ibm.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rick Edgecombe <rick.p.edgecombe@intel.com> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tycho Andersen <tycho@tycho.ws> Cc: Will Deacon <will@kernel.org> Cc: kernel test robot <lkp@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-07quota: Change quotactl_path() systcall to an fd-based oneJan Kara1-2/+2
Some users have pointed out that path-based syscalls are problematic in some environments and at least directory fd argument and possibly also resolve flags are desirable for such syscalls. Rather than reimplementing all details of pathname lookup and following where it may eventually evolve, let's go for full file descriptor based syscall similar to how ioctl(2) works since the beginning. Managing of quotas isn't performance sensitive so the extra overhead of open does not matter and we are able to consume O_PATH descriptors as well which makes open cheap anyway. Also for frequent operations (such as retrieving usage information for all users) we can reuse single fd and in fact get even better performance as well as avoiding races with possible remounts etc. Tested-by: Sascha Hauer <s.hauer@pengutronix.de> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>
2021-05-02Merge tag 'landlock_v34' of ↵Linus Torvalds1-1/+8
git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security Pull Landlock LSM from James Morris: "Add Landlock, a new LSM from Mickaël Salaün. Briefly, Landlock provides for unprivileged application sandboxing. From Mickaël's cover letter: "The goal of Landlock is to enable to restrict ambient rights (e.g. global filesystem access) for a set of processes. Because Landlock is a stackable LSM [1], it makes possible to create safe security sandboxes as new security layers in addition to the existing system-wide access-controls. This kind of sandbox is expected to help mitigate the security impact of bugs or unexpected/malicious behaviors in user-space applications. Landlock empowers any process, including unprivileged ones, to securely restrict themselves. Landlock is inspired by seccomp-bpf but instead of filtering syscalls and their raw arguments, a Landlock rule can restrict the use of kernel objects like file hierarchies, according to the kernel semantic. Landlock also takes inspiration from other OS sandbox mechanisms: XNU Sandbox, FreeBSD Capsicum or OpenBSD Pledge/Unveil. In this current form, Landlock misses some access-control features. This enables to minimize this patch series and ease review. This series still addresses multiple use cases, especially with the combined use of seccomp-bpf: applications with built-in sandboxing, init systems, security sandbox tools and security-oriented APIs [2]" The cover letter and v34 posting is here: https://lore.kernel.org/linux-security-module/20210422154123.13086-1-mic@digikod.net/ See also: https://landlock.io/ This code has had extensive design discussion and review over several years" Link: https://lore.kernel.org/lkml/50db058a-7dde-441b-a7f9-f6837fe8b69f@schaufler-ca.com/ [1] Link: https://lore.kernel.org/lkml/f646e1c7-33cf-333f-070c-0a40ad0468cd@digikod.net/ [2] * tag 'landlock_v34' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: landlock: Enable user space to infer supported features landlock: Add user and kernel documentation samples/landlock: Add a sandbox manager example selftests/landlock: Add user space tests landlock: Add syscall implementations arch: Wire up Landlock syscalls fs,security: Add sb_delete hook landlock: Support filesystem access-control LSM: Infrastructure management of the superblock landlock: Add ptrace restrictions landlock: Set up the security framework and manage credentials landlock: Add ruleset and domain management landlock: Add object management
2021-04-22arch: Wire up Landlock syscallsMickaël Salaün1-1/+7
Wire up the following system calls for all architectures: * landlock_create_ruleset(2) * landlock_add_rule(2) * landlock_restrict_self(2) Cc: Arnd Bergmann <arnd@arndb.de> Cc: James Morris <jmorris@namei.org> Cc: Jann Horn <jannh@google.com> Cc: Kees Cook <keescook@chromium.org> Cc: Serge E. Hallyn <serge@hallyn.com> Signed-off-by: Mickaël Salaün <mic@linux.microsoft.com> Link: https://lore.kernel.org/r/20210422154123.13086-10-mic@digikod.net Signed-off-by: James Morris <jamorris@linux.microsoft.com>
2021-03-17quota: wire up quotactl_pathSascha Hauer1-1/+3
Wire up the quotactl_path syscall added in the previous patch. Link: https://lore.kernel.org/r/20210304123541.30749-3-s.hauer@pengutronix.de Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>
2021-01-24fs: add mount_setattr()Christian Brauner1-1/+3
This implements the missing mount_setattr() syscall. While the new mount api allows to change the properties of a superblock there is currently no way to change the properties of a mount or a mount tree using file descriptors which the new mount api is based on. In addition the old mount api has the restriction that mount options cannot be applied recursively. This hasn't changed since changing mount options on a per-mount basis was implemented in [1] and has been a frequent request not just for convenience but also for security reasons. The legacy mount syscall is unable to accommodate this behavior without introducing a whole new set of flags because MS_REC | MS_REMOUNT | MS_BIND | MS_RDONLY | MS_NOEXEC | [...] only apply the mount option to the topmost mount. Changing MS_REC to apply to the whole mount tree would mean introducing a significant uapi change and would likely cause significant regressions. The new mount_setattr() syscall allows to recursively clear and set mount options in one shot. Multiple calls to change mount options requesting the same changes are idempotent: int mount_setattr(int dfd, const char *path, unsigned flags, struct mount_attr *uattr, size_t usize); Flags to modify path resolution behavior are specified in the @flags argument. Currently, AT_EMPTY_PATH, AT_RECURSIVE, AT_SYMLINK_NOFOLLOW, and AT_NO_AUTOMOUNT are supported. If useful, additional lookup flags to restrict path resolution as introduced with openat2() might be supported in the future. The mount_setattr() syscall can be expected to grow over time and is designed with extensibility in mind. It follows the extensible syscall pattern we have used with other syscalls such as openat2(), clone3(), sched_{set,get}attr(), and others. The set of mount options is passed in the uapi struct mount_attr which currently has the following layout: struct mount_attr { __u64 attr_set; __u64 attr_clr; __u64 propagation; __u64 userns_fd; }; The @attr_set and @attr_clr members are used to clear and set mount options. This way a user can e.g. request that a set of flags is to be raised such as turning mounts readonly by raising MOUNT_ATTR_RDONLY in @attr_set while at the same time requesting that another set of flags is to be lowered such as removing noexec from a mount tree by specifying MOUNT_ATTR_NOEXEC in @attr_clr. Note, since the MOUNT_ATTR_<atime> values are an enum starting from 0, not a bitmap, users wanting to transition to a different atime setting cannot simply specify the atime setting in @attr_set, but must also specify MOUNT_ATTR__ATIME in the @attr_clr field. So we ensure that MOUNT_ATTR__ATIME can't be partially set in @attr_clr and that @attr_set can't have any atime bits set if MOUNT_ATTR__ATIME isn't set in @attr_clr. The @propagation field lets callers specify the propagation type of a mount tree. Propagation is a single property that has four different settings and as such is not really a flag argument but an enum. Specifically, it would be unclear what setting and clearing propagation settings in combination would amount to. The legacy mount() syscall thus forbids the combination of multiple propagation settings too. The goal is to keep the semantics of mount propagation somewhat simple as they are overly complex as it is. The @userns_fd field lets user specify a user namespace whose idmapping becomes the idmapping of the mount. This is implemented and explained in detail in the next patch. [1]: commit 2e4b7fcd9260 ("[PATCH] r/o bind mounts: honor mount writer counts at remount") Link: https://lore.kernel.org/r/20210121131959.646623-35-christian.brauner@ubuntu.com Cc: David Howells <dhowells@redhat.com> Cc: Aleksa Sarai <cyphar@cyphar.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Cc: linux-api@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2020-12-19epoll: wire up syscall epoll_pwait2Willem de Bruijn1-1/+3
Split off from prev patch in the series that implements the syscall. Link: https://lkml.kernel.org/r/20201121144401.3727659-4-willemdebruijn.kernel@gmail.com Signed-off-by: Willem de Bruijn <willemb@google.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-11-13syscalls: Fix file comments for syscalls implemented in kernel/sys.cTal Zussman1-1/+1
The relevant syscalls were previously moved from kernel/timer.c to kernel/sys.c, but the comments weren't updated to reflect this change. Fixing these comments messes up the alphabetical ordering of syscalls by filename. This could be fixed by merging the two groups of kernel/sys.c syscalls, but that would require reordering the syscalls and renumbering them to maintain the numerical order in unistd.h. Signed-off-by: Tal Zussman <tz2294@columbia.edu> Link: https://lore.kernel.org/r/20201112215657.GA4539@charmander' Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2020-10-18mm/madvise: introduce process_madvise() syscall: an external memory hinting APIMinchan Kim1-1/+3
There is usecase that System Management Software(SMS) want to give a memory hint like MADV_[COLD|PAGEEOUT] to other processes and in the case of Android, it is the ActivityManagerService. The information required to make the reclaim decision is not known to the app. Instead, it is known to the centralized userspace daemon(ActivityManagerService), and that daemon must be able to initiate reclaim on its own without any app involvement. To solve the issue, this patch introduces a new syscall process_madvise(2). It uses pidfd of an external process to give the hint. It also supports vector address range because Android app has thousands of vmas due to zygote so it's totally waste of CPU and power if we should call the syscall one by one for each vma.(With testing 2000-vma syscall vs 1-vector syscall, it showed 15% performance improvement. I think it would be bigger in real practice because the testing ran very cache friendly environment). Another potential use case for the vector range is to amortize the cost ofTLB shootdowns for multiple ranges when using MADV_DONTNEED; this could benefit users like TCP receive zerocopy and malloc implementations. In future, we could find more usecases for other advises so let's make it happens as API since we introduce a new syscall at this moment. With that, existing madvise(2) user could replace it with process_madvise(2) with their own pid if they want to have batch address ranges support feature. ince it could affect other process's address range, only privileged process(PTRACE_MODE_ATTACH_FSCREDS) or something else(e.g., being the same UID) gives it the right to ptrace the process could use it successfully. The flag argument is reserved for future use if we need to extend the API. I think supporting all hints madvise has/will supported/support to process_madvise is rather risky. Because we are not sure all hints make sense from external process and implementation for the hint may rely on the caller being in the current context so it could be error-prone. Thus, I just limited hints as MADV_[COLD|PAGEOUT] in this patch. If someone want to add other hints, we could hear the usecase and review it for each hint. It's safer for maintenance rather than introducing a buggy syscall but hard to fix it later. So finally, the API is as follows, ssize_t process_madvise(int pidfd, const struct iovec *iovec, unsigned long vlen, int advice, unsigned int flags); DESCRIPTION The process_madvise() system call is used to give advice or directions to the kernel about the address ranges from external process as well as local process. It provides the advice to address ranges of process described by iovec and vlen. The goal of such advice is to improve system or application performance. The pidfd selects the process referred to by the PID file descriptor specified in pidfd. (See pidofd_open(2) for further information) The pointer iovec points to an array of iovec structures, defined in <sys/uio.h> as: struct iovec { void *iov_base; /* starting address */ size_t iov_len; /* number of bytes to be advised */ }; The iovec describes address ranges beginning at address(iov_base) and with size length of bytes(iov_len). The vlen represents the number of elements in iovec. The advice is indicated in the advice argument, which is one of the following at this moment if the target process specified by pidfd is external. MADV_COLD MADV_PAGEOUT Permission to provide a hint to external process is governed by a ptrace access mode PTRACE_MODE_ATTACH_FSCREDS check; see ptrace(2). The process_madvise supports every advice madvise(2) has if target process is in same thread group with calling process so user could use process_madvise(2) to extend existing madvise(2) to support vector address ranges. RETURN VALUE On success, process_madvise() returns the number of bytes advised. This return value may be less than the total number of requested bytes, if an error occurred. The caller should check return value to determine whether a partial advice occurred. FAQ: Q.1 - Why does any external entity have better knowledge? Quote from Sandeep "For Android, every application (including the special SystemServer) are forked from Zygote. The reason of course is to share as many libraries and classes between the two as possible to benefit from the preloading during boot. After applications start, (almost) all of the APIs end up calling into this SystemServer process over IPC (binder) and back to the application. In a fully running system, the SystemServer monitors every single process periodically to calculate their PSS / RSS and also decides which process is "important" to the user for interactivity. So, because of how these processes start _and_ the fact that the SystemServer is looping to monitor each process, it does tend to *know* which address range of the application is not used / useful. Besides, we can never rely on applications to clean things up themselves. We've had the "hey app1, the system is low on memory, please trim your memory usage down" notifications for a long time[1]. They rely on applications honoring the broadcasts and very few do. So, if we want to avoid the inevitable killing of the application and restarting it, some way to be able to tell the OS about unimportant memory in these applications will be useful. - ssp Q.2 - How to guarantee the race(i.e., object validation) between when giving a hint from an external process and get the hint from the target process? process_madvise operates on the target process's address space as it exists at the instant that process_madvise is called. If the space target process can run between the time the process_madvise process inspects the target process address space and the time that process_madvise is actually called, process_madvise may operate on memory regions that the calling process does not expect. It's the responsibility of the process calling process_madvise to close this race condition. For example, the calling process can suspend the target process with ptrace, SIGSTOP, or the freezer cgroup so that it doesn't have an opportunity to change its own address space before process_madvise is called. Another option is to operate on memory regions that the caller knows a priori will be unchanged in the target process. Yet another option is to accept the race for certain process_madvise calls after reasoning that mistargeting will do no harm. The suggested API itself does not provide synchronization. It also apply other APIs like move_pages, process_vm_write. The race isn't really a problem though. Why is it so wrong to require that callers do their own synchronization in some manner? Nobody objects to write(2) merely because it's possible for two processes to open the same file and clobber each other's writes --- instead, we tell people to use flock or something. Think about mmap. It never guarantees newly allocated address space is still valid when the user tries to access it because other threads could unmap the memory right before. That's where we need synchronization by using other API or design from userside. It shouldn't be part of API itself. If someone needs more fine-grained synchronization rather than process level, there were two ideas suggested - cookie[2] and anon-fd[3]. Both are applicable via using last reserved argument of the API but I don't think it's necessary right now since we have already ways to prevent the race so don't want to add additional complexity with more fine-grained optimization model. To make the API extend, it reserved an unsigned long as last argument so we could support it in future if someone really needs it. Q.3 - Why doesn't ptrace work? Injecting an madvise in the target process using ptrace would not work for us because such injected madvise would have to be executed by the target process, which means that process would have to be runnable and that creates the risk of the abovementioned race and hinting a wrong VMA. Furthermore, we want to act the hint in caller's context, not the callee's, because the callee is usually limited in cpuset/cgroups or even freezed state so they can't act by themselves quick enough, which causes more thrashing/kill. It doesn't work if the target process are ptraced(e.g., strace, debugger, minidump) because a process can have at most one ptracer. [1] https://developer.android.com/topic/performance/memory" [2] process_getinfo for getting the cookie which is updated whenever vma of process address layout are changed - Daniel Colascione - https://lore.kernel.org/lkml/20190520035254.57579-1-minchan@kernel.org/T/#m7694416fd179b2066a2c62b5b139b14e3894e224 [3] anonymous fd which is used for the object(i.e., address range) validation - Michal Hocko - https://lore.kernel.org/lkml/20200120112722.GY18451@dhcp22.suse.cz/ [minchan@kernel.org: fix process_madvise build break for arm64] Link: http://lkml.kernel.org/r/20200303145756.GA219683@google.com [minchan@kernel.org: fix build error for mips of process_madvise] Link: http://lkml.kernel.org/r/20200508052517.GA197378@google.com [akpm@linux-foundation.org: fix patch ordering issue] [akpm@linux-foundation.org: fix arm64 whoops] [minchan@kernel.org: make process_madvise() vlen arg have type size_t, per Florian] [akpm@linux-foundation.org: fix i386 build] [sfr@canb.auug.org.au: fix syscall numbering] Link: https://lkml.kernel.org/r/20200905142639.49fc3f1a@canb.auug.org.au [sfr@canb.auug.org.au: madvise.c needs compat.h] Link: https://lkml.kernel.org/r/20200908204547.285646b4@canb.auug.org.au [minchan@kernel.org: fix mips build] Link: https://lkml.kernel.org/r/20200909173655.GC2435453@google.com [yuehaibing@huawei.com: remove duplicate header which is included twice] Link: https://lkml.kernel.org/r/20200915121550.30584-1-yuehaibing@huawei.com [minchan@kernel.org: do not use helper functions for process_madvise] Link: https://lkml.kernel.org/r/20200921175539.GB387368@google.com [akpm@linux-foundation.org: pidfd_get_pid() gained an argument] [sfr@canb.auug.org.au: fix up for "iov_iter: transparently handle compat iovecs in import_iovec"] Link: https://lkml.kernel.org/r/20200928212542.468e1fef@canb.auug.org.au Signed-off-by: Minchan Kim <minchan@kernel.org> Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Rientjes <rientjes@google.com> Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com> Cc: Brian Geffon <bgeffon@google.com> Cc: Christian Brauner <christian@brauner.io> Cc: Daniel Colascione <dancol@google.com> Cc: Jann Horn <jannh@google.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: John Dias <joaodias@google.com> Cc: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Oleksandr Natalenko <oleksandr@redhat.com> Cc: Sandeep Patil <sspatil@google.com> Cc: SeongJae Park <sj38.park@gmail.com> Cc: SeongJae Park <sjpark@amazon.de> Cc: Shakeel Butt <shakeelb@google.com> Cc: Sonny Rao <sonnyrao@google.com> Cc: Tim Murray <timmurray@google.com> Cc: Christian Brauner <christian.brauner@ubuntu.com> Cc: Florian Weimer <fw@deneb.enyo.de> Cc: <linux-man@vger.kernel.org> Link: http://lkml.kernel.org/r/20200302193630.68771-3-minchan@kernel.org Link: http://lkml.kernel.org/r/20200508183320.GA125527@google.com Link: http://lkml.kernel.org/r/20200622192900.22757-4-minchan@kernel.org Link: https://lkml.kernel.org/r/20200901000633.1920247-4-minchan@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-13Merge branch 'compat.mount' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull compat mount cleanups from Al Viro: "The last remnants of mount(2) compat buried by Christoph. Buried into NFS, that is. Generally I'm less enthusiastic about "let's use in_compat_syscall() deep in call chain" kind of approach than Christoph seems to be, but in this case it's warranted - that had been an NFS-specific wart, hopefully not to be repeated in any other filesystems (read: any new filesystem introducing non-text mount options will get NAKed even if it doesn't mess the layout up). IOW, not worth trying to grow an infrastructure that would avoid that use of in_compat_syscall()..." * 'compat.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: fs: remove compat_sys_mount fs,nfs: lift compat nfs4 mount data handling into the nfs code nfs: simplify nfs4_parse_monolithic
2020-10-03mm: remove compat_process_vm_{readv,writev}Christoph Hellwig1-4/+2
Now that import_iovec handles compat iovecs, the native syscalls can be used for the compat case as well. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-10-03fs: remove compat_sys_vmspliceChristoph Hellwig1-1/+1
Now that import_iovec handles compat iovecs, the native vmsplice syscall can be used for the compat case as well. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-10-03fs: remove the compat readv/writev syscallsChristoph Hellwig1-2/+2
Now that import_iovec handles compat iovecs, the native readv and writev syscalls can be used for the compat case as well. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-09-23fs: remove compat_sys_mountChristoph Hellwig1-1/+1
compat_sys_mount is identical to the regular sys_mount now, so remove it and use the native version everywhere. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-08-06Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-nextLinus Torvalds1-2/+2
Pull networking updates from David Miller: 1) Support 6Ghz band in ath11k driver, from Rajkumar Manoharan. 2) Support UDP segmentation in code TSO code, from Eric Dumazet. 3) Allow flashing different flash images in cxgb4 driver, from Vishal Kulkarni. 4) Add drop frames counter and flow status to tc flower offloading, from Po Liu. 5) Support n-tuple filters in cxgb4, from Vishal Kulkarni. 6) Various new indirect call avoidance, from Eric Dumazet and Brian Vazquez. 7) Fix BPF verifier failures on 32-bit pointer arithmetic, from Yonghong Song. 8) Support querying and setting hardware address of a port function via devlink, use this in mlx5, from Parav Pandit. 9) Support hw ipsec offload on bonding slaves, from Jarod Wilson. 10) Switch qca8k driver over to phylink, from Jonathan McDowell. 11) In bpftool, show list of processes holding BPF FD references to maps, programs, links, and btf objects. From Andrii Nakryiko. 12) Several conversions over to generic power management, from Vaibhav Gupta. 13) Add support for SO_KEEPALIVE et al. to bpf_setsockopt(), from Dmitry Yakunin. 14) Various https url conversions, from Alexander A. Klimov. 15) Timestamping and PHC support for mscc PHY driver, from Antoine Tenart. 16) Support bpf iterating over tcp and udp sockets, from Yonghong Song. 17) Support 5GBASE-T i40e NICs, from Aleksandr Loktionov. 18) Add kTLS RX HW offload support to mlx5e, from Tariq Toukan. 19) Fix the ->ndo_start_xmit() return type to be netdev_tx_t in several drivers. From Luc Van Oostenryck. 20) XDP support for xen-netfront, from Denis Kirjanov. 21) Support receive buffer autotuning in MPTCP, from Florian Westphal. 22) Support EF100 chip in sfc driver, from Edward Cree. 23) Add XDP support to mvpp2 driver, from Matteo Croce. 24) Support MPTCP in sock_diag, from Paolo Abeni. 25) Commonize UDP tunnel offloading code by creating udp_tunnel_nic infrastructure, from Jakub Kicinski. 26) Several pci_ --> dma_ API conversions, from Christophe JAILLET. 27) Add FLOW_ACTION_POLICE support to mlxsw, from Ido Schimmel. 28) Add SK_LOOKUP bpf program type, from Jakub Sitnicki. 29) Refactor a lot of networking socket option handling code in order to avoid set_fs() calls, from Christoph Hellwig. 30) Add rfc4884 support to icmp code, from Willem de Bruijn. 31) Support TBF offload in dpaa2-eth driver, from Ioana Ciornei. 32) Support XDP_REDIRECT in qede driver, from Alexander Lobakin. 33) Support PCI relaxed ordering in mlx5 driver, from Aya Levin. 34) Support TCP syncookies in MPTCP, from Flowian Westphal. 35) Fix several tricky cases of PMTU handling wrt. briding, from Stefano Brivio. * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2056 commits) net: thunderx: initialize VF's mailbox mutex before first usage usb: hso: remove bogus check for EINPROGRESS usb: hso: no complaint about kmalloc failure hso: fix bailout in error case of probe ip_tunnel_core: Fix build for archs without _HAVE_ARCH_IPV6_CSUM selftests/net: relax cpu affinity requirement in msg_zerocopy test mptcp: be careful on subflow creation selftests: rtnetlink: make kci_test_encap() return sub-test result selftests: rtnetlink: correct the final return value for the test net: dsa: sja1105: use detected device id instead of DT one on mismatch tipc: set ub->ifindex for local ipv6 address ipv6: add ipv6_dev_find() net: openvswitch: silence suspicious RCU usage warning Revert "vxlan: fix tos value before xmit" ptp: only allow phase values lower than 1 period farsync: switch from 'pci_' to 'dma_' API wan: wanxl: switch from 'pci_' to 'dma_' API hv_netvsc: do not use VF device if link is down dpaa2-eth: Fix passing zero to 'PTR_ERR' warning net: macb: Properly handle phylink on at91sam9x ...
2020-07-20net: remove compat_sys_{get,set}sockoptChristoph Hellwig1-2/+2
Now that the ->compat_{get,set}sockopt proto_ops methods are gone there is no good reason left to keep the compat syscalls separate. This fixes the odd use of unsigned int for the compat_setsockopt optlen and the missing sock_use_custom_sol_socket. It would also easily allow running the eBPF hooks for the compat syscalls, but such a large change in behavior does not belong into a consolidation patch like this one. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-17arch: wire-up close_range()Christian Brauner1-0/+2
This wires up the close_range() syscall into all arches at once. Suggested-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc) Cc: Jann Horn <jannh@google.com> Cc: David Howells <dhowells@redhat.com> Cc: Dmitry V. Levin <ldv@altlinux.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Florian Weimer <fweimer@redhat.com> Cc: linux-api@vger.kernel.org Cc: linux-alpha@vger.kernel.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-ia64@vger.kernel.org Cc: linux-m68k@lists.linux-m68k.org Cc: linux-mips@vger.kernel.org Cc: linux-parisc@vger.kernel.org Cc: linuxppc-dev@lists.ozlabs.org Cc: linux-s390@vger.kernel.org Cc: linux-sh@vger.kernel.org Cc: sparclinux@vger.kernel.org Cc: linux-xtensa@linux-xtensa.org Cc: linux-arch@vger.kernel.org Cc: x86@kernel.org
2020-05-14vfs: add faccessat2 syscallMiklos Szeredi1-1/+3
POSIX defines faccessat() as having a fourth "flags" argument, while the linux syscall doesn't have it. Glibc tries to emulate AT_EACCESS and AT_SYMLINK_NOFOLLOW, but AT_EACCESS emulation is broken. Add a new faccessat(2) syscall with the added flags argument and implement both flags. The value of AT_EACCESS is defined in glibc headers to be the same as AT_REMOVEDIR. Use this value for the kernel interface as well, together with the explanatory comment. Also add AT_EMPTY_PATH support, which is not documented by POSIX, but can be useful and is trivial to implement. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-01-30Merge tag 'threads-v5.6' of ↵Linus Torvalds1-1/+3
git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux Pull thread management updates from Christian Brauner: "Sargun Dhillon over the last cycle has worked on the pidfd_getfd() syscall. This syscall allows for the retrieval of file descriptors of a process based on its pidfd. A task needs to have ptrace_may_access() permissions with PTRACE_MODE_ATTACH_REALCREDS (suggested by Oleg and Andy) on the target. One of the main use-cases is in combination with seccomp's user notification feature. As a reminder, seccomp's user notification feature was made available in v5.0. It allows a task to retrieve a file descriptor for its seccomp filter. The file descriptor is usually handed of to a more privileged supervising process. The supervisor can then listen for syscall events caught by the seccomp filter of the supervisee and perform actions in lieu of the supervisee, usually emulating syscalls. pidfd_getfd() is needed to expand its uses. There are currently two major users that wait on pidfd_getfd() and one future user: - Netflix, Sargun said, is working on a service mesh where users should be able to connect to a dns-based VIP. When a user connects to e.g. 1.2.3.4:80 that runs e.g. service "foo" they will be redirected to an envoy process. This service mesh uses seccomp user notifications and pidfd to intercept all connect calls and instead of connecting them to 1.2.3.4:80 connects them to e.g. 127.0.0.1:8080. - LXD uses the seccomp notifier heavily to intercept and emulate mknod() and mount() syscalls for unprivileged containers/processes. With pidfd_getfd() more uses-cases e.g. bridging socket connections will be possible. - The patchset has also seen some interest from the browser corner. Right now, Firefox is using a SECCOMP_RET_TRAP sandbox managed by a broker process. In the future glibc will start blocking all signals during dlopen() rendering this type of sandbox impossible. Hence, in the future Firefox will switch to a seccomp-user-nofication based sandbox which also makes use of file descriptor retrieval. The thread for this can be found at https://sourceware.org/ml/libc-alpha/2019-12/msg00079.html With pidfd_getfd() it is e.g. possible to bridge socket connections for the supervisee (binding to a privileged port) and taking actions on file descriptors on behalf of the supervisee in general. Sargun's first version was using an ioctl on pidfds but various people pushed for it to be a proper syscall which he duely implemented as well over various review cycles. Selftests are of course included. I've also added instructions how to deal with merge conflicts below. There's also a small fix coming from the kernel mentee project to correctly annotate struct sighand_struct with __rcu to fix various sparse warnings. We've received a few more such fixes and even though they are mostly trivial I've decided to postpone them until after -rc1 since they came in rather late and I don't want to risk introducing build warnings. Finally, there's a new prctl() command PR_{G,S}ET_IO_FLUSHER which is needed to avoid allocation recursions triggerable by storage drivers that have userspace parts that run in the IO path (e.g. dm-multipath, iscsi, etc). These allocation recursions deadlock the device. The new prctl() allows such privileged userspace components to avoid allocation recursions by setting the PF_MEMALLOC_NOIO and PF_LESS_THROTTLE flags. The patch carries the necessary acks from the relevant maintainers and is routed here as part of prctl() thread-management." * tag 'threads-v5.6' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: prctl: PR_{G,S}ET_IO_FLUSHER to support controlling memory reclaim sched.h: Annotate sighand_struct with __rcu test: Add test for pidfd getfd arch: wire up pidfd_getfd syscall pid: Implement pidfd_getfd syscall vfs, fdtable: Add fget_task helper
2020-01-18open: introduce openat2(2) syscallAleksa Sarai1-1/+4
/* Background. */ For a very long time, extending openat(2) with new features has been incredibly frustrating. This stems from the fact that openat(2) is possibly the most famous counter-example to the mantra "don't silently accept garbage from userspace" -- it doesn't check whether unknown flags are present[1]. This means that (generally) the addition of new flags to openat(2) has been fraught with backwards-compatibility issues (O_TMPFILE has to be defined as __O_TMPFILE|O_DIRECTORY|[O_RDWR or O_WRONLY] to ensure old kernels gave errors, since it's insecure to silently ignore the flag[2]). All new security-related flags therefore have a tough road to being added to openat(2). Userspace also has a hard time figuring out whether a particular flag is supported on a particular kernel. While it is now possible with contemporary kernels (thanks to [3]), older kernels will expose unknown flag bits through fcntl(F_GETFL). Giving a clear -EINVAL during openat(2) time matches modern syscall designs and is far more fool-proof. In addition, the newly-added path resolution restriction LOOKUP flags (which we would like to expose to user-space) don't feel related to the pre-existing O_* flag set -- they affect all components of path lookup. We'd therefore like to add a new flag argument. Adding a new syscall allows us to finally fix the flag-ignoring problem, and we can make it extensible enough so that we will hopefully never need an openat3(2). /* Syscall Prototype. */ /* * open_how is an extensible structure (similar in interface to * clone3(2) or sched_setattr(2)). The size parameter must be set to * sizeof(struct open_how), to allow for future extensions. All future * extensions will be appended to open_how, with their zero value * acting as a no-op default. */ struct open_how { /* ... */ }; int openat2(int dfd, const char *pathname, struct open_how *how, size_t size); /* Description. */ The initial version of 'struct open_how' contains the following fields: flags Used to specify openat(2)-style flags. However, any unknown flag bits or otherwise incorrect flag combinations (like O_PATH|O_RDWR) will result in -EINVAL. In addition, this field is 64-bits wide to allow for more O_ flags than currently permitted with openat(2). mode The file mode for O_CREAT or O_TMPFILE. Must be set to zero if flags does not contain O_CREAT or O_TMPFILE. resolve Restrict path resolution (in contrast to O_* flags they affect all path components). The current set of flags are as follows (at the moment, all of the RESOLVE_ flags are implemented as just passing the corresponding LOOKUP_ flag). RESOLVE_NO_XDEV => LOOKUP_NO_XDEV RESOLVE_NO_SYMLINKS => LOOKUP_NO_SYMLINKS RESOLVE_NO_MAGICLINKS => LOOKUP_NO_MAGICLINKS RESOLVE_BENEATH => LOOKUP_BENEATH RESOLVE_IN_ROOT => LOOKUP_IN_ROOT open_how does not contain an embedded size field, because it is of little benefit (userspace can figure out the kernel open_how size at runtime fairly easily without it). It also only contains u64s (even though ->mode arguably should be a u16) to avoid having padding fields which are never used in the future. Note that as a result of the new how->flags handling, O_PATH|O_TMPFILE is no longer permitted for openat(2). As far as I can tell, this has always been a bug and appears to not be used by userspace (and I've not seen any problems on my machines by disallowing it). If it turns out this breaks something, we can special-case it and only permit it for openat(2) but not openat2(2). After input from Florian Weimer, the new open_how and flag definitions are inside a separate header from uapi/linux/fcntl.h, to avoid problems that glibc has with importing that header. /* Testing. */ In a follow-up patch there are over 200 selftests which ensure that this syscall has the correct semantics and will correctly handle several attack scenarios. In addition, I've written a userspace library[4] which provides convenient wrappers around openat2(RESOLVE_IN_ROOT) (this is necessary because no other syscalls support RESOLVE_IN_ROOT, and thus lots of care must be taken when using RESOLVE_IN_ROOT'd file descriptors with other syscalls). During the development of this patch, I've run numerous verification tests using libpathrs (showing that the API is reasonably usable by userspace). /* Future Work. */ Additional RESOLVE_ flags have been suggested during the review period. These can be easily implemented separately (such as blocking auto-mount during resolution). Furthermore, there are some other proposed changes to the openat(2) interface (the most obvious example is magic-link hardening[5]) which would be a good opportunity to add a way for userspace to restrict how O_PATH file descriptors can be re-opened. Another possible avenue of future work would be some kind of CHECK_FIELDS[6] flag which causes the kernel to indicate to userspace which openat2(2) flags and fields are supported by the current kernel (to avoid userspace having to go through several guesses to figure it out). [1]: https://lwn.net/Articles/588444/ [2]: https://lore.kernel.org/lkml/CA+55aFyyxJL1LyXZeBsf2ypriraj5ut1XkNDsunRBqgVjZU_6Q@mail.gmail.com [3]: commit 629e014bb834 ("fs: completely ignore unknown open flags") [4]: https://sourceware.org/bugzilla/show_bug.cgi?id=17523 [5]: https://lore.kernel.org/lkml/20190930183316.10190-2-cyphar@cyphar.com/ [6]: https://youtu.be/ggD-eb3yPVs Suggested-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Aleksa Sarai <cyphar@cyphar.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-01-13arch: wire up pidfd_getfd syscallSargun Dhillon1-1/+3
This wires up the pidfd_getfd syscall for all architectures. Signed-off-by: Sargun Dhillon <sargun@sargun.me> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Link: https://lore.kernel.org/r/20200107175927.4558-4-sargun@sargun.me Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2019-09-06ipc: fix semtimedop for generic 32-bit architecturesArnd Bergmann1-1/+1
As Vincent noticed, the y2038 conversion of semtimedop in linux-5.1 broke when commit 00bf25d693e7 ("y2038: use time32 syscall names on 32-bit") changed all system calls on all architectures that take a 32-bit time_t to point to the _time32 implementation, but left out semtimedop in the asm-generic header. This affects all 32-bit architectures using asm-generic/unistd.h: h8300, unicore32, openrisc, nios2, hexagon, c6x, arc, nds32 and csky. The notable exception is riscv32, which has dropped support for the time32 system calls entirely. Reported-by: Vincent Chen <deanbo422@gmail.com> Cc: stable@vger.kernel.org Cc: Vincent Chen <deanbo422@gmail.com> Cc: Greentime Hu <green.hu@gmail.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: Stafford Horne <shorne@gmail.com> Cc: Jonas Bonn <jonas@southpole.se> Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi> Cc: Ley Foon Tan <lftan@altera.com> Cc: Richard Kuo <rkuo@codeaurora.org> Cc: Mark Salter <msalter@redhat.com> Cc: Aurelien Jacquiot <jacquiot.aurelien@gmail.com> Cc: Guo Ren <guoren@kernel.org> Fixes: 00bf25d693e7 ("y2038: use time32 syscall names on 32-bit") Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-07-15unistd: protect clone3 via __ARCH_WANT_SYS_CLONE3Christian Brauner1-0/+2
This lets us catch new architectures that implicitly make use of clone3 without setting __ARCH_WANT_SYS_CLONE3. Failing on missing __ARCH_WANT_SYS_CLONE3 is a good indicator that they either did not really want this syscall or haven't really thought about whether it needs special treatment and just accidently included it in their entrypoints by e.g. generating their syscall table automatically via asm-generic/unistd.h This patch has been compile-tested for the h8300 architecture which is one of the architectures that does not yet implement clone3 and generates its syscall table via asm-generic/unistd.h. Signed-off-by: Christian Brauner <christian@brauner.io> Suggested-by: Arnd Bergmann <arnd@arndb.de> Link: https://lore.kernel.org/r/20190714192205.27190-3-christian@brauner.io Reviewed-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Christian Brauner <christian@brauner.io>
2019-07-11Merge tag 'clone3-v5.3' of ↵Linus Torvalds1-1/+3
git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux Pull clone3 system call from Christian Brauner: "This adds the clone3 syscall which is an extensible successor to clone after we snagged the last flag with CLONE_PIDFD during the 5.2 merge window for clone(). It cleanly supports all of the flags from clone() and thus all legacy workloads. There are few user visible differences between clone3 and clone. First, CLONE_DETACHED will cause EINVAL with clone3 so we can reuse this flag. Second, the CSIGNAL flag is deprecated and will cause EINVAL to be reported. It is superseeded by a dedicated "exit_signal" argument in struct clone_args thus freeing up even more flags. And third, clone3 gives CLONE_PIDFD a dedicated return argument in struct clone_args instead of abusing CLONE_PARENT_SETTID's parent_tidptr argument. The clone3 uapi is designed to be easy to handle on 32- and 64 bit: /* uapi */ struct clone_args { __aligned_u64 flags; __aligned_u64 pidfd; __aligned_u64 child_tid; __aligned_u64 parent_tid; __aligned_u64 exit_signal; __aligned_u64 stack; __aligned_u64 stack_size; __aligned_u64 tls; }; and a separate kernel struct is used that uses proper kernel typing: /* kernel internal */ struct kernel_clone_args { u64 flags; int __user *pidfd; int __user *child_tid; int __user *parent_tid; int exit_signal; unsigned long stack; unsigned long stack_size; unsigned long tls; }; The system call comes with a size argument which enables the kernel to detect what version of clone_args userspace is passing in. clone3 validates that any additional bytes a given kernel does not know about are set to zero and that the size never exceeds a page. A nice feature is that this patchset allowed us to cleanup and simplify various core kernel codepaths in kernel/fork.c by making the internal _do_fork() function take struct kernel_clone_args even for legacy clone(). This patch also unblocks the time namespace patchset which wants to introduce a new CLONE_TIMENS flag. Note, that clone3 has only been wired up for x86{_32,64}, arm{64}, and xtensa. These were the architectures that did not require special massaging. Other architectures treat fork-like system calls individually and after some back and forth neither Arnd nor I felt confident that we dared to add clone3 unconditionally to all architectures. We agreed to leave this up to individual architecture maintainers. This is why there's an additional patch that introduces __ARCH_WANT_SYS_CLONE3 which any architecture can set once it has implemented support for clone3. The patch also adds a cond_syscall(clone3) for architectures such as nios2 or h8300 that generate their syscall table by simply including asm-generic/unistd.h. The hope is to get rid of __ARCH_WANT_SYS_CLONE3 and cond_syscall() rather soon" * tag 'clone3-v5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: arch: handle arches who do not yet define clone3 arch: wire-up clone3() syscall fork: add clone3
2019-06-28arch: wire-up pidfd_open()Christian Brauner1-1/+3
This wires up the pidfd_open() syscall into all arches at once. Signed-off-by: Christian Brauner <christian@brauner.io> Reviewed-by: David Howells <dhowells@redhat.com> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Kees Cook <keescook@chromium.org> Cc: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Jann Horn <jannh@google.com> Cc: Andy Lutomirsky <luto@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Aleksa Sarai <cyphar@cyphar.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-api@vger.kernel.org Cc: linux-alpha@vger.kernel.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-ia64@vger.kernel.org Cc: linux-m68k@lists.linux-m68k.org Cc: linux-mips@vger.kernel.org Cc: linux-parisc@vger.kernel.org Cc: linuxppc-dev@lists.ozlabs.org Cc: linux-s390@vger.kernel.org Cc: linux-sh@vger.kernel.org Cc: sparclinux@vger.kernel.org Cc: linux-xtensa@linux-xtensa.org Cc: linux-arch@vger.kernel.org Cc: x86@kernel.org
2019-06-09arch: wire-up clone3() syscallChristian Brauner1-1/+3
Wire up the clone3() call on all arches that don't require hand-rolled assembly. Some of the arches look like they need special assembly massaging and it is probably smarter if the appropriate arch maintainers would do the actual wiring. Arches that are wired-up are: - x86{_32,64} - arm{64} - xtensa Signed-off-by: Christian Brauner <christian@brauner.io> Acked-by: Arnd Bergmann <arnd@arndb.de> Cc: Kees Cook <keescook@chromium.org> Cc: David Howells <dhowells@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Adrian Reber <adrian@lisas.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Florian Weimer <fweimer@redhat.com> Cc: linux-api@vger.kernel.org Cc: linux-arch@vger.kernel.org Cc: x86@kernel.org
2019-05-16uapi: Wire up the mount API syscalls on non-x86 arches [ver #2]David Howells1-1/+13
Wire up the mount API syscalls on non-x86 arches. Reported-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-03-16Merge tag 'pidfd-v5.1-rc1' of ↵Linus Torvalds1-0/+2
git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux Pull pidfd system call from Christian Brauner: "This introduces the ability to use file descriptors from /proc/<pid>/ as stable handles on struct pid. Even if a pid is recycled the handle will not change. For a start these fds can be used to send signals to the processes they refer to. With the ability to use /proc/<pid> fds as stable handles on struct pid we can fix a long-standing issue where after a process has exited its pid can be reused by another process. If a caller sends a signal to a reused pid it will end up signaling the wrong process. With this patchset we enable a variety of use cases. One obvious example is that we can now safely delegate an important part of process management - sending signals - to processes other than the parent of a given process by sending file descriptors around via scm rights and not fearing that the given process will have been recycled in the meantime. It also allows for easy testing whether a given process is still alive or not by sending signal 0 to a pidfd which is quite handy. There has been some interest in this feature e.g. from systems management (systemd, glibc) and container managers. I have requested and gotten comments from glibc to make sure that this syscall is suitable for their needs as well. In the future I expect it to take on most other pid-based signal syscalls. But such features are left for the future once they are needed. This has been sitting in linux-next for quite a while and has not caused any issues. It comes with selftests which verify basic functionality and also test that a recycled pid cannot be signaled via a pidfd. Jon has written about a prior version of this patchset. It should cover the basic functionality since not a lot has changed since then: https://lwn.net/Articles/773459/ The commit message for the syscall itself is extensively documenting the syscall, including it's functionality and extensibility" * tag 'pidfd-v5.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: selftests: add tests for pidfd_send_signal() signal: add pidfd_send_signal() syscall
2019-03-09Merge tag 'io_uring-2019-03-06' of git://git.kernel.dk/linux-blockLinus Torvalds1-1/+8
Pull io_uring IO interface from Jens Axboe: "Second attempt at adding the io_uring interface. Since the first one, we've added basic unit testing of the three system calls, that resides in liburing like the other unit tests that we have so far. It'll take a while to get full coverage of it, but we're working towards it. I've also added two basic test programs to tools/io_uring. One uses the raw interface and has support for all the various features that io_uring supports outside of standard IO, like fixed files, fixed IO buffers, and polled IO. The other uses the liburing API, and is a simplified version of cp(1). This adds support for a new IO interface, io_uring. io_uring allows an application to communicate with the kernel through two rings, the submission queue (SQ) and completion queue (CQ) ring. This allows for very efficient handling of IOs, see the v5 posting for some basic numbers: https://lore.kernel.org/linux-block/20190116175003.17880-1-axboe@kernel.dk/ Outside of just efficiency, the interface is also flexible and extendable, and allows for future use cases like the upcoming NVMe key-value store API, networked IO, and so on. It also supports async buffered IO, something that we've always failed to support in the kernel. Outside of basic IO features, it supports async polled IO as well. This particular feature has already been tested at Facebook months ago for flash storage boxes, with 25-33% improvements. It makes polled IO actually useful for real world use cases, where even basic flash sees a nice win in terms of efficiency, latency, and performance. These boxes were IOPS bound before, now they are not. This series adds three new system calls. One for setting up an io_uring instance (io_uring_setup(2)), one for submitting/completing IO (io_uring_enter(2)), and one for aux functions like registrating file sets, buffers, etc (io_uring_register(2)). Through the help of Arnd, I've coordinated the syscall numbers so merge on that front should be painless. Jon did a writeup of the interface a while back, which (except for minor details that have been tweaked) is still accurate. Find that here: https://lwn.net/Articles/776703/ Huge thanks to Al Viro for helping getting the reference cycle code correct, and to Jann Horn for his extensive reviews focused on both security and bugs in general. There's a userspace library that provides basic functionality for applications that don't need or want to care about how to fiddle with the rings directly. It has helpers to allow applications to easily set up an io_uring instance, and submit/complete IO through it without knowing about the intricacies of the rings. It also includes man pages (thanks to Jeff Moyer), and will continue to grow support helper functions and features as time progresses. Find it here: git://git.kernel.dk/liburing Fio has full support for the raw interface, both in the form of an IO engine (io_uring), but also with a small test application (t/io_uring) that can exercise and benchmark the interface" * tag 'io_uring-2019-03-06' of git://git.kernel.dk/linux-block: io_uring: add a few test tools io_uring: allow workqueue item to handle multiple buffered requests io_uring: add support for IORING_OP_POLL io_uring: add io_kiocb ref count io_uring: add submission polling io_uring: add file set registration net: split out functions related to registering inflight socket files io_uring: add support for pre-mapped user IO buffers block: implement bio helper to add iter bvec pages to bio io_uring: batch io_kiocb allocation io_uring: use fget/fput_many() for file references fs: add fget_many() and fput_many() io_uring: support for IO polling io_uring: add fsync support Add io_uring IO interface
2019-03-05signal: add pidfd_send_signal() syscallChristian Brauner1-1/+3
The kill() syscall operates on process identifiers (pid). After a process has exited its pid can be reused by another process. If a caller sends a signal to a reused pid it will end up signaling the wrong process. This issue has often surfaced and there has been a push to address this problem [1]. This patch uses file descriptors (fd) from proc/<pid> as stable handles on struct pid. Even if a pid is recycled the handle will not change. The fd can be used to send signals to the process it refers to. Thus, the new syscall pidfd_send_signal() is introduced to solve this problem. Instead of pids it operates on process fds (pidfd). /* prototype and argument /* long pidfd_send_signal(int pidfd, int sig, siginfo_t *info, unsigned int flags); /* syscall number 424 */ The syscall number was chosen to be 424 to align with Arnd's rework in his y2038 to minimize merge conflicts (cf. [25]). In addition to the pidfd and signal argument it takes an additional siginfo_t and flags argument. If the siginfo_t argument is NULL then pidfd_send_signal() is equivalent to kill(<positive-pid>, <signal>). If it is not NULL pidfd_send_signal() is equivalent to rt_sigqueueinfo(). The flags argument is added to allow for future extensions of this syscall. It currently needs to be passed as 0. Failing to do so will cause EINVAL. /* pidfd_send_signal() replaces multiple pid-based syscalls */ The pidfd_send_signal() syscall currently takes on the job of rt_sigqueueinfo(2) and parts of the functionality of kill(2), Namely, when a positive pid is passed to kill(2). It will however be possible to also replace tgkill(2) and rt_tgsigqueueinfo(2) if this syscall is extended. /* sending signals to threads (tid) and process groups (pgid) */ Specifically, the pidfd_send_signal() syscall does currently not operate on process groups or threads. This is left for future extensions. In order to extend the syscall to allow sending signal to threads and process groups appropriately named flags (e.g. PIDFD_TYPE_PGID, and PIDFD_TYPE_TID) should be added. This implies that the flags argument will determine what is signaled and not the file descriptor itself. Put in other words, grouping in this api is a property of the flags argument not a property of the file descriptor (cf. [13]). Clarification for this has been requested by Eric (cf. [19]). When appropriate extensions through the flags argument are added then pidfd_send_signal() can additionally replace the part of kill(2) which operates on process groups as well as the tgkill(2) and rt_tgsigqueueinfo(2) syscalls. How such an extension could be implemented has been very roughly sketched in [14], [15], and [16]. However, this should not be taken as a commitment to a particular implementation. There might be better ways to do it. Right now this is intentionally left out to keep this patchset as simple as possible (cf. [4]). /* naming */ The syscall had various names throughout iterations of this patchset: - procfd_signal() - procfd_send_signal() - taskfd_send_signal() In the last round of reviews it was pointed out that given that if the flags argument decides the scope of the signal instead of different types of fds it might make sense to either settle for "procfd_" or "pidfd_" as prefix. The community was willing to accept either (cf. [17] and [18]). Given that one developer expressed strong preference for the "pidfd_" prefix (cf. [13]) and with other developers less opinionated about the name we should settle for "pidfd_" to avoid further bikeshedding. The "_send_signal" suffix was chosen to reflect the fact that the syscall takes on the job of multiple syscalls. It is therefore intentional that the name is not reminiscent of neither kill(2) nor rt_sigqueueinfo(2). Not the fomer because it might imply that pidfd_send_signal() is a replacement for kill(2), and not the latter because it is a hassle to remember the correct spelling - especially for non-native speakers - and because it is not descriptive enough of what the syscall actually does. The name "pidfd_send_signal" makes it very clear that its job is to send signals. /* zombies */ Zombies can be signaled just as any other process. No special error will be reported since a zombie state is an unreliable state (cf. [3]). However, this can be added as an extension through the @flags argument if the need ever arises. /* cross-namespace signals */ The patch currently enforces that the signaler and signalee either are in the same pid namespace or that the signaler's pid namespace is an ancestor of the signalee's pid namespace. This is done for the sake of simplicity and because it is unclear to what values certain members of struct siginfo_t would need to be set to (cf. [5], [6]). /* compat syscalls */ It became clear that we would like to avoid adding compat syscalls (cf. [7]). The compat syscall handling is now done in kernel/signal.c itself by adding __copy_siginfo_from_user_generic() which lets us avoid compat syscalls (cf. [8]). It should be noted that the addition of __copy_siginfo_from_user_any() is caused by a bug in the original implementation of rt_sigqueueinfo(2) (cf. 12). With upcoming rework for syscall handling things might improve significantly (cf. [11]) and __copy_siginfo_from_user_any() will not gain any additional callers. /* testing */ This patch was tested on x64 and x86. /* userspace usage */ An asciinema recording for the basic functionality can be found under [9]. With this patch a process can be killed via: #define _GNU_SOURCE #include <errno.h> #include <fcntl.h> #include <signal.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #include <sys/stat.h> #include <sys/syscall.h> #include <sys/types.h> #include <unistd.h> static inline int do_pidfd_send_signal(int pidfd, int sig, siginfo_t *info, unsigned int flags) { #ifdef __NR_pidfd_send_signal return syscall(__NR_pidfd_send_signal, pidfd, sig, info, flags); #else return -ENOSYS; #endif } int main(int argc, char *argv[]) { int fd, ret, saved_errno, sig; if (argc < 3) exit(EXIT_FAILURE); fd = open(argv[1], O_DIRECTORY | O_CLOEXEC); if (fd < 0) { printf("%s - Failed to open \"%s\"\n", strerror(errno), argv[1]); exit(EXIT_FAILURE); } sig = atoi(argv[2]); printf("Sending signal %d to process %s\n", sig, argv[1]); ret = do_pidfd_send_signal(fd, sig, NULL, 0); saved_errno = errno; close(fd); errno = saved_errno; if (ret < 0) { printf("%s - Failed to send signal %d to process %s\n", strerror(errno), sig, argv[1]); exit(EXIT_FAILURE); } exit(EXIT_SUCCESS); } /* Q&A * Given that it seems the same questions get asked again by people who are * late to the party it makes sense to add a Q&A section to the commit * message so it's hopefully easier to avoid duplicate threads. * * For the sake of progress please consider these arguments settled unless * there is a new point that desperately needs to be addressed. Please make * sure to check the links to the threads in this commit message whether * this has not already been covered. */ Q-01: (Florian Weimer [20], Andrew Morton [21]) What happens when the target process has exited? A-01: Sending the signal will fail with ESRCH (cf. [22]). Q-02: (Andrew Morton [21]) Is the task_struct pinned by the fd? A-02: No. A reference to struct pid is kept. struct pid - as far as I understand - was created exactly for the reason to not require to pin struct task_struct (cf. [22]). Q-03: (Andrew Morton [21]) Does the entire procfs directory remain visible? Just one entry within it? A-03: The same thing that happens right now when you hold a file descriptor to /proc/<pid> open (cf. [22]). Q-04: (Andrew Morton [21]) Does the pid remain reserved? A-04: No. This patchset guarantees a stable handle not that pids are not recycled (cf. [22]). Q-05: (Andrew Morton [21]) Do attempts to signal that fd return errors? A-05: See {Q,A}-01. Q-06: (Andrew Morton [22]) Is there a cleaner way of obtaining the fd? Another syscall perhaps. A-06: Userspace can already trivially retrieve file descriptors from procfs so this is something that we will need to support anyway. Hence, there's no immediate need to add another syscalls just to make pidfd_send_signal() not dependent on the presence of procfs. However, adding a syscalls to get such file descriptors is planned for a future patchset (cf. [22]). Q-07: (Andrew Morton [21] and others) This fd-for-a-process sounds like a handy thing and people may well think up other uses for it in the future, probably unrelated to signals. Are the code and the interface designed to permit such future applications? A-07: Yes (cf. [22]). Q-08: (Andrew Morton [21] and others) Now I think about it, why a new syscall? This thing is looking rather like an ioctl? A-08: This has been extensively discussed. It was agreed that a syscall is preferred for a variety or reasons. Here are just a few taken from prior threads. Syscalls are safer than ioctl()s especially when signaling to fds. Processes are a core kernel concept so a syscall seems more appropriate. The layout of the syscall with its four arguments would require the addition of a custom struct for the ioctl() thereby causing at least the same amount or even more complexity for userspace than a simple syscall. The new syscall will replace multiple other pid-based syscalls (see description above). The file-descriptors-for-processes concept introduced with this syscall will be extended with other syscalls in the future. See also [22], [23] and various other threads already linked in here. Q-09: (Florian Weimer [24]) What happens if you use the new interface with an O_PATH descriptor? A-09: pidfds opened as O_PATH fds cannot be used to send signals to a process (cf. [2]). Signaling processes through pidfds is the equivalent of writing to a file. Thus, this is not an operation that operates "purely at the file descriptor level" as required by the open(2) manpage. See also [4]. /* References */ [1]: https://lore.kernel.org/lkml/20181029221037.87724-1-dancol@google.com/ [2]: https://lore.kernel.org/lkml/874lbtjvtd.fsf@oldenburg2.str.redhat.com/ [3]: https://lore.kernel.org/lkml/20181204132604.aspfupwjgjx6fhva@brauner.io/ [4]: https://lore.kernel.org/lkml/20181203180224.fkvw4kajtbvru2ku@brauner.io/ [5]: https://lore.kernel.org/lkml/20181121213946.GA10795@mail.hallyn.com/ [6]: https://lore.kernel.org/lkml/20181120103111.etlqp7zop34v6nv4@brauner.io/ [7]: https://lore.kernel.org/lkml/36323361-90BD-41AF-AB5B-EE0D7BA02C21@amacapital.net/ [8]: https://lore.kernel.org/lkml/87tvjxp8pc.fsf@xmission.com/ [9]: https://asciinema.org/a/IQjuCHew6bnq1cr78yuMv16cy [11]: https://lore.kernel.org/lkml/F53D6D38-3521-4C20-9034-5AF447DF62FF@amacapital.net/ [12]: https://lore.kernel.org/lkml/87zhtjn8ck.fsf@xmission.com/ [13]: https://lore.kernel.org/lkml/871s6u9z6u.fsf@xmission.com/ [14]: https://lore.kernel.org/lkml/20181206231742.xxi4ghn24z4h2qki@brauner.io/ [15]: https://lore.kernel.org/lkml/20181207003124.GA11160@mail.hallyn.com/ [16]: https://lore.kernel.org/lkml/20181207015423.4miorx43l3qhppfz@brauner.io/ [17]: https://lore.kernel.org/lkml/CAGXu5jL8PciZAXvOvCeCU3wKUEB_dU-O3q0tDw4uB_ojMvDEew@mail.gmail.com/ [18]: https://lore.kernel.org/lkml/20181206222746.GB9224@mail.hallyn.com/ [19]: https://lore.kernel.org/lkml/20181208054059.19813-1-christian@brauner.io/ [20]: https://lore.kernel.org/lkml/8736rebl9s.fsf@oldenburg.str.redhat.com/ [21]: https://lore.kernel.org/lkml/20181228152012.dbf0508c2508138efc5f2bbe@linux-foundation.org/ [22]: https://lore.kernel.org/lkml/20181228233725.722tdfgijxcssg76@brauner.io/ [23]: https://lwn.net/Articles/773459/ [24]: https://lore.kernel.org/lkml/8736rebl9s.fsf@oldenburg.str.redhat.com/ [25]: https://lore.kernel.org/lkml/CAK8P3a0ej9NcJM8wXNPbcGUyOUZYX+VLoDFdbenW3s3114oQZw@mail.gmail.com/ Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Jann Horn <jannh@google.com> Cc: Andy Lutomirsky <luto@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Florian Weimer <fweimer@redhat.com> Signed-off-by: Christian Brauner <christian@brauner.io> Reviewed-by: Tycho Andersen <tycho@tycho.ws> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: David Howells <dhowells@redhat.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Serge Hallyn <serge@hallyn.com> Acked-by: Aleksa Sarai <cyphar@cyphar.com>