From 1507f51255c9ff07d75909a84e7c0d7f3c4b2f49 Mon Sep 17 00:00:00 2001 From: Mike Rapoport Date: Wed, 7 Jul 2021 18:08:03 -0700 Subject: mm: introduce memfd_secret system call to create "secret" memory areas Introduce "memfd_secret" system call with the ability to create memory areas visible only in the context of the owning process and not mapped not only to other processes but in the kernel page tables as well. The secretmem feature is off by default and the user must explicitly enable it at the boot time. Once secretmem is enabled, the user will be able to create a file descriptor using the memfd_secret() system call. The memory areas created by mmap() calls from this file descriptor will be unmapped from the kernel direct map and they will be only mapped in the page table of the processes that have access to the file descriptor. Secretmem is designed to provide the following protections: * Enhanced protection (in conjunction with all the other in-kernel attack prevention systems) against ROP attacks. Seceretmem makes "simple" ROP insufficient to perform exfiltration, which increases the required complexity of the attack. Along with other protections like the kernel stack size limit and address space layout randomization which make finding gadgets is really hard, absence of any in-kernel primitive for accessing secret memory means the one gadget ROP attack can't work. Since the only way to access secret memory is to reconstruct the missing mapping entry, the attacker has to recover the physical page and insert a PTE pointing to it in the kernel and then retrieve the contents. That takes at least three gadgets which is a level of difficulty beyond most standard attacks. * Prevent cross-process secret userspace memory exposures. Once the secret memory is allocated, the user can't accidentally pass it into the kernel to be transmitted somewhere. The secreremem pages cannot be accessed via the direct map and they are disallowed in GUP. * Harden against exploited kernel flaws. In order to access secretmem, a kernel-side attack would need to either walk the page tables and create new ones, or spawn a new privileged uiserspace process to perform secrets exfiltration using ptrace. The file descriptor based memory has several advantages over the "traditional" mm interfaces, such as mlock(), mprotect(), madvise(). File descriptor approach allows explicit and controlled sharing of the memory areas, it allows to seal the operations. Besides, file descriptor based memory paves the way for VMMs to remove the secret memory range from the userspace hipervisor process, for instance QEMU. Andy Lutomirski says: "Getting fd-backed memory into a guest will take some possibly major work in the kernel, but getting vma-backed memory into a guest without mapping it in the host user address space seems much, much worse." memfd_secret() is made a dedicated system call rather than an extension to memfd_create() because it's purpose is to allow the user to create more secure memory mappings rather than to simply allow file based access to the memory. Nowadays a new system call cost is negligible while it is way simpler for userspace to deal with a clear-cut system calls than with a multiplexer or an overloaded syscall. Moreover, the initial implementation of memfd_secret() is completely distinct from memfd_create() so there is no much sense in overloading memfd_create() to begin with. If there will be a need for code sharing between these implementation it can be easily achieved without a need to adjust user visible APIs. The secret memory remains accessible in the process context using uaccess primitives, but it is not exposed to the kernel otherwise; secret memory areas are removed from the direct map and functions in the follow_page()/get_user_page() family will refuse to return a page that belongs to the secret memory area. Once there will be a use case that will require exposing secretmem to the kernel it will be an opt-in request in the system call flags so that user would have to decide what data can be exposed to the kernel. Removing of the pages from the direct map may cause its fragmentation on architectures that use large pages to map the physical memory which affects the system performance. However, the original Kconfig text for CONFIG_DIRECT_GBPAGES said that gigabyte pages in the direct map "... can improve the kernel's performance a tiny bit ..." (commit 00d1c5e05736 ("x86: add gbpages switches")) and the recent report [1] showed that "... although 1G mappings are a good default choice, there is no compelling evidence that it must be the only choice". Hence, it is sufficient to have secretmem disabled by default with the ability of a system administrator to enable it at boot time. Pages in the secretmem regions are unevictable and unmovable to avoid accidental exposure of the sensitive data via swap or during page migration. Since the secretmem mappings are locked in memory they cannot exceed RLIMIT_MEMLOCK. Since these mappings are already locked independently from mlock(), an attempt to mlock()/munlock() secretmem range would fail and mlockall()/munlockall() will ignore secretmem mappings. However, unlike mlock()ed memory, secretmem currently behaves more like long-term GUP: secretmem mappings are unmovable mappings directly consumed by user space. With default limits, there is no excessive use of secretmem and it poses no real problem in combination with ZONE_MOVABLE/CMA, but in the future this should be addressed to allow balanced use of large amounts of secretmem along with ZONE_MOVABLE/CMA. A page that was a part of the secret memory area is cleared when it is freed to ensure the data is not exposed to the next user of that page. The following example demonstrates creation of a secret mapping (error handling is omitted): fd = memfd_secret(0); ftruncate(fd, MAP_SIZE); ptr = mmap(NULL, MAP_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); [1] https://lore.kernel.org/linux-mm/213b4567-46ce-f116-9cdf-bbd0c884eb3c@linux.intel.com/ [akpm@linux-foundation.org: suppress Kconfig whine] Link: https://lkml.kernel.org/r/20210518072034.31572-5-rppt@kernel.org Signed-off-by: Mike Rapoport Acked-by: Hagen Paul Pfeifer Acked-by: James Bottomley Cc: Alexander Viro Cc: Andy Lutomirski Cc: Arnd Bergmann Cc: Borislav Petkov Cc: Catalin Marinas Cc: Christopher Lameter Cc: Dan Williams Cc: Dave Hansen Cc: Elena Reshetova Cc: "H. Peter Anvin" Cc: Ingo Molnar Cc: James Bottomley Cc: "Kirill A. Shutemov" Cc: Matthew Wilcox Cc: Mark Rutland Cc: Michael Kerrisk Cc: Palmer Dabbelt Cc: Palmer Dabbelt Cc: Paul Walmsley Cc: Peter Zijlstra Cc: Rick Edgecombe Cc: Roman Gushchin Cc: Shakeel Butt Cc: Shuah Khan Cc: Thomas Gleixner Cc: Tycho Andersen Cc: Will Deacon Cc: David Hildenbrand Cc: kernel test robot Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/uapi/linux/magic.h | 1 + 1 file changed, 1 insertion(+) (limited to 'include/uapi') diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h index f3956fc11de6..35687dcb1a42 100644 --- a/include/uapi/linux/magic.h +++ b/include/uapi/linux/magic.h @@ -97,5 +97,6 @@ #define DEVMEM_MAGIC 0x454d444d /* "DMEM" */ #define Z3FOLD_MAGIC 0x33 #define PPC_CMM_MAGIC 0xc7571590 +#define SECRETMEM_MAGIC 0x5345434d /* "SECM" */ #endif /* __LINUX_MAGIC_H__ */ -- cgit v1.2.3 From 7bb7f2ac24a028b20fca466b9633847b289b156a Mon Sep 17 00:00:00 2001 From: Mike Rapoport Date: Wed, 7 Jul 2021 18:08:11 -0700 Subject: arch, mm: wire up memfd_secret system call where relevant Wire up memfd_secret system call on architectures that define ARCH_HAS_SET_DIRECT_MAP, namely arm64, risc-v and x86. Link: https://lkml.kernel.org/r/20210518072034.31572-7-rppt@kernel.org Signed-off-by: Mike Rapoport Acked-by: Palmer Dabbelt Acked-by: Arnd Bergmann Acked-by: Catalin Marinas Acked-by: David Hildenbrand Acked-by: James Bottomley Cc: Alexander Viro Cc: Andy Lutomirski Cc: Borislav Petkov Cc: Christopher Lameter Cc: Dan Williams Cc: Dave Hansen Cc: David Hildenbrand Cc: Elena Reshetova Cc: Hagen Paul Pfeifer Cc: "H. Peter Anvin" Cc: Ingo Molnar Cc: James Bottomley Cc: "Kirill A. Shutemov" Cc: Mark Rutland Cc: Matthew Wilcox Cc: Michael Kerrisk Cc: Palmer Dabbelt Cc: Paul Walmsley Cc: Peter Zijlstra Cc: Rick Edgecombe Cc: Roman Gushchin Cc: Shakeel Butt Cc: Shuah Khan Cc: Thomas Gleixner Cc: Tycho Andersen Cc: Will Deacon Cc: kernel test robot Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/arm64/include/uapi/asm/unistd.h | 1 + arch/riscv/include/asm/unistd.h | 1 + arch/x86/entry/syscalls/syscall_32.tbl | 1 + arch/x86/entry/syscalls/syscall_64.tbl | 1 + include/linux/syscalls.h | 1 + include/uapi/asm-generic/unistd.h | 7 ++++++- scripts/checksyscalls.sh | 4 ++++ 7 files changed, 15 insertions(+), 1 deletion(-) (limited to 'include/uapi') diff --git a/arch/arm64/include/uapi/asm/unistd.h b/arch/arm64/include/uapi/asm/unistd.h index f83a70e07df8..ce2ee8f1e361 100644 --- a/arch/arm64/include/uapi/asm/unistd.h +++ b/arch/arm64/include/uapi/asm/unistd.h @@ -20,5 +20,6 @@ #define __ARCH_WANT_SET_GET_RLIMIT #define __ARCH_WANT_TIME32_SYSCALLS #define __ARCH_WANT_SYS_CLONE3 +#define __ARCH_WANT_MEMFD_SECRET #include diff --git a/arch/riscv/include/asm/unistd.h b/arch/riscv/include/asm/unistd.h index 977ee6181dab..6c316093a1e5 100644 --- a/arch/riscv/include/asm/unistd.h +++ b/arch/riscv/include/asm/unistd.h @@ -9,6 +9,7 @@ */ #define __ARCH_WANT_SYS_CLONE +#define __ARCH_WANT_MEMFD_SECRET #include diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl index fba2f615119a..ce763a12311c 100644 --- a/arch/x86/entry/syscalls/syscall_32.tbl +++ b/arch/x86/entry/syscalls/syscall_32.tbl @@ -451,3 +451,4 @@ 444 i386 landlock_create_ruleset sys_landlock_create_ruleset 445 i386 landlock_add_rule sys_landlock_add_rule 446 i386 landlock_restrict_self sys_landlock_restrict_self +447 i386 memfd_secret sys_memfd_secret diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl index af973e400053..f6b57799c1ea 100644 --- a/arch/x86/entry/syscalls/syscall_64.tbl +++ b/arch/x86/entry/syscalls/syscall_64.tbl @@ -368,6 +368,7 @@ 444 common landlock_create_ruleset sys_landlock_create_ruleset 445 common landlock_add_rule sys_landlock_add_rule 446 common landlock_restrict_self sys_landlock_restrict_self +447 common memfd_secret sys_memfd_secret # # Due to a historical design error, certain syscalls are numbered differently diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index 586128d5c3b8..69c9a7010081 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -1050,6 +1050,7 @@ asmlinkage long sys_landlock_create_ruleset(const struct landlock_ruleset_attr _ asmlinkage long sys_landlock_add_rule(int ruleset_fd, enum landlock_rule_type rule_type, const void __user *rule_attr, __u32 flags); asmlinkage long sys_landlock_restrict_self(int ruleset_fd, __u32 flags); +asmlinkage long sys_memfd_secret(unsigned int flags); /* * Architecture-specific system calls diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h index f211961ce1da..a9d6fcd95f42 100644 --- a/include/uapi/asm-generic/unistd.h +++ b/include/uapi/asm-generic/unistd.h @@ -873,8 +873,13 @@ __SYSCALL(__NR_landlock_add_rule, sys_landlock_add_rule) #define __NR_landlock_restrict_self 446 __SYSCALL(__NR_landlock_restrict_self, sys_landlock_restrict_self) +#ifdef __ARCH_WANT_MEMFD_SECRET +#define __NR_memfd_secret 447 +__SYSCALL(__NR_memfd_secret, sys_memfd_secret) +#endif + #undef __NR_syscalls -#define __NR_syscalls 447 +#define __NR_syscalls 448 /* * 32 bit systems traditionally used different diff --git a/scripts/checksyscalls.sh b/scripts/checksyscalls.sh index a18b47695f55..b7609958ee36 100755 --- a/scripts/checksyscalls.sh +++ b/scripts/checksyscalls.sh @@ -40,6 +40,10 @@ cat << EOF #define __IGNORE_setrlimit /* setrlimit */ #endif +#ifndef __ARCH_WANT_MEMFD_SECRET +#define __IGNORE_memfd_secret +#endif + /* Missing flags argument */ #define __IGNORE_renameat /* renameat2 */ -- cgit v1.2.3