From afc18069a2cb7ead5f86623a5f3d4ad6e21f940d Mon Sep 17 00:00:00 2001
From: Kairui Song <kasong@redhat.com>
Date: Wed, 14 Oct 2020 17:24:28 +0800
Subject: x86/kexec: Use up-to-dated screen_info copy to fill boot params

kexec_file_load() currently reuses the old boot_params.screen_info,
but if drivers have change the hardware state, boot_param.screen_info
could contain invalid info.

For example, the video type might be no longer VGA, or the frame buffer
address might be changed. If the kexec kernel keeps using the old screen_info,
kexec'ed kernel may attempt to write to an invalid framebuffer
memory region.

There are two screen_info instances globally available, boot_params.screen_info
and screen_info. Later one is a copy, and is updated by drivers.

So let kexec_file_load use the updated copy.

[ mingo: Tidied up the changelog. ]

Signed-off-by: Kairui Song <kasong@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20201014092429.1415040-2-kasong@redhat.com
---
 arch/x86/kernel/kexec-bzimage64.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

(limited to 'arch')
diff --git a/arch/x86/kernel/kexec-bzimage64.c b/arch/x86/kernel/kexec-bzimage64.c
index 57c2ecf43134..ce831f9448e7 100644
--- a/arch/x86/kernel/kexec-bzimage64.c
+++ b/arch/x86/kernel/kexec-bzimage64.c
@@ -200,8 +200,7 @@ setup_boot_parameters(struct kimage *image, struct boot_params *params,
 	params->hdr.hardware_subarch = boot_params.hdr.hardware_subarch;
 
 	/* Copying screen_info will do? */
-	memcpy(&params->screen_info, &boot_params.screen_info,
-				sizeof(struct screen_info));
+	memcpy(&params->screen_info, &screen_info, sizeof(struct screen_info));
 
 	/* Fill in memsize later */
 	params->screen_info.ext_mem_k = 0;
-- 
cgit v1.2.3


From f2ac57a4c49d40409c21c82d23b5706df9b438af Mon Sep 17 00:00:00 2001
From: Jiri Slaby <jslaby@suse.cz>
Date: Wed, 14 Oct 2020 07:30:51 +0200
Subject: x86/unwind/orc: Fix inactive tasks with stack pointer in %sp on GCC
 10 compiled kernels

GCC 10 optimizes the scheduler code differently than its predecessors.

When CONFIG_DEBUG_SECTION_MISMATCH=y, the Makefile forces GCC not
to inline some functions (-fno-inline-functions-called-once). Before GCC
10, "no-inlined" __schedule() starts with the usual prologue:

  push %bp
  mov %sp, %bp

So the ORC unwinder simply picks stack pointer from %bp and
unwinds from __schedule() just perfectly:

  $ cat /proc/1/stack
  [<0>] ep_poll+0x3e9/0x450
  [<0>] do_epoll_wait+0xaa/0xc0
  [<0>] __x64_sys_epoll_wait+0x1a/0x20
  [<0>] do_syscall_64+0x33/0x40
  [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9

But now, with GCC 10, there is no %bp prologue in __schedule():

  $ cat /proc/1/stack
  <nothing>

The ORC entry of the point in __schedule() is:

  sp:sp+88 bp:last_sp-48 type:call end:0

In this case, nobody subtracts sizeof "struct inactive_task_frame" in
__unwind_start(). The struct is put on the stack by __switch_to_asm() and
only then __switch_to_asm() stores %sp to task->thread.sp. But we start
unwinding from a point in __schedule() (stored in frame->ret_addr by
'call') and not in __switch_to_asm().

So for these example values in __unwind_start():

  sp=ffff94b50001fdc8 bp=ffff8e1f41d29340 ip=__schedule+0x1f0

The stack is:

  ffff94b50001fdc8: ffff8e1f41578000 # struct inactive_task_frame
  ffff94b50001fdd0: 0000000000000000
  ffff94b50001fdd8: ffff8e1f41d29340
  ffff94b50001fde0: ffff8e1f41611d40 # ...
  ffff94b50001fde8: ffffffff93c41920 # bx
  ffff94b50001fdf0: ffff8e1f41d29340 # bp
  ffff94b50001fdf8: ffffffff9376cad0 # ret_addr (and end of the struct)

0xffffffff9376cad0 is __schedule+0x1f0 (after the call to
__switch_to_asm).  Now follow those 88 bytes from the ORC entry (sp+88).
The entry is correct, __schedule() really pushes 48 bytes (8*7) + 32 bytes
via subq to store some local values (like 4U below). So to unwind, look
at the offset 88-sizeof(long) = 0x50 from here:

  ffff94b50001fe00: ffff8e1f41578618
  ffff94b50001fe08: 00000cc000000255
  ffff94b50001fe10: 0000000500000004
  ffff94b50001fe18: 7793fab6956b2d00 # NOTE (see below)
  ffff94b50001fe20: ffff8e1f41578000
  ffff94b50001fe28: ffff8e1f41578000
  ffff94b50001fe30: ffff8e1f41578000
  ffff94b50001fe38: ffff8e1f41578000
  ffff94b50001fe40: ffff94b50001fed8
  ffff94b50001fe48: ffff8e1f41577ff0
  ffff94b50001fe50: ffffffff9376cf12

Here                ^^^^^^^^^^^^^^^^ is the correct ret addr from
__schedule(). It translates to schedule+0x42 (insn after a call to
__schedule()).

BUT, unwind_next_frame() tries to take the address starting from
0xffff94b50001fdc8. That is exactly from thread.sp+88-sizeof(long) =
0xffff94b50001fdc8+88-8 = 0xffff94b50001fe18, which is garbage marked as
NOTE above. So this quits the unwinding as 7793fab6956b2d00 is obviously
not a kernel address.

There was a fix to skip 'struct inactive_task_frame' in
unwind_get_return_address_ptr in the following commit:

  187b96db5ca7 ("x86/unwind/orc: Fix unwind_get_return_address_ptr() for inactive tasks")

But we need to skip the struct already in the unwinder proper. So
subtract the size (increase the stack pointer) of the structure in
__unwind_start() directly. This allows for removal of the code added by
commit 187b96db5ca7 completely, as the address is now at
'(unsigned long *)state->sp - 1', the same as in the generic case.

[ mingo: Cleaned up the changelog a bit, for better readability. ]

Fixes: ee9f8fce9964 ("x86/unwind: Add the ORC unwinder")
Bug: https://bugzilla.suse.com/show_bug.cgi?id=1176907
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20201014053051.24199-1-jslaby@suse.cz
---
 arch/x86/kernel/unwind_orc.c | 9 +--------
 1 file changed, 1 insertion(+), 8 deletions(-)

(limited to 'arch')

diff --git a/arch/x86/kernel/unwind_orc.c b/arch/x86/kernel/unwind_orc.c
index ec88bbe08a32..4a96aa3de7d8 100644
--- a/arch/x86/kernel/unwind_orc.c
+++ b/arch/x86/kernel/unwind_orc.c
@@ -320,19 +320,12 @@ EXPORT_SYMBOL_GPL(unwind_get_return_address);
 
 unsigned long *unwind_get_return_address_ptr(struct unwind_state *state)
 {
-	struct task_struct *task = state->task;
-
 	if (unwind_done(state))
 		return NULL;
 
 	if (state->regs)
 		return &state->regs->ip;
 
-	if (task != current && state->sp == task->thread.sp) {
-		struct inactive_task_frame *frame = (void *)task->thread.sp;
-		return &frame->ret_addr;
-	}
-
 	if (state->sp)
 		return (unsigned long *)state->sp - 1;
 
@@ -662,7 +655,7 @@ void __unwind_start(struct unwind_state *state, struct task_struct *task,
 	} else {
 		struct inactive_task_frame *frame = (void *)task->thread.sp;
 
-		state->sp = task->thread.sp;
+		state->sp = task->thread.sp + sizeof(*frame);
 		state->bp = READ_ONCE_NOCHECK(frame->bp);
 		state->ip = READ_ONCE_NOCHECK(frame->ret_addr);
 		state->signal = (void *)state->ip == ret_from_fork;
-- 
cgit v1.2.3


From c3b484c439b0bab7a698495f33ef16286a1000c4 Mon Sep 17 00:00:00 2001
From: Andy Lutomirski <luto@kernel.org>
Date: Sun, 11 Oct 2020 19:51:21 -0700
Subject: x86/syscalls: Document the fact that syscalls 512-547 are a legacy
 mistake

Since this commit:

  6365b842aae4 ("x86/syscalls: Split the x32 syscalls into their own table")

there is no need for special x32-specific syscall numbers.  I forgot to
update the comments in syscall_64.tbl.  Add comments to make it clear to
future contributors that this range is a legacy wart.

Reported-by: Jessica Clarke <jrtc27@jrtc27.com>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/6c56fb4ddd18fc60a238eb4d867e4b3d97c6351e.1602471055.git.luto@kernel.org
---
 arch/x86/entry/syscalls/syscall_64.tbl | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

(limited to 'arch')

diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index f30d6ae9a688..4adb5d2a3319 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -363,10 +363,10 @@
 439	common	faccessat2		sys_faccessat2
 
 #
-# x32-specific system call numbers start at 512 to avoid cache impact
-# for native 64-bit operation. The __x32_compat_sys stubs are created
-# on-the-fly for compat_sys_*() compatibility system calls if X86_X32
-# is defined.
+# Due to a historical design error, certain syscalls are numbered differently
+# in x32 as compared to native x86_64.  These syscalls have numbers 512-547.
+# Do not add new syscalls to this range.  Numbers 548 and above are available
+# for non-x32 use.
 #
 512	x32	rt_sigaction		compat_sys_rt_sigaction
 513	x32	rt_sigreturn		compat_sys_x32_rt_sigreturn
@@ -404,3 +404,5 @@
 545	x32	execveat		compat_sys_execveat
 546	x32	preadv2			compat_sys_preadv64v2
 547	x32	pwritev2		compat_sys_pwritev64v2
+# This is the end of the legacy x32 range.  Numbers 548 and above are
+# not special and are not to be used for x32-specific syscalls.
-- 
cgit v1.2.3


From abee7c494d8c41bb388839bccc47e06247f0d7de Mon Sep 17 00:00:00 2001
From: Juergen Gross <jgross@suse.com>
Date: Fri, 9 Oct 2020 16:42:25 +0200
Subject: x86/alternative: Don't call text_poke() in lazy TLB mode

When running in lazy TLB mode the currently active page tables might
be the ones of a previous process, e.g. when running a kernel thread.

This can be problematic in case kernel code is being modified via
text_poke() in a kernel thread, and on another processor exit_mmap()
is active for the process which was running on the first cpu before
the kernel thread.

As text_poke() is using a temporary address space and the former
address space (obtained via cpu_tlbstate.loaded_mm) is restored
afterwards, there is a race possible in case the cpu on which
exit_mmap() is running wants to make sure there are no stale
references to that address space on any cpu active (this e.g. is
required when running as a Xen PV guest, where this problem has been
observed and analyzed).

In order to avoid that, drop off TLB lazy mode before switching to the
temporary address space.

Fixes: cefa929c034eb5d ("x86/mm: Introduce temporary mm structs")
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20201009144225.12019-1-jgross@suse.com
---
 arch/x86/kernel/alternative.c | 9 +++++++++
 1 file changed, 9 insertions(+)

(limited to 'arch')

diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index cdaab30880b9..cd6be6f143e8 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -807,6 +807,15 @@ static inline temp_mm_state_t use_temporary_mm(struct mm_struct *mm)
 	temp_mm_state_t temp_state;
 
 	lockdep_assert_irqs_disabled();
+
+	/*
+	 * Make sure not to be in TLB lazy mode, as otherwise we'll end up
+	 * with a stale address space WITHOUT being in lazy mode after
+	 * restoring the previous mm.
+	 */
+	if (this_cpu_read(cpu_tlbstate.is_lazy))
+		leave_mm(smp_processor_id());
+
 	temp_state.mm = this_cpu_read(cpu_tlbstate.loaded_mm);
 	switch_mm_irqs_off(NULL, mm, current);
 
-- 
cgit v1.2.3