From 507ad93ad22b58d1690c65851f4b2b7dd9965ef8 Mon Sep 17 00:00:00 2001 From: Nathan Chancellor Date: Tue, 11 Jun 2019 10:19:32 -0700 Subject: arm64: Don't unconditionally add -Wno-psabi to KBUILD_CFLAGS MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit commit fa63da2ab046b885a7f70291aafc4e8ce015429b upstream. This is a GCC only option, which warns about ABI changes within GCC, so unconditionally adding it breaks Clang with tons of: warning: unknown warning option '-Wno-psabi' [-Wunknown-warning-option] and link time failures: ld.lld: error: undefined symbol: __efistub___stack_chk_guard >>> referenced by arm-stub.c:73 (/home/nathan/cbl/linux/drivers/firmware/efi/libstub/arm-stub.c:73) >>> arm-stub.stub.o:(__efistub_install_memreserve_table) in archive ./drivers/firmware/efi/libstub/lib.a These failures come from the lack of -fno-stack-protector, which is added via cc-option in drivers/firmware/efi/libstub/Makefile. When an unknown flag is added to KBUILD_CFLAGS, clang will noisily warn that it is ignoring the option like above, unlike gcc, who will just error. $ echo "int main() { return 0; }" > tmp.c $ clang -Wno-psabi tmp.c; echo $? warning: unknown warning option '-Wno-psabi' [-Wunknown-warning-option] 1 warning generated. 0 $ gcc -Wsometimes-uninitialized tmp.c; echo $? gcc: error: unrecognized command line option ‘-Wsometimes-uninitialized’; did you mean ‘-Wmaybe-uninitialized’? 1 For cc-option to work properly with clang and behave like gcc, -Werror is needed, which was done in commit c3f0d0bc5b01 ("kbuild, LLVMLinux: Add -Werror to cc-option to support clang"). $ clang -Werror -Wno-psabi tmp.c; echo $? error: unknown warning option '-Wno-psabi' [-Werror,-Wunknown-warning-option] 1 As a consequence of this, when an unknown flag is unconditionally added to KBUILD_CFLAGS, it will cause cc-option to always fail and those flags will never get added: $ clang -Werror -Wno-psabi -fno-stack-protector tmp.c; echo $? error: unknown warning option '-Wno-psabi' [-Werror,-Wunknown-warning-option] 1 This can be seen when compiling the whole kernel as some warnings that are normally disabled (see below) show up. The full list of flags missing from drivers/firmware/efi/libstub are the following (gathered from diffing .arm64-stub.o.cmd): -fno-delete-null-pointer-checks -Wno-address-of-packed-member -Wframe-larger-than=2048 -Wno-unused-const-variable -fno-strict-overflow -fno-merge-all-constants -fno-stack-check -Werror=date-time -Werror=incompatible-pointer-types -ffreestanding -fno-stack-protector Use cc-disable-warning so that it gets disabled for GCC and does nothing for Clang. Fixes: ebcc5928c5d9 ("arm64: Silence gcc warnings about arch ABI drift") Link: https://github.com/ClangBuiltLinux/linux/issues/511 Reported-by: Qian Cai Acked-by: Dave Martin Reviewed-by: Nick Desaulniers Signed-off-by: Nathan Chancellor Signed-off-by: Will Deacon Signed-off-by: Greg Kroah-Hartman --- arch/arm64/Makefile | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/arm64/Makefile b/arch/arm64/Makefile index 8fbd583b18e1..e9d2e578cbe6 100644 --- a/arch/arm64/Makefile +++ b/arch/arm64/Makefile @@ -51,7 +51,7 @@ endif KBUILD_CFLAGS += -mgeneral-regs-only $(lseinstr) $(brokengasinst) KBUILD_CFLAGS += -fno-asynchronous-unwind-tables -KBUILD_CFLAGS += -Wno-psabi +KBUILD_CFLAGS += $(call cc-disable-warning, psabi) KBUILD_AFLAGS += $(lseinstr) $(brokengasinst) KBUILD_CFLAGS += $(call cc-option,-mabi=lp64) -- cgit v1.2.3 From 0d1d924485133dca882e8fcebc9003acd7188724 Mon Sep 17 00:00:00 2001 From: Sasha Levin Date: Tue, 25 Jun 2019 07:36:40 -0400 Subject: Revert "x86/uaccess, ftrace: Fix ftrace_likely_update() vs. SMAP" This reverts commit b65b70ba068b7cdbfeb65eee87cce84a74618603, which was upstream commit 4a6c91fbdef846ec7250b82f2eeeb87ac5f18cf9. On Tue, Jun 25, 2019 at 09:39:45AM +0200, Sebastian Andrzej Siewior wrote: >Please backport commit e74deb11931ff682b59d5b9d387f7115f689698e to >stable _or_ revert the backport of commit 4a6c91fbdef84 ("x86/uaccess, >ftrace: Fix ftrace_likely_update() vs. SMAP"). It uses >user_access_{save|restore}() which has been introduced in the following >commit. Signed-off-by: Sasha Levin --- kernel/trace/trace_branch.c | 4 ---- 1 file changed, 4 deletions(-) diff --git a/kernel/trace/trace_branch.c b/kernel/trace/trace_branch.c index 3ea65cdff30d..4ad967453b6f 100644 --- a/kernel/trace/trace_branch.c +++ b/kernel/trace/trace_branch.c @@ -205,8 +205,6 @@ void trace_likely_condition(struct ftrace_likely_data *f, int val, int expect) void ftrace_likely_update(struct ftrace_likely_data *f, int val, int expect, int is_constant) { - unsigned long flags = user_access_save(); - /* A constant is always correct */ if (is_constant) { f->constant++; @@ -225,8 +223,6 @@ void ftrace_likely_update(struct ftrace_likely_data *f, int val, f->data.correct++; else f->data.incorrect++; - - user_access_restore(flags); } EXPORT_SYMBOL(ftrace_likely_update); -- cgit v1.2.3 From 4d750447128f20c74fb89b3bdff8fc7fb7ccdec9 Mon Sep 17 00:00:00 2001 From: Bjørn Mork Date: Mon, 24 Jun 2019 18:45:11 +0200 Subject: qmi_wwan: Fix out-of-bounds read MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit [ Upstream commit 904d88d743b0c94092c5117955eab695df8109e8 ] The syzbot reported Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0xca/0x13e lib/dump_stack.c:113 print_address_description+0x67/0x231 mm/kasan/report.c:188 __kasan_report.cold+0x1a/0x32 mm/kasan/report.c:317 kasan_report+0xe/0x20 mm/kasan/common.c:614 qmi_wwan_probe+0x342/0x360 drivers/net/usb/qmi_wwan.c:1417 usb_probe_interface+0x305/0x7a0 drivers/usb/core/driver.c:361 really_probe+0x281/0x660 drivers/base/dd.c:509 driver_probe_device+0x104/0x210 drivers/base/dd.c:670 __device_attach_driver+0x1c2/0x220 drivers/base/dd.c:777 bus_for_each_drv+0x15c/0x1e0 drivers/base/bus.c:454 Caused by too many confusing indirections and casts. id->driver_info is a pointer stored in a long. We want the pointer here, not the address of it. Thanks-to: Hillf Danton Reported-by: syzbot+b68605d7fadd21510de1@syzkaller.appspotmail.com Cc: Kristian Evensen Fixes: e4bf63482c30 ("qmi_wwan: Add quirk for Quectel dynamic config") Signed-off-by: Bjørn Mork Signed-off-by: David S. Miller Signed-off-by: Sasha Levin --- drivers/net/usb/qmi_wwan.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/usb/qmi_wwan.c b/drivers/net/usb/qmi_wwan.c index d9a6699abe59..e657d8947125 100644 --- a/drivers/net/usb/qmi_wwan.c +++ b/drivers/net/usb/qmi_wwan.c @@ -1412,7 +1412,7 @@ static int qmi_wwan_probe(struct usb_interface *intf, * different. Ignore the current interface if the number of endpoints * equals the number for the diag interface (two). */ - info = (void *)&id->driver_info; + info = (void *)id->driver_info; if (info->data & QMI_WWAN_QUIRK_QUECTEL_DYNCFG) { if (desc->bNumEndpoints == 2) -- cgit v1.2.3 From f1fb34c22786829fb58564d36609a05483ea9002 Mon Sep 17 00:00:00 2001 From: John Ogness Date: Fri, 28 Jun 2019 12:06:40 -0700 Subject: fs/proc/array.c: allow reporting eip/esp for all coredumping threads commit cb8f381f1613cafe3aec30809991cd56e7135d92 upstream. 0a1eb2d474ed ("fs/proc: Stop reporting eip and esp in /proc/PID/stat") stopped reporting eip/esp and fd7d56270b52 ("fs/proc: Report eip/esp in /prod/PID/stat for coredumping") reintroduced the feature to fix a regression with userspace core dump handlers (such as minicoredumper). Because PF_DUMPCORE is only set for the primary thread, this didn't fix the original problem for secondary threads. Allow reporting the eip/esp for all threads by checking for PF_EXITING as well. This is set for all the other threads when they are killed. coredump_wait() waits for all the tasks to become inactive before proceeding to invoke a core dumper. Link: http://lkml.kernel.org/r/87y32p7i7a.fsf@linutronix.de Link: http://lkml.kernel.org/r/20190522161614.628-1-jlu@pengutronix.de Fixes: fd7d56270b526ca3 ("fs/proc: Report eip/esp in /prod/PID/stat for coredumping") Signed-off-by: John Ogness Reported-by: Jan Luebbe Tested-by: Jan Luebbe Cc: Alexey Dobriyan Cc: Andy Lutomirski Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Greg Kroah-Hartman --- fs/proc/array.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/proc/array.c b/fs/proc/array.c index 2edbb657f859..55180501b915 100644 --- a/fs/proc/array.c +++ b/fs/proc/array.c @@ -462,7 +462,7 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns, * a program is not able to use ptrace(2) in that case. It is * safe because the task has stopped executing permanently. */ - if (permitted && (task->flags & PF_DUMPCORE)) { + if (permitted && (task->flags & (PF_EXITING|PF_DUMPCORE))) { if (try_get_task_stack(task)) { eip = KSTK_EIP(task); esp = KSTK_ESP(task); -- cgit v1.2.3 From 41ceb21b7dc18e9369e9b903b46b8334744f5cfb Mon Sep 17 00:00:00 2001 From: zhong jiang Date: Fri, 28 Jun 2019 12:06:43 -0700 Subject: mm/mempolicy.c: fix an incorrect rebind node in mpol_rebind_nodemask commit 29b190fa774dd1b72a1a6f19687d55dc72ea83be upstream. mpol_rebind_nodemask() is called for MPOL_BIND and MPOL_INTERLEAVE mempoclicies when the tasks's cpuset's mems_allowed changes. For policies created without MPOL_F_STATIC_NODES or MPOL_F_RELATIVE_NODES, it works by remapping the policy's allowed nodes (stored in v.nodes) using the previous value of mems_allowed (stored in w.cpuset_mems_allowed) as the domain of map and the new mems_allowed (passed as nodes) as the range of the map (see the comment of bitmap_remap() for details). The result of remapping is stored back as policy's nodemask in v.nodes, and the new value of mems_allowed should be stored in w.cpuset_mems_allowed to facilitate the next rebind, if it happens. However, 213980c0f23b ("mm, mempolicy: simplify rebinding mempolicies when updating cpusets") introduced a bug where the result of remapping is stored in w.cpuset_mems_allowed instead. Thus, a mempolicy's allowed nodes can evolve in an unexpected way after a series of rebinding due to cpuset mems_allowed changes, possibly binding to a wrong node or a smaller number of nodes which may e.g. overload them. This patch fixes the bug so rebinding again works as intended. [vbabka@suse.cz: new changlog] Link: http://lkml.kernel.org/r/ef6a69c6-c052-b067-8f2c-9d615c619bb9@suse.cz Link: http://lkml.kernel.org/r/1558768043-23184-1-git-send-email-zhongjiang@huawei.com Fixes: 213980c0f23b ("mm, mempolicy: simplify rebinding mempolicies when updating cpusets") Signed-off-by: zhong jiang Reviewed-by: Vlastimil Babka Cc: Oscar Salvador Cc: Anshuman Khandual Cc: Michal Hocko Cc: Mel Gorman Cc: Andrea Arcangeli Cc: Ralph Campbell Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Greg Kroah-Hartman --- mm/mempolicy.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 2219e747df49..5b3bf1747c19 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -306,7 +306,7 @@ static void mpol_rebind_nodemask(struct mempolicy *pol, const nodemask_t *nodes) else { nodes_remap(tmp, pol->v.nodes,pol->w.cpuset_mems_allowed, *nodes); - pol->w.cpuset_mems_allowed = tmp; + pol->w.cpuset_mems_allowed = *nodes; } if (nodes_empty(tmp)) -- cgit v1.2.3 From ecace84283e77c2ac21ba0ef398fc433636fe686 Mon Sep 17 00:00:00 2001 From: Jann Horn Date: Fri, 28 Jun 2019 12:06:46 -0700 Subject: fs/binfmt_flat.c: make load_flat_shared_library() work commit 867bfa4a5fcee66f2b25639acae718e8b28b25a5 upstream. load_flat_shared_library() is broken: It only calls load_flat_file() if prepare_binprm() returns zero, but prepare_binprm() returns the number of bytes read - so this only happens if the file is empty. Instead, call into load_flat_file() if the number of bytes read is non-negative. (Even if the number of bytes is zero - in that case, load_flat_file() will see nullbytes and return a nice -ENOEXEC.) In addition, remove the code related to bprm creds and stop using prepare_binprm() - this code is loading a library, not a main executable, and it only actually uses the members "buf", "file" and "filename" of the linux_binprm struct. Instead, call kernel_read() directly. Link: http://lkml.kernel.org/r/20190524201817.16509-1-jannh@google.com Fixes: 287980e49ffc ("remove lots of IS_ERR_VALUE abuses") Signed-off-by: Jann Horn Cc: Alexander Viro Cc: Kees Cook Cc: Nicolas Pitre Cc: Arnd Bergmann Cc: Geert Uytterhoeven Cc: Russell King Cc: Greg Ungerer Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Greg Kroah-Hartman --- fs/binfmt_flat.c | 23 +++++++---------------- 1 file changed, 7 insertions(+), 16 deletions(-) diff --git a/fs/binfmt_flat.c b/fs/binfmt_flat.c index 82a48e830018..e4b59e76afb0 100644 --- a/fs/binfmt_flat.c +++ b/fs/binfmt_flat.c @@ -856,9 +856,14 @@ err: static int load_flat_shared_library(int id, struct lib_info *libs) { + /* + * This is a fake bprm struct; only the members "buf", "file" and + * "filename" are actually used. + */ struct linux_binprm bprm; int res; char buf[16]; + loff_t pos = 0; memset(&bprm, 0, sizeof(bprm)); @@ -872,25 +877,11 @@ static int load_flat_shared_library(int id, struct lib_info *libs) if (IS_ERR(bprm.file)) return res; - bprm.cred = prepare_exec_creds(); - res = -ENOMEM; - if (!bprm.cred) - goto out; - - /* We don't really care about recalculating credentials at this point - * as we're past the point of no return and are dealing with shared - * libraries. - */ - bprm.called_set_creds = 1; + res = kernel_read(bprm.file, bprm.buf, BINPRM_BUF_SIZE, &pos); - res = prepare_binprm(&bprm); - - if (!res) + if (res >= 0) res = load_flat_file(&bprm, libs, id, NULL); - abort_creds(bprm.cred); - -out: allow_write_access(bprm.file); fput(bprm.file); -- cgit v1.2.3 From 87cdb0596c561b150683cdad0f9bdf0faa4d7424 Mon Sep 17 00:00:00 2001 From: Jon Hunter Date: Wed, 5 Jun 2019 15:01:39 +0100 Subject: clk: tegra210: Fix default rates for HDA clocks commit 9caec6620f25b6d15646bbdb93062c872ba3b56f upstream. Currently the default clock rates for the HDA and HDA2CODEC_2X clocks are both 19.2MHz. However, the default rates for these clocks should actually be 51MHz and 48MHz, respectively. The current clock settings results in a distorted output during audio playback. Correct the default clock rates for these clocks by specifying them in the clock init table for Tegra210. Cc: stable@vger.kernel.org Signed-off-by: Jon Hunter Acked-by: Thierry Reding Signed-off-by: Stephen Boyd Signed-off-by: Greg Kroah-Hartman --- drivers/clk/tegra/clk-tegra210.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/clk/tegra/clk-tegra210.c b/drivers/clk/tegra/clk-tegra210.c index 7545af763d7a..af4ace8f7369 100644 --- a/drivers/clk/tegra/clk-tegra210.c +++ b/drivers/clk/tegra/clk-tegra210.c @@ -3377,6 +3377,8 @@ static struct tegra_clk_init_table init_table[] __initdata = { { TEGRA210_CLK_I2S3_SYNC, TEGRA210_CLK_CLK_MAX, 24576000, 0 }, { TEGRA210_CLK_I2S4_SYNC, TEGRA210_CLK_CLK_MAX, 24576000, 0 }, { TEGRA210_CLK_VIMCLK_SYNC, TEGRA210_CLK_CLK_MAX, 24576000, 0 }, + { TEGRA210_CLK_HDA, TEGRA210_CLK_PLL_P, 51000000, 0 }, + { TEGRA210_CLK_HDA2CODEC_2X, TEGRA210_CLK_PLL_P, 48000000, 0 }, /* This MUST be the last entry. */ { TEGRA210_CLK_CLK_MAX, TEGRA210_CLK_CLK_MAX, 0, 0 }, }; -- cgit v1.2.3 From 261b9429c577f913d3a5c60aa914901ed699e97a Mon Sep 17 00:00:00 2001 From: Dinh Nguyen Date: Fri, 7 Jun 2019 10:12:46 -0500 Subject: clk: socfpga: stratix10: fix divider entry for the emac clocks commit 74684cce5ebd567b01e9bc0e9a1945c70a32f32f upstream. The fixed dividers for the emac clocks should be 2 not 4. Cc: stable@vger.kernel.org Signed-off-by: Dinh Nguyen Signed-off-by: Stephen Boyd Signed-off-by: Greg Kroah-Hartman --- drivers/clk/socfpga/clk-s10.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/clk/socfpga/clk-s10.c b/drivers/clk/socfpga/clk-s10.c index 8281dfbf38c2..5bed36e12951 100644 --- a/drivers/clk/socfpga/clk-s10.c +++ b/drivers/clk/socfpga/clk-s10.c @@ -103,9 +103,9 @@ static const struct stratix10_perip_cnt_clock s10_main_perip_cnt_clks[] = { { STRATIX10_NOC_CLK, "noc_clk", NULL, noc_mux, ARRAY_SIZE(noc_mux), 0, 0, 0, 0x3C, 1}, { STRATIX10_EMAC_A_FREE_CLK, "emaca_free_clk", NULL, emaca_free_mux, ARRAY_SIZE(emaca_free_mux), - 0, 0, 4, 0xB0, 0}, + 0, 0, 2, 0xB0, 0}, { STRATIX10_EMAC_B_FREE_CLK, "emacb_free_clk", NULL, emacb_free_mux, ARRAY_SIZE(emacb_free_mux), - 0, 0, 4, 0xB0, 1}, + 0, 0, 2, 0xB0, 1}, { STRATIX10_EMAC_PTP_FREE_CLK, "emac_ptp_free_clk", NULL, emac_ptp_free_mux, ARRAY_SIZE(emac_ptp_free_mux), 0, 0, 4, 0xB0, 2}, { STRATIX10_GPIO_DB_FREE_CLK, "gpio_db_free_clk", NULL, gpio_db_free_mux, -- cgit v1.2.3 From 64e3d1c9f0e8001d0bd5ecedc0641b88755d255c Mon Sep 17 00:00:00 2001 From: Ville Syrjälä Date: Wed, 20 Mar 2019 15:54:36 +0200 Subject: drm/i915: Force 2*96 MHz cdclk on glk/cnl when audio power is enabled MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit commit 905801fe72377b4dc53c6e13eea1a91c6a4aa0c4 upstream. CDCLK has to be at least twice the BLCK regardless of audio. Audio driver has to probe using this hook and increase the clock even in absence of any display. v2: Use atomic refcount for get_power, put_power so that we can call each once(Abhay). v3: Reset power well 2 to avoid any transaction on iDisp link during cdclk change(Abhay). v4: Remove Power well 2 reset workaround(Ville). v5: Remove unwanted Power well 2 register defined in v4(Abhay). v6: - Use a dedicated flag instead of state->modeset for min CDCLK changes - Make get/put audio power domain symmetric - Rebased on top of intel_wakeref tracking changes. Signed-off-by: Ville Syrjälä Signed-off-by: Abhay Kumar Tested-by: Abhay Kumar Signed-off-by: Imre Deak Reviewed-by: Clint Taylor Link: https://patchwork.freedesktop.org/patch/msgid/20190320135439.12201-1-imre.deak@intel.com Cc: # 5.1.x Signed-off-by: Jian-Hong Pan Buglink: https://bugzilla.kernel.org/show_bug.cgi?id=203623 Buglink: https://bugs.freedesktop.org/show_bug.cgi?id=110916 Link: https://www.spinics.net/lists/stable/msg310910.html Signed-off-by: Greg Kroah-Hartman --- drivers/gpu/drm/i915/i915_drv.h | 3 ++ drivers/gpu/drm/i915/intel_audio.c | 62 ++++++++++++++++++++++++++++++++++-- drivers/gpu/drm/i915/intel_cdclk.c | 30 ++++++----------- drivers/gpu/drm/i915/intel_display.c | 9 +++++- drivers/gpu/drm/i915/intel_drv.h | 3 ++ 5 files changed, 83 insertions(+), 24 deletions(-) diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h index a67a63b5aa84..5c9bb1b2d1f3 100644 --- a/drivers/gpu/drm/i915/i915_drv.h +++ b/drivers/gpu/drm/i915/i915_drv.h @@ -1622,6 +1622,8 @@ struct drm_i915_private { struct intel_cdclk_state actual; /* The current hardware cdclk state */ struct intel_cdclk_state hw; + + int force_min_cdclk; } cdclk; /** @@ -1741,6 +1743,7 @@ struct drm_i915_private { * */ struct mutex av_mutex; + int audio_power_refcount; struct { struct mutex mutex; diff --git a/drivers/gpu/drm/i915/intel_audio.c b/drivers/gpu/drm/i915/intel_audio.c index 5104c6bbd66f..2f3fd5bf0f11 100644 --- a/drivers/gpu/drm/i915/intel_audio.c +++ b/drivers/gpu/drm/i915/intel_audio.c @@ -741,15 +741,71 @@ void intel_init_audio_hooks(struct drm_i915_private *dev_priv) } } +static void glk_force_audio_cdclk(struct drm_i915_private *dev_priv, + bool enable) +{ + struct drm_modeset_acquire_ctx ctx; + struct drm_atomic_state *state; + int ret; + + drm_modeset_acquire_init(&ctx, 0); + state = drm_atomic_state_alloc(&dev_priv->drm); + if (WARN_ON(!state)) + return; + + state->acquire_ctx = &ctx; + +retry: + to_intel_atomic_state(state)->cdclk.force_min_cdclk_changed = true; + to_intel_atomic_state(state)->cdclk.force_min_cdclk = + enable ? 2 * 96000 : 0; + + /* + * Protects dev_priv->cdclk.force_min_cdclk + * Need to lock this here in case we have no active pipes + * and thus wouldn't lock it during the commit otherwise. + */ + ret = drm_modeset_lock(&dev_priv->drm.mode_config.connection_mutex, + &ctx); + if (!ret) + ret = drm_atomic_commit(state); + + if (ret == -EDEADLK) { + drm_atomic_state_clear(state); + drm_modeset_backoff(&ctx); + goto retry; + } + + WARN_ON(ret); + + drm_atomic_state_put(state); + + drm_modeset_drop_locks(&ctx); + drm_modeset_acquire_fini(&ctx); +} + static void i915_audio_component_get_power(struct device *kdev) { - intel_display_power_get(kdev_to_i915(kdev), POWER_DOMAIN_AUDIO); + struct drm_i915_private *dev_priv = kdev_to_i915(kdev); + + intel_display_power_get(dev_priv, POWER_DOMAIN_AUDIO); + + /* Force CDCLK to 2*BCLK as long as we need audio to be powered. */ + if (dev_priv->audio_power_refcount++ == 0) + if (IS_CANNONLAKE(dev_priv) || IS_GEMINILAKE(dev_priv)) + glk_force_audio_cdclk(dev_priv, true); } static void i915_audio_component_put_power(struct device *kdev) { - intel_display_power_put_unchecked(kdev_to_i915(kdev), - POWER_DOMAIN_AUDIO); + struct drm_i915_private *dev_priv = kdev_to_i915(kdev); + + /* Stop forcing CDCLK to 2*BCLK if no need for audio to be powered. */ + if (--dev_priv->audio_power_refcount == 0) + if (IS_CANNONLAKE(dev_priv) || IS_GEMINILAKE(dev_priv)) + glk_force_audio_cdclk(dev_priv, false); + + intel_display_power_put_unchecked(dev_priv, POWER_DOMAIN_AUDIO); } static void i915_audio_component_codec_wake_override(struct device *kdev, diff --git a/drivers/gpu/drm/i915/intel_cdclk.c b/drivers/gpu/drm/i915/intel_cdclk.c index 15ba950dee00..553f57ff60f4 100644 --- a/drivers/gpu/drm/i915/intel_cdclk.c +++ b/drivers/gpu/drm/i915/intel_cdclk.c @@ -2187,19 +2187,8 @@ int intel_crtc_compute_min_cdclk(const struct intel_crtc_state *crtc_state) /* * According to BSpec, "The CD clock frequency must be at least twice * the frequency of the Azalia BCLK." and BCLK is 96 MHz by default. - * - * FIXME: Check the actual, not default, BCLK being used. - * - * FIXME: This does not depend on ->has_audio because the higher CDCLK - * is required for audio probe, also when there are no audio capable - * displays connected at probe time. This leads to unnecessarily high - * CDCLK when audio is not required. - * - * FIXME: This limit is only applied when there are displays connected - * at probe time. If we probe without displays, we'll still end up using - * the platform minimum CDCLK, failing audio probe. */ - if (INTEL_GEN(dev_priv) >= 9) + if (crtc_state->has_audio && INTEL_GEN(dev_priv) >= 9) min_cdclk = max(2 * 96000, min_cdclk); /* @@ -2239,7 +2228,7 @@ static int intel_compute_min_cdclk(struct drm_atomic_state *state) intel_state->min_cdclk[i] = min_cdclk; } - min_cdclk = 0; + min_cdclk = intel_state->cdclk.force_min_cdclk; for_each_pipe(dev_priv, pipe) min_cdclk = max(intel_state->min_cdclk[pipe], min_cdclk); @@ -2300,7 +2289,8 @@ static int vlv_modeset_calc_cdclk(struct drm_atomic_state *state) vlv_calc_voltage_level(dev_priv, cdclk); if (!intel_state->active_crtcs) { - cdclk = vlv_calc_cdclk(dev_priv, 0); + cdclk = vlv_calc_cdclk(dev_priv, + intel_state->cdclk.force_min_cdclk); intel_state->cdclk.actual.cdclk = cdclk; intel_state->cdclk.actual.voltage_level = @@ -2333,7 +2323,7 @@ static int bdw_modeset_calc_cdclk(struct drm_atomic_state *state) bdw_calc_voltage_level(cdclk); if (!intel_state->active_crtcs) { - cdclk = bdw_calc_cdclk(0); + cdclk = bdw_calc_cdclk(intel_state->cdclk.force_min_cdclk); intel_state->cdclk.actual.cdclk = cdclk; intel_state->cdclk.actual.voltage_level = @@ -2405,7 +2395,7 @@ static int skl_modeset_calc_cdclk(struct drm_atomic_state *state) skl_calc_voltage_level(cdclk); if (!intel_state->active_crtcs) { - cdclk = skl_calc_cdclk(0, vco); + cdclk = skl_calc_cdclk(intel_state->cdclk.force_min_cdclk, vco); intel_state->cdclk.actual.vco = vco; intel_state->cdclk.actual.cdclk = cdclk; @@ -2444,10 +2434,10 @@ static int bxt_modeset_calc_cdclk(struct drm_atomic_state *state) if (!intel_state->active_crtcs) { if (IS_GEMINILAKE(dev_priv)) { - cdclk = glk_calc_cdclk(0); + cdclk = glk_calc_cdclk(intel_state->cdclk.force_min_cdclk); vco = glk_de_pll_vco(dev_priv, cdclk); } else { - cdclk = bxt_calc_cdclk(0); + cdclk = bxt_calc_cdclk(intel_state->cdclk.force_min_cdclk); vco = bxt_de_pll_vco(dev_priv, cdclk); } @@ -2483,7 +2473,7 @@ static int cnl_modeset_calc_cdclk(struct drm_atomic_state *state) cnl_compute_min_voltage_level(intel_state)); if (!intel_state->active_crtcs) { - cdclk = cnl_calc_cdclk(0); + cdclk = cnl_calc_cdclk(intel_state->cdclk.force_min_cdclk); vco = cnl_cdclk_pll_vco(dev_priv, cdclk); intel_state->cdclk.actual.vco = vco; @@ -2519,7 +2509,7 @@ static int icl_modeset_calc_cdclk(struct drm_atomic_state *state) cnl_compute_min_voltage_level(intel_state)); if (!intel_state->active_crtcs) { - cdclk = icl_calc_cdclk(0, ref); + cdclk = icl_calc_cdclk(intel_state->cdclk.force_min_cdclk, ref); vco = icl_calc_cdclk_pll_vco(dev_priv, cdclk); intel_state->cdclk.actual.vco = vco; diff --git a/drivers/gpu/drm/i915/intel_display.c b/drivers/gpu/drm/i915/intel_display.c index be4024f0e3a8..92b8c1c42fd8 100644 --- a/drivers/gpu/drm/i915/intel_display.c +++ b/drivers/gpu/drm/i915/intel_display.c @@ -12770,6 +12770,11 @@ static int intel_modeset_checks(struct drm_atomic_state *state) return -EINVAL; } + /* keep the current setting */ + if (!intel_state->cdclk.force_min_cdclk_changed) + intel_state->cdclk.force_min_cdclk = + dev_priv->cdclk.force_min_cdclk; + intel_state->modeset = true; intel_state->active_crtcs = dev_priv->active_crtcs; intel_state->cdclk.logical = dev_priv->cdclk.logical; @@ -12892,7 +12897,7 @@ static int intel_atomic_check(struct drm_device *dev, struct drm_crtc *crtc; struct drm_crtc_state *old_crtc_state, *crtc_state; int ret, i; - bool any_ms = false; + bool any_ms = intel_state->cdclk.force_min_cdclk_changed; /* Catch I915_MODE_FLAG_INHERITED */ for_each_oldnew_crtc_in_state(state, crtc, old_crtc_state, @@ -13480,6 +13485,8 @@ static int intel_atomic_commit(struct drm_device *dev, dev_priv->active_crtcs = intel_state->active_crtcs; dev_priv->cdclk.logical = intel_state->cdclk.logical; dev_priv->cdclk.actual = intel_state->cdclk.actual; + dev_priv->cdclk.force_min_cdclk = + intel_state->cdclk.force_min_cdclk; } drm_atomic_state_get(state); diff --git a/drivers/gpu/drm/i915/intel_drv.h b/drivers/gpu/drm/i915/intel_drv.h index 18e17b422701..824935b87a12 100644 --- a/drivers/gpu/drm/i915/intel_drv.h +++ b/drivers/gpu/drm/i915/intel_drv.h @@ -480,6 +480,9 @@ struct intel_atomic_state { * state only when all crtc's are DPMS off. */ struct intel_cdclk_state actual; + + int force_min_cdclk; + bool force_min_cdclk_changed; } cdclk; bool dpll_set, modeset; -- cgit v1.2.3 From ca2d66597e7456c0cbfd68067db14239730a4fac Mon Sep 17 00:00:00 2001 From: Imre Deak Date: Wed, 20 Mar 2019 15:54:37 +0200 Subject: drm/i915: Save the old CDCLK atomic state MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit commit 48d9f87ddd2108663fd866b254e05d422243cc56 upstream. The old state will be needed by an upcoming patch to determine if the commit increases or decreases CDCLK, so move the old state to the atomic state (while keeping the new one in dev_priv). cdclk.logical and cdclk.actual in the atomic state isn't used atm anywhere after the atomic check phase, so this should be safe. v2: - Use swap() instead of opencoding it. (Ville) Suggested-by: Ville Syrjälä Cc: Ville Syrjälä Signed-off-by: Imre Deak Reviewed-by: Ville Syrjälä Link: https://patchwork.freedesktop.org/patch/msgid/20190320135439.12201-2-imre.deak@intel.com Signed-off-by: Jian-Hong Pan Signed-off-by: Greg Kroah-Hartman --- drivers/gpu/drm/i915/intel_cdclk.c | 20 ++++++++++++++++++++ drivers/gpu/drm/i915/intel_display.c | 4 ++-- drivers/gpu/drm/i915/intel_drv.h | 1 + 3 files changed, 23 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/i915/intel_cdclk.c b/drivers/gpu/drm/i915/intel_cdclk.c index 553f57ff60f4..9814a6f2b3c4 100644 --- a/drivers/gpu/drm/i915/intel_cdclk.c +++ b/drivers/gpu/drm/i915/intel_cdclk.c @@ -2100,6 +2100,26 @@ bool intel_cdclk_changed(const struct intel_cdclk_state *a, a->voltage_level != b->voltage_level; } +/** + * intel_cdclk_swap_state - make atomic CDCLK configuration effective + * @state: atomic state + * + * This is the CDCLK version of drm_atomic_helper_swap_state() since the + * helper does not handle driver-specific global state. + * + * Similarly to the atomic helpers this function does a complete swap, + * i.e. it also puts the old state into @state. This is used by the commit + * code to determine how CDCLK has changed (for instance did it increase or + * decrease). + */ +void intel_cdclk_swap_state(struct intel_atomic_state *state) +{ + struct drm_i915_private *dev_priv = to_i915(state->base.dev); + + swap(state->cdclk.logical, dev_priv->cdclk.logical); + swap(state->cdclk.actual, dev_priv->cdclk.actual); +} + void intel_dump_cdclk_state(const struct intel_cdclk_state *cdclk_state, const char *context) { diff --git a/drivers/gpu/drm/i915/intel_display.c b/drivers/gpu/drm/i915/intel_display.c index 92b8c1c42fd8..7e9b63a8df59 100644 --- a/drivers/gpu/drm/i915/intel_display.c +++ b/drivers/gpu/drm/i915/intel_display.c @@ -13483,10 +13483,10 @@ static int intel_atomic_commit(struct drm_device *dev, intel_state->min_voltage_level, sizeof(intel_state->min_voltage_level)); dev_priv->active_crtcs = intel_state->active_crtcs; - dev_priv->cdclk.logical = intel_state->cdclk.logical; - dev_priv->cdclk.actual = intel_state->cdclk.actual; dev_priv->cdclk.force_min_cdclk = intel_state->cdclk.force_min_cdclk; + + intel_cdclk_swap_state(intel_state); } drm_atomic_state_get(state); diff --git a/drivers/gpu/drm/i915/intel_drv.h b/drivers/gpu/drm/i915/intel_drv.h index 824935b87a12..2a11dfc28a2a 100644 --- a/drivers/gpu/drm/i915/intel_drv.h +++ b/drivers/gpu/drm/i915/intel_drv.h @@ -1597,6 +1597,7 @@ bool intel_cdclk_needs_modeset(const struct intel_cdclk_state *a, const struct intel_cdclk_state *b); bool intel_cdclk_changed(const struct intel_cdclk_state *a, const struct intel_cdclk_state *b); +void intel_cdclk_swap_state(struct intel_atomic_state *state); void intel_set_cdclk(struct drm_i915_private *dev_priv, const struct intel_cdclk_state *cdclk_state); void intel_dump_cdclk_state(const struct intel_cdclk_state *cdclk_state, -- cgit v1.2.3 From 994f9ddbd1a5fbfea297b5760f862d3e866e1149 Mon Sep 17 00:00:00 2001 From: Imre Deak Date: Wed, 20 Mar 2019 15:54:38 +0200 Subject: drm/i915: Remove redundant store of logical CDCLK state MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit commit 2b21dfbeee725778daed2c3dd45a3fc808176feb upstream. We copied the original state into the atomic state already earlier in the function, so no need to do it a second time. Cc: Ville Syrjälä Signed-off-by: Imre Deak Reviewed-by: Ville Syrjälä Link: https://patchwork.freedesktop.org/patch/msgid/20190320135439.12201-3-imre.deak@intel.com Signed-off-by: Greg Kroah-Hartman Signed-off-by: Jian-Hong Pan --- drivers/gpu/drm/i915/intel_display.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/drivers/gpu/drm/i915/intel_display.c b/drivers/gpu/drm/i915/intel_display.c index 7e9b63a8df59..e25032e4192c 100644 --- a/drivers/gpu/drm/i915/intel_display.c +++ b/drivers/gpu/drm/i915/intel_display.c @@ -12828,8 +12828,6 @@ static int intel_modeset_checks(struct drm_atomic_state *state) DRM_DEBUG_KMS("New voltage level calculated to be logical %u, actual %u\n", intel_state->cdclk.logical.voltage_level, intel_state->cdclk.actual.voltage_level); - } else { - to_intel_atomic_state(state)->cdclk.logical = dev_priv->cdclk.logical; } intel_modeset_clear_plls(state); -- cgit v1.2.3 From ec73bed9027f26336ab9f4a40970324e42e478a5 Mon Sep 17 00:00:00 2001 From: Ville Syrjälä Date: Wed, 27 Mar 2019 12:13:21 +0200 Subject: drm/i915: Skip modeset for cdclk changes if possible MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit commit 59f9e9cab3a1e6762fb707d0d829b982930f1349 upstream. If we have only a single active pipe and the cdclk change only requires the cd2x divider to be updated bxt+ can do the update with forcing a full modeset on the pipe. Try to hook that up. v2: - Wait for vblank after an optimized CDCLK change. - Avoid optimization if the pipe needs a modeset (or was disabled). - Split CDCLK change to a pre/post plane update step. v3: - Use correct version of CDCLK state as old state. (Ville) - Remove unused intel_cdclk_can_skip_modeset() v4: - For consistency call intel_set_cdclk_post_plane_update() only during modesets (and not fastsets). v5: - Remove the logic to update the CD2X divider on-the-fly on ICL, since only a divider of 1 is supported there. Clint also noticed that the pipe select bits in CDCLK_CTL are oddly defined on ICL, it's not clear yet whether that's only an error in the specification. Signed-off-by: Ville Syrjälä Signed-off-by: Abhay Kumar Tested-by: Abhay Kumar Signed-off-by: Imre Deak Reviewed-by: Clint Taylor Link: https://patchwork.freedesktop.org/patch/msgid/20190327101321.3095-1-imre.deak@intel.com Signed-off-by: Jian-Hong Pan Signed-off-by: Greg Kroah-Hartman --- drivers/gpu/drm/i915/i915_drv.h | 3 +- drivers/gpu/drm/i915/intel_cdclk.c | 135 +++++++++++++++++++++++++++-------- drivers/gpu/drm/i915/intel_display.c | 42 ++++++++++- drivers/gpu/drm/i915/intel_drv.h | 17 ++++- 4 files changed, 163 insertions(+), 34 deletions(-) diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h index 5c9bb1b2d1f3..0c4a76bca5c6 100644 --- a/drivers/gpu/drm/i915/i915_drv.h +++ b/drivers/gpu/drm/i915/i915_drv.h @@ -280,7 +280,8 @@ struct drm_i915_display_funcs { void (*get_cdclk)(struct drm_i915_private *dev_priv, struct intel_cdclk_state *cdclk_state); void (*set_cdclk)(struct drm_i915_private *dev_priv, - const struct intel_cdclk_state *cdclk_state); + const struct intel_cdclk_state *cdclk_state, + enum pipe pipe); int (*get_fifo_size)(struct drm_i915_private *dev_priv, enum i9xx_plane_id i9xx_plane); int (*compute_pipe_wm)(struct intel_crtc_state *cstate); diff --git a/drivers/gpu/drm/i915/intel_cdclk.c b/drivers/gpu/drm/i915/intel_cdclk.c index 9814a6f2b3c4..00f76261924c 100644 --- a/drivers/gpu/drm/i915/intel_cdclk.c +++ b/drivers/gpu/drm/i915/intel_cdclk.c @@ -516,7 +516,8 @@ static void vlv_program_pfi_credits(struct drm_i915_private *dev_priv) } static void vlv_set_cdclk(struct drm_i915_private *dev_priv, - const struct intel_cdclk_state *cdclk_state) + const struct intel_cdclk_state *cdclk_state, + enum pipe pipe) { int cdclk = cdclk_state->cdclk; u32 val, cmd = cdclk_state->voltage_level; @@ -598,7 +599,8 @@ static void vlv_set_cdclk(struct drm_i915_private *dev_priv, } static void chv_set_cdclk(struct drm_i915_private *dev_priv, - const struct intel_cdclk_state *cdclk_state) + const struct intel_cdclk_state *cdclk_state, + enum pipe pipe) { int cdclk = cdclk_state->cdclk; u32 val, cmd = cdclk_state->voltage_level; @@ -697,7 +699,8 @@ static void bdw_get_cdclk(struct drm_i915_private *dev_priv, } static void bdw_set_cdclk(struct drm_i915_private *dev_priv, - const struct intel_cdclk_state *cdclk_state) + const struct intel_cdclk_state *cdclk_state, + enum pipe pipe) { int cdclk = cdclk_state->cdclk; u32 val; @@ -987,7 +990,8 @@ static void skl_dpll0_disable(struct drm_i915_private *dev_priv) } static void skl_set_cdclk(struct drm_i915_private *dev_priv, - const struct intel_cdclk_state *cdclk_state) + const struct intel_cdclk_state *cdclk_state, + enum pipe pipe) { int cdclk = cdclk_state->cdclk; int vco = cdclk_state->vco; @@ -1158,7 +1162,7 @@ void skl_init_cdclk(struct drm_i915_private *dev_priv) cdclk_state.cdclk = skl_calc_cdclk(0, cdclk_state.vco); cdclk_state.voltage_level = skl_calc_voltage_level(cdclk_state.cdclk); - skl_set_cdclk(dev_priv, &cdclk_state); + skl_set_cdclk(dev_priv, &cdclk_state, INVALID_PIPE); } /** @@ -1176,7 +1180,7 @@ void skl_uninit_cdclk(struct drm_i915_private *dev_priv) cdclk_state.vco = 0; cdclk_state.voltage_level = skl_calc_voltage_level(cdclk_state.cdclk); - skl_set_cdclk(dev_priv, &cdclk_state); + skl_set_cdclk(dev_priv, &cdclk_state, INVALID_PIPE); } static int bxt_calc_cdclk(int min_cdclk) @@ -1355,7 +1359,8 @@ static void bxt_de_pll_enable(struct drm_i915_private *dev_priv, int vco) } static void bxt_set_cdclk(struct drm_i915_private *dev_priv, - const struct intel_cdclk_state *cdclk_state) + const struct intel_cdclk_state *cdclk_state, + enum pipe pipe) { int cdclk = cdclk_state->cdclk; int vco = cdclk_state->vco; @@ -1408,11 +1413,10 @@ static void bxt_set_cdclk(struct drm_i915_private *dev_priv, bxt_de_pll_enable(dev_priv, vco); val = divider | skl_cdclk_decimal(cdclk); - /* - * FIXME if only the cd2x divider needs changing, it could be done - * without shutting off the pipe (if only one pipe is active). - */ - val |= BXT_CDCLK_CD2X_PIPE_NONE; + if (pipe == INVALID_PIPE) + val |= BXT_CDCLK_CD2X_PIPE_NONE; + else + val |= BXT_CDCLK_CD2X_PIPE(pipe); /* * Disable SSA Precharge when CD clock frequency < 500 MHz, * enable otherwise. @@ -1421,6 +1425,9 @@ static void bxt_set_cdclk(struct drm_i915_private *dev_priv, val |= BXT_CDCLK_SSA_PRECHARGE_ENABLE; I915_WRITE(CDCLK_CTL, val); + if (pipe != INVALID_PIPE) + intel_wait_for_vblank(dev_priv, pipe); + mutex_lock(&dev_priv->pcu_lock); /* * The timeout isn't specified, the 2ms used here is based on @@ -1525,7 +1532,7 @@ void bxt_init_cdclk(struct drm_i915_private *dev_priv) } cdclk_state.voltage_level = bxt_calc_voltage_level(cdclk_state.cdclk); - bxt_set_cdclk(dev_priv, &cdclk_state); + bxt_set_cdclk(dev_priv, &cdclk_state, INVALID_PIPE); } /** @@ -1543,7 +1550,7 @@ void bxt_uninit_cdclk(struct drm_i915_private *dev_priv) cdclk_state.vco = 0; cdclk_state.voltage_level = bxt_calc_voltage_level(cdclk_state.cdclk); - bxt_set_cdclk(dev_priv, &cdclk_state); + bxt_set_cdclk(dev_priv, &cdclk_state, INVALID_PIPE); } static int cnl_calc_cdclk(int min_cdclk) @@ -1663,7 +1670,8 @@ static void cnl_cdclk_pll_enable(struct drm_i915_private *dev_priv, int vco) } static void cnl_set_cdclk(struct drm_i915_private *dev_priv, - const struct intel_cdclk_state *cdclk_state) + const struct intel_cdclk_state *cdclk_state, + enum pipe pipe) { int cdclk = cdclk_state->cdclk; int vco = cdclk_state->vco; @@ -1704,13 +1712,15 @@ static void cnl_set_cdclk(struct drm_i915_private *dev_priv, cnl_cdclk_pll_enable(dev_priv, vco); val = divider | skl_cdclk_decimal(cdclk); - /* - * FIXME if only the cd2x divider needs changing, it could be done - * without shutting off the pipe (if only one pipe is active). - */ - val |= BXT_CDCLK_CD2X_PIPE_NONE; + if (pipe == INVALID_PIPE) + val |= BXT_CDCLK_CD2X_PIPE_NONE; + else + val |= BXT_CDCLK_CD2X_PIPE(pipe); I915_WRITE(CDCLK_CTL, val); + if (pipe != INVALID_PIPE) + intel_wait_for_vblank(dev_priv, pipe); + /* inform PCU of the change */ mutex_lock(&dev_priv->pcu_lock); sandybridge_pcode_write(dev_priv, SKL_PCODE_CDCLK_CONTROL, @@ -1847,7 +1857,8 @@ static int icl_calc_cdclk_pll_vco(struct drm_i915_private *dev_priv, int cdclk) } static void icl_set_cdclk(struct drm_i915_private *dev_priv, - const struct intel_cdclk_state *cdclk_state) + const struct intel_cdclk_state *cdclk_state, + enum pipe pipe) { unsigned int cdclk = cdclk_state->cdclk; unsigned int vco = cdclk_state->vco; @@ -1872,6 +1883,11 @@ static void icl_set_cdclk(struct drm_i915_private *dev_priv, if (dev_priv->cdclk.hw.vco != vco) cnl_cdclk_pll_enable(dev_priv, vco); + /* + * On ICL CD2X_DIV can only be 1, so we'll never end up changing the + * divider here synchronized to a pipe while CDCLK is on, nor will we + * need the corresponding vblank wait. + */ I915_WRITE(CDCLK_CTL, ICL_CDCLK_CD2X_PIPE_NONE | skl_cdclk_decimal(cdclk)); @@ -2002,7 +2018,7 @@ sanitize: sanitized_state.voltage_level = icl_calc_voltage_level(sanitized_state.cdclk); - icl_set_cdclk(dev_priv, &sanitized_state); + icl_set_cdclk(dev_priv, &sanitized_state, INVALID_PIPE); } /** @@ -2020,7 +2036,7 @@ void icl_uninit_cdclk(struct drm_i915_private *dev_priv) cdclk_state.vco = 0; cdclk_state.voltage_level = icl_calc_voltage_level(cdclk_state.cdclk); - icl_set_cdclk(dev_priv, &cdclk_state); + icl_set_cdclk(dev_priv, &cdclk_state, INVALID_PIPE); } /** @@ -2048,7 +2064,7 @@ void cnl_init_cdclk(struct drm_i915_private *dev_priv) cdclk_state.vco = cnl_cdclk_pll_vco(dev_priv, cdclk_state.cdclk); cdclk_state.voltage_level = cnl_calc_voltage_level(cdclk_state.cdclk); - cnl_set_cdclk(dev_priv, &cdclk_state); + cnl_set_cdclk(dev_priv, &cdclk_state, INVALID_PIPE); } /** @@ -2066,7 +2082,7 @@ void cnl_uninit_cdclk(struct drm_i915_private *dev_priv) cdclk_state.vco = 0; cdclk_state.voltage_level = cnl_calc_voltage_level(cdclk_state.cdclk); - cnl_set_cdclk(dev_priv, &cdclk_state); + cnl_set_cdclk(dev_priv, &cdclk_state, INVALID_PIPE); } /** @@ -2085,6 +2101,27 @@ bool intel_cdclk_needs_modeset(const struct intel_cdclk_state *a, a->ref != b->ref; } +/** + * intel_cdclk_needs_cd2x_update - Determine if two CDCLK states require a cd2x divider update + * @a: first CDCLK state + * @b: second CDCLK state + * + * Returns: + * True if the CDCLK states require just a cd2x divider update, false if not. + */ +bool intel_cdclk_needs_cd2x_update(struct drm_i915_private *dev_priv, + const struct intel_cdclk_state *a, + const struct intel_cdclk_state *b) +{ + /* Older hw doesn't have the capability */ + if (INTEL_GEN(dev_priv) < 10 && !IS_GEN9_LP(dev_priv)) + return false; + + return a->cdclk != b->cdclk && + a->vco == b->vco && + a->ref == b->ref; +} + /** * intel_cdclk_changed - Determine if two CDCLK states are different * @a: first CDCLK state @@ -2133,12 +2170,14 @@ void intel_dump_cdclk_state(const struct intel_cdclk_state *cdclk_state, * intel_set_cdclk - Push the CDCLK state to the hardware * @dev_priv: i915 device * @cdclk_state: new CDCLK state + * @pipe: pipe with which to synchronize the update * * Program the hardware based on the passed in CDCLK state, * if necessary. */ -void intel_set_cdclk(struct drm_i915_private *dev_priv, - const struct intel_cdclk_state *cdclk_state) +static void intel_set_cdclk(struct drm_i915_private *dev_priv, + const struct intel_cdclk_state *cdclk_state, + enum pipe pipe) { if (!intel_cdclk_changed(&dev_priv->cdclk.hw, cdclk_state)) return; @@ -2148,7 +2187,7 @@ void intel_set_cdclk(struct drm_i915_private *dev_priv, intel_dump_cdclk_state(cdclk_state, "Changing CDCLK to"); - dev_priv->display.set_cdclk(dev_priv, cdclk_state); + dev_priv->display.set_cdclk(dev_priv, cdclk_state, pipe); if (WARN(intel_cdclk_changed(&dev_priv->cdclk.hw, cdclk_state), "cdclk state doesn't match!\n")) { @@ -2157,6 +2196,46 @@ void intel_set_cdclk(struct drm_i915_private *dev_priv, } } +/** + * intel_set_cdclk_pre_plane_update - Push the CDCLK state to the hardware + * @dev_priv: i915 device + * @old_state: old CDCLK state + * @new_state: new CDCLK state + * @pipe: pipe with which to synchronize the update + * + * Program the hardware before updating the HW plane state based on the passed + * in CDCLK state, if necessary. + */ +void +intel_set_cdclk_pre_plane_update(struct drm_i915_private *dev_priv, + const struct intel_cdclk_state *old_state, + const struct intel_cdclk_state *new_state, + enum pipe pipe) +{ + if (pipe == INVALID_PIPE || old_state->cdclk <= new_state->cdclk) + intel_set_cdclk(dev_priv, new_state, pipe); +} + +/** + * intel_set_cdclk_post_plane_update - Push the CDCLK state to the hardware + * @dev_priv: i915 device + * @old_state: old CDCLK state + * @new_state: new CDCLK state + * @pipe: pipe with which to synchronize the update + * + * Program the hardware after updating the HW plane state based on the passed + * in CDCLK state, if necessary. + */ +void +intel_set_cdclk_post_plane_update(struct drm_i915_private *dev_priv, + const struct intel_cdclk_state *old_state, + const struct intel_cdclk_state *new_state, + enum pipe pipe) +{ + if (pipe != INVALID_PIPE && old_state->cdclk > new_state->cdclk) + intel_set_cdclk(dev_priv, new_state, pipe); +} + static int intel_pixel_rate_to_cdclk(struct drm_i915_private *dev_priv, int pixel_rate) { diff --git a/drivers/gpu/drm/i915/intel_display.c b/drivers/gpu/drm/i915/intel_display.c index e25032e4192c..9803d713f746 100644 --- a/drivers/gpu/drm/i915/intel_display.c +++ b/drivers/gpu/drm/i915/intel_display.c @@ -12779,6 +12779,7 @@ static int intel_modeset_checks(struct drm_atomic_state *state) intel_state->active_crtcs = dev_priv->active_crtcs; intel_state->cdclk.logical = dev_priv->cdclk.logical; intel_state->cdclk.actual = dev_priv->cdclk.actual; + intel_state->cdclk.pipe = INVALID_PIPE; for_each_oldnew_crtc_in_state(state, crtc, old_crtc_state, new_crtc_state, i) { if (new_crtc_state->active) @@ -12798,6 +12799,8 @@ static int intel_modeset_checks(struct drm_atomic_state *state) * adjusted_mode bits in the crtc directly. */ if (dev_priv->display.modeset_calc_cdclk) { + enum pipe pipe; + ret = dev_priv->display.modeset_calc_cdclk(state); if (ret < 0) return ret; @@ -12814,12 +12817,36 @@ static int intel_modeset_checks(struct drm_atomic_state *state) return ret; } + if (is_power_of_2(intel_state->active_crtcs)) { + struct drm_crtc *crtc; + struct drm_crtc_state *crtc_state; + + pipe = ilog2(intel_state->active_crtcs); + crtc = &intel_get_crtc_for_pipe(dev_priv, pipe)->base; + crtc_state = drm_atomic_get_new_crtc_state(state, crtc); + if (crtc_state && needs_modeset(crtc_state)) + pipe = INVALID_PIPE; + } else { + pipe = INVALID_PIPE; + } + /* All pipes must be switched off while we change the cdclk. */ - if (intel_cdclk_needs_modeset(&dev_priv->cdclk.actual, - &intel_state->cdclk.actual)) { + if (pipe != INVALID_PIPE && + intel_cdclk_needs_cd2x_update(dev_priv, + &dev_priv->cdclk.actual, + &intel_state->cdclk.actual)) { + ret = intel_lock_all_pipes(state); + if (ret < 0) + return ret; + + intel_state->cdclk.pipe = pipe; + } else if (intel_cdclk_needs_modeset(&dev_priv->cdclk.actual, + &intel_state->cdclk.actual)) { ret = intel_modeset_all_pipes(state); if (ret < 0) return ret; + + intel_state->cdclk.pipe = INVALID_PIPE; } DRM_DEBUG_KMS("New cdclk calculated to be logical %u kHz, actual %u kHz\n", @@ -13251,7 +13278,10 @@ static void intel_atomic_commit_tail(struct drm_atomic_state *state) if (intel_state->modeset) { drm_atomic_helper_update_legacy_modeset_state(state->dev, state); - intel_set_cdclk(dev_priv, &dev_priv->cdclk.actual); + intel_set_cdclk_pre_plane_update(dev_priv, + &intel_state->cdclk.actual, + &dev_priv->cdclk.actual, + intel_state->cdclk.pipe); /* * SKL workaround: bspec recommends we disable the SAGV when we @@ -13280,6 +13310,12 @@ static void intel_atomic_commit_tail(struct drm_atomic_state *state) /* Now enable the clocks, plane, pipe, and connectors that we set up. */ dev_priv->display.update_crtcs(state); + if (intel_state->modeset) + intel_set_cdclk_post_plane_update(dev_priv, + &intel_state->cdclk.actual, + &dev_priv->cdclk.actual, + intel_state->cdclk.pipe); + /* FIXME: We should call drm_atomic_helper_commit_hw_done() here * already, but still need the state for the delayed optimization. To * fix this: diff --git a/drivers/gpu/drm/i915/intel_drv.h b/drivers/gpu/drm/i915/intel_drv.h index 2a11dfc28a2a..ceaa05dfa0d0 100644 --- a/drivers/gpu/drm/i915/intel_drv.h +++ b/drivers/gpu/drm/i915/intel_drv.h @@ -483,6 +483,8 @@ struct intel_atomic_state { int force_min_cdclk; bool force_min_cdclk_changed; + /* pipe to which cd2x update is synchronized */ + enum pipe pipe; } cdclk; bool dpll_set, modeset; @@ -1593,13 +1595,24 @@ void intel_init_cdclk_hooks(struct drm_i915_private *dev_priv); void intel_update_max_cdclk(struct drm_i915_private *dev_priv); void intel_update_cdclk(struct drm_i915_private *dev_priv); void intel_update_rawclk(struct drm_i915_private *dev_priv); +bool intel_cdclk_needs_cd2x_update(struct drm_i915_private *dev_priv, + const struct intel_cdclk_state *a, + const struct intel_cdclk_state *b); bool intel_cdclk_needs_modeset(const struct intel_cdclk_state *a, const struct intel_cdclk_state *b); bool intel_cdclk_changed(const struct intel_cdclk_state *a, const struct intel_cdclk_state *b); void intel_cdclk_swap_state(struct intel_atomic_state *state); -void intel_set_cdclk(struct drm_i915_private *dev_priv, - const struct intel_cdclk_state *cdclk_state); +void +intel_set_cdclk_pre_plane_update(struct drm_i915_private *dev_priv, + const struct intel_cdclk_state *old_state, + const struct intel_cdclk_state *new_state, + enum pipe pipe); +void +intel_set_cdclk_post_plane_update(struct drm_i915_private *dev_priv, + const struct intel_cdclk_state *old_state, + const struct intel_cdclk_state *new_state, + enum pipe pipe); void intel_dump_cdclk_state(const struct intel_cdclk_state *cdclk_state, const char *context); -- cgit v1.2.3 From 897b17e012adce36976ce76f1ca36f2831927159 Mon Sep 17 00:00:00 2001 From: Naoya Horiguchi Date: Fri, 28 Jun 2019 12:06:53 -0700 Subject: mm: soft-offline: return -EBUSY if set_hwpoison_free_buddy_page() fails commit b38e5962f8ed0d2a2b28a887fc2221f7f41db119 upstream. The pass/fail of soft offline should be judged by checking whether the raw error page was finally contained or not (i.e. the result of set_hwpoison_free_buddy_page()), but current code do not work like that. It might lead us to misjudge the test result when set_hwpoison_free_buddy_page() fails. Without this fix, there are cases where madvise(MADV_SOFT_OFFLINE) may not offline the original page and will not return an error. Link: http://lkml.kernel.org/r/1560154686-18497-2-git-send-email-n-horiguchi@ah.jp.nec.com Signed-off-by: Naoya Horiguchi Fixes: 6bc9b56433b76 ("mm: fix race on soft-offlining") Reviewed-by: Mike Kravetz Reviewed-by: Oscar Salvador Cc: Michal Hocko Cc: Xishi Qiu Cc: "Chen, Jerry T" Cc: "Zhuo, Qiuxu" Cc: [4.19+] Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Greg Kroah-Hartman --- mm/memory-failure.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/mm/memory-failure.c b/mm/memory-failure.c index fc8b51744579..7ea485eb979c 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -1733,6 +1733,8 @@ static int soft_offline_huge_page(struct page *page, int flags) if (!ret) { if (set_hwpoison_free_buddy_page(page)) num_poisoned_pages_inc(); + else + ret = -EBUSY; } } return ret; -- cgit v1.2.3 From 59d44003b0a91800abc17bfe4fdb00682bfdc364 Mon Sep 17 00:00:00 2001 From: Naoya Horiguchi Date: Fri, 28 Jun 2019 12:06:56 -0700 Subject: mm: hugetlb: soft-offline: dissolve_free_huge_page() return zero on !PageHuge commit faf53def3b143df11062d87c12afe6afeb6f8cc7 upstream. madvise(MADV_SOFT_OFFLINE) often returns -EBUSY when calling soft offline for hugepages with overcommitting enabled. That was caused by the suboptimal code in current soft-offline code. See the following part: ret = migrate_pages(&pagelist, new_page, NULL, MPOL_MF_MOVE_ALL, MIGRATE_SYNC, MR_MEMORY_FAILURE); if (ret) { ... } else { /* * We set PG_hwpoison only when the migration source hugepage * was successfully dissolved, because otherwise hwpoisoned * hugepage remains on free hugepage list, then userspace will * find it as SIGBUS by allocation failure. That's not expected * in soft-offlining. */ ret = dissolve_free_huge_page(page); if (!ret) { if (set_hwpoison_free_buddy_page(page)) num_poisoned_pages_inc(); } } return ret; Here dissolve_free_huge_page() returns -EBUSY if the migration source page was freed into buddy in migrate_pages(), but even in that case we actually has a chance that set_hwpoison_free_buddy_page() succeeds. So that means current code gives up offlining too early now. dissolve_free_huge_page() checks that a given hugepage is suitable for dissolving, where we should return success for !PageHuge() case because the given hugepage is considered as already dissolved. This change also affects other callers of dissolve_free_huge_page(), which are cleaned up together. [n-horiguchi@ah.jp.nec.com: v3] Link: http://lkml.kernel.org/r/1560761476-4651-3-git-send-email-n-horiguchi@ah.jp.nec.comLink: http://lkml.kernel.org/r/1560154686-18497-3-git-send-email-n-horiguchi@ah.jp.nec.com Fixes: 6bc9b56433b76 ("mm: fix race on soft-offlining") Signed-off-by: Naoya Horiguchi Reported-by: Chen, Jerry T Tested-by: Chen, Jerry T Reviewed-by: Mike Kravetz Reviewed-by: Oscar Salvador Cc: Michal Hocko Cc: Xishi Qiu Cc: "Chen, Jerry T" Cc: "Zhuo, Qiuxu" Cc: [4.19+] Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Greg Kroah-Hartman --- mm/hugetlb.c | 29 ++++++++++++++++++++--------- mm/memory-failure.c | 5 +---- 2 files changed, 21 insertions(+), 13 deletions(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 5b4f00be325d..ec81808830ae 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -1491,16 +1491,29 @@ static int free_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed, /* * Dissolve a given free hugepage into free buddy pages. This function does - * nothing for in-use (including surplus) hugepages. Returns -EBUSY if the - * dissolution fails because a give page is not a free hugepage, or because - * free hugepages are fully reserved. + * nothing for in-use hugepages and non-hugepages. + * This function returns values like below: + * + * -EBUSY: failed to dissolved free hugepages or the hugepage is in-use + * (allocated or reserved.) + * 0: successfully dissolved free hugepages or the page is not a + * hugepage (considered as already dissolved) */ int dissolve_free_huge_page(struct page *page) { int rc = -EBUSY; + /* Not to disrupt normal path by vainly holding hugetlb_lock */ + if (!PageHuge(page)) + return 0; + spin_lock(&hugetlb_lock); - if (PageHuge(page) && !page_count(page)) { + if (!PageHuge(page)) { + rc = 0; + goto out; + } + + if (!page_count(page)) { struct page *head = compound_head(page); struct hstate *h = page_hstate(head); int nid = page_to_nid(head); @@ -1545,11 +1558,9 @@ int dissolve_free_huge_pages(unsigned long start_pfn, unsigned long end_pfn) for (pfn = start_pfn; pfn < end_pfn; pfn += 1 << minimum_order) { page = pfn_to_page(pfn); - if (PageHuge(page) && !page_count(page)) { - rc = dissolve_free_huge_page(page); - if (rc) - break; - } + rc = dissolve_free_huge_page(page); + if (rc) + break; } return rc; diff --git a/mm/memory-failure.c b/mm/memory-failure.c index 7ea485eb979c..3a83e279cc98 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -1859,11 +1859,8 @@ static int soft_offline_in_use_page(struct page *page, int flags) static int soft_offline_free_page(struct page *page) { - int rc = 0; - struct page *head = compound_head(page); + int rc = dissolve_free_huge_page(page); - if (PageHuge(head)) - rc = dissolve_free_huge_page(page); if (!rc) { if (set_hwpoison_free_buddy_page(page)) num_poisoned_pages_inc(); -- cgit v1.2.3 From 00553cdd3377b0ff11a0d4e39e0b824049eab013 Mon Sep 17 00:00:00 2001 From: Colin Ian King Date: Fri, 28 Jun 2019 12:07:05 -0700 Subject: mm/page_idle.c: fix oops because end_pfn is larger than max_pfn commit 7298e3b0a149c91323b3205d325e942c3b3b9ef6 upstream. Currently the calcuation of end_pfn can round up the pfn number to more than the actual maximum number of pfns, causing an Oops. Fix this by ensuring end_pfn is never more than max_pfn. This can be easily triggered when on systems where the end_pfn gets rounded up to more than max_pfn using the idle-page stress-ng stress test: sudo stress-ng --idle-page 0 BUG: unable to handle kernel paging request at 00000000000020d8 #PF error: [normal kernel read fault] PGD 0 P4D 0 Oops: 0000 [#1] SMP PTI CPU: 1 PID: 11039 Comm: stress-ng-idle- Not tainted 5.0.0-5-generic #6-Ubuntu Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014 RIP: 0010:page_idle_get_page+0xc8/0x1a0 Code: 0f b1 0a 75 7d 48 8b 03 48 89 c2 48 c1 e8 33 83 e0 07 48 c1 ea 36 48 8d 0c 40 4c 8d 24 88 49 c1 e4 07 4c 03 24 d5 00 89 c3 be <49> 8b 44 24 58 48 8d b8 80 a1 02 00 e8 07 d5 77 00 48 8b 53 08 48 RSP: 0018:ffffafd7c672fde8 EFLAGS: 00010202 RAX: 0000000000000005 RBX: ffffe36341fff700 RCX: 000000000000000f RDX: 0000000000000284 RSI: 0000000000000275 RDI: 0000000001fff700 RBP: ffffafd7c672fe00 R08: ffffa0bc34056410 R09: 0000000000000276 R10: ffffa0bc754e9b40 R11: ffffa0bc330f6400 R12: 0000000000002080 R13: ffffe36341fff700 R14: 0000000000080000 R15: ffffa0bc330f6400 FS: 00007f0ec1ea5740(0000) GS:ffffa0bc7db00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000000020d8 CR3: 0000000077d68000 CR4: 00000000000006e0 Call Trace: page_idle_bitmap_write+0x8c/0x140 sysfs_kf_bin_write+0x5c/0x70 kernfs_fop_write+0x12e/0x1b0 __vfs_write+0x1b/0x40 vfs_write+0xab/0x1b0 ksys_write+0x55/0xc0 __x64_sys_write+0x1a/0x20 do_syscall_64+0x5a/0x110 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Link: http://lkml.kernel.org/r/20190618124352.28307-1-colin.king@canonical.com Fixes: 33c3fc71c8cf ("mm: introduce idle page tracking") Signed-off-by: Colin Ian King Reviewed-by: Andrew Morton Acked-by: Vladimir Davydov Cc: Michal Hocko Cc: Mike Rapoport Cc: Mel Gorman Cc: Stephen Rothwell Cc: Andrey Ryabinin Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Greg Kroah-Hartman --- mm/page_idle.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/mm/page_idle.c b/mm/page_idle.c index 0b39ec0c945c..295512465065 100644 --- a/mm/page_idle.c +++ b/mm/page_idle.c @@ -136,7 +136,7 @@ static ssize_t page_idle_bitmap_read(struct file *file, struct kobject *kobj, end_pfn = pfn + count * BITS_PER_BYTE; if (end_pfn > max_pfn) - end_pfn = ALIGN(max_pfn, BITMAP_CHUNK_BITS); + end_pfn = max_pfn; for (; pfn < end_pfn; pfn++) { bit = pfn % BITMAP_CHUNK_BITS; @@ -181,7 +181,7 @@ static ssize_t page_idle_bitmap_write(struct file *file, struct kobject *kobj, end_pfn = pfn + count * BITS_PER_BYTE; if (end_pfn > max_pfn) - end_pfn = ALIGN(max_pfn, BITMAP_CHUNK_BITS); + end_pfn = max_pfn; for (; pfn < end_pfn; pfn++) { bit = pfn % BITMAP_CHUNK_BITS; -- cgit v1.2.3 From e3d6fe0b33dfa707a37d0cd750bb2d191c3a64dc Mon Sep 17 00:00:00 2001 From: Huang Ying Date: Fri, 28 Jun 2019 12:07:18 -0700 Subject: mm, swap: fix THP swap out commit 1a5f439c7c02837d943e528d46501564d4226757 upstream. 0-Day test system reported some OOM regressions for several THP (Transparent Huge Page) swap test cases. These regressions are bisected to 6861428921b5 ("block: always define BIO_MAX_PAGES as 256"). In the commit, BIO_MAX_PAGES is set to 256 even when THP swap is enabled. So the bio_alloc(gfp_flags, 512) in get_swap_bio() may fail when swapping out THP. That causes the OOM. As in the patch description of 6861428921b5 ("block: always define BIO_MAX_PAGES as 256"), THP swap should use multi-page bvec to write THP to swap space. So the issue is fixed via doing that in get_swap_bio(). BTW: I remember I have checked the THP swap code when 6861428921b5 ("block: always define BIO_MAX_PAGES as 256") was merged, and thought the THP swap code needn't to be changed. But apparently, I was wrong. I should have done this at that time. Link: http://lkml.kernel.org/r/20190624075515.31040-1-ying.huang@intel.com Fixes: 6861428921b5 ("block: always define BIO_MAX_PAGES as 256") Signed-off-by: "Huang, Ying" Reviewed-by: Ming Lei Cc: Michal Hocko Cc: Johannes Weiner Cc: Hugh Dickins Cc: Minchan Kim Cc: Rik van Riel Cc: Daniel Jordan Cc: Jens Axboe Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Greg Kroah-Hartman --- mm/page_io.c | 7 ++----- 1 file changed, 2 insertions(+), 5 deletions(-) diff --git a/mm/page_io.c b/mm/page_io.c index 2e8019d0e048..189415852077 100644 --- a/mm/page_io.c +++ b/mm/page_io.c @@ -29,10 +29,9 @@ static struct bio *get_swap_bio(gfp_t gfp_flags, struct page *page, bio_end_io_t end_io) { - int i, nr = hpage_nr_pages(page); struct bio *bio; - bio = bio_alloc(gfp_flags, nr); + bio = bio_alloc(gfp_flags, 1); if (bio) { struct block_device *bdev; @@ -41,9 +40,7 @@ static struct bio *get_swap_bio(gfp_t gfp_flags, bio->bi_iter.bi_sector <<= PAGE_SHIFT - 9; bio->bi_end_io = end_io; - for (i = 0; i < nr; i++) - bio_add_page(bio, page + i, PAGE_SIZE, 0); - VM_BUG_ON(bio->bi_iter.bi_size != PAGE_SIZE * nr); + bio_add_page(bio, page, PAGE_SIZE * hpage_nr_pages(page), 0); } return bio; } -- cgit v1.2.3 From 6e17b11ffedd6e935c507c39b92e9c8ddb88c3c9 Mon Sep 17 00:00:00 2001 From: Gen Zhang Date: Wed, 29 May 2019 09:33:20 +0800 Subject: dm init: fix incorrect uses of kstrndup() commit dec7e6494e1aea6bf676223da3429cd17ce0af79 upstream. Fix 2 kstrndup() calls with incorrect argument order. Fixes: 6bbc923dfcf5 ("dm: add support to directly boot to a mapped device") Cc: stable@vger.kernel.org # v5.1 Signed-off-by: Gen Zhang Signed-off-by: Mike Snitzer Signed-off-by: Greg Kroah-Hartman --- drivers/md/dm-init.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/drivers/md/dm-init.c b/drivers/md/dm-init.c index 352e803f566e..64611633e77c 100644 --- a/drivers/md/dm-init.c +++ b/drivers/md/dm-init.c @@ -140,8 +140,8 @@ static char __init *dm_parse_table_entry(struct dm_device *dev, char *str) return ERR_PTR(-EINVAL); } /* target_args */ - dev->target_args_array[n] = kstrndup(field[3], GFP_KERNEL, - DM_MAX_STR_SIZE); + dev->target_args_array[n] = kstrndup(field[3], DM_MAX_STR_SIZE, + GFP_KERNEL); if (!dev->target_args_array[n]) return ERR_PTR(-ENOMEM); @@ -275,7 +275,7 @@ static int __init dm_init_init(void) DMERR("Argument is too big. Limit is %d\n", DM_MAX_STR_SIZE); return -EINVAL; } - str = kstrndup(create, GFP_KERNEL, DM_MAX_STR_SIZE); + str = kstrndup(create, DM_MAX_STR_SIZE, GFP_KERNEL); if (!str) return -ENOMEM; -- cgit v1.2.3 From 25df4ce382c963f0c28fefd0bcbd49bd44ce9fdc Mon Sep 17 00:00:00 2001 From: "zhangyi (F)" Date: Wed, 5 Jun 2019 21:27:08 +0800 Subject: dm log writes: make sure super sector log updates are written in order commit 211ad4b733037f66f9be0a79eade3da7ab11cbb8 upstream. Currently, although we submit super bios in order (and super.nr_entries is incremented by each logged entry), submit_bio() is async so each super sector may not be written to log device in order and then the final nr_entries may be smaller than it should be. This problem can be reproduced by the xfstests generic/455 with ext4: QA output created by 455 -Silence is golden +mark 'end' does not exist Fix this by serializing submission of super sectors to make sure each is written to the log disk in order. Fixes: 0e9cebe724597 ("dm: add log writes target") Cc: stable@vger.kernel.org Signed-off-by: zhangyi (F) Suggested-by: Josef Bacik Reviewed-by: Josef Bacik Signed-off-by: Mike Snitzer Signed-off-by: Greg Kroah-Hartman --- drivers/md/dm-log-writes.c | 23 +++++++++++++++++++++-- 1 file changed, 21 insertions(+), 2 deletions(-) diff --git a/drivers/md/dm-log-writes.c b/drivers/md/dm-log-writes.c index 9ea2b0291f20..e549392e0ea5 100644 --- a/drivers/md/dm-log-writes.c +++ b/drivers/md/dm-log-writes.c @@ -60,6 +60,7 @@ #define WRITE_LOG_VERSION 1ULL #define WRITE_LOG_MAGIC 0x6a736677736872ULL +#define WRITE_LOG_SUPER_SECTOR 0 /* * The disk format for this is braindead simple. @@ -115,6 +116,7 @@ struct log_writes_c { struct list_head logging_blocks; wait_queue_head_t wait; struct task_struct *log_kthread; + struct completion super_done; }; struct pending_block { @@ -180,6 +182,14 @@ static void log_end_io(struct bio *bio) bio_put(bio); } +static void log_end_super(struct bio *bio) +{ + struct log_writes_c *lc = bio->bi_private; + + complete(&lc->super_done); + log_end_io(bio); +} + /* * Meant to be called if there is an error, it will free all the pages * associated with the block. @@ -215,7 +225,8 @@ static int write_metadata(struct log_writes_c *lc, void *entry, bio->bi_iter.bi_size = 0; bio->bi_iter.bi_sector = sector; bio_set_dev(bio, lc->logdev->bdev); - bio->bi_end_io = log_end_io; + bio->bi_end_io = (sector == WRITE_LOG_SUPER_SECTOR) ? + log_end_super : log_end_io; bio->bi_private = lc; bio_set_op_attrs(bio, REQ_OP_WRITE, 0); @@ -418,11 +429,18 @@ static int log_super(struct log_writes_c *lc) super.nr_entries = cpu_to_le64(lc->logged_entries); super.sectorsize = cpu_to_le32(lc->sectorsize); - if (write_metadata(lc, &super, sizeof(super), NULL, 0, 0)) { + if (write_metadata(lc, &super, sizeof(super), NULL, 0, + WRITE_LOG_SUPER_SECTOR)) { DMERR("Couldn't write super"); return -1; } + /* + * Super sector should be writen in-order, otherwise the + * nr_entries could be rewritten incorrectly by an old bio. + */ + wait_for_completion_io(&lc->super_done); + return 0; } @@ -531,6 +549,7 @@ static int log_writes_ctr(struct dm_target *ti, unsigned int argc, char **argv) INIT_LIST_HEAD(&lc->unflushed_blocks); INIT_LIST_HEAD(&lc->logging_blocks); init_waitqueue_head(&lc->wait); + init_completion(&lc->super_done); atomic_set(&lc->io_blocks, 0); atomic_set(&lc->pending_blocks, 0); -- cgit v1.2.3 From ddae0798dd1183183694977c2632e686ed317b06 Mon Sep 17 00:00:00 2001 From: Jens Axboe Date: Fri, 21 Jun 2019 10:20:18 -0600 Subject: io_uring: ensure req->file is cleared on allocation commit 60c112b0ada09826cc4ae6a4e55df677f76f1313 upstream. Stephen reports: I hit the following General Protection Fault when testing io_uring via the io_uring engine in fio. This was on a VM running 5.2-rc5 and the latest version of fio. The issue occurs for both null_blk and fake NVMe drives. I have not tested bare metal or real NVMe SSDs. The fio script used is given below. [io_uring] time_based=1 runtime=60 filename=/dev/nvme2n1 (note /dev/nullb0 also fails) ioengine=io_uring bs=4k rw=readwrite direct=1 fixedbufs=1 sqthread_poll=1 sqthread_poll_cpu=0 general protection fault: 0000 [#1] SMP PTI CPU: 0 PID: 872 Comm: io_uring-sq Not tainted 5.2.0-rc5-cpacket-io-uring #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 RIP: 0010:fput_many+0x7/0x90 Code: 01 48 85 ff 74 17 55 48 89 e5 53 48 8b 1f e8 a0 f9 ff ff 48 85 db 48 89 df 75 f0 5b 5d f3 c3 0f 1f 40 00 0f 1f 44 00 00 89 f6 48 29 77 38 74 01 c3 55 48 89 e5 53 48 89 fb 65 48 \ RSP: 0018:ffffadeb817ebc50 EFLAGS: 00010246 RAX: 0000000000000004 RBX: ffff8f46ad477480 RCX: 0000000000001805 RDX: 0000000000000000 RSI: 0000000000000001 RDI: f18b51b9a39552b5 RBP: ffffadeb817ebc58 R08: ffff8f46b7a318c0 R09: 000000000000015d R10: ffffadeb817ebce8 R11: 0000000000000020 R12: ffff8f46ad4cd000 R13: 00000000fffffff7 R14: ffffadeb817ebe30 R15: 0000000000000004 FS: 0000000000000000(0000) GS:ffff8f46b7a00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000055828f0bbbf0 CR3: 0000000232176004 CR4: 00000000003606f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: ? fput+0x13/0x20 io_free_req+0x20/0x40 io_put_req+0x1b/0x20 io_submit_sqe+0x40a/0x680 ? __switch_to_asm+0x34/0x70 ? __switch_to_asm+0x40/0x70 io_submit_sqes+0xb9/0x160 ? io_submit_sqes+0xb9/0x160 ? __switch_to_asm+0x40/0x70 ? __switch_to_asm+0x34/0x70 ? __schedule+0x3f2/0x6a0 ? __switch_to_asm+0x34/0x70 io_sq_thread+0x1af/0x470 ? __switch_to_asm+0x34/0x70 ? wait_woken+0x80/0x80 ? __switch_to+0x85/0x410 ? __switch_to_asm+0x40/0x70 ? __switch_to_asm+0x34/0x70 ? __schedule+0x3f2/0x6a0 kthread+0x105/0x140 ? io_submit_sqes+0x160/0x160 ? kthread+0x105/0x140 ? io_submit_sqes+0x160/0x160 ? kthread_destroy_worker+0x50/0x50 ret_from_fork+0x35/0x40 which occurs because using a kernel side submission thread isn't valid without using fixed files (registered through io_uring_register()). This causes io_uring to put the request after logging an error, but before the file field is set in the request. If it happens to be non-zero, we attempt to fput() garbage. Fix this by ensuring that req->file is initialized when the request is allocated. Cc: stable@vger.kernel.org # 5.1+ Reported-by: Stephen Bates Tested-by: Stephen Bates Signed-off-by: Jens Axboe Signed-off-by: Greg Kroah-Hartman --- fs/io_uring.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/fs/io_uring.c b/fs/io_uring.c index e82adbf8adc1..b897695c91c0 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -533,6 +533,7 @@ static struct io_kiocb *io_get_req(struct io_ring_ctx *ctx, state->cur_req++; } + req->file = NULL; req->ctx = ctx; req->flags = 0; /* one is dropped after submission, the other at completion */ @@ -1684,10 +1685,8 @@ static int io_req_set_file(struct io_ring_ctx *ctx, const struct sqe_submit *s, flags = READ_ONCE(s->sqe->flags); fd = READ_ONCE(s->sqe->fd); - if (!io_op_needs_file(s->sqe)) { - req->file = NULL; + if (!io_op_needs_file(s->sqe)) return 0; - } if (flags & IOSQE_FIXED_FILE) { if (unlikely(!ctx->user_files || -- cgit v1.2.3 From e5fb2093f9e86abe7c742465a8e9908951c463d2 Mon Sep 17 00:00:00 2001 From: Jan Kara Date: Wed, 19 Jun 2019 09:05:41 +0200 Subject: scsi: vmw_pscsi: Fix use-after-free in pvscsi_queue_lck() commit 240b4cc8fd5db138b675297d4226ec46594d9b3b upstream. Once we unlock adapter->hw_lock in pvscsi_queue_lck() nothing prevents just queued scsi_cmnd from completing and freeing the request. Thus cmd->cmnd[0] dereference can dereference already freed request leading to kernel crashes or other issues (which one of our customers observed). Store cmd->cmnd[0] in a local variable before unlocking adapter->hw_lock to fix the issue. CC: Signed-off-by: Jan Kara Reviewed-by: Ewan D. Milne Signed-off-by: Martin K. Petersen Signed-off-by: Greg Kroah-Hartman --- drivers/scsi/vmw_pvscsi.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/drivers/scsi/vmw_pvscsi.c b/drivers/scsi/vmw_pvscsi.c index ecee4b3ff073..377b07b2feeb 100644 --- a/drivers/scsi/vmw_pvscsi.c +++ b/drivers/scsi/vmw_pvscsi.c @@ -763,6 +763,7 @@ static int pvscsi_queue_lck(struct scsi_cmnd *cmd, void (*done)(struct scsi_cmnd struct pvscsi_adapter *adapter = shost_priv(host); struct pvscsi_ctx *ctx; unsigned long flags; + unsigned char op; spin_lock_irqsave(&adapter->hw_lock, flags); @@ -775,13 +776,14 @@ static int pvscsi_queue_lck(struct scsi_cmnd *cmd, void (*done)(struct scsi_cmnd } cmd->scsi_done = done; + op = cmd->cmnd[0]; dev_dbg(&cmd->device->sdev_gendev, - "queued cmd %p, ctx %p, op=%x\n", cmd, ctx, cmd->cmnd[0]); + "queued cmd %p, ctx %p, op=%x\n", cmd, ctx, op); spin_unlock_irqrestore(&adapter->hw_lock, flags); - pvscsi_kick_io(adapter, cmd->cmnd[0]); + pvscsi_kick_io(adapter, op); return 0; } -- cgit v1.2.3 From 6aec2bbd7839c9d45609e78198dac3ddc9b91750 Mon Sep 17 00:00:00 2001 From: Alejandro Jimenez Date: Mon, 10 Jun 2019 13:20:10 -0400 Subject: x86/speculation: Allow guests to use SSBD even if host does not commit c1f7fec1eb6a2c86d01bc22afce772c743451d88 upstream. The bits set in x86_spec_ctrl_mask are used to calculate the guest's value of SPEC_CTRL that is written to the MSR before VMENTRY, and control which mitigations the guest can enable. In the case of SSBD, unless the host has enabled SSBD always on mode (by passing "spec_store_bypass_disable=on" in the kernel parameters), the SSBD bit is not set in the mask and the guest can not properly enable the SSBD always on mitigation mode. This has been confirmed by running the SSBD PoC on a guest using the SSBD always on mitigation mode (booted with kernel parameter "spec_store_bypass_disable=on"), and verifying that the guest is vulnerable unless the host is also using SSBD always on mode. In addition, the guest OS incorrectly reports the SSB vulnerability as mitigated. Always set the SSBD bit in x86_spec_ctrl_mask when the host CPU supports it, allowing the guest to use SSBD whether or not the host has chosen to enable the mitigation in any of its modes. Fixes: be6fcb5478e9 ("x86/bugs: Rework spec_ctrl base and mask logic") Signed-off-by: Alejandro Jimenez Signed-off-by: Thomas Gleixner Reviewed-by: Liam Merwick Reviewed-by: Mark Kanda Reviewed-by: Paolo Bonzini Cc: bp@alien8.de Cc: rkrcmar@redhat.com Cc: kvm@vger.kernel.org Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/1560187210-11054-1-git-send-email-alejandro.j.jimenez@oracle.com Signed-off-by: Greg Kroah-Hartman --- arch/x86/kernel/cpu/bugs.c | 11 ++++++++++- 1 file changed, 10 insertions(+), 1 deletion(-) diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c index 03b4cc0ec3a7..66ca906aa790 100644 --- a/arch/x86/kernel/cpu/bugs.c +++ b/arch/x86/kernel/cpu/bugs.c @@ -835,6 +835,16 @@ static enum ssb_mitigation __init __ssb_select_mitigation(void) break; } + /* + * If SSBD is controlled by the SPEC_CTRL MSR, then set the proper + * bit in the mask to allow guests to use the mitigation even in the + * case where the host does not enable it. + */ + if (static_cpu_has(X86_FEATURE_SPEC_CTRL_SSBD) || + static_cpu_has(X86_FEATURE_AMD_SSBD)) { + x86_spec_ctrl_mask |= SPEC_CTRL_SSBD; + } + /* * We have three CPU feature flags that are in play here: * - X86_BUG_SPEC_STORE_BYPASS - CPU is susceptible. @@ -852,7 +862,6 @@ static enum ssb_mitigation __init __ssb_select_mitigation(void) x86_amd_ssb_disable(); } else { x86_spec_ctrl_base |= SPEC_CTRL_SSBD; - x86_spec_ctrl_mask |= SPEC_CTRL_SSBD; wrmsrl(MSR_IA32_SPEC_CTRL, x86_spec_ctrl_base); } } -- cgit v1.2.3 From 3c762ccd9d57fbe34d4439cc167ea59803b3fb05 Mon Sep 17 00:00:00 2001 From: Thomas Gleixner Date: Tue, 18 Jun 2019 22:31:40 +0200 Subject: x86/microcode: Fix the microcode load on CPU hotplug for real commit 5423f5ce5ca410b3646f355279e4e937d452e622 upstream. A recent change moved the microcode loader hotplug callback into the early startup phase which is running with interrupts disabled. It missed that the callbacks invoke sysfs functions which might sleep causing nice 'might sleep' splats with proper debugging enabled. Split the callbacks and only load the microcode in the early startup phase and move the sysfs handling back into the later threaded and preemptible bringup phase where it was before. Fixes: 78f4e932f776 ("x86/microcode, cpuhotplug: Add a microcode loader CPU hotplug callback") Signed-off-by: Thomas Gleixner Signed-off-by: Borislav Petkov Cc: "H. Peter Anvin" Cc: Ingo Molnar Cc: stable@vger.kernel.org Cc: x86-ml Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1906182228350.1766@nanos.tec.linutronix.de Signed-off-by: Greg Kroah-Hartman --- arch/x86/kernel/cpu/microcode/core.c | 15 ++++++++++----- 1 file changed, 10 insertions(+), 5 deletions(-) diff --git a/arch/x86/kernel/cpu/microcode/core.c b/arch/x86/kernel/cpu/microcode/core.c index f53658dde639..dfc36cd56676 100644 --- a/arch/x86/kernel/cpu/microcode/core.c +++ b/arch/x86/kernel/cpu/microcode/core.c @@ -793,13 +793,16 @@ static struct syscore_ops mc_syscore_ops = { .resume = mc_bp_resume, }; -static int mc_cpu_online(unsigned int cpu) +static int mc_cpu_starting(unsigned int cpu) { - struct device *dev; - - dev = get_cpu_device(cpu); microcode_update_cpu(cpu); pr_debug("CPU%d added\n", cpu); + return 0; +} + +static int mc_cpu_online(unsigned int cpu) +{ + struct device *dev = get_cpu_device(cpu); if (sysfs_create_group(&dev->kobj, &mc_attr_group)) pr_err("Failed to create group for CPU%d\n", cpu); @@ -876,7 +879,9 @@ int __init microcode_init(void) goto out_ucode_group; register_syscore_ops(&mc_syscore_ops); - cpuhp_setup_state_nocalls(CPUHP_AP_MICROCODE_LOADER, "x86/microcode:online", + cpuhp_setup_state_nocalls(CPUHP_AP_MICROCODE_LOADER, "x86/microcode:starting", + mc_cpu_starting, NULL); + cpuhp_setup_state_nocalls(CPUHP_AP_ONLINE_DYN, "x86/microcode:online", mc_cpu_online, mc_cpu_down_prep); pr_info("Microcode Update Driver: v%s.", DRIVER_VERSION); -- cgit v1.2.3 From 9b901ec94de5181f936858dfb2c7f28ebe70e5a7 Mon Sep 17 00:00:00 2001 From: Reinette Chatre Date: Wed, 19 Jun 2019 13:27:16 -0700 Subject: x86/resctrl: Prevent possible overrun during bitmap operations commit 32f010deab575199df4ebe7b6aec20c17bb7eccd upstream. While the DOC at the beginning of lib/bitmap.c explicitly states that "The number of valid bits in a given bitmap does _not_ need to be an exact multiple of BITS_PER_LONG.", some of the bitmap operations do indeed access BITS_PER_LONG portions of the provided bitmap no matter the size of the provided bitmap. For example, if find_first_bit() is provided with an 8 bit bitmap the operation will access BITS_PER_LONG bits from the provided bitmap. While the operation ensures that these extra bits do not affect the result, the memory is still accessed. The capacity bitmasks (CBMs) are typically stored in u32 since they can never exceed 32 bits. A few instances exist where a bitmap_* operation is performed on a CBM by simply pointing the bitmap operation to the stored u32 value. The consequence of this pattern is that some bitmap_* operations will access out-of-bounds memory when interacting with the provided CBM. This same issue has previously been addressed with commit 49e00eee0061 ("x86/intel_rdt: Fix out-of-bounds memory access in CBM tests") but at that time not all instances of the issue were fixed. Fix this by using an unsigned long to store the capacity bitmask data that is passed to bitmap functions. Fixes: e651901187ab ("x86/intel_rdt: Introduce "bit_usage" to display cache allocations details") Fixes: f4e80d67a527 ("x86/intel_rdt: Resctrl files reflect pseudo-locked information") Fixes: 95f0b77efa57 ("x86/intel_rdt: Initialize new resource group with sane defaults") Signed-off-by: Reinette Chatre Signed-off-by: Borislav Petkov Cc: Fenghua Yu Cc: "H. Peter Anvin" Cc: Ingo Molnar Cc: stable Cc: Thomas Gleixner Cc: Tony Luck Cc: x86-ml Link: https://lkml.kernel.org/r/58c9b6081fd9bf599af0dfc01a6fdd335768efef.1560975645.git.reinette.chatre@intel.com Signed-off-by: Greg Kroah-Hartman --- arch/x86/kernel/cpu/resctrl/rdtgroup.c | 35 ++++++++++++++++------------------ 1 file changed, 16 insertions(+), 19 deletions(-) diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c index c51b56e29948..f70a617b31b0 100644 --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c @@ -804,8 +804,12 @@ static int rdt_bit_usage_show(struct kernfs_open_file *of, struct seq_file *seq, void *v) { struct rdt_resource *r = of->kn->parent->priv; - u32 sw_shareable = 0, hw_shareable = 0; - u32 exclusive = 0, pseudo_locked = 0; + /* + * Use unsigned long even though only 32 bits are used to ensure + * test_bit() is used safely. + */ + unsigned long sw_shareable = 0, hw_shareable = 0; + unsigned long exclusive = 0, pseudo_locked = 0; struct rdt_domain *dom; int i, hwb, swb, excl, psl; enum rdtgrp_mode mode; @@ -850,10 +854,10 @@ static int rdt_bit_usage_show(struct kernfs_open_file *of, } for (i = r->cache.cbm_len - 1; i >= 0; i--) { pseudo_locked = dom->plr ? dom->plr->cbm : 0; - hwb = test_bit(i, (unsigned long *)&hw_shareable); - swb = test_bit(i, (unsigned long *)&sw_shareable); - excl = test_bit(i, (unsigned long *)&exclusive); - psl = test_bit(i, (unsigned long *)&pseudo_locked); + hwb = test_bit(i, &hw_shareable); + swb = test_bit(i, &sw_shareable); + excl = test_bit(i, &exclusive); + psl = test_bit(i, &pseudo_locked); if (hwb && swb) seq_putc(seq, 'X'); else if (hwb && !swb) @@ -2494,26 +2498,19 @@ out_destroy: */ static void cbm_ensure_valid(u32 *_val, struct rdt_resource *r) { - /* - * Convert the u32 _val to an unsigned long required by all the bit - * operations within this function. No more than 32 bits of this - * converted value can be accessed because all bit operations are - * additionally provided with cbm_len that is initialized during - * hardware enumeration using five bits from the EAX register and - * thus never can exceed 32 bits. - */ - unsigned long *val = (unsigned long *)_val; + unsigned long val = *_val; unsigned int cbm_len = r->cache.cbm_len; unsigned long first_bit, zero_bit; - if (*val == 0) + if (val == 0) return; - first_bit = find_first_bit(val, cbm_len); - zero_bit = find_next_zero_bit(val, cbm_len, first_bit); + first_bit = find_first_bit(&val, cbm_len); + zero_bit = find_next_zero_bit(&val, cbm_len, first_bit); /* Clear any remaining bits to ensure contiguous region */ - bitmap_clear(val, zero_bit, cbm_len - zero_bit); + bitmap_clear(&val, zero_bit, cbm_len - zero_bit); + *_val = (u32)val; } /** -- cgit v1.2.3 From 994f9a520c1ba3a5d8d585888da4e27ff9abfeb0 Mon Sep 17 00:00:00 2001 From: Johannes Weiner Date: Fri, 24 May 2019 10:12:46 -0400 Subject: mm: fix page cache convergence regression commit 7b785645e8f13e17cbce492708cf6e7039d32e46 upstream. Since a28334862993 ("page cache: Finish XArray conversion"), on most major Linux distributions, the page cache doesn't correctly transition when the hot data set is changing, and leaves the new pages thrashing indefinitely instead of kicking out the cold ones. On a freshly booted, freshly ssh'd into virtual machine with 1G RAM running stock Arch Linux: [root@ham ~]# ./reclaimtest.sh + dd of=workingset-a bs=1M count=0 seek=600 + cat workingset-a + cat workingset-a + cat workingset-a + cat workingset-a + cat workingset-a + cat workingset-a + cat workingset-a + cat workingset-a + ./mincore workingset-a 153600/153600 workingset-a + dd of=workingset-b bs=1M count=0 seek=600 + cat workingset-b + cat workingset-b + cat workingset-b + cat workingset-b + ./mincore workingset-a workingset-b 104029/153600 workingset-a 120086/153600 workingset-b + cat workingset-b + cat workingset-b + cat workingset-b + cat workingset-b + ./mincore workingset-a workingset-b 104029/153600 workingset-a 120268/153600 workingset-b workingset-b is a 600M file on a 1G host that is otherwise entirely idle. No matter how often it's being accessed, it won't get cached. While investigating, I noticed that the non-resident information gets aggressively reclaimed - /proc/vmstat::workingset_nodereclaim. This is a problem because a workingset transition like this relies on the non-resident information tracked in the page cache tree of evicted file ranges: when the cache faults are refaults of recently evicted cache, we challenge the existing active set, and that allows a new workingset to establish itself. Tracing the shrinker that maintains this memory revealed that all page cache tree nodes were allocated to the root cgroup. This is a problem, because 1) the shrinker sizes the amount of non-resident information it keeps to the size of the cgroup's other memory and 2) on most major Linux distributions, only kernel threads live in the root cgroup and everything else gets put into services or session groups: [root@ham ~]# cat /proc/self/cgroup 0::/user.slice/user-0.slice/session-c1.scope As a result, we basically maintain no non-resident information for the workloads running on the system, thus breaking the caching algorithm. Looking through the code, I found the culprit in the above-mentioned patch: when switching from the radix tree to xarray, it dropped the __GFP_ACCOUNT flag from the tree node allocations - the flag that makes sure the allocated memory gets charged to and tracked by the cgroup of the calling process - in this case, the one doing the fault. To fix this, allow xarray users to specify per-tree flag that makes xarray allocate nodes using __GFP_ACCOUNT. Then restore the page cache tree annotation to request such cgroup tracking for the cache nodes. With this patch applied, the page cache correctly converges on new workingsets again after just a few iterations: [root@ham ~]# ./reclaimtest.sh + dd of=workingset-a bs=1M count=0 seek=600 + cat workingset-a + cat workingset-a + cat workingset-a + cat workingset-a + cat workingset-a + cat workingset-a + cat workingset-a + cat workingset-a + ./mincore workingset-a 153600/153600 workingset-a + dd of=workingset-b bs=1M count=0 seek=600 + cat workingset-b + ./mincore workingset-a workingset-b 124607/153600 workingset-a 87876/153600 workingset-b + cat workingset-b + ./mincore workingset-a workingset-b 81313/153600 workingset-a 133321/153600 workingset-b + cat workingset-b + ./mincore workingset-a workingset-b 63036/153600 workingset-a 153600/153600 workingset-b Cc: stable@vger.kernel.org # 4.20+ Signed-off-by: Johannes Weiner Reviewed-by: Shakeel Butt Signed-off-by: Matthew Wilcox (Oracle) Signed-off-by: Greg Kroah-Hartman --- fs/inode.c | 2 +- include/linux/xarray.h | 1 + lib/xarray.c | 12 ++++++++++-- 3 files changed, 12 insertions(+), 3 deletions(-) diff --git a/fs/inode.c b/fs/inode.c index 9a453f3637f8..5bd1dd2e942f 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -349,7 +349,7 @@ EXPORT_SYMBOL(inc_nlink); static void __address_space_init_once(struct address_space *mapping) { - xa_init_flags(&mapping->i_pages, XA_FLAGS_LOCK_IRQ); + xa_init_flags(&mapping->i_pages, XA_FLAGS_LOCK_IRQ | XA_FLAGS_ACCOUNT); init_rwsem(&mapping->i_mmap_rwsem); INIT_LIST_HEAD(&mapping->private_list); spin_lock_init(&mapping->private_lock); diff --git a/include/linux/xarray.h b/include/linux/xarray.h index 0e01e6129145..5921599b6dc4 100644 --- a/include/linux/xarray.h +++ b/include/linux/xarray.h @@ -265,6 +265,7 @@ enum xa_lock_type { #define XA_FLAGS_TRACK_FREE ((__force gfp_t)4U) #define XA_FLAGS_ZERO_BUSY ((__force gfp_t)8U) #define XA_FLAGS_ALLOC_WRAPPED ((__force gfp_t)16U) +#define XA_FLAGS_ACCOUNT ((__force gfp_t)32U) #define XA_FLAGS_MARK(mark) ((__force gfp_t)((1U << __GFP_BITS_SHIFT) << \ (__force unsigned)(mark))) diff --git a/lib/xarray.c b/lib/xarray.c index 6be3acbb861f..446b956c9188 100644 --- a/lib/xarray.c +++ b/lib/xarray.c @@ -298,6 +298,8 @@ bool xas_nomem(struct xa_state *xas, gfp_t gfp) xas_destroy(xas); return false; } + if (xas->xa->xa_flags & XA_FLAGS_ACCOUNT) + gfp |= __GFP_ACCOUNT; xas->xa_alloc = kmem_cache_alloc(radix_tree_node_cachep, gfp); if (!xas->xa_alloc) return false; @@ -325,6 +327,8 @@ static bool __xas_nomem(struct xa_state *xas, gfp_t gfp) xas_destroy(xas); return false; } + if (xas->xa->xa_flags & XA_FLAGS_ACCOUNT) + gfp |= __GFP_ACCOUNT; if (gfpflags_allow_blocking(gfp)) { xas_unlock_type(xas, lock_type); xas->xa_alloc = kmem_cache_alloc(radix_tree_node_cachep, gfp); @@ -358,8 +362,12 @@ static void *xas_alloc(struct xa_state *xas, unsigned int shift) if (node) { xas->xa_alloc = NULL; } else { - node = kmem_cache_alloc(radix_tree_node_cachep, - GFP_NOWAIT | __GFP_NOWARN); + gfp_t gfp = GFP_NOWAIT | __GFP_NOWARN; + + if (xas->xa->xa_flags & XA_FLAGS_ACCOUNT) + gfp |= __GFP_ACCOUNT; + + node = kmem_cache_alloc(radix_tree_node_cachep, gfp); if (!node) { xas_set_err(xas, -ENOMEM); return NULL; -- cgit v1.2.3 From b5961ecad7121fec138ebee7d9eba4863f9dc6e7 Mon Sep 17 00:00:00 2001 From: Ard Biesheuvel Date: Sun, 9 Jun 2019 20:17:44 +0200 Subject: efi/memreserve: deal with memreserve entries in unmapped memory commit 18df7577adae6c6c778bf774b3aebcacbc1fb439 upstream. Ensure that the EFI memreserve entries can be accessed, even if they are located in memory that the kernel (e.g., a crashkernel) omits from the linear map. Fixes: 80424b02d42b ("efi: Reduce the amount of memblock reservations ...") Cc: # 5.0+ Reported-by: Jonathan Richardson Reviewed-by: Jonathan Richardson Tested-by: Jonathan Richardson Signed-off-by: Ard Biesheuvel Signed-off-by: Greg Kroah-Hartman --- drivers/firmware/efi/efi.c | 12 ++++++++++-- 1 file changed, 10 insertions(+), 2 deletions(-) diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c index 55b77c576c42..6d9995cbf651 100644 --- a/drivers/firmware/efi/efi.c +++ b/drivers/firmware/efi/efi.c @@ -1007,14 +1007,16 @@ int __ref efi_mem_reserve_persistent(phys_addr_t addr, u64 size) /* first try to find a slot in an existing linked list entry */ for (prsv = efi_memreserve_root->next; prsv; prsv = rsv->next) { - rsv = __va(prsv); + rsv = memremap(prsv, sizeof(*rsv), MEMREMAP_WB); index = atomic_fetch_add_unless(&rsv->count, 1, rsv->size); if (index < rsv->size) { rsv->entry[index].base = addr; rsv->entry[index].size = size; + memunmap(rsv); return 0; } + memunmap(rsv); } /* no slot found - allocate a new linked list entry */ @@ -1022,7 +1024,13 @@ int __ref efi_mem_reserve_persistent(phys_addr_t addr, u64 size) if (!rsv) return -ENOMEM; - rsv->size = EFI_MEMRESERVE_COUNT(PAGE_SIZE); + /* + * The memremap() call above assumes that a linux_efi_memreserve entry + * never crosses a page boundary, so let's ensure that this remains true + * even when kexec'ing a 4k pages kernel from a >4k pages kernel, by + * using SZ_4K explicitly in the size calculation below. + */ + rsv->size = EFI_MEMRESERVE_COUNT(SZ_4K); atomic_set(&rsv->count, 1); rsv->entry[0].base = addr; rsv->entry[0].size = size; -- cgit v1.2.3 From 82d0f7b68d939aa67b6adcb449a7753cbeee36b8 Mon Sep 17 00:00:00 2001 From: Trond Myklebust Date: Tue, 25 Jun 2019 16:41:16 -0400 Subject: NFS/flexfiles: Use the correct TCP timeout for flexfiles I/O commit 68f461593f76bd5f17e87cdd0bea28f4278c7268 upstream. Fix a typo where we're confusing the default TCP retrans value (NFS_DEF_TCP_RETRANS) for the default TCP timeout value. Fixes: 15d03055cf39f ("pNFS/flexfiles: Set reasonable default ...") Cc: stable@vger.kernel.org # 4.8+ Signed-off-by: Trond Myklebust Signed-off-by: Anna Schumaker Signed-off-by: Greg Kroah-Hartman --- fs/nfs/flexfilelayout/flexfilelayoutdev.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/nfs/flexfilelayout/flexfilelayoutdev.c b/fs/nfs/flexfilelayout/flexfilelayoutdev.c index a809989807d6..19f856f45689 100644 --- a/fs/nfs/flexfilelayout/flexfilelayoutdev.c +++ b/fs/nfs/flexfilelayout/flexfilelayoutdev.c @@ -18,7 +18,7 @@ #define NFSDBG_FACILITY NFSDBG_PNFS_LD -static unsigned int dataserver_timeo = NFS_DEF_TCP_RETRANS; +static unsigned int dataserver_timeo = NFS_DEF_TCP_TIMEO; static unsigned int dataserver_retrans; static bool ff_layout_has_available_ds(struct pnfs_layout_segment *lseg); -- cgit v1.2.3 From b187fae6ee297f3bba913cfc8b6c2181bccae3dd Mon Sep 17 00:00:00 2001 From: Geert Uytterhoeven Date: Thu, 16 May 2019 09:09:35 +0200 Subject: cpu/speculation: Warn on unsupported mitigations= parameter commit 1bf72720281770162c87990697eae1ba2f1d917a upstream. Currently, if the user specifies an unsupported mitigation strategy on the kernel command line, it will be ignored silently. The code will fall back to the default strategy, possibly leaving the system more vulnerable than expected. This may happen due to e.g. a simple typo, or, for a stable kernel release, because not all mitigation strategies have been backported. Inform the user by printing a message. Fixes: 98af8452945c5565 ("cpu/speculation: Add 'mitigations=' cmdline option") Signed-off-by: Geert Uytterhoeven Signed-off-by: Thomas Gleixner Acked-by: Josh Poimboeuf Cc: Peter Zijlstra Cc: Jiri Kosina Cc: Greg Kroah-Hartman Cc: Ben Hutchings Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20190516070935.22546-1-geert@linux-m68k.org Signed-off-by: Greg Kroah-Hartman --- kernel/cpu.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/kernel/cpu.c b/kernel/cpu.c index 9cc8b6fdb2dc..6170034f4118 100644 --- a/kernel/cpu.c +++ b/kernel/cpu.c @@ -2315,6 +2315,9 @@ static int __init mitigations_parse_cmdline(char *arg) cpu_mitigations = CPU_MITIGATIONS_AUTO; else if (!strcmp(arg, "auto,nosmt")) cpu_mitigations = CPU_MITIGATIONS_AUTO_NOSMT; + else + pr_crit("Unsupported mitigations=%s, system may still be vulnerable\n", + arg); return 0; } -- cgit v1.2.3 From 5dfe49ca70e10eef208b6d4a673f34d01bff0293 Mon Sep 17 00:00:00 2001 From: Trond Myklebust Date: Mon, 24 Jun 2019 19:15:44 -0400 Subject: SUNRPC: Fix up calculation of client message length commit 7e3d3620974b743b91b1f9d0660061b1de20174c upstream. In the case where a record marker was used, xs_sendpages() needs to return the length of the payload + record marker so that we operate correctly in the case of a partial transmission. When the callers check return value, they therefore need to take into account the record marker length. Fixes: 06b5fc3ad94e ("Merge tag 'nfs-rdma-for-5.1-1'...") Cc: stable@vger.kernel.org # 5.1+ Signed-off-by: Trond Myklebust Signed-off-by: Anna Schumaker Signed-off-by: Greg Kroah-Hartman --- net/sunrpc/xprtsock.c | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c index 732d4b57411a..a437ee8ae482 100644 --- a/net/sunrpc/xprtsock.c +++ b/net/sunrpc/xprtsock.c @@ -950,6 +950,8 @@ static int xs_local_send_request(struct rpc_rqst *req) struct sock_xprt *transport = container_of(xprt, struct sock_xprt, xprt); struct xdr_buf *xdr = &req->rq_snd_buf; + rpc_fraghdr rm = xs_stream_record_marker(xdr); + unsigned int msglen = rm ? req->rq_slen + sizeof(rm) : req->rq_slen; int status; int sent = 0; @@ -964,9 +966,7 @@ static int xs_local_send_request(struct rpc_rqst *req) req->rq_xtime = ktime_get(); status = xs_sendpages(transport->sock, NULL, 0, xdr, - transport->xmit.offset, - xs_stream_record_marker(xdr), - &sent); + transport->xmit.offset, rm, &sent); dprintk("RPC: %s(%u) = %d\n", __func__, xdr->len - transport->xmit.offset, status); @@ -976,7 +976,7 @@ static int xs_local_send_request(struct rpc_rqst *req) if (likely(sent > 0) || status == 0) { transport->xmit.offset += sent; req->rq_bytes_sent = transport->xmit.offset; - if (likely(req->rq_bytes_sent >= req->rq_slen)) { + if (likely(req->rq_bytes_sent >= msglen)) { req->rq_xmit_bytes_sent += transport->xmit.offset; transport->xmit.offset = 0; return 0; @@ -1097,6 +1097,8 @@ static int xs_tcp_send_request(struct rpc_rqst *req) struct rpc_xprt *xprt = req->rq_xprt; struct sock_xprt *transport = container_of(xprt, struct sock_xprt, xprt); struct xdr_buf *xdr = &req->rq_snd_buf; + rpc_fraghdr rm = xs_stream_record_marker(xdr); + unsigned int msglen = rm ? req->rq_slen + sizeof(rm) : req->rq_slen; bool vm_wait = false; int status; int sent; @@ -1122,9 +1124,7 @@ static int xs_tcp_send_request(struct rpc_rqst *req) while (1) { sent = 0; status = xs_sendpages(transport->sock, NULL, 0, xdr, - transport->xmit.offset, - xs_stream_record_marker(xdr), - &sent); + transport->xmit.offset, rm, &sent); dprintk("RPC: xs_tcp_send_request(%u) = %d\n", xdr->len - transport->xmit.offset, status); @@ -1133,7 +1133,7 @@ static int xs_tcp_send_request(struct rpc_rqst *req) * reset the count of bytes sent. */ transport->xmit.offset += sent; req->rq_bytes_sent = transport->xmit.offset; - if (likely(req->rq_bytes_sent >= req->rq_slen)) { + if (likely(req->rq_bytes_sent >= msglen)) { req->rq_xmit_bytes_sent += transport->xmit.offset; transport->xmit.offset = 0; return 0; -- cgit v1.2.3 From 0730644e5602b5573401e091442acbee7e5778a1 Mon Sep 17 00:00:00 2001 From: Paul Burton Date: Wed, 5 Jun 2019 09:34:10 +0100 Subject: irqchip/mips-gic: Use the correct local interrupt map registers commit 6d4d367d0e9ffab4d64a3436256a6a052dc1195d upstream. The MIPS GIC contains a block of registers used to map local interrupts to a particular CPU interrupt pin. Since these registers are found at a consecutive range of addresses we access them using an index, via the (read|write)_gic_v[lo]_map accessor functions. We currently use values from enum mips_gic_local_interrupt as those indices. Unfortunately whilst enum mips_gic_local_interrupt provides the correct offsets for bits in the pending & mask registers, the ordering of the map registers is subtly different... Compared with the ordering of pending & mask bits, the map registers move the FDC from the end of the list to index 3 after the timer interrupt. As a result the performance counter & software interrupts are therefore at indices 4-6 rather than indices 3-5. Notably this causes problems with performance counter interrupts being incorrectly mapped on some systems, and presumably will also cause problems for FDC interrupts. Introduce a function to map from enum mips_gic_local_interrupt to the index of the corresponding map register, and use it to ensure we access the map registers for the correct interrupts. Signed-off-by: Paul Burton Fixes: a0dc5cb5e31b ("irqchip: mips-gic: Simplify gic_local_irq_domain_map()") Fixes: da61fcf9d62a ("irqchip: mips-gic: Use irq_cpu_online to (un)mask all-VP(E) IRQs") Reported-and-tested-by: Archer Yan Cc: Thomas Gleixner Cc: Jason Cooper Cc: stable@vger.kernel.org # v4.14+ Signed-off-by: Marc Zyngier Signed-off-by: Greg Kroah-Hartman --- arch/mips/include/asm/mips-gic.h | 30 ++++++++++++++++++++++++++++++ drivers/irqchip/irq-mips-gic.c | 4 ++-- 2 files changed, 32 insertions(+), 2 deletions(-) diff --git a/arch/mips/include/asm/mips-gic.h b/arch/mips/include/asm/mips-gic.h index 558059a8f218..0277b56157af 100644 --- a/arch/mips/include/asm/mips-gic.h +++ b/arch/mips/include/asm/mips-gic.h @@ -314,6 +314,36 @@ static inline bool mips_gic_present(void) return IS_ENABLED(CONFIG_MIPS_GIC) && mips_gic_base; } +/** + * mips_gic_vx_map_reg() - Return GIC_Vx__MAP register offset + * @intr: A GIC local interrupt + * + * Determine the index of the GIC_VL__MAP or GIC_VO__MAP register + * within the block of GIC map registers. This is almost the same as the order + * of interrupts in the pending & mask registers, as used by enum + * mips_gic_local_interrupt, but moves the FDC interrupt & thus offsets the + * interrupts after it... + * + * Return: The map register index corresponding to @intr. + * + * The return value is suitable for use with the (read|write)_gic_v[lo]_map + * accessor functions. + */ +static inline unsigned int +mips_gic_vx_map_reg(enum mips_gic_local_interrupt intr) +{ + /* WD, Compare & Timer are 1:1 */ + if (intr <= GIC_LOCAL_INT_TIMER) + return intr; + + /* FDC moves to after Timer... */ + if (intr == GIC_LOCAL_INT_FDC) + return GIC_LOCAL_INT_TIMER + 1; + + /* As a result everything else is offset by 1 */ + return intr + 1; +} + /** * gic_get_c0_compare_int() - Return cp0 count/compare interrupt virq * diff --git a/drivers/irqchip/irq-mips-gic.c b/drivers/irqchip/irq-mips-gic.c index d32268cc1174..f3985469c221 100644 --- a/drivers/irqchip/irq-mips-gic.c +++ b/drivers/irqchip/irq-mips-gic.c @@ -388,7 +388,7 @@ static void gic_all_vpes_irq_cpu_online(struct irq_data *d) intr = GIC_HWIRQ_TO_LOCAL(d->hwirq); cd = irq_data_get_irq_chip_data(d); - write_gic_vl_map(intr, cd->map); + write_gic_vl_map(mips_gic_vx_map_reg(intr), cd->map); if (cd->mask) write_gic_vl_smask(BIT(intr)); } @@ -517,7 +517,7 @@ static int gic_irq_domain_map(struct irq_domain *d, unsigned int virq, spin_lock_irqsave(&gic_lock, flags); for_each_online_cpu(cpu) { write_gic_vl_other(mips_cm_vp_id(cpu)); - write_gic_vo_map(intr, map); + write_gic_vo_map(mips_gic_vx_map_reg(intr), map); } spin_unlock_irqrestore(&gic_lock, flags); -- cgit v1.2.3 From c14a0de97597f2022e83abb5cbb0144c231a4a33 Mon Sep 17 00:00:00 2001 From: Neil Horman Date: Tue, 25 Jun 2019 17:57:49 -0400 Subject: af_packet: Block execution of tasks waiting for transmit to complete in AF_PACKET [ Upstream commit 89ed5b519004a7706f50b70f611edbd3aaacff2c ] When an application is run that: a) Sets its scheduler to be SCHED_FIFO and b) Opens a memory mapped AF_PACKET socket, and sends frames with the MSG_DONTWAIT flag cleared, its possible for the application to hang forever in the kernel. This occurs because when waiting, the code in tpacket_snd calls schedule, which under normal circumstances allows other tasks to run, including ksoftirqd, which in some cases is responsible for freeing the transmitted skb (which in AF_PACKET calls a destructor that flips the status bit of the transmitted frame back to available, allowing the transmitting task to complete). However, when the calling application is SCHED_FIFO, its priority is such that the schedule call immediately places the task back on the cpu, preventing ksoftirqd from freeing the skb, which in turn prevents the transmitting task from detecting that the transmission is complete. We can fix this by converting the schedule call to a completion mechanism. By using a completion queue, we force the calling task, when it detects there are no more frames to send, to schedule itself off the cpu until such time as the last transmitted skb is freed, allowing forward progress to be made. Tested by myself and the reporter, with good results Change Notes: V1->V2: Enhance the sleep logic to support being interruptible and allowing for honoring to SK_SNDTIMEO (Willem de Bruijn) V2->V3: Rearrage the point at which we wait for the completion queue, to avoid needing to check for ph/skb being null at the end of the loop. Also move the complete call to the skb destructor to avoid needing to modify __packet_set_status. Also gate calling complete on packet_read_pending returning zero to avoid multiple calls to complete. (Willem de Bruijn) Move timeo computation within loop, to re-fetch the socket timeout since we also use the timeo variable to record the return code from the wait_for_complete call (Neil Horman) V3->V4: Willem has requested that the control flow be restored to the previous state. Doing so lets us eliminate the need for the po->wait_on_complete flag variable, and lets us get rid of the packet_next_frame function, but introduces another complexity. Specifically, but using the packet pending count, we can, if an applications calls sendmsg multiple times with MSG_DONTWAIT set, each set of transmitted frames, when complete, will cause tpacket_destruct_skb to issue a complete call, for which there will never be a wait_on_completion call. This imbalance will lead to any future call to wait_for_completion here to return early, when the frames they sent may not have completed. To correct this, we need to re-init the completion queue on every call to tpacket_snd before we enter the loop so as to ensure we wait properly for the frames we send in this iteration. Change the timeout and interrupted gotos to out_put rather than out_status so that we don't try to free a non-existant skb Clean up some extra newlines (Willem de Bruijn) Reviewed-by: Willem de Bruijn Signed-off-by: Neil Horman Reported-by: Matteo Croce Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman --- net/packet/af_packet.c | 20 +++++++++++++++++--- net/packet/internal.h | 1 + 2 files changed, 18 insertions(+), 3 deletions(-) diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c index 71d5544243d2..3385f4c6eb74 100644 --- a/net/packet/af_packet.c +++ b/net/packet/af_packet.c @@ -2409,6 +2409,9 @@ static void tpacket_destruct_skb(struct sk_buff *skb) ts = __packet_set_timestamp(po, ph, skb); __packet_set_status(po, ph, TP_STATUS_AVAILABLE | ts); + + if (!packet_read_pending(&po->tx_ring)) + complete(&po->skb_completion); } sock_wfree(skb); @@ -2593,7 +2596,7 @@ static int tpacket_parse_header(struct packet_sock *po, void *frame, static int tpacket_snd(struct packet_sock *po, struct msghdr *msg) { - struct sk_buff *skb; + struct sk_buff *skb = NULL; struct net_device *dev; struct virtio_net_hdr *vnet_hdr = NULL; struct sockcm_cookie sockc; @@ -2608,6 +2611,7 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg) int len_sum = 0; int status = TP_STATUS_AVAILABLE; int hlen, tlen, copylen = 0; + long timeo = 0; mutex_lock(&po->pg_vec_lock); @@ -2654,12 +2658,21 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg) if ((size_max > dev->mtu + reserve + VLAN_HLEN) && !po->has_vnet_hdr) size_max = dev->mtu + reserve + VLAN_HLEN; + reinit_completion(&po->skb_completion); + do { ph = packet_current_frame(po, &po->tx_ring, TP_STATUS_SEND_REQUEST); if (unlikely(ph == NULL)) { - if (need_wait && need_resched()) - schedule(); + if (need_wait && skb) { + timeo = sock_sndtimeo(&po->sk, msg->msg_flags & MSG_DONTWAIT); + timeo = wait_for_completion_interruptible_timeout(&po->skb_completion, timeo); + if (timeo <= 0) { + err = !timeo ? -ETIMEDOUT : -ERESTARTSYS; + goto out_put; + } + } + /* check for additional frames */ continue; } @@ -3215,6 +3228,7 @@ static int packet_create(struct net *net, struct socket *sock, int protocol, sock_init_data(sock, sk); po = pkt_sk(sk); + init_completion(&po->skb_completion); sk->sk_family = PF_PACKET; po->num = proto; po->xmit = dev_queue_xmit; diff --git a/net/packet/internal.h b/net/packet/internal.h index 3bb7c5fb3bff..c70a2794456f 100644 --- a/net/packet/internal.h +++ b/net/packet/internal.h @@ -128,6 +128,7 @@ struct packet_sock { unsigned int tp_hdrlen; unsigned int tp_reserve; unsigned int tp_tstamp; + struct completion skb_completion; struct net_device __rcu *cached_dev; int (*xmit)(struct sk_buff *skb); struct packet_type prot_hook ____cacheline_aligned_in_smp; -- cgit v1.2.3 From bc4fdb7d73ba4b4ecd9e12686ab64a6cdb3f2bb1 Mon Sep 17 00:00:00 2001 From: YueHaibing Date: Wed, 26 Jun 2019 16:08:44 +0800 Subject: bonding: Always enable vlan tx offload [ Upstream commit 30d8177e8ac776d89d387fad547af6a0f599210e ] We build vlan on top of bonding interface, which vlan offload is off, bond mode is 802.3ad (LACP) and xmit_hash_policy is BOND_XMIT_POLICY_ENCAP34. Because vlan tx offload is off, vlan tci is cleared and skb push the vlan header in validate_xmit_vlan() while sending from vlan devices. Then in bond_xmit_hash, __skb_flow_dissect() fails to get information from protocol headers encapsulated within vlan, because 'nhoff' is points to IP header, so bond hashing is based on layer 2 info, which fails to distribute packets across slaves. This patch always enable bonding's vlan tx offload, pass the vlan packets to the slave devices with vlan tci, let them to handle vlan implementation. Fixes: 278339a42a1b ("bonding: propogate vlan_features to bonding master") Suggested-by: Jiri Pirko Signed-off-by: YueHaibing Acked-by: Jiri Pirko Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman --- drivers/net/bonding/bond_main.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c index f96efa363d34..59e919b92873 100644 --- a/drivers/net/bonding/bond_main.c +++ b/drivers/net/bonding/bond_main.c @@ -4321,12 +4321,12 @@ void bond_setup(struct net_device *bond_dev) bond_dev->features |= NETIF_F_NETNS_LOCAL; bond_dev->hw_features = BOND_VLAN_FEATURES | - NETIF_F_HW_VLAN_CTAG_TX | NETIF_F_HW_VLAN_CTAG_RX | NETIF_F_HW_VLAN_CTAG_FILTER; bond_dev->hw_features |= NETIF_F_GSO_ENCAP_ALL | NETIF_F_GSO_UDP_L4; bond_dev->features |= bond_dev->hw_features; + bond_dev->features |= NETIF_F_HW_VLAN_CTAG_TX | NETIF_F_HW_VLAN_STAG_TX; } /* Destroy a bonding device. -- cgit v1.2.3 From c79ab459bea42f66f5c4d161dc41f9d8e40ab4ff Mon Sep 17 00:00:00 2001 From: Stephen Suryaputra Date: Mon, 24 Jun 2019 20:14:06 -0400 Subject: ipv4: Use return value of inet_iif() for __raw_v4_lookup in the while loop [ Upstream commit 38c73529de13e1e10914de7030b659a2f8b01c3b ] In commit 19e4e768064a8 ("ipv4: Fix raw socket lookup for local traffic"), the dif argument to __raw_v4_lookup() is coming from the returned value of inet_iif() but the change was done only for the first lookup. Subsequent lookups in the while loop still use skb->dev->ifIndex. Fixes: 19e4e768064a8 ("ipv4: Fix raw socket lookup for local traffic") Signed-off-by: Stephen Suryaputra Reviewed-by: David Ahern Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman --- net/ipv4/raw.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c index dc91c27bb788..3f6e95cb21b0 100644 --- a/net/ipv4/raw.c +++ b/net/ipv4/raw.c @@ -201,7 +201,7 @@ static int raw_v4_input(struct sk_buff *skb, const struct iphdr *iph, int hash) } sk = __raw_v4_lookup(net, sk_next(sk), iph->protocol, iph->saddr, iph->daddr, - skb->dev->ifindex, sdif); + dif, sdif); } out: read_unlock(&raw_v4_hashinfo.lock); -- cgit v1.2.3 From 65b2a8047939229a9e767b82d741bee4f8ac6b53 Mon Sep 17 00:00:00 2001 From: Eric Dumazet Date: Mon, 24 Jun 2019 02:38:20 -0700 Subject: net/packet: fix memory leak in packet_set_ring() [ Upstream commit 55655e3d1197fff16a7a05088fb0e5eba50eac55 ] syzbot found we can leak memory in packet_set_ring(), if user application provides buggy parameters. Fixes: 7f953ab2ba46 ("af_packet: TX_RING support for TPACKET_V3") Signed-off-by: Eric Dumazet Cc: Sowmini Varadhan Reported-by: syzbot Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman --- net/packet/af_packet.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c index 3385f4c6eb74..21814acb862d 100644 --- a/net/packet/af_packet.c +++ b/net/packet/af_packet.c @@ -4341,7 +4341,7 @@ static int packet_set_ring(struct sock *sk, union tpacket_req_u *req_u, req3->tp_sizeof_priv || req3->tp_feature_req_word) { err = -EINVAL; - goto out; + goto out_free_pg_vec; } } break; @@ -4405,6 +4405,7 @@ static int packet_set_ring(struct sock *sk, union tpacket_req_u *req_u, prb_shutdown_retire_blk_timer(po, rb_queue); } +out_free_pg_vec: if (pg_vec) free_pg_vec(pg_vec, order, req->tp_block_nr); out: -- cgit v1.2.3 From 505c925823144996daa030e4bbe26660569a0609 Mon Sep 17 00:00:00 2001 From: JingYi Hou Date: Mon, 17 Jun 2019 14:56:05 +0800 Subject: net: remove duplicate fetch in sock_getsockopt [ Upstream commit d0bae4a0e3d8c5690a885204d7eb2341a5b4884d ] In sock_getsockopt(), 'optlen' is fetched the first time from userspace. 'len < 0' is then checked. Then in condition 'SO_MEMINFO', 'optlen' is fetched the second time from userspace. If change it between two fetches may cause security problems or unexpected behaivor, and there is no reason to fetch it a second time. To fix this, we need to remove the second fetch. Signed-off-by: JingYi Hou Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman --- net/core/sock.c | 3 --- 1 file changed, 3 deletions(-) diff --git a/net/core/sock.c b/net/core/sock.c index 067878a1e4c5..30afb072eecf 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -1482,9 +1482,6 @@ int sock_getsockopt(struct socket *sock, int level, int optname, { u32 meminfo[SK_MEMINFO_VARS]; - if (get_user(len, optlen)) - return -EFAULT; - sk_get_meminfo(sk, meminfo); len = min_t(unsigned int, len, sizeof(meminfo)); -- cgit v1.2.3 From ac086d4c5d0f54c3412472d343268df132b55328 Mon Sep 17 00:00:00 2001 From: Roland Hii Date: Wed, 19 Jun 2019 22:13:48 +0800 Subject: net: stmmac: fixed new system time seconds value calculation [ Upstream commit a1e5388b4d5fc78688e5e9ee6641f779721d6291 ] When ADDSUB bit is set, the system time seconds field is calculated as the complement of the seconds part of the update value. For example, if 3.000000001 seconds need to be subtracted from the system time, this field is calculated as 2^32 - 3 = 4294967296 - 3 = 0x100000000 - 3 = 0xFFFFFFFD Previously, the 0x100000000 is mistakenly written as 100000000. This is further simplified from sec = (0x100000000ULL - sec); to sec = -sec; Fixes: ba1ffd74df74 ("stmmac: fix PTP support for GMAC4") Signed-off-by: Roland Hii Signed-off-by: Ong Boon Leong Signed-off-by: Voon Weifeng Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman --- drivers/net/ethernet/stmicro/stmmac/stmmac_hwtstamp.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_hwtstamp.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_hwtstamp.c index 8d9cc2157afd..7423262ce590 100644 --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_hwtstamp.c +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_hwtstamp.c @@ -122,7 +122,7 @@ static int adjust_systime(void __iomem *ioaddr, u32 sec, u32 nsec, * programmed with (2^32 – ) */ if (gmac4) - sec = (100000000ULL - sec); + sec = -sec; value = readl(ioaddr + PTP_TCR); if (value & PTP_TCR_TSCTRLSSR) -- cgit v1.2.3 From a6902fe436d068f8afeb34a0ac5dfdf2d99abd8d Mon Sep 17 00:00:00 2001 From: Roland Hii Date: Wed, 19 Jun 2019 22:41:48 +0800 Subject: net: stmmac: set IC bit when transmitting frames with HW timestamp [ Upstream commit d0bb82fd60183868f46c8ccc595a3d61c3334a18 ] When transmitting certain PTP frames, e.g. SYNC and DELAY_REQ, the PTP daemon, e.g. ptp4l, is polling the driver for the frame transmit hardware timestamp. The polling will most likely timeout if the tx coalesce is enabled due to the Interrupt-on-Completion (IC) bit is not set in tx descriptor for those frames. This patch will ignore the tx coalesce parameter and set the IC bit when transmitting PTP frames which need to report out the frame transmit hardware timestamp to user space. Fixes: f748be531d70 ("net: stmmac: Rework coalesce timer and fix multi-queue races") Signed-off-by: Roland Hii Signed-off-by: Ong Boon Leong Signed-off-by: Voon Weifeng Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman --- drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 22 ++++++++++++++-------- 1 file changed, 14 insertions(+), 8 deletions(-) diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c index 635d88d82610..a634054dcb11 100644 --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c @@ -2957,12 +2957,15 @@ static netdev_tx_t stmmac_tso_xmit(struct sk_buff *skb, struct net_device *dev) /* Manage tx mitigation */ tx_q->tx_count_frames += nfrags + 1; - if (priv->tx_coal_frames <= tx_q->tx_count_frames) { + if (likely(priv->tx_coal_frames > tx_q->tx_count_frames) && + !(priv->synopsys_id >= DWMAC_CORE_4_00 && + (skb_shinfo(skb)->tx_flags & SKBTX_HW_TSTAMP) && + priv->hwts_tx_en)) { + stmmac_tx_timer_arm(priv, queue); + } else { + tx_q->tx_count_frames = 0; stmmac_set_tx_ic(priv, desc); priv->xstats.tx_set_ic_bit++; - tx_q->tx_count_frames = 0; - } else { - stmmac_tx_timer_arm(priv, queue); } skb_tx_timestamp(skb); @@ -3176,12 +3179,15 @@ static netdev_tx_t stmmac_xmit(struct sk_buff *skb, struct net_device *dev) * element in case of no SG. */ tx_q->tx_count_frames += nfrags + 1; - if (priv->tx_coal_frames <= tx_q->tx_count_frames) { + if (likely(priv->tx_coal_frames > tx_q->tx_count_frames) && + !(priv->synopsys_id >= DWMAC_CORE_4_00 && + (skb_shinfo(skb)->tx_flags & SKBTX_HW_TSTAMP) && + priv->hwts_tx_en)) { + stmmac_tx_timer_arm(priv, queue); + } else { + tx_q->tx_count_frames = 0; stmmac_set_tx_ic(priv, desc); priv->xstats.tx_set_ic_bit++; - tx_q->tx_count_frames = 0; - } else { - stmmac_tx_timer_arm(priv, queue); } skb_tx_timestamp(skb); -- cgit v1.2.3 From 0962d139f22a4e889e686b86806767b99eda8086 Mon Sep 17 00:00:00 2001 From: Dirk van der Merwe Date: Sun, 23 Jun 2019 21:26:58 -0700 Subject: net/tls: fix page double free on TX cleanup [ Upstream commit 9354544cbccf68da1b047f8fb7b47630e3c8a59d ] With commit 94850257cf0f ("tls: Fix tls_device handling of partial records") a new path was introduced to cleanup partial records during sk_proto_close. This path does not handle the SW KTLS tx_list cleanup. This is unnecessary though since the free_resources calls for both SW and offload paths will cleanup a partial record. The visible effect is the following warning, but this bug also causes a page double free. WARNING: CPU: 7 PID: 4000 at net/core/stream.c:206 sk_stream_kill_queues+0x103/0x110 RIP: 0010:sk_stream_kill_queues+0x103/0x110 RSP: 0018:ffffb6df87e07bd0 EFLAGS: 00010206 RAX: 0000000000000000 RBX: ffff8c21db4971c0 RCX: 0000000000000007 RDX: ffffffffffffffa0 RSI: 000000000000001d RDI: ffff8c21db497270 RBP: ffff8c21db497270 R08: ffff8c29f4748600 R09: 000000010020001a R10: ffffb6df87e07aa0 R11: ffffffff9a445600 R12: 0000000000000007 R13: 0000000000000000 R14: ffff8c21f03f2900 R15: ffff8c21f03b8df0 Call Trace: inet_csk_destroy_sock+0x55/0x100 tcp_close+0x25d/0x400 ? tcp_check_oom+0x120/0x120 tls_sk_proto_close+0x127/0x1c0 inet_release+0x3c/0x60 __sock_release+0x3d/0xb0 sock_close+0x11/0x20 __fput+0xd8/0x210 task_work_run+0x84/0xa0 do_exit+0x2dc/0xb90 ? release_sock+0x43/0x90 do_group_exit+0x3a/0xa0 get_signal+0x295/0x720 do_signal+0x36/0x610 ? SYSC_recvfrom+0x11d/0x130 exit_to_usermode_loop+0x69/0xb0 do_syscall_64+0x173/0x180 entry_SYSCALL_64_after_hwframe+0x3d/0xa2 RIP: 0033:0x7fe9b9abc10d RSP: 002b:00007fe9b19a1d48 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca RAX: fffffffffffffe00 RBX: 0000000000000006 RCX: 00007fe9b9abc10d RDX: 0000000000000002 RSI: 0000000000000080 RDI: 00007fe948003430 RBP: 00007fe948003410 R08: 00007fe948003430 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 00005603739d9080 R13: 00007fe9b9ab9f90 R14: 00007fe948003430 R15: 0000000000000000 Fixes: 94850257cf0f ("tls: Fix tls_device handling of partial records") Signed-off-by: Dirk van der Merwe Signed-off-by: Jakub Kicinski Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman --- include/net/tls.h | 15 --------------- net/tls/tls_main.c | 3 ++- 2 files changed, 2 insertions(+), 16 deletions(-) diff --git a/include/net/tls.h b/include/net/tls.h index 053082d98906..a67ad7d56ff2 100644 --- a/include/net/tls.h +++ b/include/net/tls.h @@ -347,21 +347,6 @@ static inline bool tls_is_partially_sent_record(struct tls_context *ctx) return !!ctx->partially_sent_record; } -static inline int tls_complete_pending_work(struct sock *sk, - struct tls_context *ctx, - int flags, long *timeo) -{ - int rc = 0; - - if (unlikely(sk->sk_write_pending)) - rc = wait_on_pending_writer(sk, timeo); - - if (!rc && tls_is_partially_sent_record(ctx)) - rc = tls_push_partial_record(sk, ctx, flags); - - return rc; -} - static inline bool tls_is_pending_open_record(struct tls_context *tls_ctx) { return tls_ctx->pending_open_record_frags; diff --git a/net/tls/tls_main.c b/net/tls/tls_main.c index 478603f43964..f4f632824247 100644 --- a/net/tls/tls_main.c +++ b/net/tls/tls_main.c @@ -279,7 +279,8 @@ static void tls_sk_proto_close(struct sock *sk, long timeout) goto skip_tx_cleanup; } - if (!tls_complete_pending_work(sk, ctx, 0, &timeo)) + if (unlikely(sk->sk_write_pending) && + !wait_on_pending_writer(sk, &timeo)) tls_handle_open_record(sk, 0); /* We need these for tls_sw_fallback handling of other packets */ -- cgit v1.2.3 From 6c616a135a6d9b8ff2f5aa65b6c0999530228058 Mon Sep 17 00:00:00 2001 From: Xin Long Date: Tue, 25 Jun 2019 00:21:45 +0800 Subject: sctp: change to hold sk after auth shkey is created successfully [ Upstream commit 25bff6d5478b2a02368097015b7d8eb727c87e16 ] Now in sctp_endpoint_init(), it holds the sk then creates auth shkey. But when the creation fails, it doesn't release the sk, which causes a sk defcnf leak, Here to fix it by only holding the sk when auth shkey is created successfully. Fixes: a29a5bd4f5c3 ("[SCTP]: Implement SCTP-AUTH initializations.") Reported-by: syzbot+afabda3890cc2f765041@syzkaller.appspotmail.com Reported-by: syzbot+276ca1c77a19977c0130@syzkaller.appspotmail.com Signed-off-by: Xin Long Acked-by: Neil Horman Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman --- net/sctp/endpointola.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/net/sctp/endpointola.c b/net/sctp/endpointola.c index 0448b68fce74..bcfc81ee153d 100644 --- a/net/sctp/endpointola.c +++ b/net/sctp/endpointola.c @@ -133,10 +133,6 @@ static struct sctp_endpoint *sctp_endpoint_init(struct sctp_endpoint *ep, /* Initialize the bind addr area */ sctp_bind_addr_init(&ep->base.bind_addr, 0); - /* Remember who we are attached to. */ - ep->base.sk = sk; - sock_hold(ep->base.sk); - /* Create the lists of associations. */ INIT_LIST_HEAD(&ep->asocs); @@ -169,6 +165,10 @@ static struct sctp_endpoint *sctp_endpoint_init(struct sctp_endpoint *ep, ep->prsctp_enable = net->sctp.prsctp_enable; ep->reconf_enable = net->sctp.reconf_enable; + /* Remember who we are attached to. */ + ep->base.sk = sk; + sock_hold(ep->base.sk); + return ep; nomem_shkey: -- cgit v1.2.3 From a061216af44be711890d0153b4305553c98d9528 Mon Sep 17 00:00:00 2001 From: YueHaibing Date: Thu, 27 Jun 2019 00:03:39 +0800 Subject: team: Always enable vlan tx offload [ Upstream commit ee4297420d56a0033a8593e80b33fcc93fda8509 ] We should rather have vlan_tci filled all the way down to the transmitting netdevice and let it do the hw/sw vlan implementation. Suggested-by: Jiri Pirko Signed-off-by: YueHaibing Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman --- drivers/net/team/team.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/team/team.c b/drivers/net/team/team.c index 16963f7a88f7..351f25e1fc48 100644 --- a/drivers/net/team/team.c +++ b/drivers/net/team/team.c @@ -2135,12 +2135,12 @@ static void team_setup(struct net_device *dev) dev->features |= NETIF_F_NETNS_LOCAL; dev->hw_features = TEAM_VLAN_FEATURES | - NETIF_F_HW_VLAN_CTAG_TX | NETIF_F_HW_VLAN_CTAG_RX | NETIF_F_HW_VLAN_CTAG_FILTER; dev->hw_features |= NETIF_F_GSO_ENCAP_ALL | NETIF_F_GSO_UDP_L4; dev->features |= dev->hw_features; + dev->features |= NETIF_F_HW_VLAN_CTAG_TX | NETIF_F_HW_VLAN_STAG_TX; } static int team_newlink(struct net *src_net, struct net_device *dev, -- cgit v1.2.3 From ec7fafa68f287c290c08a07765cbb0772e3d7229 Mon Sep 17 00:00:00 2001 From: Xin Long Date: Thu, 20 Jun 2019 18:39:28 +0800 Subject: tipc: change to use register_pernet_device [ Upstream commit c492d4c74dd3f87559883ffa0f94a8f1ae3fe5f5 ] This patch is to fix a dst defcnt leak, which can be reproduced by doing: # ip net a c; ip net a s; modprobe tipc # ip net e s ip l a n eth1 type veth peer n eth1 netns c # ip net e c ip l s lo up; ip net e c ip l s eth1 up # ip net e s ip l s lo up; ip net e s ip l s eth1 up # ip net e c ip a a 1.1.1.2/8 dev eth1 # ip net e s ip a a 1.1.1.1/8 dev eth1 # ip net e c tipc b e m udp n u1 localip 1.1.1.2 # ip net e s tipc b e m udp n u1 localip 1.1.1.1 # ip net d c; ip net d s; rmmod tipc and it will get stuck and keep logging the error: unregister_netdevice: waiting for lo to become free. Usage count = 1 The cause is that a dst is held by the udp sock's sk_rx_dst set on udp rx path with udp_early_demux == 1, and this dst (eventually holding lo dev) can't be released as bearer's removal in tipc pernet .exit happens after lo dev's removal, default_device pernet .exit. "There are two distinct types of pernet_operations recognized: subsys and device. At creation all subsys init functions are called before device init functions, and at destruction all device exit functions are called before subsys exit function." So by calling register_pernet_device instead to register tipc_net_ops, the pernet .exit() will be invoked earlier than loopback dev's removal when a netns is being destroyed, as fou/gue does. Note that vxlan and geneve udp tunnels don't have this issue, as the udp sock is released in their device ndo_stop(). This fix is also necessary for tipc dst_cache, which will hold dsts on tx path and I will introduce in my next patch. Reported-by: Li Shuang Signed-off-by: Xin Long Acked-by: Jon Maloy Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman --- net/tipc/core.c | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/net/tipc/core.c b/net/tipc/core.c index 3ecca3b88bf8..eb0f701f9bf1 100644 --- a/net/tipc/core.c +++ b/net/tipc/core.c @@ -132,7 +132,7 @@ static int __init tipc_init(void) if (err) goto out_sysctl; - err = register_pernet_subsys(&tipc_net_ops); + err = register_pernet_device(&tipc_net_ops); if (err) goto out_pernet; @@ -140,7 +140,7 @@ static int __init tipc_init(void) if (err) goto out_socket; - err = register_pernet_subsys(&tipc_topsrv_net_ops); + err = register_pernet_device(&tipc_topsrv_net_ops); if (err) goto out_pernet_topsrv; @@ -151,11 +151,11 @@ static int __init tipc_init(void) pr_info("Started in single node mode\n"); return 0; out_bearer: - unregister_pernet_subsys(&tipc_topsrv_net_ops); + unregister_pernet_device(&tipc_topsrv_net_ops); out_pernet_topsrv: tipc_socket_stop(); out_socket: - unregister_pernet_subsys(&tipc_net_ops); + unregister_pernet_device(&tipc_net_ops); out_pernet: tipc_unregister_sysctl(); out_sysctl: @@ -170,9 +170,9 @@ out_netlink: static void __exit tipc_exit(void) { tipc_bearer_cleanup(); - unregister_pernet_subsys(&tipc_topsrv_net_ops); + unregister_pernet_device(&tipc_topsrv_net_ops); tipc_socket_stop(); - unregister_pernet_subsys(&tipc_net_ops); + unregister_pernet_device(&tipc_net_ops); tipc_netlink_stop(); tipc_netlink_compat_stop(); tipc_unregister_sysctl(); -- cgit v1.2.3 From a54c0c1d392138f2012c1f644c434cbeed4cac77 Mon Sep 17 00:00:00 2001 From: Xin Long Date: Tue, 25 Jun 2019 00:28:19 +0800 Subject: tipc: check msg->req data len in tipc_nl_compat_bearer_disable [ Upstream commit 4f07b80c973348a99b5d2a32476a2e7877e94a05 ] This patch is to fix an uninit-value issue, reported by syzbot: BUG: KMSAN: uninit-value in memchr+0xce/0x110 lib/string.c:981 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x191/0x1f0 lib/dump_stack.c:113 kmsan_report+0x130/0x2a0 mm/kmsan/kmsan.c:622 __msan_warning+0x75/0xe0 mm/kmsan/kmsan_instr.c:310 memchr+0xce/0x110 lib/string.c:981 string_is_valid net/tipc/netlink_compat.c:176 [inline] tipc_nl_compat_bearer_disable+0x2a1/0x480 net/tipc/netlink_compat.c:449 __tipc_nl_compat_doit net/tipc/netlink_compat.c:327 [inline] tipc_nl_compat_doit+0x3ac/0xb00 net/tipc/netlink_compat.c:360 tipc_nl_compat_handle net/tipc/netlink_compat.c:1178 [inline] tipc_nl_compat_recv+0x1b1b/0x27b0 net/tipc/netlink_compat.c:1281 TLV_GET_DATA_LEN() may return a negtive int value, which will be used as size_t (becoming a big unsigned long) passed into memchr, cause this issue. Similar to what it does in tipc_nl_compat_bearer_enable(), this fix is to return -EINVAL when TLV_GET_DATA_LEN() is negtive in tipc_nl_compat_bearer_disable(), as well as in tipc_nl_compat_link_stat_dump() and tipc_nl_compat_link_reset_stats(). v1->v2: - add the missing Fixes tags per Eric's request. Fixes: 0762216c0ad2 ("tipc: fix uninit-value in tipc_nl_compat_bearer_enable") Fixes: 8b66fee7f8ee ("tipc: fix uninit-value in tipc_nl_compat_link_reset_stats") Reported-by: syzbot+30eaa8bf392f7fafffaf@syzkaller.appspotmail.com Signed-off-by: Xin Long Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman --- net/tipc/netlink_compat.c | 18 +++++++++++++++--- 1 file changed, 15 insertions(+), 3 deletions(-) diff --git a/net/tipc/netlink_compat.c b/net/tipc/netlink_compat.c index 340a6e7c43a7..8836aebd6180 100644 --- a/net/tipc/netlink_compat.c +++ b/net/tipc/netlink_compat.c @@ -445,7 +445,11 @@ static int tipc_nl_compat_bearer_disable(struct tipc_nl_compat_cmd_doit *cmd, if (!bearer) return -EMSGSIZE; - len = min_t(int, TLV_GET_DATA_LEN(msg->req), TIPC_MAX_BEARER_NAME); + len = TLV_GET_DATA_LEN(msg->req); + if (len <= 0) + return -EINVAL; + + len = min_t(int, len, TIPC_MAX_BEARER_NAME); if (!string_is_valid(name, len)) return -EINVAL; @@ -537,7 +541,11 @@ static int tipc_nl_compat_link_stat_dump(struct tipc_nl_compat_msg *msg, name = (char *)TLV_DATA(msg->req); - len = min_t(int, TLV_GET_DATA_LEN(msg->req), TIPC_MAX_LINK_NAME); + len = TLV_GET_DATA_LEN(msg->req); + if (len <= 0) + return -EINVAL; + + len = min_t(int, len, TIPC_MAX_BEARER_NAME); if (!string_is_valid(name, len)) return -EINVAL; @@ -815,7 +823,11 @@ static int tipc_nl_compat_link_reset_stats(struct tipc_nl_compat_cmd_doit *cmd, if (!link) return -EMSGSIZE; - len = min_t(int, TLV_GET_DATA_LEN(msg->req), TIPC_MAX_LINK_NAME); + len = TLV_GET_DATA_LEN(msg->req); + if (len <= 0) + return -EINVAL; + + len = min_t(int, len, TIPC_MAX_BEARER_NAME); if (!string_is_valid(name, len)) return -EINVAL; -- cgit v1.2.3 From 9590d1d1b033cb3c1211a90c66b928a141d6b129 Mon Sep 17 00:00:00 2001 From: Fei Li Date: Mon, 17 Jun 2019 21:26:36 +0800 Subject: tun: wake up waitqueues after IFF_UP is set [ Upstream commit 72b319dc08b4924a29f5e2560ef6d966fa54c429 ] Currently after setting tap0 link up, the tun code wakes tx/rx waited queues up in tun_net_open() when .ndo_open() is called, however the IFF_UP flag has not been set yet. If there's already a wait queue, it would fail to transmit when checking the IFF_UP flag in tun_sendmsg(). Then the saving vhost_poll_start() will add the wq into wqh until it is waken up again. Although this works when IFF_UP flag has been set when tun_chr_poll detects; this is not true if IFF_UP flag has not been set at that time. Sadly the latter case is a fatal error, as the wq will never be waken up in future unless later manually setting link up on purpose. Fix this by moving the wakeup process into the NETDEV_UP event notifying process, this makes sure IFF_UP has been set before all waited queues been waken up. Signed-off-by: Fei Li Acked-by: Jason Wang Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman --- drivers/net/tun.c | 19 +++++++++---------- 1 file changed, 9 insertions(+), 10 deletions(-) diff --git a/drivers/net/tun.c b/drivers/net/tun.c index f4c933ac6edf..d1cafb29eca9 100644 --- a/drivers/net/tun.c +++ b/drivers/net/tun.c @@ -1024,18 +1024,8 @@ static void tun_net_uninit(struct net_device *dev) /* Net device open. */ static int tun_net_open(struct net_device *dev) { - struct tun_struct *tun = netdev_priv(dev); - int i; - netif_tx_start_all_queues(dev); - for (i = 0; i < tun->numqueues; i++) { - struct tun_file *tfile; - - tfile = rtnl_dereference(tun->tfiles[i]); - tfile->socket.sk->sk_write_space(tfile->socket.sk); - } - return 0; } @@ -3636,6 +3626,7 @@ static int tun_device_event(struct notifier_block *unused, { struct net_device *dev = netdev_notifier_info_to_dev(ptr); struct tun_struct *tun = netdev_priv(dev); + int i; if (dev->rtnl_link_ops != &tun_link_ops) return NOTIFY_DONE; @@ -3645,6 +3636,14 @@ static int tun_device_event(struct notifier_block *unused, if (tun_queue_resize(tun)) return NOTIFY_BAD; break; + case NETDEV_UP: + for (i = 0; i < tun->numqueues; i++) { + struct tun_file *tfile; + + tfile = rtnl_dereference(tun->tfiles[i]); + tfile->socket.sk->sk_write_space(tfile->socket.sk); + } + break; default: break; } -- cgit v1.2.3 From 03c3e507e90eb4fc528ed52e77d9f9ae8e4edd64 Mon Sep 17 00:00:00 2001 From: Dmitry Bogdanov Date: Sat, 22 Jun 2019 08:46:37 +0000 Subject: net: aquantia: fix vlans not working over bridged network [ Upstream commit 48dd73d08d4dda47ee31cc8611fb16840fc16803 ] In configuration of vlan over bridge over aquantia device it was found that vlan tagged traffic is dropped on chip. The reason is that bridge device enables promisc mode, but in atlantic chip vlan filters will still apply. So we have to corellate promisc settings with vlan configuration. The solution is to track in a separate state variable the need of vlan forced promisc. And also consider generic promisc configuration when doing vlan filter config. Fixes: 7975d2aff5af ("net: aquantia: add support of rx-vlan-filter offload") Signed-off-by: Dmitry Bogdanov Signed-off-by: Igor Russkikh Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman --- drivers/net/ethernet/aquantia/atlantic/aq_filters.c | 10 ++++++++-- drivers/net/ethernet/aquantia/atlantic/aq_nic.c | 1 + drivers/net/ethernet/aquantia/atlantic/aq_nic.h | 1 + .../net/ethernet/aquantia/atlantic/hw_atl/hw_atl_b0.c | 19 +++++++++++++------ 4 files changed, 23 insertions(+), 8 deletions(-) diff --git a/drivers/net/ethernet/aquantia/atlantic/aq_filters.c b/drivers/net/ethernet/aquantia/atlantic/aq_filters.c index 18bc035da850..1fff462a4175 100644 --- a/drivers/net/ethernet/aquantia/atlantic/aq_filters.c +++ b/drivers/net/ethernet/aquantia/atlantic/aq_filters.c @@ -843,9 +843,14 @@ int aq_filters_vlans_update(struct aq_nic_s *aq_nic) return err; if (aq_nic->ndev->features & NETIF_F_HW_VLAN_CTAG_FILTER) { - if (hweight < AQ_VLAN_MAX_FILTERS) - err = aq_hw_ops->hw_filter_vlan_ctrl(aq_hw, true); + if (hweight < AQ_VLAN_MAX_FILTERS && hweight > 0) { + err = aq_hw_ops->hw_filter_vlan_ctrl(aq_hw, + !(aq_nic->packet_filter & IFF_PROMISC)); + aq_nic->aq_nic_cfg.is_vlan_force_promisc = false; + } else { /* otherwise left in promiscue mode */ + aq_nic->aq_nic_cfg.is_vlan_force_promisc = true; + } } return err; @@ -866,6 +871,7 @@ int aq_filters_vlan_offload_off(struct aq_nic_s *aq_nic) if (unlikely(!aq_hw_ops->hw_filter_vlan_ctrl)) return -EOPNOTSUPP; + aq_nic->aq_nic_cfg.is_vlan_force_promisc = true; err = aq_hw_ops->hw_filter_vlan_ctrl(aq_hw, false); if (err) return err; diff --git a/drivers/net/ethernet/aquantia/atlantic/aq_nic.c b/drivers/net/ethernet/aquantia/atlantic/aq_nic.c index ff83667410bd..550abfe6973d 100644 --- a/drivers/net/ethernet/aquantia/atlantic/aq_nic.c +++ b/drivers/net/ethernet/aquantia/atlantic/aq_nic.c @@ -117,6 +117,7 @@ void aq_nic_cfg_start(struct aq_nic_s *self) cfg->link_speed_msk &= cfg->aq_hw_caps->link_speed_msk; cfg->features = cfg->aq_hw_caps->hw_features; + cfg->is_vlan_force_promisc = true; } static int aq_nic_update_link_status(struct aq_nic_s *self) diff --git a/drivers/net/ethernet/aquantia/atlantic/aq_nic.h b/drivers/net/ethernet/aquantia/atlantic/aq_nic.h index 8e34c1e49bf2..65e681be9b5d 100644 --- a/drivers/net/ethernet/aquantia/atlantic/aq_nic.h +++ b/drivers/net/ethernet/aquantia/atlantic/aq_nic.h @@ -36,6 +36,7 @@ struct aq_nic_cfg_s { u32 flow_control; u32 link_speed_msk; u32 wol; + bool is_vlan_force_promisc; u16 is_mc_list_enabled; u16 mc_list_count; bool is_autoneg; diff --git a/drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_b0.c b/drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_b0.c index ec302fdfec63..a4cc04741115 100644 --- a/drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_b0.c +++ b/drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_b0.c @@ -771,8 +771,15 @@ static int hw_atl_b0_hw_packet_filter_set(struct aq_hw_s *self, unsigned int packet_filter) { unsigned int i = 0U; + struct aq_nic_cfg_s *cfg = self->aq_nic_cfg; + + hw_atl_rpfl2promiscuous_mode_en_set(self, + IS_FILTER_ENABLED(IFF_PROMISC)); + + hw_atl_rpf_vlan_prom_mode_en_set(self, + IS_FILTER_ENABLED(IFF_PROMISC) || + cfg->is_vlan_force_promisc); - hw_atl_rpfl2promiscuous_mode_en_set(self, IS_FILTER_ENABLED(IFF_PROMISC)); hw_atl_rpfl2multicast_flr_en_set(self, IS_FILTER_ENABLED(IFF_ALLMULTI), 0); @@ -781,13 +788,13 @@ static int hw_atl_b0_hw_packet_filter_set(struct aq_hw_s *self, hw_atl_rpfl2broadcast_en_set(self, IS_FILTER_ENABLED(IFF_BROADCAST)); - self->aq_nic_cfg->is_mc_list_enabled = IS_FILTER_ENABLED(IFF_MULTICAST); + cfg->is_mc_list_enabled = IS_FILTER_ENABLED(IFF_MULTICAST); for (i = HW_ATL_B0_MAC_MIN; i < HW_ATL_B0_MAC_MAX; ++i) hw_atl_rpfl2_uc_flr_en_set(self, - (self->aq_nic_cfg->is_mc_list_enabled && - (i <= self->aq_nic_cfg->mc_list_count)) ? - 1U : 0U, i); + (cfg->is_mc_list_enabled && + (i <= cfg->mc_list_count)) ? + 1U : 0U, i); return aq_hw_err_from_flags(self); } @@ -1079,7 +1086,7 @@ static int hw_atl_b0_hw_vlan_set(struct aq_hw_s *self, static int hw_atl_b0_hw_vlan_ctrl(struct aq_hw_s *self, bool enable) { /* set promisc in case of disabing the vland filter */ - hw_atl_rpf_vlan_prom_mode_en_set(self, !!!enable); + hw_atl_rpf_vlan_prom_mode_en_set(self, !enable); return aq_hw_err_from_flags(self); } -- cgit v1.2.3 From 7108c83502e822e52052eea7d13f9d3d8bc8dea0 Mon Sep 17 00:00:00 2001 From: Martynas Pumputis Date: Wed, 12 Jun 2019 18:05:40 +0200 Subject: bpf: simplify definition of BPF_FIB_LOOKUP related flags commit b1d6c15b9d824a58c5415673f374fac19e8eccdf upstream. Previously, the BPF_FIB_LOOKUP_{DIRECT,OUTPUT} flags in the BPF UAPI were defined with the help of BIT macro. This had the following issues: - In order to use any of the flags, a user was required to depend on . - No other flag in bpf.h uses the macro, so it seems that an unwritten convention is to use (1 << (nr)) to define BPF-related flags. Fixes: 87f5fc7e48dd ("bpf: Provide helper to do forwarding lookups in kernel FIB table") Signed-off-by: Martynas Pumputis Acked-by: Andrii Nakryiko Signed-off-by: Daniel Borkmann Signed-off-by: Greg Kroah-Hartman --- include/uapi/linux/bpf.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 929c8e537a14..0aa05f316741 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -3104,8 +3104,8 @@ struct bpf_raw_tracepoint_args { /* DIRECT: Skip the FIB rules and go to FIB table associated with device * OUTPUT: Do lookup from egress perspective; default is ingress */ -#define BPF_FIB_LOOKUP_DIRECT BIT(0) -#define BPF_FIB_LOOKUP_OUTPUT BIT(1) +#define BPF_FIB_LOOKUP_DIRECT (1U << 0) +#define BPF_FIB_LOOKUP_OUTPUT (1U << 1) enum { BPF_FIB_LKUP_RET_SUCCESS, /* lookup successful */ -- cgit v1.2.3 From 7cec89761822f911527ba89ffda314fa4c0fad67 Mon Sep 17 00:00:00 2001 From: Jonathan Lemon Date: Sat, 8 Jun 2019 12:54:19 -0700 Subject: bpf: lpm_trie: check left child of last leftmost node for NULL commit da2577fdd0932ea4eefe73903f1130ee366767d2 upstream. If the leftmost parent node of the tree has does not have a child on the left side, then trie_get_next_key (and bpftool map dump) will not look at the child on the right. This leads to the traversal missing elements. Lookup is not affected. Update selftest to handle this case. Reproducer: bpftool map create /sys/fs/bpf/lpm type lpm_trie key 6 \ value 1 entries 256 name test_lpm flags 1 bpftool map update pinned /sys/fs/bpf/lpm key 8 0 0 0 0 0 value 1 bpftool map update pinned /sys/fs/bpf/lpm key 16 0 0 0 0 128 value 2 bpftool map dump pinned /sys/fs/bpf/lpm Returns only 1 element. (2 expected) Fixes: b471f2f1de8b ("bpf: implement MAP_GET_NEXT_KEY command for LPM_TRIE") Signed-off-by: Jonathan Lemon Acked-by: Martin KaFai Lau Signed-off-by: Daniel Borkmann Signed-off-by: Greg Kroah-Hartman --- kernel/bpf/lpm_trie.c | 9 +++++-- tools/testing/selftests/bpf/test_lpm_map.c | 41 +++++++++++++++++++++++++++--- 2 files changed, 45 insertions(+), 5 deletions(-) diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c index 93a5cbbde421..3b03a7342f3c 100644 --- a/kernel/bpf/lpm_trie.c +++ b/kernel/bpf/lpm_trie.c @@ -715,9 +715,14 @@ find_leftmost: * have exact two children, so this function will never return NULL. */ for (node = search_root; node;) { - if (!(node->flags & LPM_TREE_NODE_FLAG_IM)) + if (node->flags & LPM_TREE_NODE_FLAG_IM) { + node = rcu_dereference(node->child[0]); + } else { next_node = node; - node = rcu_dereference(node->child[0]); + node = rcu_dereference(node->child[0]); + if (!node) + node = rcu_dereference(next_node->child[1]); + } } do_copy: next_key->prefixlen = next_node->prefixlen; diff --git a/tools/testing/selftests/bpf/test_lpm_map.c b/tools/testing/selftests/bpf/test_lpm_map.c index 02d7c871862a..006be3963977 100644 --- a/tools/testing/selftests/bpf/test_lpm_map.c +++ b/tools/testing/selftests/bpf/test_lpm_map.c @@ -573,13 +573,13 @@ static void test_lpm_get_next_key(void) /* add one more element (total two) */ key_p->prefixlen = 24; - inet_pton(AF_INET, "192.168.0.0", key_p->data); + inet_pton(AF_INET, "192.168.128.0", key_p->data); assert(bpf_map_update_elem(map_fd, key_p, &value, 0) == 0); memset(key_p, 0, key_size); assert(bpf_map_get_next_key(map_fd, NULL, key_p) == 0); assert(key_p->prefixlen == 24 && key_p->data[0] == 192 && - key_p->data[1] == 168 && key_p->data[2] == 0); + key_p->data[1] == 168 && key_p->data[2] == 128); memset(next_key_p, 0, key_size); assert(bpf_map_get_next_key(map_fd, key_p, next_key_p) == 0); @@ -592,7 +592,7 @@ static void test_lpm_get_next_key(void) /* Add one more element (total three) */ key_p->prefixlen = 24; - inet_pton(AF_INET, "192.168.128.0", key_p->data); + inet_pton(AF_INET, "192.168.0.0", key_p->data); assert(bpf_map_update_elem(map_fd, key_p, &value, 0) == 0); memset(key_p, 0, key_size); @@ -643,6 +643,41 @@ static void test_lpm_get_next_key(void) assert(bpf_map_get_next_key(map_fd, key_p, next_key_p) == -1 && errno == ENOENT); + /* Add one more element (total five) */ + key_p->prefixlen = 28; + inet_pton(AF_INET, "192.168.1.128", key_p->data); + assert(bpf_map_update_elem(map_fd, key_p, &value, 0) == 0); + + memset(key_p, 0, key_size); + assert(bpf_map_get_next_key(map_fd, NULL, key_p) == 0); + assert(key_p->prefixlen == 24 && key_p->data[0] == 192 && + key_p->data[1] == 168 && key_p->data[2] == 0); + + memset(next_key_p, 0, key_size); + assert(bpf_map_get_next_key(map_fd, key_p, next_key_p) == 0); + assert(next_key_p->prefixlen == 28 && next_key_p->data[0] == 192 && + next_key_p->data[1] == 168 && next_key_p->data[2] == 1 && + next_key_p->data[3] == 128); + + memcpy(key_p, next_key_p, key_size); + assert(bpf_map_get_next_key(map_fd, key_p, next_key_p) == 0); + assert(next_key_p->prefixlen == 24 && next_key_p->data[0] == 192 && + next_key_p->data[1] == 168 && next_key_p->data[2] == 1); + + memcpy(key_p, next_key_p, key_size); + assert(bpf_map_get_next_key(map_fd, key_p, next_key_p) == 0); + assert(next_key_p->prefixlen == 24 && next_key_p->data[0] == 192 && + next_key_p->data[1] == 168 && next_key_p->data[2] == 128); + + memcpy(key_p, next_key_p, key_size); + assert(bpf_map_get_next_key(map_fd, key_p, next_key_p) == 0); + assert(next_key_p->prefixlen == 16 && next_key_p->data[0] == 192 && + next_key_p->data[1] == 168); + + memcpy(key_p, next_key_p, key_size); + assert(bpf_map_get_next_key(map_fd, key_p, next_key_p) == -1 && + errno == ENOENT); + /* no exact matching key should return the first one in post order */ key_p->prefixlen = 22; inet_pton(AF_INET, "192.168.1.0", key_p->data); -- cgit v1.2.3 From 2a9fedc1ef4be2acb4fd4674f405c21c811e1505 Mon Sep 17 00:00:00 2001 From: Matt Mullins Date: Tue, 11 Jun 2019 14:53:04 -0700 Subject: bpf: fix nested bpf tracepoints with per-cpu data commit 9594dc3c7e71b9f52bee1d7852eb3d4e3aea9e99 upstream. BPF_PROG_TYPE_RAW_TRACEPOINTs can be executed nested on the same CPU, as they do not increment bpf_prog_active while executing. This enables three levels of nesting, to support - a kprobe or raw tp or perf event, - another one of the above that irq context happens to call, and - another one in nmi context (at most one of which may be a kprobe or perf event). Fixes: 20b9d7ac4852 ("bpf: avoid excessive stack usage for perf_sample_data") Signed-off-by: Matt Mullins Acked-by: Andrii Nakryiko Acked-by: Daniel Borkmann Signed-off-by: Alexei Starovoitov Signed-off-by: Greg Kroah-Hartman --- kernel/trace/bpf_trace.c | 100 +++++++++++++++++++++++++++++++++++++++-------- 1 file changed, 84 insertions(+), 16 deletions(-) diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c index d64c00afceb5..568a0e839903 100644 --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -402,8 +402,6 @@ static const struct bpf_func_proto bpf_perf_event_read_value_proto = { .arg4_type = ARG_CONST_SIZE, }; -static DEFINE_PER_CPU(struct perf_sample_data, bpf_trace_sd); - static __always_inline u64 __bpf_perf_event_output(struct pt_regs *regs, struct bpf_map *map, u64 flags, struct perf_sample_data *sd) @@ -434,24 +432,50 @@ __bpf_perf_event_output(struct pt_regs *regs, struct bpf_map *map, return perf_event_output(event, sd, regs); } +/* + * Support executing tracepoints in normal, irq, and nmi context that each call + * bpf_perf_event_output + */ +struct bpf_trace_sample_data { + struct perf_sample_data sds[3]; +}; + +static DEFINE_PER_CPU(struct bpf_trace_sample_data, bpf_trace_sds); +static DEFINE_PER_CPU(int, bpf_trace_nest_level); BPF_CALL_5(bpf_perf_event_output, struct pt_regs *, regs, struct bpf_map *, map, u64, flags, void *, data, u64, size) { - struct perf_sample_data *sd = this_cpu_ptr(&bpf_trace_sd); + struct bpf_trace_sample_data *sds = this_cpu_ptr(&bpf_trace_sds); + int nest_level = this_cpu_inc_return(bpf_trace_nest_level); struct perf_raw_record raw = { .frag = { .size = size, .data = data, }, }; + struct perf_sample_data *sd; + int err; - if (unlikely(flags & ~(BPF_F_INDEX_MASK))) - return -EINVAL; + if (WARN_ON_ONCE(nest_level > ARRAY_SIZE(sds->sds))) { + err = -EBUSY; + goto out; + } + + sd = &sds->sds[nest_level - 1]; + + if (unlikely(flags & ~(BPF_F_INDEX_MASK))) { + err = -EINVAL; + goto out; + } perf_sample_data_init(sd, 0, 0); sd->raw = &raw; - return __bpf_perf_event_output(regs, map, flags, sd); + err = __bpf_perf_event_output(regs, map, flags, sd); + +out: + this_cpu_dec(bpf_trace_nest_level); + return err; } static const struct bpf_func_proto bpf_perf_event_output_proto = { @@ -808,16 +832,48 @@ pe_prog_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) /* * bpf_raw_tp_regs are separate from bpf_pt_regs used from skb/xdp * to avoid potential recursive reuse issue when/if tracepoints are added - * inside bpf_*_event_output, bpf_get_stackid and/or bpf_get_stack + * inside bpf_*_event_output, bpf_get_stackid and/or bpf_get_stack. + * + * Since raw tracepoints run despite bpf_prog_active, support concurrent usage + * in normal, irq, and nmi context. */ -static DEFINE_PER_CPU(struct pt_regs, bpf_raw_tp_regs); +struct bpf_raw_tp_regs { + struct pt_regs regs[3]; +}; +static DEFINE_PER_CPU(struct bpf_raw_tp_regs, bpf_raw_tp_regs); +static DEFINE_PER_CPU(int, bpf_raw_tp_nest_level); +static struct pt_regs *get_bpf_raw_tp_regs(void) +{ + struct bpf_raw_tp_regs *tp_regs = this_cpu_ptr(&bpf_raw_tp_regs); + int nest_level = this_cpu_inc_return(bpf_raw_tp_nest_level); + + if (WARN_ON_ONCE(nest_level > ARRAY_SIZE(tp_regs->regs))) { + this_cpu_dec(bpf_raw_tp_nest_level); + return ERR_PTR(-EBUSY); + } + + return &tp_regs->regs[nest_level - 1]; +} + +static void put_bpf_raw_tp_regs(void) +{ + this_cpu_dec(bpf_raw_tp_nest_level); +} + BPF_CALL_5(bpf_perf_event_output_raw_tp, struct bpf_raw_tracepoint_args *, args, struct bpf_map *, map, u64, flags, void *, data, u64, size) { - struct pt_regs *regs = this_cpu_ptr(&bpf_raw_tp_regs); + struct pt_regs *regs = get_bpf_raw_tp_regs(); + int ret; + + if (IS_ERR(regs)) + return PTR_ERR(regs); perf_fetch_caller_regs(regs); - return ____bpf_perf_event_output(regs, map, flags, data, size); + ret = ____bpf_perf_event_output(regs, map, flags, data, size); + + put_bpf_raw_tp_regs(); + return ret; } static const struct bpf_func_proto bpf_perf_event_output_proto_raw_tp = { @@ -834,12 +890,18 @@ static const struct bpf_func_proto bpf_perf_event_output_proto_raw_tp = { BPF_CALL_3(bpf_get_stackid_raw_tp, struct bpf_raw_tracepoint_args *, args, struct bpf_map *, map, u64, flags) { - struct pt_regs *regs = this_cpu_ptr(&bpf_raw_tp_regs); + struct pt_regs *regs = get_bpf_raw_tp_regs(); + int ret; + + if (IS_ERR(regs)) + return PTR_ERR(regs); perf_fetch_caller_regs(regs); /* similar to bpf_perf_event_output_tp, but pt_regs fetched differently */ - return bpf_get_stackid((unsigned long) regs, (unsigned long) map, - flags, 0, 0); + ret = bpf_get_stackid((unsigned long) regs, (unsigned long) map, + flags, 0, 0); + put_bpf_raw_tp_regs(); + return ret; } static const struct bpf_func_proto bpf_get_stackid_proto_raw_tp = { @@ -854,11 +916,17 @@ static const struct bpf_func_proto bpf_get_stackid_proto_raw_tp = { BPF_CALL_4(bpf_get_stack_raw_tp, struct bpf_raw_tracepoint_args *, args, void *, buf, u32, size, u64, flags) { - struct pt_regs *regs = this_cpu_ptr(&bpf_raw_tp_regs); + struct pt_regs *regs = get_bpf_raw_tp_regs(); + int ret; + + if (IS_ERR(regs)) + return PTR_ERR(regs); perf_fetch_caller_regs(regs); - return bpf_get_stack((unsigned long) regs, (unsigned long) buf, - (unsigned long) size, flags, 0); + ret = bpf_get_stack((unsigned long) regs, (unsigned long) buf, + (unsigned long) size, flags, 0); + put_bpf_raw_tp_regs(); + return ret; } static const struct bpf_func_proto bpf_get_stack_proto_raw_tp = { -- cgit v1.2.3 From 591c18e3aed16fde52cfdfc4af094b2cfd5dd0f2 Mon Sep 17 00:00:00 2001 From: Daniel Borkmann Date: Fri, 7 Jun 2019 01:48:57 +0200 Subject: bpf: fix unconnected udp hooks commit 983695fa676568fc0fe5ddd995c7267aabc24632 upstream. Intention of cgroup bind/connect/sendmsg BPF hooks is to act transparently to applications as also stated in original motivation in 7828f20e3779 ("Merge branch 'bpf-cgroup-bind-connect'"). When recently integrating the latter two hooks into Cilium to enable host based load-balancing with Kubernetes, I ran into the issue that pods couldn't start up as DNS got broken. Kubernetes typically sets up DNS as a service and is thus subject to load-balancing. Upon further debugging, it turns out that the cgroupv2 sendmsg BPF hooks API is currently insufficient and thus not usable as-is for standard applications shipped with most distros. To break down the issue we ran into with a simple example: # cat /etc/resolv.conf nameserver 147.75.207.207 nameserver 147.75.207.208 For the purpose of a simple test, we set up above IPs as service IPs and transparently redirect traffic to a different DNS backend server for that node: # cilium service list ID Frontend Backend 1 147.75.207.207:53 1 => 8.8.8.8:53 2 147.75.207.208:53 1 => 8.8.8.8:53 The attached BPF program is basically selecting one of the backends if the service IP/port matches on the cgroup hook. DNS breaks here, because the hooks are not transparent enough to applications which have built-in msg_name address checks: # nslookup 1.1.1.1 ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53 ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.208#53 ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53 [...] ;; connection timed out; no servers could be reached # dig 1.1.1.1 ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53 ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.208#53 ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53 [...] ; <<>> DiG 9.11.3-1ubuntu1.7-Ubuntu <<>> 1.1.1.1 ;; global options: +cmd ;; connection timed out; no servers could be reached For comparison, if none of the service IPs is used, and we tell nslookup to use 8.8.8.8 directly it works just fine, of course: # nslookup 1.1.1.1 8.8.8.8 1.1.1.1.in-addr.arpa name = one.one.one.one. In order to fix this and thus act more transparent to the application, this needs reverse translation on recvmsg() side. A minimal fix for this API is to add similar recvmsg() hooks behind the BPF cgroups static key such that the program can track state and replace the current sockaddr_in{,6} with the original service IP. From BPF side, this basically tracks the service tuple plus socket cookie in an LRU map where the reverse NAT can then be retrieved via map value as one example. Side-note: the BPF cgroups static key should be converted to a per-hook static key in future. Same example after this fix: # cilium service list ID Frontend Backend 1 147.75.207.207:53 1 => 8.8.8.8:53 2 147.75.207.208:53 1 => 8.8.8.8:53 Lookups work fine now: # nslookup 1.1.1.1 1.1.1.1.in-addr.arpa name = one.one.one.one. Authoritative answers can be found from: # dig 1.1.1.1 ; <<>> DiG 9.11.3-1ubuntu1.7-Ubuntu <<>> 1.1.1.1 ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 51550 ;; flags: qr rd ra ad; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1 ;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 512 ;; QUESTION SECTION: ;1.1.1.1. IN A ;; AUTHORITY SECTION: . 23426 IN SOA a.root-servers.net. nstld.verisign-grs.com. 2019052001 1800 900 604800 86400 ;; Query time: 17 msec ;; SERVER: 147.75.207.207#53(147.75.207.207) ;; WHEN: Tue May 21 12:59:38 UTC 2019 ;; MSG SIZE rcvd: 111 And from an actual packet level it shows that we're using the back end server when talking via 147.75.207.20{7,8} front end: # tcpdump -i any udp [...] 12:59:52.698732 IP foo.42011 > google-public-dns-a.google.com.domain: 18803+ PTR? 1.1.1.1.in-addr.arpa. (38) 12:59:52.698735 IP foo.42011 > google-public-dns-a.google.com.domain: 18803+ PTR? 1.1.1.1.in-addr.arpa. (38) 12:59:52.701208 IP google-public-dns-a.google.com.domain > foo.42011: 18803 1/0/0 PTR one.one.one.one. (67) 12:59:52.701208 IP google-public-dns-a.google.com.domain > foo.42011: 18803 1/0/0 PTR one.one.one.one. (67) [...] In order to be flexible and to have same semantics as in sendmsg BPF programs, we only allow return codes in [1,1] range. In the sendmsg case the program is called if msg->msg_name is present which can be the case in both, connected and unconnected UDP. The former only relies on the sockaddr_in{,6} passed via connect(2) if passed msg->msg_name was NULL. Therefore, on recvmsg side, we act in similar way to call into the BPF program whenever a non-NULL msg->msg_name was passed independent of sk->sk_state being TCP_ESTABLISHED or not. Note that for TCP case, the msg->msg_name is ignored in the regular recvmsg path and therefore not relevant. For the case of ip{,v6}_recv_error() paths, picked up via MSG_ERRQUEUE, the hook is not called. This is intentional as it aligns with the same semantics as in case of TCP cgroup BPF hooks right now. This might be better addressed in future through a different bpf_attach_type such that this case can be distinguished from the regular recvmsg paths, for example. Fixes: 1cedee13d25a ("bpf: Hooks for sys_sendmsg") Signed-off-by: Daniel Borkmann Acked-by: Andrey Ignatov Acked-by: Martin KaFai Lau Acked-by: Martynas Pumputis Signed-off-by: Alexei Starovoitov Signed-off-by: Daniel Borkmann Signed-off-by: Greg Kroah-Hartman --- include/linux/bpf-cgroup.h | 8 ++++++++ include/uapi/linux/bpf.h | 2 ++ kernel/bpf/syscall.c | 8 ++++++++ kernel/bpf/verifier.c | 12 ++++++++---- net/core/filter.c | 2 ++ net/ipv4/udp.c | 4 ++++ net/ipv6/udp.c | 4 ++++ 7 files changed, 36 insertions(+), 4 deletions(-) diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h index a4c644c1c091..ada0246d43ab 100644 --- a/include/linux/bpf-cgroup.h +++ b/include/linux/bpf-cgroup.h @@ -230,6 +230,12 @@ int bpf_percpu_cgroup_storage_update(struct bpf_map *map, void *key, #define BPF_CGROUP_RUN_PROG_UDP6_SENDMSG_LOCK(sk, uaddr, t_ctx) \ BPF_CGROUP_RUN_SA_PROG_LOCK(sk, uaddr, BPF_CGROUP_UDP6_SENDMSG, t_ctx) +#define BPF_CGROUP_RUN_PROG_UDP4_RECVMSG_LOCK(sk, uaddr) \ + BPF_CGROUP_RUN_SA_PROG_LOCK(sk, uaddr, BPF_CGROUP_UDP4_RECVMSG, NULL) + +#define BPF_CGROUP_RUN_PROG_UDP6_RECVMSG_LOCK(sk, uaddr) \ + BPF_CGROUP_RUN_SA_PROG_LOCK(sk, uaddr, BPF_CGROUP_UDP6_RECVMSG, NULL) + #define BPF_CGROUP_RUN_PROG_SOCK_OPS(sock_ops) \ ({ \ int __ret = 0; \ @@ -319,6 +325,8 @@ static inline int bpf_percpu_cgroup_storage_update(struct bpf_map *map, #define BPF_CGROUP_RUN_PROG_INET6_CONNECT_LOCK(sk, uaddr) ({ 0; }) #define BPF_CGROUP_RUN_PROG_UDP4_SENDMSG_LOCK(sk, uaddr, t_ctx) ({ 0; }) #define BPF_CGROUP_RUN_PROG_UDP6_SENDMSG_LOCK(sk, uaddr, t_ctx) ({ 0; }) +#define BPF_CGROUP_RUN_PROG_UDP4_RECVMSG_LOCK(sk, uaddr) ({ 0; }) +#define BPF_CGROUP_RUN_PROG_UDP6_RECVMSG_LOCK(sk, uaddr) ({ 0; }) #define BPF_CGROUP_RUN_PROG_SOCK_OPS(sock_ops) ({ 0; }) #define BPF_CGROUP_RUN_PROG_DEVICE_CGROUP(type,major,minor,access) ({ 0; }) diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 0aa05f316741..9d01f4788d3e 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -187,6 +187,8 @@ enum bpf_attach_type { BPF_CGROUP_UDP6_SENDMSG, BPF_LIRC_MODE2, BPF_FLOW_DISSECTOR, + BPF_CGROUP_UDP4_RECVMSG = 19, + BPF_CGROUP_UDP6_RECVMSG, __MAX_BPF_ATTACH_TYPE }; diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index db6e825e2958..29b8f20d500c 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -1520,6 +1520,8 @@ bpf_prog_load_check_attach_type(enum bpf_prog_type prog_type, case BPF_CGROUP_INET6_CONNECT: case BPF_CGROUP_UDP4_SENDMSG: case BPF_CGROUP_UDP6_SENDMSG: + case BPF_CGROUP_UDP4_RECVMSG: + case BPF_CGROUP_UDP6_RECVMSG: return 0; default: return -EINVAL; @@ -1809,6 +1811,8 @@ static int bpf_prog_attach(const union bpf_attr *attr) case BPF_CGROUP_INET6_CONNECT: case BPF_CGROUP_UDP4_SENDMSG: case BPF_CGROUP_UDP6_SENDMSG: + case BPF_CGROUP_UDP4_RECVMSG: + case BPF_CGROUP_UDP6_RECVMSG: ptype = BPF_PROG_TYPE_CGROUP_SOCK_ADDR; break; case BPF_CGROUP_SOCK_OPS: @@ -1891,6 +1895,8 @@ static int bpf_prog_detach(const union bpf_attr *attr) case BPF_CGROUP_INET6_CONNECT: case BPF_CGROUP_UDP4_SENDMSG: case BPF_CGROUP_UDP6_SENDMSG: + case BPF_CGROUP_UDP4_RECVMSG: + case BPF_CGROUP_UDP6_RECVMSG: ptype = BPF_PROG_TYPE_CGROUP_SOCK_ADDR; break; case BPF_CGROUP_SOCK_OPS: @@ -1939,6 +1945,8 @@ static int bpf_prog_query(const union bpf_attr *attr, case BPF_CGROUP_INET6_CONNECT: case BPF_CGROUP_UDP4_SENDMSG: case BPF_CGROUP_UDP6_SENDMSG: + case BPF_CGROUP_UDP4_RECVMSG: + case BPF_CGROUP_UDP6_RECVMSG: case BPF_CGROUP_SOCK_OPS: case BPF_CGROUP_DEVICE: break; diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 950fac024fbb..4ff130ddfbf6 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -5145,9 +5145,12 @@ static int check_return_code(struct bpf_verifier_env *env) struct tnum range = tnum_range(0, 1); switch (env->prog->type) { + case BPF_PROG_TYPE_CGROUP_SOCK_ADDR: + if (env->prog->expected_attach_type == BPF_CGROUP_UDP4_RECVMSG || + env->prog->expected_attach_type == BPF_CGROUP_UDP6_RECVMSG) + range = tnum_range(1, 1); case BPF_PROG_TYPE_CGROUP_SKB: case BPF_PROG_TYPE_CGROUP_SOCK: - case BPF_PROG_TYPE_CGROUP_SOCK_ADDR: case BPF_PROG_TYPE_SOCK_OPS: case BPF_PROG_TYPE_CGROUP_DEVICE: break; @@ -5163,16 +5166,17 @@ static int check_return_code(struct bpf_verifier_env *env) } if (!tnum_in(range, reg->var_off)) { + char tn_buf[48]; + verbose(env, "At program exit the register R0 "); if (!tnum_is_unknown(reg->var_off)) { - char tn_buf[48]; - tnum_strn(tn_buf, sizeof(tn_buf), reg->var_off); verbose(env, "has value %s", tn_buf); } else { verbose(env, "has unknown scalar value"); } - verbose(env, " should have been 0 or 1\n"); + tnum_strn(tn_buf, sizeof(tn_buf), range); + verbose(env, " should have been in %s\n", tn_buf); return -EINVAL; } return 0; diff --git a/net/core/filter.c b/net/core/filter.c index 27e61ffd9039..b76f14197128 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -6420,6 +6420,7 @@ static bool sock_addr_is_valid_access(int off, int size, case BPF_CGROUP_INET4_BIND: case BPF_CGROUP_INET4_CONNECT: case BPF_CGROUP_UDP4_SENDMSG: + case BPF_CGROUP_UDP4_RECVMSG: break; default: return false; @@ -6430,6 +6431,7 @@ static bool sock_addr_is_valid_access(int off, int size, case BPF_CGROUP_INET6_BIND: case BPF_CGROUP_INET6_CONNECT: case BPF_CGROUP_UDP6_SENDMSG: + case BPF_CGROUP_UDP6_RECVMSG: break; default: return false; diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c index 3b179ce6170f..7b359012e496 100644 --- a/net/ipv4/udp.c +++ b/net/ipv4/udp.c @@ -1783,6 +1783,10 @@ try_again: sin->sin_addr.s_addr = ip_hdr(skb)->saddr; memset(sin->sin_zero, 0, sizeof(sin->sin_zero)); *addr_len = sizeof(*sin); + + if (cgroup_bpf_enabled) + BPF_CGROUP_RUN_PROG_UDP4_RECVMSG_LOCK(sk, + (struct sockaddr *)sin); } if (udp_sk(sk)->gro_enabled) diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c index 622eeaf5732b..ecb4ae37e3bf 100644 --- a/net/ipv6/udp.c +++ b/net/ipv6/udp.c @@ -370,6 +370,10 @@ try_again: inet6_iif(skb)); } *addr_len = sizeof(*sin6); + + if (cgroup_bpf_enabled) + BPF_CGROUP_RUN_PROG_UDP6_RECVMSG_LOCK(sk, + (struct sockaddr *)sin6); } if (udp_sk(sk)->gro_enabled) -- cgit v1.2.3 From da6dab6373b223a3f05df6b2236a3ffa81ed7cb8 Mon Sep 17 00:00:00 2001 From: Martin KaFai Lau Date: Fri, 31 May 2019 15:29:13 -0700 Subject: bpf: udp: Avoid calling reuseport's bpf_prog from udp_gro commit 257a525fe2e49584842c504a92c27097407f778f upstream. When the commit a6024562ffd7 ("udp: Add GRO functions to UDP socket") added udp[46]_lib_lookup_skb to the udp_gro code path, it broke the reuseport_select_sock() assumption that skb->data is pointing to the transport header. This patch follows an earlier __udp6_lib_err() fix by passing a NULL skb to avoid calling the reuseport's bpf_prog. Fixes: a6024562ffd7 ("udp: Add GRO functions to UDP socket") Cc: Tom Herbert Signed-off-by: Martin KaFai Lau Acked-by: Song Liu Signed-off-by: Alexei Starovoitov Signed-off-by: Greg Kroah-Hartman --- net/ipv4/udp.c | 6 +++++- net/ipv6/udp.c | 2 +- 2 files changed, 6 insertions(+), 2 deletions(-) diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c index 7b359012e496..ee999cb5e670 100644 --- a/net/ipv4/udp.c +++ b/net/ipv4/udp.c @@ -503,7 +503,11 @@ static inline struct sock *__udp4_lib_lookup_skb(struct sk_buff *skb, struct sock *udp4_lib_lookup_skb(struct sk_buff *skb, __be16 sport, __be16 dport) { - return __udp4_lib_lookup_skb(skb, sport, dport, &udp_table); + const struct iphdr *iph = ip_hdr(skb); + + return __udp4_lib_lookup(dev_net(skb->dev), iph->saddr, sport, + iph->daddr, dport, inet_iif(skb), + inet_sdif(skb), &udp_table, NULL); } EXPORT_SYMBOL_GPL(udp4_lib_lookup_skb); diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c index ecb4ae37e3bf..c5eff697a2f5 100644 --- a/net/ipv6/udp.c +++ b/net/ipv6/udp.c @@ -242,7 +242,7 @@ struct sock *udp6_lib_lookup_skb(struct sk_buff *skb, return __udp6_lib_lookup(dev_net(skb->dev), &iph->saddr, sport, &iph->daddr, dport, inet6_iif(skb), - inet6_sdif(skb), &udp_table, skb); + inet6_sdif(skb), &udp_table, NULL); } EXPORT_SYMBOL_GPL(udp6_lib_lookup_skb); -- cgit v1.2.3 From bb3fb093b41f10315e93ca2974164243958a6f51 Mon Sep 17 00:00:00 2001 From: Martin KaFai Lau Date: Fri, 31 May 2019 15:29:11 -0700 Subject: bpf: udp: ipv6: Avoid running reuseport's bpf_prog from __udp6_lib_err commit 4ac30c4b3659efac031818c418beb51e630d512d upstream. __udp6_lib_err() may be called when handling icmpv6 message. For example, the icmpv6 toobig(type=2). __udp6_lib_lookup() is then called which may call reuseport_select_sock(). reuseport_select_sock() will call into a bpf_prog (if there is one). reuseport_select_sock() is expecting the skb->data pointing to the transport header (udphdr in this case). For example, run_bpf_filter() is pulling the transport header. However, in the __udp6_lib_err() path, the skb->data is pointing to the ipv6hdr instead of the udphdr. One option is to pull and push the ipv6hdr in __udp6_lib_err(). Instead of doing this, this patch follows how the original commit 538950a1b752 ("soreuseport: setsockopt SO_ATTACH_REUSEPORT_[CE]BPF") was done in IPv4, which has passed a NULL skb pointer to reuseport_select_sock(). Fixes: 538950a1b752 ("soreuseport: setsockopt SO_ATTACH_REUSEPORT_[CE]BPF") Cc: Craig Gallek Signed-off-by: Martin KaFai Lau Acked-by: Song Liu Acked-by: Craig Gallek Signed-off-by: Alexei Starovoitov Signed-off-by: Greg Kroah-Hartman --- net/ipv6/udp.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c index c5eff697a2f5..3a01d3d2f105 100644 --- a/net/ipv6/udp.c +++ b/net/ipv6/udp.c @@ -520,7 +520,7 @@ int __udp6_lib_err(struct sk_buff *skb, struct inet6_skb_parm *opt, struct net *net = dev_net(skb->dev); sk = __udp6_lib_lookup(net, daddr, uh->dest, saddr, uh->source, - inet6_iif(skb), inet6_sdif(skb), udptable, skb); + inet6_iif(skb), inet6_sdif(skb), udptable, NULL); if (!sk) { /* No socket for error: try tunnels before discarding */ sk = ERR_PTR(-ENOENT); -- cgit v1.2.3 From fac9c64326dd2176d59a97a7046b34de3edce2f9 Mon Sep 17 00:00:00 2001 From: Will Deacon Date: Wed, 10 Apr 2019 11:49:11 +0100 Subject: arm64: futex: Avoid copying out uninitialised stack in failed cmpxchg() commit 8e4e0ac02b449297b86498ac24db5786ddd9f647 upstream. Returning an error code from futex_atomic_cmpxchg_inatomic() indicates that the caller should not make any use of *uval, and should instead act upon on the value of the error code. Although this is implemented correctly in our futex code, we needlessly copy uninitialised stack to *uval in the error case, which can easily be avoided. Signed-off-by: Will Deacon Signed-off-by: Greg Kroah-Hartman --- arch/arm64/include/asm/futex.h | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/arch/arm64/include/asm/futex.h b/arch/arm64/include/asm/futex.h index 2d78ea6932b7..bdb3c05070a2 100644 --- a/arch/arm64/include/asm/futex.h +++ b/arch/arm64/include/asm/futex.h @@ -134,7 +134,9 @@ futex_atomic_cmpxchg_inatomic(u32 *uval, u32 __user *_uaddr, : "memory"); uaccess_disable(); - *uval = val; + if (!ret) + *uval = val; + return ret; } -- cgit v1.2.3 From 272ca3913c8eaaa92086df6f78f108f2546e277d Mon Sep 17 00:00:00 2001 From: Daniel Borkmann Date: Fri, 26 Apr 2019 21:48:22 +0200 Subject: bpf, arm64: use more scalable stadd over ldxr / stxr loop in xadd commit 34b8ab091f9ef57a2bb3c8c8359a0a03a8abf2f9 upstream. Since ARMv8.1 supplement introduced LSE atomic instructions back in 2016, lets add support for STADD and use that in favor of LDXR / STXR loop for the XADD mapping if available. STADD is encoded as an alias for LDADD with XZR as the destination register, therefore add LDADD to the instruction encoder along with STADD as special case and use it in the JIT for CPUs that advertise LSE atomics in CPUID register. If immediate offset in the BPF XADD insn is 0, then use dst register directly instead of temporary one. Signed-off-by: Daniel Borkmann Acked-by: Jean-Philippe Brucker Acked-by: Will Deacon Signed-off-by: Alexei Starovoitov Signed-off-by: Greg Kroah-Hartman --- arch/arm64/include/asm/insn.h | 8 ++++++++ arch/arm64/kernel/insn.c | 40 ++++++++++++++++++++++++++++++++++++++++ arch/arm64/net/bpf_jit.h | 4 ++++ arch/arm64/net/bpf_jit_comp.c | 28 +++++++++++++++++++--------- 4 files changed, 71 insertions(+), 9 deletions(-) diff --git a/arch/arm64/include/asm/insn.h b/arch/arm64/include/asm/insn.h index 9c01f04db64d..ec894de0ed4e 100644 --- a/arch/arm64/include/asm/insn.h +++ b/arch/arm64/include/asm/insn.h @@ -277,6 +277,7 @@ __AARCH64_INSN_FUNCS(adrp, 0x9F000000, 0x90000000) __AARCH64_INSN_FUNCS(prfm, 0x3FC00000, 0x39800000) __AARCH64_INSN_FUNCS(prfm_lit, 0xFF000000, 0xD8000000) __AARCH64_INSN_FUNCS(str_reg, 0x3FE0EC00, 0x38206800) +__AARCH64_INSN_FUNCS(ldadd, 0x3F20FC00, 0xB8200000) __AARCH64_INSN_FUNCS(ldr_reg, 0x3FE0EC00, 0x38606800) __AARCH64_INSN_FUNCS(ldr_lit, 0xBF000000, 0x18000000) __AARCH64_INSN_FUNCS(ldrsw_lit, 0xFF000000, 0x98000000) @@ -394,6 +395,13 @@ u32 aarch64_insn_gen_load_store_ex(enum aarch64_insn_register reg, enum aarch64_insn_register state, enum aarch64_insn_size_type size, enum aarch64_insn_ldst_type type); +u32 aarch64_insn_gen_ldadd(enum aarch64_insn_register result, + enum aarch64_insn_register address, + enum aarch64_insn_register value, + enum aarch64_insn_size_type size); +u32 aarch64_insn_gen_stadd(enum aarch64_insn_register address, + enum aarch64_insn_register value, + enum aarch64_insn_size_type size); u32 aarch64_insn_gen_add_sub_imm(enum aarch64_insn_register dst, enum aarch64_insn_register src, int imm, enum aarch64_insn_variant variant, diff --git a/arch/arm64/kernel/insn.c b/arch/arm64/kernel/insn.c index 7820a4a688fa..9e2b5882cdeb 100644 --- a/arch/arm64/kernel/insn.c +++ b/arch/arm64/kernel/insn.c @@ -734,6 +734,46 @@ u32 aarch64_insn_gen_load_store_ex(enum aarch64_insn_register reg, state); } +u32 aarch64_insn_gen_ldadd(enum aarch64_insn_register result, + enum aarch64_insn_register address, + enum aarch64_insn_register value, + enum aarch64_insn_size_type size) +{ + u32 insn = aarch64_insn_get_ldadd_value(); + + switch (size) { + case AARCH64_INSN_SIZE_32: + case AARCH64_INSN_SIZE_64: + break; + default: + pr_err("%s: unimplemented size encoding %d\n", __func__, size); + return AARCH64_BREAK_FAULT; + } + + insn = aarch64_insn_encode_ldst_size(size, insn); + + insn = aarch64_insn_encode_register(AARCH64_INSN_REGTYPE_RT, insn, + result); + + insn = aarch64_insn_encode_register(AARCH64_INSN_REGTYPE_RN, insn, + address); + + return aarch64_insn_encode_register(AARCH64_INSN_REGTYPE_RS, insn, + value); +} + +u32 aarch64_insn_gen_stadd(enum aarch64_insn_register address, + enum aarch64_insn_register value, + enum aarch64_insn_size_type size) +{ + /* + * STADD is simply encoded as an alias for LDADD with XZR as + * the destination register. + */ + return aarch64_insn_gen_ldadd(AARCH64_INSN_REG_ZR, address, + value, size); +} + static u32 aarch64_insn_encode_prfm_imm(enum aarch64_insn_prfm_type type, enum aarch64_insn_prfm_target target, enum aarch64_insn_prfm_policy policy, diff --git a/arch/arm64/net/bpf_jit.h b/arch/arm64/net/bpf_jit.h index 6c881659ee8a..76606e87233f 100644 --- a/arch/arm64/net/bpf_jit.h +++ b/arch/arm64/net/bpf_jit.h @@ -100,6 +100,10 @@ #define A64_STXR(sf, Rt, Rn, Rs) \ A64_LSX(sf, Rt, Rn, Rs, STORE_EX) +/* LSE atomics */ +#define A64_STADD(sf, Rn, Rs) \ + aarch64_insn_gen_stadd(Rn, Rs, A64_SIZE(sf)) + /* Add/subtract (immediate) */ #define A64_ADDSUB_IMM(sf, Rd, Rn, imm12, type) \ aarch64_insn_gen_add_sub_imm(Rd, Rn, imm12, \ diff --git a/arch/arm64/net/bpf_jit_comp.c b/arch/arm64/net/bpf_jit_comp.c index a1420626fca2..df845cee438e 100644 --- a/arch/arm64/net/bpf_jit_comp.c +++ b/arch/arm64/net/bpf_jit_comp.c @@ -365,7 +365,7 @@ static int build_insn(const struct bpf_insn *insn, struct jit_ctx *ctx, const bool is64 = BPF_CLASS(code) == BPF_ALU64 || BPF_CLASS(code) == BPF_JMP; const bool isdw = BPF_SIZE(code) == BPF_DW; - u8 jmp_cond; + u8 jmp_cond, reg; s32 jmp_offset; #define check_imm(bits, imm) do { \ @@ -756,18 +756,28 @@ emit_cond_jmp: break; } break; + /* STX XADD: lock *(u32 *)(dst + off) += src */ case BPF_STX | BPF_XADD | BPF_W: /* STX XADD: lock *(u64 *)(dst + off) += src */ case BPF_STX | BPF_XADD | BPF_DW: - emit_a64_mov_i(1, tmp, off, ctx); - emit(A64_ADD(1, tmp, tmp, dst), ctx); - emit(A64_LDXR(isdw, tmp2, tmp), ctx); - emit(A64_ADD(isdw, tmp2, tmp2, src), ctx); - emit(A64_STXR(isdw, tmp2, tmp, tmp3), ctx); - jmp_offset = -3; - check_imm19(jmp_offset); - emit(A64_CBNZ(0, tmp3, jmp_offset), ctx); + if (!off) { + reg = dst; + } else { + emit_a64_mov_i(1, tmp, off, ctx); + emit(A64_ADD(1, tmp, tmp, dst), ctx); + reg = tmp; + } + if (cpus_have_cap(ARM64_HAS_LSE_ATOMICS)) { + emit(A64_STADD(isdw, reg, src), ctx); + } else { + emit(A64_LDXR(isdw, tmp2, reg), ctx); + emit(A64_ADD(isdw, tmp2, tmp2, src), ctx); + emit(A64_STXR(isdw, tmp2, reg, tmp3), ctx); + jmp_offset = -3; + check_imm19(jmp_offset); + emit(A64_CBNZ(0, tmp3, jmp_offset), ctx); + } break; default: -- cgit v1.2.3 From 41dd902f6ec7bfc2185e27c5c0f6ef7cb158bc1f Mon Sep 17 00:00:00 2001 From: Will Deacon Date: Wed, 10 Apr 2019 11:51:54 +0100 Subject: futex: Update comments and docs about return values of arch futex code commit 427503519739e779c0db8afe876c1b33f3ac60ae upstream. The architecture implementations of 'arch_futex_atomic_op_inuser()' and 'futex_atomic_cmpxchg_inatomic()' are permitted to return only -EFAULT, -EAGAIN or -ENOSYS in the case of failure. Update the comments in the asm-generic/ implementation and also a stray reference in the robust futex documentation. Signed-off-by: Will Deacon Signed-off-by: Greg Kroah-Hartman --- Documentation/robust-futexes.txt | 3 +-- include/asm-generic/futex.h | 8 ++++++-- 2 files changed, 7 insertions(+), 4 deletions(-) diff --git a/Documentation/robust-futexes.txt b/Documentation/robust-futexes.txt index 6c42c75103eb..6361fb01c9c1 100644 --- a/Documentation/robust-futexes.txt +++ b/Documentation/robust-futexes.txt @@ -218,5 +218,4 @@ All other architectures should build just fine too - but they won't have the new syscalls yet. Architectures need to implement the new futex_atomic_cmpxchg_inatomic() -inline function before writing up the syscalls (that function returns --ENOSYS right now). +inline function before writing up the syscalls. diff --git a/include/asm-generic/futex.h b/include/asm-generic/futex.h index fcb61b4659b3..8666fe7f35d7 100644 --- a/include/asm-generic/futex.h +++ b/include/asm-generic/futex.h @@ -23,7 +23,9 @@ * * Return: * 0 - On success - * <0 - On error + * -EFAULT - User access resulted in a page fault + * -EAGAIN - Atomic operation was unable to complete due to contention + * -ENOSYS - Operation not supported */ static inline int arch_futex_atomic_op_inuser(int op, u32 oparg, int *oval, u32 __user *uaddr) @@ -85,7 +87,9 @@ out_pagefault_enable: * * Return: * 0 - On success - * <0 - On error + * -EFAULT - User access resulted in a page fault + * -EAGAIN - Atomic operation was unable to complete due to contention + * -ENOSYS - Function not implemented (only if !HAVE_FUTEX_CMPXCHG) */ static inline int futex_atomic_cmpxchg_inatomic(u32 *uval, u32 __user *uaddr, -- cgit v1.2.3 From 993a0821eb5b810bff67152a8005e1107f07a69d Mon Sep 17 00:00:00 2001 From: Jason Gunthorpe Date: Sun, 12 May 2019 21:57:57 -0300 Subject: RDMA: Directly cast the sockaddr union to sockaddr commit 641114d2af312d39ca9bbc2369d18a5823da51c6 upstream. gcc 9 now does allocation size tracking and thinks that passing the member of a union and then accessing beyond that member's bounds is an overflow. Instead of using the union member, use the entire union with a cast to get to the sockaddr. gcc will now know that the memory extends the full size of the union. Signed-off-by: Jason Gunthorpe Signed-off-by: Greg Kroah-Hartman --- drivers/infiniband/core/addr.c | 16 ++++++++-------- drivers/infiniband/hw/ocrdma/ocrdma_ah.c | 5 ++--- drivers/infiniband/hw/ocrdma/ocrdma_hw.c | 5 ++--- 3 files changed, 12 insertions(+), 14 deletions(-) diff --git a/drivers/infiniband/core/addr.c b/drivers/infiniband/core/addr.c index 0dce94e3c495..d0b04b0d309f 100644 --- a/drivers/infiniband/core/addr.c +++ b/drivers/infiniband/core/addr.c @@ -730,8 +730,8 @@ int roce_resolve_route_from_path(struct sa_path_rec *rec, if (rec->roce.route_resolved) return 0; - rdma_gid2ip(&sgid._sockaddr, &rec->sgid); - rdma_gid2ip(&dgid._sockaddr, &rec->dgid); + rdma_gid2ip((struct sockaddr *)&sgid, &rec->sgid); + rdma_gid2ip((struct sockaddr *)&dgid, &rec->dgid); if (sgid._sockaddr.sa_family != dgid._sockaddr.sa_family) return -EINVAL; @@ -742,7 +742,7 @@ int roce_resolve_route_from_path(struct sa_path_rec *rec, dev_addr.net = &init_net; dev_addr.sgid_attr = attr; - ret = addr_resolve(&sgid._sockaddr, &dgid._sockaddr, + ret = addr_resolve((struct sockaddr *)&sgid, (struct sockaddr *)&dgid, &dev_addr, false, true, 0); if (ret) return ret; @@ -814,22 +814,22 @@ int rdma_addr_find_l2_eth_by_grh(const union ib_gid *sgid, struct rdma_dev_addr dev_addr; struct resolve_cb_context ctx; union { - struct sockaddr _sockaddr; struct sockaddr_in _sockaddr_in; struct sockaddr_in6 _sockaddr_in6; } sgid_addr, dgid_addr; int ret; - rdma_gid2ip(&sgid_addr._sockaddr, sgid); - rdma_gid2ip(&dgid_addr._sockaddr, dgid); + rdma_gid2ip((struct sockaddr *)&sgid_addr, sgid); + rdma_gid2ip((struct sockaddr *)&dgid_addr, dgid); memset(&dev_addr, 0, sizeof(dev_addr)); dev_addr.net = &init_net; dev_addr.sgid_attr = sgid_attr; init_completion(&ctx.comp); - ret = rdma_resolve_ip(&sgid_addr._sockaddr, &dgid_addr._sockaddr, - &dev_addr, 1000, resolve_cb, true, &ctx); + ret = rdma_resolve_ip((struct sockaddr *)&sgid_addr, + (struct sockaddr *)&dgid_addr, &dev_addr, 1000, + resolve_cb, true, &ctx); if (ret) return ret; diff --git a/drivers/infiniband/hw/ocrdma/ocrdma_ah.c b/drivers/infiniband/hw/ocrdma/ocrdma_ah.c index a7295322efbc..f4500d77f314 100644 --- a/drivers/infiniband/hw/ocrdma/ocrdma_ah.c +++ b/drivers/infiniband/hw/ocrdma/ocrdma_ah.c @@ -83,7 +83,6 @@ static inline int set_av_attr(struct ocrdma_dev *dev, struct ocrdma_ah *ah, struct iphdr ipv4; const struct ib_global_route *ib_grh; union { - struct sockaddr _sockaddr; struct sockaddr_in _sockaddr_in; struct sockaddr_in6 _sockaddr_in6; } sgid_addr, dgid_addr; @@ -133,9 +132,9 @@ static inline int set_av_attr(struct ocrdma_dev *dev, struct ocrdma_ah *ah, ipv4.tot_len = htons(0); ipv4.ttl = ib_grh->hop_limit; ipv4.protocol = nxthdr; - rdma_gid2ip(&sgid_addr._sockaddr, sgid); + rdma_gid2ip((struct sockaddr *)&sgid_addr, sgid); ipv4.saddr = sgid_addr._sockaddr_in.sin_addr.s_addr; - rdma_gid2ip(&dgid_addr._sockaddr, &ib_grh->dgid); + rdma_gid2ip((struct sockaddr*)&dgid_addr, &ib_grh->dgid); ipv4.daddr = dgid_addr._sockaddr_in.sin_addr.s_addr; memcpy((u8 *)ah->av + eth_sz, &ipv4, sizeof(struct iphdr)); } else { diff --git a/drivers/infiniband/hw/ocrdma/ocrdma_hw.c b/drivers/infiniband/hw/ocrdma/ocrdma_hw.c index 097e5ab2a19f..94ca0f4a4073 100644 --- a/drivers/infiniband/hw/ocrdma/ocrdma_hw.c +++ b/drivers/infiniband/hw/ocrdma/ocrdma_hw.c @@ -2499,7 +2499,6 @@ static int ocrdma_set_av_params(struct ocrdma_qp *qp, u32 vlan_id = 0xFFFF; u8 mac_addr[6], hdr_type; union { - struct sockaddr _sockaddr; struct sockaddr_in _sockaddr_in; struct sockaddr_in6 _sockaddr_in6; } sgid_addr, dgid_addr; @@ -2541,8 +2540,8 @@ static int ocrdma_set_av_params(struct ocrdma_qp *qp, hdr_type = rdma_gid_attr_network_type(sgid_attr); if (hdr_type == RDMA_NETWORK_IPV4) { - rdma_gid2ip(&sgid_addr._sockaddr, &sgid_attr->gid); - rdma_gid2ip(&dgid_addr._sockaddr, &grh->dgid); + rdma_gid2ip((struct sockaddr *)&sgid_addr, &sgid_attr->gid); + rdma_gid2ip((struct sockaddr *)&dgid_addr, &grh->dgid); memcpy(&cmd->params.dgid[0], &dgid_addr._sockaddr_in.sin_addr.s_addr, 4); memcpy(&cmd->params.sgid[0], -- cgit v1.2.3 From b74063b5f16ab816dece79eec1830870d472aa43 Mon Sep 17 00:00:00 2001 From: Amir Goldstein Date: Wed, 19 Jun 2019 13:34:44 +0300 Subject: fanotify: update connector fsid cache on add mark commit c285a2f01d692ef48d7243cf1072897bbd237407 upstream. When implementing connector fsid cache, we only initialized the cache when the first mark added to object was added by FAN_REPORT_FID group. We forgot to update conn->fsid when the second mark is added by FAN_REPORT_FID group to an already attached connector without fsid cache. Reported-and-tested-by: syzbot+c277e8e2f46414645508@syzkaller.appspotmail.com Fixes: 77115225acc6 ("fanotify: cache fsid in fsnotify_mark_connector") Signed-off-by: Amir Goldstein Signed-off-by: Jan Kara Signed-off-by: Greg Kroah-Hartman --- fs/notify/fanotify/fanotify.c | 4 ++++ fs/notify/mark.c | 14 +++++++++++--- include/linux/fsnotify_backend.h | 4 +++- 3 files changed, 18 insertions(+), 4 deletions(-) diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c index 63c6bb1f8c4d..8c286f8228e5 100644 --- a/fs/notify/fanotify/fanotify.c +++ b/fs/notify/fanotify/fanotify.c @@ -355,6 +355,10 @@ static __kernel_fsid_t fanotify_get_fsid(struct fsnotify_iter_info *iter_info) /* Mark is just getting destroyed or created? */ if (!conn) continue; + if (!(conn->flags & FSNOTIFY_CONN_FLAG_HAS_FSID)) + continue; + /* Pairs with smp_wmb() in fsnotify_add_mark_list() */ + smp_rmb(); fsid = conn->fsid; if (WARN_ON_ONCE(!fsid.val[0] && !fsid.val[1])) continue; diff --git a/fs/notify/mark.c b/fs/notify/mark.c index 22acb0a79b53..e9d49191d39e 100644 --- a/fs/notify/mark.c +++ b/fs/notify/mark.c @@ -495,10 +495,13 @@ static int fsnotify_attach_connector_to_object(fsnotify_connp_t *connp, conn->type = type; conn->obj = connp; /* Cache fsid of filesystem containing the object */ - if (fsid) + if (fsid) { conn->fsid = *fsid; - else + conn->flags = FSNOTIFY_CONN_FLAG_HAS_FSID; + } else { conn->fsid.val[0] = conn->fsid.val[1] = 0; + conn->flags = 0; + } if (conn->type == FSNOTIFY_OBJ_TYPE_INODE) inode = igrab(fsnotify_conn_inode(conn)); /* @@ -573,7 +576,12 @@ restart: if (err) return err; goto restart; - } else if (fsid && (conn->fsid.val[0] || conn->fsid.val[1]) && + } else if (fsid && !(conn->flags & FSNOTIFY_CONN_FLAG_HAS_FSID)) { + conn->fsid = *fsid; + /* Pairs with smp_rmb() in fanotify_get_fsid() */ + smp_wmb(); + conn->flags |= FSNOTIFY_CONN_FLAG_HAS_FSID; + } else if (fsid && (conn->flags & FSNOTIFY_CONN_FLAG_HAS_FSID) && (fsid->val[0] != conn->fsid.val[0] || fsid->val[1] != conn->fsid.val[1])) { /* diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h index 094b38f2d9a1..0f67cabbbec8 100644 --- a/include/linux/fsnotify_backend.h +++ b/include/linux/fsnotify_backend.h @@ -292,7 +292,9 @@ typedef struct fsnotify_mark_connector __rcu *fsnotify_connp_t; */ struct fsnotify_mark_connector { spinlock_t lock; - unsigned int type; /* Type of object [lock] */ + unsigned short type; /* Type of object [lock] */ +#define FSNOTIFY_CONN_FLAG_HAS_FSID 0x01 + unsigned short flags; /* flags [lock] */ __kernel_fsid_t fsid; /* fsid of filesystem containing object */ union { /* Object pointer [lock] */ -- cgit v1.2.3 From cf9513b45f6408f12e84fca6a7bf83f62ac9d1bc Mon Sep 17 00:00:00 2001 From: Xin Long Date: Mon, 17 Jun 2019 21:34:15 +0800 Subject: tipc: pass tunnel dev as NULL to udp_tunnel(6)_xmit_skb commit c3bcde026684c62d7a2b6f626dc7cf763833875c upstream. udp_tunnel(6)_xmit_skb() called by tipc_udp_xmit() expects a tunnel device to count packets on dev->tstats, a perpcu variable. However, TIPC is using udp tunnel with no tunnel device, and pass the lower dev, like veth device that only initializes dev->lstats(a perpcu variable) when creating it. Later iptunnel_xmit_stats() called by ip(6)tunnel_xmit() thinks the dev as a tunnel device, and uses dev->tstats instead of dev->lstats. tstats' each pointer points to a bigger struct than lstats, so when tstats->tx_bytes is increased, other percpu variable's members could be overwritten. syzbot has reported quite a few crashes due to fib_nh_common percpu member 'nhc_pcpu_rth_output' overwritten, call traces are like: BUG: KASAN: slab-out-of-bounds in rt_cache_valid+0x158/0x190 net/ipv4/route.c:1556 rt_cache_valid+0x158/0x190 net/ipv4/route.c:1556 __mkroute_output net/ipv4/route.c:2332 [inline] ip_route_output_key_hash_rcu+0x819/0x2d50 net/ipv4/route.c:2564 ip_route_output_key_hash+0x1ef/0x360 net/ipv4/route.c:2393 __ip_route_output_key include/net/route.h:125 [inline] ip_route_output_flow+0x28/0xc0 net/ipv4/route.c:2651 ip_route_output_key include/net/route.h:135 [inline] ... or: kasan: GPF could be caused by NULL-ptr deref or user memory access RIP: 0010:dst_dev_put+0x24/0x290 net/core/dst.c:168 rt_fibinfo_free_cpus net/ipv4/fib_semantics.c:200 [inline] free_fib_info_rcu+0x2e1/0x490 net/ipv4/fib_semantics.c:217 __rcu_reclaim kernel/rcu/rcu.h:240 [inline] rcu_do_batch kernel/rcu/tree.c:2437 [inline] invoke_rcu_callbacks kernel/rcu/tree.c:2716 [inline] rcu_process_callbacks+0x100a/0x1ac0 kernel/rcu/tree.c:2697 ... The issue exists since tunnel stats update is moved to iptunnel_xmit by Commit 039f50629b7f ("ip_tunnel: Move stats update to iptunnel_xmit()"), and here to fix it by passing a NULL tunnel dev to udp_tunnel(6)_xmit_skb so that the packets counting won't happen on dev->tstats. Reported-by: syzbot+9d4c12bfd45a58738d0a@syzkaller.appspotmail.com Reported-by: syzbot+a9e23ea2aa21044c2798@syzkaller.appspotmail.com Reported-by: syzbot+c4c4b2bb358bb936ad7e@syzkaller.appspotmail.com Reported-by: syzbot+0290d2290a607e035ba1@syzkaller.appspotmail.com Reported-by: syzbot+a43d8d4e7e8a7a9e149e@syzkaller.appspotmail.com Reported-by: syzbot+a47c5f4c6c00fc1ed16e@syzkaller.appspotmail.com Fixes: 039f50629b7f ("ip_tunnel: Move stats update to iptunnel_xmit()") Signed-off-by: Xin Long Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman --- net/tipc/udp_media.c | 8 +++----- 1 file changed, 3 insertions(+), 5 deletions(-) diff --git a/net/tipc/udp_media.c b/net/tipc/udp_media.c index 4d85d71f16e2..c86f136e5962 100644 --- a/net/tipc/udp_media.c +++ b/net/tipc/udp_media.c @@ -176,7 +176,6 @@ static int tipc_udp_xmit(struct net *net, struct sk_buff *skb, goto tx_error; } - skb->dev = rt->dst.dev; ttl = ip4_dst_hoplimit(&rt->dst); udp_tunnel_xmit_skb(rt, ub->ubsock->sk, skb, src->ipv4.s_addr, dst->ipv4.s_addr, 0, ttl, 0, src->port, @@ -195,10 +194,9 @@ static int tipc_udp_xmit(struct net *net, struct sk_buff *skb, if (err) goto tx_error; ttl = ip6_dst_hoplimit(ndst); - err = udp_tunnel6_xmit_skb(ndst, ub->ubsock->sk, skb, - ndst->dev, &src->ipv6, - &dst->ipv6, 0, ttl, 0, src->port, - dst->port, false); + err = udp_tunnel6_xmit_skb(ndst, ub->ubsock->sk, skb, NULL, + &src->ipv6, &dst->ipv6, 0, ttl, 0, + src->port, dst->port, false); #endif } return err; -- cgit v1.2.3 From 25998210bb2c905d66590703ad11fdcb0cb3e55f Mon Sep 17 00:00:00 2001 From: Jean-Philippe Brucker Date: Fri, 24 May 2019 13:52:19 +0100 Subject: arm64: insn: Fix ldadd instruction encoding commit c5e2edeb01ae9ffbdde95bdcdb6d3614ba1eb195 upstream. GCC 8.1.0 reports that the ldadd instruction encoding, recently added to insn.c, doesn't match the mask and couldn't possibly be identified: linux/arch/arm64/include/asm/insn.h: In function 'aarch64_insn_is_ldadd': linux/arch/arm64/include/asm/insn.h:280:257: warning: bitwise comparison always evaluates to false [-Wtautological-compare] Bits [31:30] normally encode the size of the instruction (1 to 8 bytes) and the current instruction value only encodes the 4- and 8-byte variants. At the moment only the BPF JIT needs this instruction, and doesn't require the 1- and 2-byte variants, but to be consistent with our other ldr and str instruction encodings, clear the size field in the insn value. Fixes: 34b8ab091f9ef57a ("bpf, arm64: use more scalable stadd over ldxr / stxr loop in xadd") Acked-by: Daniel Borkmann Reported-by: Kuninori Morimoto Signed-off-by: Yoshihiro Shimoda Signed-off-by: Jean-Philippe Brucker Signed-off-by: Will Deacon Signed-off-by: Greg Kroah-Hartman --- arch/arm64/include/asm/insn.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/arm64/include/asm/insn.h b/arch/arm64/include/asm/insn.h index ec894de0ed4e..f71b84d9f294 100644 --- a/arch/arm64/include/asm/insn.h +++ b/arch/arm64/include/asm/insn.h @@ -277,7 +277,7 @@ __AARCH64_INSN_FUNCS(adrp, 0x9F000000, 0x90000000) __AARCH64_INSN_FUNCS(prfm, 0x3FC00000, 0x39800000) __AARCH64_INSN_FUNCS(prfm_lit, 0xFF000000, 0xD8000000) __AARCH64_INSN_FUNCS(str_reg, 0x3FE0EC00, 0x38206800) -__AARCH64_INSN_FUNCS(ldadd, 0x3F20FC00, 0xB8200000) +__AARCH64_INSN_FUNCS(ldadd, 0x3F20FC00, 0x38200000) __AARCH64_INSN_FUNCS(ldr_reg, 0x3FE0EC00, 0x38606800) __AARCH64_INSN_FUNCS(ldr_lit, 0xBF000000, 0x18000000) __AARCH64_INSN_FUNCS(ldrsw_lit, 0xFF000000, 0x98000000) -- cgit v1.2.3 From 8584aaf1c3262ca17d1e4a614ede9179ef462bb0 Mon Sep 17 00:00:00 2001 From: Greg Kroah-Hartman Date: Wed, 3 Jul 2019 13:13:45 +0200 Subject: Linux 5.1.16 --- Makefile | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Makefile b/Makefile index d7b3c8e3ff3e..46a0ae537182 100644 --- a/Makefile +++ b/Makefile @@ -1,7 +1,7 @@ # SPDX-License-Identifier: GPL-2.0 VERSION = 5 PATCHLEVEL = 1 -SUBLEVEL = 15 +SUBLEVEL = 16 EXTRAVERSION = NAME = Shy Crocodile -- cgit v1.2.3