summaryrefslogtreecommitdiff
path: root/arch/powerpc/mm
AgeCommit message (Collapse)AuthorFilesLines
2018-02-16powerpc/pseries: Check for zero filled ibm,dynamic-memory propertyNathan Fontenot1-0/+8
Some versions of QEMU will produce an ibm,dynamic-reconfiguration-memory node with a ibm,dynamic-memory property that is zero-filled. This causes the drmem code to oops trying to parse this property. The fix for this is to validate that the property does contain LMB entries before trying to parse it and bail if the count is zero. Oops: Kernel access of bad area, sig: 11 [#1] DAR: 0000000000000010 NIP read_drconf_v1_cell+0x54/0x9c LR read_drconf_v1_cell+0x48/0x9c Call Trace: __param_initcall_debug+0x0/0x28 (unreliable) drmem_init+0x144/0x2f8 do_one_initcall+0x64/0x1d0 kernel_init_freeable+0x298/0x38c kernel_init+0x24/0x160 ret_from_kernel_thread+0x5c/0xb4 The ibm,dynamic-reconfiguration-memory device tree property generated that causes this: ibm,dynamic-reconfiguration-memory { ibm,lmb-size = <0x0 0x10000000>; ibm,memory-flags-mask = <0xff>; ibm,dynamic-memory = <0x0 0x0 0x0 0x0 0x0 0x0>; linux,phandle = <0x7e57eed8>; ibm,associativity-lookup-arrays = <0x1 0x4 0x0 0x0 0x0 0x0>; ibm,memory-preservation-time = <0x0>; }; Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com> Reviewed-by: Cyril Bur <cyrilbur@gmail.com> Tested-by: Daniel Black <daniel@linux.vnet.ibm.com> [mpe: Trim oops report] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-02-13powerpc/mm/hash64: Store the slot information at the right offset for hugetlbAneesh Kumar K.V4-11/+20
The hugetlb pte entries are at the PMD and PUD level, so we can't use PTRS_PER_PTE to find the second half of the page table. Use the right offset for PUD/PMD to get to the second half of the table. Fixes: bf9a95f9a648 ("powerpc: Free up four 64K PTE bits in 64K backed HPTE pages") Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Reviewed-by: Ram Pai <linuxram@us.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-02-13powerpc/mm: Fix crashes with 16G huge pagesAneesh Kumar K.V4-2/+6
To support memory keys, we moved the hash pte slot information to the second half of the page table. This was ok with PTE entries at level 4 (PTE page) and level 3 (PMD). We already allocate larger page table pages at those levels to accomodate extra details. For level 4 we already have the extra space which was used to track 4k hash page table entry details and at level 3 the extra space was allocated to track the THP details. With hugetlbfs PTE, we used this extra space at the PMD level to store the slot details. But we also support hugetlbfs PTE at PUD level for 16GB pages and PUD level page didn't allocate extra space. This resulted in memory corruption. Fix this by allocating extra space at PUD level when HUGETLB is enabled. Fixes: bf9a95f9a648 ("powerpc: Free up four 64K PTE bits in 64K backed HPTE pages") Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Reviewed-by: Ram Pai <linuxram@us.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-02-13powerpc/mm: Flush radix process translations when setting MMU typeAlexey Kardashevskiy1-0/+2
Radix guests do normally invalidate process-scoped translations when a new pid is allocated but migrated guests do not invalidate these so migrated guests crash sometime, especially easy to reproduce with migration happening within first 10 seconds after the guest boot start on the same machine. This adds the "Invalidate process-scoped translations" flush to fix radix guests migration. Fixes: 2ee13be34b13 ("KVM: PPC: Book3S HV: Update kvmppc_set_arch_compat() for ISA v3.00") Cc: stable@vger.kernel.org # v4.10+ Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Tested-by: Laurent Vivier <lvivier@redhat.com> Tested-by: Daniel Henrique Barboza <danielhb@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-02-08powerpc/mm/radix: Split linear mapping on hot-unplugBalbir Singh1-21/+74
This patch splits the linear mapping if the hot-unplug range is smaller than the mapping size. The code detects if the mapping needs to be split into a smaller size and if so, uses the stop machine infrastructure to clear the existing mapping and then remap the remaining range using a smaller page size. The code will skip any region of the mapping that overlaps with kernel text and warn about it once. We don't want to remove a mapping where the kernel text and the LMB we intend to remove overlap in the same TLB mapping as it may affect the currently executing code. I've tested these changes under a kvm guest with 2 vcpus, from a split mapping point of view, some of the caveats mentioned above applied to the testing I did. Fixes: 4b5d62ca17a1 ("powerpc/mm: add radix__remove_section_mapping()") Signed-off-by: Balbir Singh <bsingharora@gmail.com> [mpe: Tweak change log to match updated behaviour] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-02-08powerpc/64s/radix: Boot-time NULL pointer protection using a guard-PIDNicholas Piggin1-1/+20
This change restores and formalises the behaviour that access to NULL or other user addresses by the kernel during boot should fault rather than succeed and modify memory. This was inadvertently broken when fixing another bug, because it was previously not well defined and only worked by chance. powerpc/64s/radix uses high address bits to select an address space "quadrant", which determines which PID and LPID are used to translate the rest of the address (effective PID, effective LPID). The kernel mapping at 0xC... selects quadrant 3, which uses PID=0 and LPID=0. So the kernel page tables are installed in the PID 0 process table entry. An address at 0x0... selects quadrant 0, which uses PID=PIDR for translating the rest of the address (that is, it uses the value of the PIDR register as the effective PID). If PIDR=0, then the translation is performed with the PID 0 process table entry page tables. This is the kernel mapping, so we effectively get another copy of the kernel address space at 0. A NULL pointer access will access physical memory address 0. To prevent duplicating the kernel address space in quadrant 0, this patch allocates a guard PID containing no translations, and initializes PIDR with this during boot, before the MMU is switched on. Any kernel access to quadrant 0 will use this guard PID for translation and find no valid mappings, and therefore fault. After boot, this PID will be switchd away to user context PIDs, but those contain user mappings (and usually NULL pointer protection) rather than kernel mapping, which is much safer (and by design). It may be in future this is tightened further, which the guard PID could be used for. Commit 371b8044 ("powerpc/64s: Initialize ISAv3 MMU registers before setting partition table"), introduced this problem because it zeroes PIDR at boot. However previously the value was inherited from firmware or kexec, which is not robust and can be zero (e.g., mambo). Fixes: 371b80447ff3 ("powerpc/64s: Initialize ISAv3 MMU registers before setting partition table") Cc: stable@vger.kernel.org # v4.15+ Reported-by: Florian Weimer <fweimer@redhat.com> Tested-by: Mauricio Faria de Oliveira <mauricfo@linux.vnet.ibm.com> Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-02-08powerpc/numa: Invalidate numa_cpu_lookup_table on cpu removeNathan Fontenot1-5/+0
When DLPAR removing a CPU, the unmapping of the cpu from a node in unmap_cpu_from_node() should also invalidate the CPUs entry in the numa_cpu_lookup_table. There is not a guarantee that on a subsequent DLPAR add of the CPU the associativity will be the same and thus could be in a different node. Invalidating the entry in the numa_cpu_lookup_table causes the associativity to be read from the device tree at the time of the add. The current behavior of not invalidating the CPUs entry in the numa_cpu_lookup_table can result in scenarios where the the topology layout of CPUs in the partition does not match the device tree or the topology reported by the HMC. This bug looks like it was introduced in 2004 in the commit titled "ppc64: cpu hotplug notifier for numa", which is 6b15e4e87e32 in the linux-fullhist tree. Hence tag it for all stable releases. Cc: stable@vger.kernel.org Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com> Reviewed-by: Tyrel Datwyler <tyreld@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-02-06Merge branch 'linus' into sched/urgent, to resolve conflictsIngo Molnar25-406/+1473
Conflicts: arch/arm64/kernel/entry.S arch/x86/Kconfig include/linux/sched/mm.h kernel/fork.c Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-02-06Merge tag 'libnvdimm-for-4.16' of ↵Linus Torvalds2-15/+13
git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm Pull libnvdimm updates from Ross Zwisler: - Require struct page by default for filesystem DAX to remove a number of surprising failure cases. This includes failures with direct I/O, gdb and fork(2). - Add support for the new Platform Capabilities Structure added to the NFIT in ACPI 6.2a. This new table tells us whether the platform supports flushing of CPU and memory controller caches on unexpected power loss events. - Revamp vmem_altmap and dev_pagemap handling to clean up code and better support future future PCI P2P uses. - Deprecate the ND_IOCTL_SMART_THRESHOLD command whose payload has become out-of-sync with recent versions of the NVDIMM_FAMILY_INTEL spec, and instead rely on the generic ND_CMD_CALL approach used by the two other IOCTL families, NVDIMM_FAMILY_{HPE,MSFT}. - Enhance nfit_test so we can test some of the new things added in version 1.6 of the DSM specification. This includes testing firmware download and simulating the Last Shutdown State (LSS) status. * tag 'libnvdimm-for-4.16' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (37 commits) libnvdimm, namespace: remove redundant initialization of 'nd_mapping' acpi, nfit: fix register dimm error handling libnvdimm, namespace: make min namespace size 4K tools/testing/nvdimm: force nfit_test to depend on instrumented modules libnvdimm/nfit_test: adding support for unit testing enable LSS status libnvdimm/nfit_test: add firmware download emulation nfit-test: Add platform cap support from ACPI 6.2a to test libnvdimm: expose platform persistence attribute for nd_region acpi: nfit: add persistent memory control flag for nd_region acpi: nfit: Add support for detect platform CPU cache flush on power loss device-dax: Fix trailing semicolon libnvdimm, btt: fix uninitialized err_lock dax: require 'struct page' by default for filesystem dax ext2: auto disable dax instead of failing mount ext4: auto disable dax instead of failing mount mm, dax: introduce pfn_t_special() mm: Fix devm_memremap_pages() collision handling mm: Fix memory size alignment in devm_memremap_pages_release() memremap: merge find_dev_pagemap into get_dev_pagemap memremap: change devm_memremap_pages interface to use struct dev_pagemap ...
2018-02-05powerpc, membarrier: Skip memory barrier in switch_mm()Mathieu Desnoyers1-0/+7
Allow PowerPC to skip the full memory barrier in switch_mm(), and only issue the barrier when scheduling into a task belonging to a process that has registered to use expedited private. Threads targeting the same VM but which belong to different thread groups is a tricky case. It has a few consequences: It turns out that we cannot rely on get_nr_threads(p) to count the number of threads using a VM. We can use (atomic_read(&mm->mm_users) == 1 && get_nr_threads(p) == 1) instead to skip the synchronize_sched() for cases where the VM only has a single user, and that user only has a single thread. It also turns out that we cannot use for_each_thread() to set thread flags in all threads using a VM, as it only iterates on the thread group. Therefore, test the membarrier state variable directly rather than relying on thread flags. This means membarrier_register_private_expedited() needs to set the MEMBARRIER_STATE_PRIVATE_EXPEDITED flag, issue synchronize_sched(), and only then set MEMBARRIER_STATE_PRIVATE_EXPEDITED_READY which allows private expedited membarrier commands to succeed. membarrier_arch_switch_mm() now tests for the MEMBARRIER_STATE_PRIVATE_EXPEDITED flag. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Andrea Parri <parri.andrea@gmail.com> Cc: Andrew Hunter <ahh@google.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Avi Kivity <avi@scylladb.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Dave Watson <davejwatson@fb.com> Cc: David Sehr <sehr@google.com> Cc: Greg Hackmann <ghackmann@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Maged Michael <maged.michael@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Will Deacon <will.deacon@arm.com> Cc: linux-api@vger.kernel.org Cc: linux-arch@vger.kernel.org Cc: linuxppc-dev@lists.ozlabs.org Link: http://lkml.kernel.org/r/20180129202020.8515-3-mathieu.desnoyers@efficios.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-02-03Merge branch 'for-4.16/nfit' into libnvdimm-for-nextRoss Zwisler1-1/+6
2018-02-02Merge tag 'powerpc-4.16-1' of ↵Linus Torvalds23-367/+1455
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc updates from Michael Ellerman: "Highlights: - Enable support for memory protection keys aka "pkeys" on Power7/8/9 when using the hash table MMU. - Extend our interrupt soft masking to support masking PMU interrupts as well as "normal" interrupts, and then use that to implement local_t for a ~4x speedup vs the current atomics-based implementation. - A new driver "ocxl" for "Open Coherent Accelerator Processor Interface (OpenCAPI)" devices. - Support for new device tree properties on PowerVM to describe hotpluggable memory and devices. - Add support for CLOCK_{REALTIME/MONOTONIC}_COARSE to the 64-bit VDSO. - Freescale updates from Scott: fixes for CPM GPIO and an FSL PCI erratum workaround, plus a minor cleanup patch. As well as quite a lot of other changes all over the place, and small fixes and cleanups as always. Thanks to: Alan Modra, Alastair D'Silva, Alexey Kardashevskiy, Alistair Popple, Andreas Schwab, Andrew Donnellan, Aneesh Kumar K.V, Anju T Sudhakar, Anshuman Khandual, Anton Blanchard, Arnd Bergmann, Balbir Singh, Benjamin Herrenschmidt, Bhaktipriya Shridhar, Bryant G. Ly, Cédric Le Goater, Christophe Leroy, Christophe Lombard, Cyril Bur, David Gibson, Desnes A. Nunes do Rosario, Dmitry Torokhov, Frederic Barrat, Geert Uytterhoeven, Guilherme G. Piccoli, Gustavo A. R. Silva, Gustavo Romero, Ivan Mikhaylov, Joakim Tjernlund, Joe Perches, Josh Poimboeuf, Juan J. Alvarez, Julia Cartwright, Kamalesh Babulal, Madhavan Srinivasan, Mahesh Salgaonkar, Mathieu Malaterre, Michael Bringmann, Michael Hanselmann, Michael Neuling, Nathan Fontenot, Naveen N. Rao, Nicholas Piggin, Paul Mackerras, Philippe Bergheaud, Ram Pai, Russell Currey, Santosh Sivaraj, Scott Wood, Seth Forshee, Simon Guo, Stewart Smith, Sukadev Bhattiprolu, Thiago Jung Bauermann, Vaibhav Jain, Vasyl Gomonovych" * tag 'powerpc-4.16-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (199 commits) powerpc/mm/radix: Fix build error when RADIX_MMU=n macintosh/ams-input: Use true and false for boolean values macintosh: change some data types from int to bool powerpc/watchdog: Print the NIP in soft_nmi_interrupt() powerpc/watchdog: regs can't be null in soft_nmi_interrupt() powerpc/watchdog: Tweak watchdog printks powerpc/cell: Remove axonram driver rtc-opal: Fix handling of firmware error codes, prevent busy loops powerpc/mpc52xx_gpt: make use of raw_spinlock variants macintosh/adb: Properly mark continued kernel messages powerpc/pseries: Fix cpu hotplug crash with memoryless nodes powerpc/numa: Ensure nodes initialized for hotplug powerpc/numa: Use ibm,max-associativity-domains to discover possible nodes powerpc/kernel: Block interrupts when updating TIDR powerpc/powernv/idoa: Remove unnecessary pcidev from pci_dn powerpc/mm/nohash: do not flush the entire mm when range is a single page powerpc/pseries: Add Initialization of VF Bars powerpc/pseries/pci: Associate PEs to VFs in configure SR-IOV powerpc/eeh: Add EEH notify resume sysfs powerpc/eeh: Add EEH operations to notify resume ...
2018-02-01mm/thp: remove pmd_huge_split_prepare()Aneesh Kumar K.V1-22/+0
Instead of marking the pmd ready for split, invalidate the pmd. This should take care of powerpc requirement. Only side effect is that we mark the pmd invalid early. This can result in us blocking access to the page a bit longer if we race against a thp split. [kirill.shutemov@linux.intel.com: rebased, dirty THP once] Link: http://lkml.kernel.org/r/20171213105756.69879-13-kirill.shutemov@linux.intel.com Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Daney <david.daney@cavium.com> Cc: David Miller <davem@davemloft.net> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Nitin Gupta <nitin.m.gupta@oracle.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-02-01powerpc/mm: update pmdp_invalidate to return old pmd valueAneesh Kumar K.V1-2/+5
It's required to avoid losing dirty and accessed bits. Link: http://lkml.kernel.org/r/20171213105756.69879-7-kirill.shutemov@linux.intel.com Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-01-27powerpc/pseries: Fix cpu hotplug crash with memoryless nodesMichael Bringmann1-1/+3
On powerpc systems with shared configurations of CPUs and memory and memoryless nodes at boot, an event ordering problem was observed on a SLES12 build platforms with the hot-add of CPUs to the memoryless nodes. * The most common error occurred when the memory SLAB driver attempted to reference the memoryless node to which a CPU was being added before the kernel had finished initializing all of the data structures for the CPU and exited 'device_online' under DLPAR/hot-add. Normally the memoryless node would be initialized through the call path device_online ... arch_update_cpu_topology ... find_cpu_nid ... try_online_node. This patch ensures that the powerpc node will be initialized as early as possible, even if it was memoryless and CPU-less at the point when we are trying to hot-add a new CPU to it. Signed-off-by: Michael Bringmann <mwb@linux.vnet.ibm.com> Reviewed-by: Nathan Fontenot <nfont@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-27powerpc/numa: Ensure nodes initialized for hotplugMichael Bringmann1-10/+37
This patch fixes some problems encountered at runtime with configurations that support memory-less nodes, or that hot-add CPUs into nodes that are memoryless during system execution after boot. The problems of interest include: * Nodes known to powerpc to be memoryless at boot, but to have CPUs in them are allowed to be 'possible' and 'online'. Memory allocations for those nodes are taken from another node that does have memory until and if memory is hot-added to the node. * Nodes which have no resources assigned at boot, but which may still be referenced subsequently by affinity or associativity attributes, are kept in the list of 'possible' nodes for powerpc. Hot-add of memory or CPUs to the system can reference these nodes and bring them online instead of redirecting the references to one of the set of nodes known to have memory at boot. Note that this software operates under the context of CPU hotplug. We are not doing memory hotplug in this code, but rather updating the kernel's CPU topology (i.e. arch_update_cpu_topology / numa_update_cpu_topology). We are initializing a node that may be used by CPUs or memory before it can be referenced as invalid by a CPU hotplug operation. CPU hotplug operations are protected by a range of APIs including cpu_maps_update_begin/cpu_maps_update_done, cpus_read/write_lock / cpus_read/write_unlock, device locks, and more. Memory hotplug operations, including try_online_node, are protected by mem_hotplug_begin/mem_hotplug_done, device locks, and more. In the case of CPUs being hot-added to a previously memoryless node, the try_online_node operation occurs wholly within the CPU locks with no overlap. Using HMC hot-add/hot-remove operations, we have been able to add and remove CPUs to any possible node without failures. HMC operations involve a degree self-serialization, though. Signed-off-by: Michael Bringmann <mwb@linux.vnet.ibm.com> Reviewed-by: Nathan Fontenot <nfont@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-27powerpc/numa: Use ibm,max-associativity-domains to discover possible nodesMichael Bringmann1-3/+34
On powerpc systems which allow 'hot-add' of CPU or memory resources, it may occur that the new resources are to be inserted into nodes that were not used for these resources at bootup. In the kernel, any node that is used must be defined and initialized. These empty nodes may occur when, * Dedicated vs. shared resources. Shared resources require information such as the VPHN hcall for CPU assignment to nodes. Associativity decisions made based on dedicated resource rules, such as associativity properties in the device tree, may vary from decisions made using the values returned by the VPHN hcall. * memoryless nodes at boot. Nodes need to be defined as 'possible' at boot for operation with other code modules. Previously, the powerpc code would limit the set of possible nodes to those which have memory assigned at boot, and were thus online. Subsequent add/remove of CPUs or memory would only work with this subset of possible nodes. * memoryless nodes with CPUs at boot. Due to the previous restriction on nodes, nodes that had CPUs but no memory were being collapsed into other nodes that did have memory at boot. In practice this meant that the node assignment presented by the runtime kernel differed from the affinity and associativity attributes presented by the device tree or VPHN hcalls. Nodes that might be known to the pHyp were not 'possible' in the runtime kernel because they did not have memory at boot. This patch ensures that sufficient nodes are defined to support configuration requirements after boot, as well as at boot. This patch set fixes a couple of problems. * Nodes known to powerpc to be memoryless at boot, but to have CPUs in them are allowed to be 'possible' and 'online'. Memory allocations for those nodes are taken from another node that does have memory until and if memory is hot-added to the node. * Nodes which have no resources assigned at boot, but which may still be referenced subsequently by affinity or associativity attributes, are kept in the list of 'possible' nodes for powerpc. Hot-add of memory or CPUs to the system can reference these nodes and bring them online instead of redirecting to one of the set of nodes that were known to have memory at boot. This patch extracts the value of the lowest domain level (number of allocable resources) from the device tree property "ibm,max-associativity-domains" to use as the maximum number of nodes to setup as possibly available in the system. This new setting will override the instruction: nodes_and(node_possible_map, node_possible_map, node_online_map); presently seen in the function arch/powerpc/mm/numa.c:initmem_init(). If the "ibm,max-associativity-domains" property is not present at boot, no operation will be performed to define or enable additional nodes, or enable the above 'nodes_and()'. Signed-off-by: Michael Bringmann <mwb@linux.vnet.ibm.com> Reviewed-by: Nathan Fontenot <nfont@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-27powerpc/mm/nohash: do not flush the entire mm when range is a single pageChristophe Leroy1-1/+4
Most of the time, flush_tlb_range() is called on single pages. At the time being, flush_tlb_range() inconditionnaly calls flush_tlb_mm() which flushes at least the entire PID pages and on older CPUs like 4xx or 8xx it flushes the entire TLB table. This patch calls flush_tlb_page() instead of flush_tlb_mm() when the range is a single page. Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-21powerpc/radix: Remove trace_tlbie call from radix__flush_tlb_allMahesh Salgaonkar1-2/+0
radix__flush_tlb_all() is called only in kexec path in real mode and any tracepoints at this stage will make kexec to fail if enabled. To verify enable tlbie trace before kexec. $ echo 1 > /sys/kernel/debug/tracing/events/powerpc/tlbie/enable == kexec into new kernel and kexec fails. Fix this by not calling trace_tlbie from radix__flush_tlb_all(). Fixes: 0428491cba92 ("powerpc/mm: Trace tlbie(l) instructions") Cc: stable@vger.kernel.org # v4.13+ Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Acked-by: Balbir Singh <bsingharora@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-21powerpc/pseries: Don't give a warning when HPT resizing isn't availableDavid Gibson1-1/+1
As of 438cc81a41 "powerpc/pseries: Automatically resize HPT for memory hot add/remove" when running on the pseries platform, we always attempt to use the PAPR extension to resize the hashed page table (HPT) when we add or remove memory. This is fine, but when the extension is available we'll give a harmless, but scary warning. This patch suppresses the warning in this case. It will still warn if the feature is supposed to be available, but didn't work. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-21powerpc/hash: Skip non initialized page size in init_hpte_page_sizesAneesh Kumar K.V1-1/+1
One of the easiest way to test config with 4K HPTE is to disable 64K hardware page size like below. int __init htab_dt_scan_page_sizes(unsigned long node, size -= 3; prop += 3; base_idx = get_idx_from_shift(base_shift); - if (base_idx < 0) { + if (base_idx < 0 || base_idx == MMU_PAGE_64K) { /* skip the pte encoding also */ prop += lpnum * 2; size -= lpnum * 2; But then this results in error in other part of the code such as MPSS parsing where we look at 4K base page size and 64K actual page size support. This patch fix MPSS parsing by ignoring the actual page sizes marked unsupported. In reality this can happen only with a corrupt device tree. But it is good to tighten the error check. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-21Merge branch 'fixes' into nextMichael Ellerman1-1/+6
Merge our fixes branch from the 4.15 cycle. Unusually the fixes branch saw some significant features merged, notably the RFI flush patches, so we want the code in next to be tested against that, to avoid any surprises when the two are merged. There's also some other work on the panic handling that was reverted in fixes and we now want to do properly in next, which would conflict. And we also fix a few other minor merge conflicts.
2018-01-20powerpc/mm: Invalidate subpage_prot() system call on radix platformsAnshuman Khandual1-0/+3
Radix enabled platforms don't support subpage_prot() system calls. But at present the system call goes through without an error and fails later on while validating expected subpage accesses. Lets not allow the system call on powerpc radix platforms to begin with to prevent this confusion in user space. Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-20powerpc: Enable pkey subsystemRam Pai1-9/+55
PAPR defines 'ibm,processor-storage-keys' property. It exports two values. The first value holds the number of data-access keys and the second holds the number of instruction-access keys. Due to a bug in the firmware, instruction-access keys is always reported as zero. However any key can be configured to disable data-access and/or disable execution-access. The inavailablity of the second value is not a big handicap, though it could have been used to determine if the platform supported disable-execution-access. Non-PAPR platforms do not define this property in the device tree yet. Fortunately power8 is the only released Non-PAPR platform that is supported. Here, we hardcode the number of supported pkey to 32, by consulting the PowerISA3.0 This patch calculates the number of keys supported by the platform. Also it determines the platform support for read/write/execution access support for pkeys. Signed-off-by: Ram Pai <linuxram@us.ibm.com> [mpe: Use a PVR check instead of CPU_FTR for execute. Restrict to Power7/8/9 for now until older CPUs are tested.] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-20powerpc: Deliver SEGV signal on pkey violationRam Pai1-12/+27
The value of the pkey, whose protection got violated, is made available in si_pkey field of the siginfo structure. Signed-off-by: Ram Pai <linuxram@us.ibm.com> Signed-off-by: Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-20powerpc: introduce get_mm_addr_key() helperRam Pai1-0/+24
get_mm_addr_key() helper returns the pkey associated with an address corresponding to a given mm_struct. Signed-off-by: Ram Pai <linuxram@us.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-20powerpc: Handle exceptions caused by pkey violationRam Pai1-0/+22
Handle Data and Instruction exceptions caused by memory protection-key. The CPU will detect the key fault if the HPTE is already programmed with the key. However if the HPTE is not hashed, a key fault will not be detected by the hardware. The software will detect pkey violation in such a case. Signed-off-by: Ram Pai <linuxram@us.ibm.com> Signed-off-by: Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-20powerpc: implementation for arch_vma_access_permitted()Ram Pai1-0/+34
This patch provides the implementation for arch_vma_access_permitted(). Returns true if the requested access is allowed by pkey associated with the vma. Signed-off-by: Ram Pai <linuxram@us.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-20powerpc: helper to validate key-access permissions of a pteRam Pai1-0/+28
helper function that checks if the read/write/execute is allowed on the pte. Signed-off-by: Ram Pai <linuxram@us.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-20powerpc: Program HPTE key protection bitsRam Pai1-0/+1
Map the PTE protection key bits to the HPTE key protection bits, while creating HPTE entries. Acked-by: Balbir Singh <bsingharora@gmail.com> Signed-off-by: Ram Pai <linuxram@us.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-20powerpc: implementation for arch_override_mprotect_pkey()Ram Pai1-0/+36
arch independent code calls arch_override_mprotect_pkey() to return a pkey that best matches the requested protection. This patch provides the implementation. Signed-off-by: Ram Pai <linuxram@us.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-20powerpc: ability to associate pkey to a vmaRam Pai1-0/+8
arch-independent code expects the arch to map a pkey into the vma's protection bit setting. The patch provides that ability. Signed-off-by: Ram Pai <linuxram@us.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-20powerpc: introduce execute-only pkeyRam Pai1-0/+58
This patch provides the implementation of execute-only pkey. The architecture-independent layer expects the arch-dependent layer, to support the ability to create and enable a special key which has execute-only permission. Acked-by: Balbir Singh <bsingharora@gmail.com> Signed-off-by: Ram Pai <linuxram@us.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-20powerpc: store and restore the pkey state across context switchesRam Pai1-1/+51
Store and restore the AMR, IAMR and UAMOR register state of the task before scheduling out and after scheduling in, respectively. Signed-off-by: Ram Pai <linuxram@us.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-20powerpc: ability to create execute-disabled pkeysRam Pai1-0/+16
powerpc has hardware support to disable execute on a pkey. This patch enables the ability to create execute-disabled keys. Signed-off-by: Ram Pai <linuxram@us.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-20powerpc: implementation for arch_set_user_pkey_access()Ram Pai1-0/+40
This patch provides the detailed implementation for a user to allocate a key and enable it in the hardware. It provides the plumbing, but it cannot be used till the system call is implemented. The next patch will do so. Reviewed-by: Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com> Signed-off-by: Ram Pai <linuxram@us.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-20powerpc: helper functions to initialize AMR, IAMR and UAMOR registersRam Pai1-0/+47
Introduce helper functions that can initialize the bits in the AMR, IAMR and UAMOR register; the bits that correspond to the given pkey. Reviewed-by: Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com> Signed-off-by: Ram Pai <linuxram@us.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-20powerpc: helper function to read, write AMR, IAMR, UAMOR registersRam Pai1-0/+36
Implements helper functions to read and write the key related registers; AMR, IAMR, UAMOR. AMR register tracks the read,write permission of a key IAMR register tracks the execute permission of a key UAMOR register enables and disables a key Acked-by: Balbir Singh <bsingharora@gmail.com> Reviewed-by: Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com> Signed-off-by: Ram Pai <linuxram@us.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-20powerpc: track allocation status of all pkeysRam Pai2-0/+42
Total 32 keys are available on power7 and above. However pkey 0,1 are reserved. So effectively we have 30 pkeys. On 4K kernels, we do not have 5 bits in the PTE to represent all the keys; we only have 3bits. Two of those keys are reserved; pkey 0 and pkey 1. So effectively we have 6 pkeys. This patch keeps track of reserved keys, allocated keys and keys that are currently free. Also it adds skeletal functions and macros, that the architecture-independent code expects to be available. Reviewed-by: Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com> Signed-off-by: Ram Pai <linuxram@us.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-20powerpc: initial pkey plumbingRam Pai3-0/+31
Basic plumbing to initialize the pkey system. Nothing is enabled yet. A later patch will enable it once all the infrastructure is in place. Signed-off-by: Ram Pai <linuxram@us.ibm.com> [mpe: Rework copyrights to use SPDX tags] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-19powerpc/64: Rename soft_enabled to irq_soft_maskMadhavan Srinivasan1-1/+1
Rename the paca->soft_enabled to paca->irq_soft_mask as it is no longer used as a flag for interrupt state, but a mask. Signed-off-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-19powerpc/64: Add #defines for paca->soft_enabled flagsMadhavan Srinivasan1-1/+1
Two #defines IRQS_ENABLED and IRQS_DISABLED are added to be used when updating paca->soft_enabled. Replace the hardcoded values used when updating paca->soft_enabled with IRQ_(EN|DIS)ABLED #define. No logic change. Reviewed-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-17powerpc/pseries: lift RTAS limit for hashNicholas Piggin1-3/+5
With the previous patch to switch to 64-bit mode after returning from RTAS and before doing any memory accesses, the RMA limit need not be clamped to 1GB to avoid RTAS bugs. Keep the 1GB limit for older firmware (although this is more of a kernel concern than RTAS), and remove it starting with POWER9. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-17powerpc/pseries: lift RTAS limit for radixNicholas Piggin1-17/+4
With the previous patch to switch to 64-bit mode after returning from RTAS and before doing any memory accesses, the RMA limit need not be clamped to 1GB to avoid RTAS bugs. Keep the 1GB limit for older firmware (although this is more of a kernel concern than RTAS), and remove it starting with POWER9. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-17powerpc/pseries: radix is not subject to RMA limit, remove itNicholas Piggin1-7/+4
The radix guest is not subject to the paravirtualized HPT VRMA limit, so remove that from ppc64_rma_size calculation for that platform. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-17powerpc/powernv: Remove real mode access limit for early allocationsNicholas Piggin2-23/+34
This removes the RMA limit on powernv platform, which constrains early allocations such as PACAs and stacks. There are still other restrictions that must be followed, such as bolted SLB limits, but real mode addressing has no constraints. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-17powerpc/64s: Improve local TLB flush for boot and MCE on POWER9Nicholas Piggin4-0/+177
There are several cases outside the normal address space management where a CPU's entire local TLB is to be flushed: 1. Booting the kernel, in case something has left stale entries in the TLB (e.g., kexec). 2. Machine check, to clean corrupted TLB entries. One other place where the TLB is flushed, is waking from deep idle states. The flush is a side-effect of calling ->cpu_restore with the intention of re-setting various SPRs. The flush itself is unnecessary because in the first case, the TLB should not acquire new corrupted TLB entries as part of sleep/wake (though they may be lost). This type of TLB flush is coded inflexibly, several times for each CPU type, and they have a number of problems with ISA v3.0B: - The current radix mode of the MMU is not taken into account, it is always done as a hash flushn For IS=2 (LPID-matching flush from host) and IS=3 with HV=0 (guest kernel flush), tlbie(l) is undefined if the R field does not match the current radix mode. - ISA v3.0B hash must flush the partition and process table caches as well. - ISA v3.0B radix must flush partition and process scoped translations, partition and process table caches, and also the page walk cache. So consolidate the flushing code and implement it in C and inline asm under the mm/ directory with the rest of the flush code. Add ISA v3.0B cases for radix and hash, and use the radix flush in radix environment. Provide a way for IS=2 (LPID flush) to specify the radix mode of the partition. Have KVM pass in the radix mode of the guest. Take out the flushes from early cputable/dt_cpu_ftrs detection hooks, and move it later in the boot process after, the MMU registers are set up and before relocation is first turned on. The TLB flush is no longer called when restoring from deep idle states. This was not be done as a separate step because booting secondaries uses the same cpu_restore as idle restore, which needs the TLB flush. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-16powerpc: Use the TRAP macro whenever comparing a trap numberBenjamin Herrenschmidt1-1/+1
Trap numbers can have extra bits at the bottom that need to be filtered out. There are a few cases where we don't do that. It's possible that we got lucky but better safe than sorry. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-16powerpc/8xx: Use L1 entry APG to handle _PAGE_ACCESSED for CONFIG_SWAPChristophe Leroy1-1/+1
When CONFIG_SWAP is set, the TLB miss handlers have to also take into account _PAGE_ACCESSED flag. At the moment it is done by anding _PAGE_ACCESSED into _PAGE_PRESENT using 3 instructions. This patch uses APG for handling _PAGE_ACCESSED, allowing to just copy _PAGE_ACCESSED bit into APG field, hence reducing the action to a single instruction. Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-16powerpc/8xx: Remove _PAGE_USER and handle user access at PMD levelChristophe Leroy1-1/+1
As Linux kernel separates KERNEL and USER address spaces, there is therefore no need to flag USER access at page level. Today, the 8xx TLB handlers already handle user access in the L1 entry through Access Protection Groups, it is then natural to move the user access handling at PMD level once _PAGE_NA allows to handle PAGE_NONE protection without _PAGE_USER In the mean time, as we free up one bit in the PTE, we can use it to include SPS (page size flag) in the PTE and avoid handling it at every TLB miss hence removing special handling based on compiled page size. For _PAGE_EXEC, we rework it to use PP PTE bits, avoiding the copy of _PAGE_EXEC bit into the L1 entry. Unfortunatly we are not able to put it at the correct location as it conflicts with NA/RO/RW bits for data entries. Upper bits of APG in L1 entry overlap with PMD base address. In order to avoid having to filter that out, we set up all groups so that upper bits can have any value. Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>