summaryrefslogtreecommitdiff
path: root/include/linux
AgeCommit message (Collapse)AuthorFilesLines
2019-03-06numa: make "nr_node_ids" unsigned intAlexey Dobriyan1-2/+2
Number of NUMA nodes can't be negative. This saves a few bytes on x86_64: add/remove: 0/0 grow/shrink: 4/21 up/down: 27/-265 (-238) Function old new delta hv_synic_alloc.cold 88 110 +22 prealloc_shrinker 260 262 +2 bootstrap 249 251 +2 sched_init_numa 1566 1567 +1 show_slab_objects 778 777 -1 s_show 1201 1200 -1 kmem_cache_init 346 345 -1 __alloc_workqueue_key 1146 1145 -1 mem_cgroup_css_alloc 1614 1612 -2 __do_sys_swapon 4702 4699 -3 __list_lru_init 655 651 -4 nic_probe 2379 2374 -5 store_user_store 118 111 -7 red_zone_store 106 99 -7 poison_store 106 99 -7 wq_numa_init 348 338 -10 __kmem_cache_empty 75 65 -10 task_numa_free 186 173 -13 merge_across_nodes_store 351 336 -15 irq_create_affinity_masks 1261 1246 -15 do_numa_crng_init 343 321 -22 task_numa_fault 4760 4737 -23 swapfile_init 179 156 -23 hv_synic_alloc 536 492 -44 apply_wqattrs_prepare 746 695 -51 Link: http://lkml.kernel.org/r/20190201223029.GA15820@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-06mm: swap: use mem_cgroup_is_root() instead of deferencing css->parentYang Shi1-1/+1
mem_cgroup_is_root() is the preferred API to check if memcg is root or not. Use it instead of deferencing css->parent. Link: http://lkml.kernel.org/r/1547232913-118148-1-git-send-email-yang.shi@linux.alibaba.com Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Tim Chen <tim.c.chen@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-06mm: update get_user_pages_longterm to migrate pages allocated from CMA regionAneesh Kumar K.V2-1/+4
This patch updates get_user_pages_longterm to migrate pages allocated out of CMA region. This makes sure that we don't keep non-movable pages (due to page reference count) in the CMA area. This will be used by ppc64 in a later patch to avoid pinning pages in the CMA region. ppc64 uses CMA region for allocation of the hardware page table (hash page table) and not able to migrate pages out of CMA region results in page table allocation failures. One case where we hit this easy is when a guest using a VFIO passthrough device. VFIO locks all the guest's memory and if the guest memory is backed by CMA region, it becomes unmovable resulting in fragmenting the CMA and possibly preventing other guests from allocation a large enough hash page table. NOTE: We allocate the new page without using __GFP_THISNODE Link: http://lkml.kernel.org/r/20190114095438.32470-3-aneesh.kumar@linux.ibm.com Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Alexey Kardashevskiy <aik@ozlabs.ru> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Gibson <david@gibson.dropbear.id.au> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-06mm/cma: add PF flag to force non cma allocAneesh Kumar K.V2-8/+41
Patch series "mm/kvm/vfio/ppc64: Migrate compound pages out of CMA region", v8. ppc64 uses the CMA area for the allocation of guest page table (hash page table). We won't be able to start guest if we fail to allocate hash page table. We have observed hash table allocation failure because we failed to migrate pages out of CMA region because they were pinned. This happen when we are using VFIO. VFIO on ppc64 pins the entire guest RAM. If the guest RAM pages get allocated out of CMA region, we won't be able to migrate those pages. The pages are also pinned for the lifetime of the guest. Currently we support migration of non-compound pages. With THP and with the addition of hugetlb migration we can end up allocating compound pages from CMA region. This patch series add support for migrating compound pages. This patch (of 4): Add PF_MEMALLOC_NOCMA which make sure any allocation in that context is marked non-movable and hence cannot be satisfied by CMA region. This is useful with get_user_pages_longterm where we want to take a page pin by migrating pages from CMA region. Marking the section PF_MEMALLOC_NOCMA ensures that we avoid unnecessary page migration later. Link: http://lkml.kernel.org/r/20190114095438.32470-2-aneesh.kumar@linux.ibm.com Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Suggested-by: Andrea Arcangeli <aarcange@redhat.com> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Alexey Kardashevskiy <aik@ozlabs.ru> Cc: David Gibson <david@gibson.dropbear.id.au> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-06mm: better document PG_reservedDavid Hildenbrand1-2/+31
The usage of PG_reserved and how PG_reserved pages are to be treated is buried deep down in different parts of the kernel. Let's shine some light onto these details by documenting current users and expected behavior. Especially, clarify on the "Some of them might not even exist" case. These are physical memory gaps that will never be dumped as they are not marked as IORESOURCE_SYSRAM. PG_reserved does in general not hinder anybody from dumping or swapping. In some cases, these pages will not be stored in the hibernation image. Link: http://lkml.kernel.org/r/20190114125903.24845-10-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Pavel Tatashin <pasha.tatashin@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Anthony Yznaga <anthony.yznaga@oracle.com> Cc: Miles Chen <miles.chen@mediatek.com> Cc: <yi.z.zhang@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-06mm: rid swapoff of quadratic complexityVineeth Remanan Pillai2-1/+9
This patch was initially posted by Kelley Nielsen. Reposting the patch with all review comments addressed and with minor modifications and optimizations. Also, folding in the fixes offered by Hugh Dickins and Huang Ying. Tests were rerun and commit message updated with new results. try_to_unuse() is of quadratic complexity, with a lot of wasted effort. It unuses swap entries one by one, potentially iterating over all the page tables for all the processes in the system for each one. This new proposed implementation of try_to_unuse simplifies its complexity to linear. It iterates over the system's mms once, unusing all the affected entries as it walks each set of page tables. It also makes similar changes to shmem_unuse. Improvement swapoff was called on a swap partition containing about 6G of data, in a VM(8cpu, 16G RAM), and calls to unuse_pte_range() were counted. Present implementation....about 1200M calls(8min, avg 80% cpu util). Prototype.................about 9.0K calls(3min, avg 5% cpu util). Details In shmem_unuse(), iterate over the shmem_swaplist and, for each shmem_inode_info that contains a swap entry, pass it to shmem_unuse_inode(), along with the swap type. In shmem_unuse_inode(), iterate over its associated xarray, and store the index and value of each swap entry in an array for passing to shmem_swapin_page() outside of the RCU critical section. In try_to_unuse(), instead of iterating over the entries in the type and unusing them one by one, perhaps walking all the page tables for all the processes for each one, iterate over the mmlist, making one pass. Pass each mm to unuse_mm() to begin its page table walk, and during the walk, unuse all the ptes that have backing store in the swap type received by try_to_unuse(). After the walk, check the type for orphaned swap entries with find_next_to_unuse(), and remove them from the swap cache. If find_next_to_unuse() starts over at the beginning of the type, repeat the check of the shmem_swaplist and the walk a maximum of three times. Change unuse_mm() and the intervening walk functions down to unuse_pte_range() to take the type as a parameter, and to iterate over their entire range, calling the next function down on every iteration. In unuse_pte_range(), make a swap entry from each pte in the range using the passed in type. If it has backing store in the type, call swapin_readahead() to retrieve the page and pass it to unuse_pte(). Pass the count of pages_to_unuse down the page table walks in try_to_unuse(), and return from the walk when the desired number of pages has been swapped back in. Link: http://lkml.kernel.org/r/20190114153129.4852-2-vpillai@digitalocean.com Signed-off-by: Vineeth Remanan Pillai <vpillai@digitalocean.com> Signed-off-by: Kelley Nielsen <kelleynnn@gmail.com> Signed-off-by: Huang Ying <ying.huang@intel.com> Acked-by: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-06mm/hugetlb: add prot_modify_start/commit sequence for hugetlb updateAneesh Kumar K.V1-0/+20
Architectures like ppc64 require to do a conditional tlb flush based on the old and new value of pte. Follow the regular pte change protection sequence for hugetlb too. This allows the architectures to override the update sequence. Link: http://lkml.kernel.org/r/20190116085035.29729-5-aneesh.kumar@linux.ibm.com Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Reviewed-by: Michael Ellerman <mpe@ellerman.id.au> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-06mm: fix some typos in mm directoryWei Yang1-1/+1
No functional change. Link: http://lkml.kernel.org/r/20190118235123.27843-1-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Pekka Enberg <penberg@kernel.org> Acked-by: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-06mm, memcg: create mem_cgroup_from_seqChris Down1-0/+10
This is the start of a series of patches similar to my earlier DEFINE_MEMCG_MAX_OR_VAL work, but with less Macro Magic(tm). There are a bunch of places we go from seq_file to mem_cgroup, which currently requires manually getting the css, then getting the mem_cgroup from the css. It's in enough places now that having mem_cgroup_from_seq makes sense (and also makes the next patch a bit nicer). Link: http://lkml.kernel.org/r/20190124194050.GA31341@chrisdown.name Signed-off-by: Chris Down <chris@chrisdown.name> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Tejun Heo <tj@kernel.org> Cc: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-06kernel: cgroup: add poll file operationJohannes Weiner1-0/+4
Cgroup has a standardized poll/notification mechanism for waking all pollers on all fds when a filesystem node changes. To allow polling for custom events, add a .poll callback that can override the default. This is in preparation for pollable cgroup pressure files which have per-fd trigger configurations. Link: http://lkml.kernel.org/r/20190124211518.244221-3-surenb@google.com Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Suren Baghdasaryan <surenb@google.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Li Zefan <lizefan@huawei.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-06fs: kernfs: add poll file operationJohannes Weiner1-0/+6
Patch series "psi: pressure stall monitors", v3. Android is adopting psi to detect and remedy memory pressure that results in stuttering and decreased responsiveness on mobile devices. Psi gives us the stall information, but because we're dealing with latencies in the millisecond range, periodically reading the pressure files to detect stalls in a timely fashion is not feasible. Psi also doesn't aggregate its averages at a high enough frequency right now. This patch series extends the psi interface such that users can configure sensitive latency thresholds and use poll() and friends to be notified when these are breached. As high-frequency aggregation is costly, it implements an aggregation method that is optimized for fast, short-interval averaging, and makes the aggregation frequency adaptive, such that high-frequency updates only happen while monitored stall events are actively occurring. With these patches applied, Android can monitor for, and ward off, mounting memory shortages before they cause problems for the user. For example, using memory stall monitors in userspace low memory killer daemon (lmkd) we can detect mounting pressure and kill less important processes before device becomes visibly sluggish. In our memory stress testing psi memory monitors produce roughly 10x less false positives compared to vmpressure signals. Having ability to specify multiple triggers for the same psi metric allows other parts of Android framework to monitor memory state of the device and act accordingly. The new interface is straightforward. The user opens one of the pressure files for writing and writes a trigger description into the file descriptor that defines the stall state - some or full, and the maximum stall time over a given window of time. E.g.: /* Signal when stall time exceeds 100ms of a 1s window */ char trigger[] = "full 100000 1000000"; fd = open("/proc/pressure/memory"); write(fd, trigger, sizeof(trigger)); while (poll() >= 0) { ... } close(fd); When the monitored stall state is entered, psi adapts its aggregation frequency according to what the configured time window requires in order to emit event signals in a timely fashion. Once the stalling subsides, aggregation reverts back to normal. The trigger is associated with the open file descriptor. To stop monitoring, the user only needs to close the file descriptor and the trigger is discarded. Patches 1-4 prepare the psi code for polling support. Patch 5 implements the adaptive polling logic, the pressure growth detection optimized for short intervals, and hooks up write() and poll() on the pressure files. The patches were developed in collaboration with Johannes Weiner. This patch (of 5): Kernfs has a standardized poll/notification mechanism for waking all pollers on all fds when a filesystem node changes. To allow polling for custom events, add a .poll callback that can override the default. This is in preparation for pollable cgroup pressure files which have per-fd trigger configurations. Link: http://lkml.kernel.org/r/20190124211518.244221-2-surenb@google.com Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Suren Baghdasaryan <surenb@google.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Li Zefan <lizefan@huawei.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-06mm, compaction: capture a page under direct compactionMel Gorman2-1/+6
Compaction is inherently race-prone as a suitable page freed during compaction can be allocated by any parallel task. This patch uses a capture_control structure to isolate a page immediately when it is freed by a direct compactor in the slow path of the page allocator. The intent is to avoid redundant scanning. 5.0.0-rc1 5.0.0-rc1 selective-v3r17 capture-v3r19 Amean fault-both-1 0.00 ( 0.00%) 0.00 * 0.00%* Amean fault-both-3 2582.11 ( 0.00%) 2563.68 ( 0.71%) Amean fault-both-5 4500.26 ( 0.00%) 4233.52 ( 5.93%) Amean fault-both-7 5819.53 ( 0.00%) 6333.65 ( -8.83%) Amean fault-both-12 9321.18 ( 0.00%) 9759.38 ( -4.70%) Amean fault-both-18 9782.76 ( 0.00%) 10338.76 ( -5.68%) Amean fault-both-24 15272.81 ( 0.00%) 13379.55 * 12.40%* Amean fault-both-30 15121.34 ( 0.00%) 16158.25 ( -6.86%) Amean fault-both-32 18466.67 ( 0.00%) 18971.21 ( -2.73%) Latency is only moderately affected but the devil is in the details. A closer examination indicates that base page fault latency is reduced but latency of huge pages is increased as it takes creater care to succeed. Part of the "problem" is that allocation success rates are close to 100% even when under pressure and compaction gets harder 5.0.0-rc1 5.0.0-rc1 selective-v3r17 capture-v3r19 Percentage huge-3 96.70 ( 0.00%) 98.23 ( 1.58%) Percentage huge-5 96.99 ( 0.00%) 95.30 ( -1.75%) Percentage huge-7 94.19 ( 0.00%) 97.24 ( 3.24%) Percentage huge-12 94.95 ( 0.00%) 97.35 ( 2.53%) Percentage huge-18 96.74 ( 0.00%) 97.30 ( 0.58%) Percentage huge-24 97.07 ( 0.00%) 97.55 ( 0.50%) Percentage huge-30 95.69 ( 0.00%) 98.50 ( 2.95%) Percentage huge-32 96.70 ( 0.00%) 99.27 ( 2.65%) And scan rates are reduced as expected by 6% for the migration scanner and 29% for the free scanner indicating that there is less redundant work. Compaction migrate scanned 20815362 19573286 Compaction free scanned 16352612 11510663 [mgorman@techsingularity.net: remove redundant check] Link: http://lkml.kernel.org/r/20190201143853.GH9565@techsingularity.net Link: http://lkml.kernel.org/r/20190118175136.31341-23-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: David Rientjes <rientjes@google.com> Cc: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-06mm, compaction: be selective about what pageblocks to clear skip hintsMel Gorman1-0/+2
Pageblock hints are cleared when compaction restarts or kswapd makes enough progress that it can sleep but it's over-eager in that the bit is cleared for migration sources with no LRU pages and migration targets with no free pages. As pageblock skip hint flushes are relatively rare and out-of-band with respect to kswapd, this patch makes a few more expensive checks to see if it's appropriate to even clear the bit. Every pageblock that is not cleared will avoid 512 pages being scanned unnecessarily on x86-64. The impact is variable with different workloads showing small differences in latency, success rates and scan rates. This is expected as clearing the hints is not that common but doing a small amount of work out-of-band to avoid a large amount of work in-band later is generally a good thing. Link: http://lkml.kernel.org/r/20190118175136.31341-22-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Qian Cai <cai@lca.pw> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: David Rientjes <rientjes@google.com> Cc: YueHaibing <yuehaibing@huawei.com> [cai@lca.pw: no stuck in __reset_isolation_pfn()] Link: http://lkml.kernel.org/r/20190206034732.75687-1-cai@lca.pw Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-06mm, compaction: use free lists to quickly locate a migration sourceMel Gorman1-0/+11
The migration scanner is a linear scan of a zone with a potentiall large search space. Furthermore, many pageblocks are unusable such as those filled with reserved pages or partially filled with pages that cannot migrate. These still get scanned in the common case of allocating a THP and the cost accumulates. The patch uses a partial search of the free lists to locate a migration source candidate that is marked as MOVABLE when allocating a THP. It prefers picking a block with a larger number of free pages already on the basis that there are fewer pages to migrate to free the entire block. The lowest PFN found during searches is tracked as the basis of the start for the linear search after the first search of the free list fails. After the search, the free list is shuffled so that the next search will not encounter the same page. If the search fails then the subsequent searches will be shorter and the linear scanner is used. If this search fails, or if the request is for a small or unmovable/reclaimable allocation then the linear scanner is still used. It is somewhat pointless to use the list search in those cases. Small free pages must be used for the search and there is no guarantee that movable pages are located within that block that are contiguous. 5.0.0-rc1 5.0.0-rc1 noboost-v3r10 findmig-v3r15 Amean fault-both-3 3771.41 ( 0.00%) 3390.40 ( 10.10%) Amean fault-both-5 5409.05 ( 0.00%) 5082.28 ( 6.04%) Amean fault-both-7 7040.74 ( 0.00%) 7012.51 ( 0.40%) Amean fault-both-12 11887.35 ( 0.00%) 11346.63 ( 4.55%) Amean fault-both-18 16718.19 ( 0.00%) 15324.19 ( 8.34%) Amean fault-both-24 21157.19 ( 0.00%) 16088.50 * 23.96%* Amean fault-both-30 21175.92 ( 0.00%) 18723.42 * 11.58%* Amean fault-both-32 21339.03 ( 0.00%) 18612.01 * 12.78%* 5.0.0-rc1 5.0.0-rc1 noboost-v3r10 findmig-v3r15 Percentage huge-3 86.50 ( 0.00%) 89.83 ( 3.85%) Percentage huge-5 92.52 ( 0.00%) 91.96 ( -0.61%) Percentage huge-7 92.44 ( 0.00%) 92.85 ( 0.44%) Percentage huge-12 92.98 ( 0.00%) 92.74 ( -0.25%) Percentage huge-18 91.70 ( 0.00%) 91.71 ( 0.02%) Percentage huge-24 91.59 ( 0.00%) 92.13 ( 0.60%) Percentage huge-30 90.14 ( 0.00%) 93.79 ( 4.04%) Percentage huge-32 90.03 ( 0.00%) 91.27 ( 1.37%) This shows an improvement in allocation latencies with similar allocation success rates. While not presented, there was a 31% reduction in migration scanning and a 8% reduction on system CPU usage. A 2-socket machine showed similar benefits. [mgorman@techsingularity.net: several fixes] Link: http://lkml.kernel.org/r/20190204120111.GL9565@techsingularity.net [vbabka@suse.cz: migrate block that was found-fast, some optimisations] Link: http://lkml.kernel.org/r/20190118175136.31341-10-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <Vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: David Rientjes <rientjes@google.com> Cc: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-06mm: shuffle GFP_* flagsAlexey Dobriyan1-15/+15
GFP_KERNEL is one of the most used constant but on archs like arm with fixed length instruction some constants are more equal than the others. Constants with tightly packed bits can be injected directly into instruction stream: 0: e3a00d33 mov r0, #3264 ; 0xcc0 Others require multiple instructions or even loading out of instruction stream: 0: e3a000c0 mov r0, #192 ; 0xc0 4: e3400060 movt r0, #96 ; 0x60 Shuffle GFP_* flags so that GFP_KERNEL/GFP_ATOMIC + __GFP_ZERO bits are close to each other. Savings on arm configs are ~0.1%. Link: http://lkml.kernel.org/r/20190109201838.GA9140@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-06mm/hugetlb: enable arch specific huge page size support for migrationAnshuman Khandual1-2/+13
Architectures like arm64 have HugeTLB page sizes which are different than generic sizes at PMD, PUD, PGD level and implemented via contiguous bits. At present these special size HugeTLB pages cannot be identified through macros like (PMD|PUD|PGDIR)_SHIFT and hence chosen not be migrated. Enabling migration support for these special HugeTLB page sizes along with the generic ones (PMD|PUD|PGD) would require identifying all of them on a given platform. A platform specific hook can precisely enumerate all huge page sizes supported for migration. Instead of comparing against standard huge page orders let hugetlb_migration_support() function call a platform hook arch_hugetlb_migration_support(). Default definition for the platform hook maintains existing semantics which checks standard huge page order. But an architecture can choose to override the default and provide support for a comprehensive set of huge page sizes. Link: http://lkml.kernel.org/r/1545121450-1663-4-git-send-email-anshuman.khandual@arm.com Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reviewed-by: Steve Capper <steve.capper@arm.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-06mm/hugetlb: enable PUD level huge page migrationAnshuman Khandual1-1/+2
Architectures like arm64 have PUD level HugeTLB pages for certain configs (1GB huge page is PUD based on ARM64_4K_PAGES base page size) that can be enabled for migration. It can be achieved through checking for PUD_SHIFT order based HugeTLB pages during migration. Link: http://lkml.kernel.org/r/1545121450-1663-3-git-send-email-anshuman.khandual@arm.com Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reviewed-by: Steve Capper <steve.capper@arm.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-06mm/hugetlb: distinguish between migratability and movabilityAnshuman Khandual1-0/+30
Patch series "arm64/mm: Enable HugeTLB migration", v4. This patch series enables HugeTLB migration support for all supported huge page sizes at all levels including contiguous bit implementation. Following HugeTLB migration support matrix has been enabled with this patch series. All permutations have been tested except for the 16GB. CONT PTE PMD CONT PMD PUD -------- --- -------- --- 4K: 64K 2M 32M 1G 16K: 2M 32M 1G 64K: 2M 512M 16G First the series adds migration support for PUD based huge pages. It then adds a platform specific hook to query an architecture if a given huge page size is supported for migration while also providing a default fallback option preserving the existing semantics which just checks for (PMD|PUD|PGDIR)_SHIFT macros. The last two patches enables HugeTLB migration on arm64 and subscribe to this new platform specific hook by defining an override. The second patch differentiates between movability and migratability aspects of huge pages and implements hugepage_movable_supported() which can then be used during allocation to decide whether to place the huge page in movable zone or not. This patch (of 5): During huge page allocation it's migratability is checked to determine if it should be placed under movable zones with GFP_HIGHUSER_MOVABLE. But the movability aspect of the huge page could depend on other factors than just migratability. Movability in itself is a distinct property which should not be tied with migratability alone. This differentiates these two and implements an enhanced movability check which also considers huge page size to determine if it is feasible to be placed under a movable zone. At present it just checks for gigantic pages but going forward it can incorporate other enhanced checks. Link: http://lkml.kernel.org/r/1545121450-1663-2-git-send-email-anshuman.khandual@arm.com Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: Steve Capper <steve.capper@arm.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Suggested-by: Michal Hocko <mhocko@kernel.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-06mm: remove sysctl_extfrag_handler()Matthew Wilcox1-2/+0
sysctl_extfrag_handler() neglects to propagate the return value from proc_dointvec_minmax() to its caller. It's a wrapper that doesn't need to exist, so just use proc_dointvec_minmax() directly. Link: http://lkml.kernel.org/r/20190104032557.3056-1-willy@infradead.org Signed-off-by: Matthew Wilcox <willy@infradead.org> Reported-by: Aditya Pakki <pakki001@umn.edu> Acked-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-06memcg: localize memcg_kmem_enabled() checkShakeel Butt1-4/+33
Move the memcg_kmem_enabled() checks into memcg kmem charge/uncharge functions, so, the users don't have to explicitly check that condition. This is purely code cleanup patch without any functional change. Only the order of checks in memcg_charge_slab() can potentially be changed but the functionally it will be same. This should not matter as memcg_charge_slab() is not in the hot path. Link: http://lkml.kernel.org/r/20190103161203.162375-1-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-06mm: reuse only-pte-mapped KSM page in do_wp_page()Kirill Tkhai1-0/+7
Add an optimization for KSM pages almost in the same way that we have for ordinary anonymous pages. If there is a write fault in a page, which is mapped to an only pte, and it is not related to swap cache; the page may be reused without copying its content. [ Note that we do not consider PageSwapCache() pages at least for now, since we don't want to complicate __get_ksm_page(), which has nice optimization based on this (for the migration case). Currenly it is spinning on PageSwapCache() pages, waiting for when they have unfreezed counters (i.e., for the migration finish). But we don't want to make it also spinning on swap cache pages, which we try to reuse, since there is not a very high probability to reuse them. So, for now we do not consider PageSwapCache() pages at all. ] So in reuse_ksm_page() we check for 1) PageSwapCache() and 2) page_stable_node(), to skip a page, which KSM is currently trying to link to stable tree. Then we do page_ref_freeze() to prohibit KSM to merge one more page into the page, we are reusing. After that, nobody can refer to the reusing page: KSM skips !PageSwapCache() pages with zero refcount; and the protection against of all other participants is the same as for reused ordinary anon pages pte lock, page lock and mmap_sem. [akpm@linux-foundation.org: replace BUG_ON()s with WARN_ON()s] Link: http://lkml.kernel.org/r/154471491016.31352.1168978849911555609.stgit@localhost.localdomain Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Reviewed-by: Yang Shi <yang.shi@linux.alibaba.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Hugh Dickins <hughd@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Christian Koenig <christian.koenig@amd.com> Cc: Claudio Imbrenda <imbrenda@linux.vnet.ibm.com> Cc: Rik van Riel <riel@surriel.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-06mm: replace all open encodings for NUMA_NO_NODEAnshuman Khandual1-1/+1
Patch series "Replace all open encodings for NUMA_NO_NODE", v3. All these places for replacement were found by running the following grep patterns on the entire kernel code. Please let me know if this might have missed some instances. This might also have replaced some false positives. I will appreciate suggestions, inputs and review. 1. git grep "nid == -1" 2. git grep "node == -1" 3. git grep "nid = -1" 4. git grep "node = -1" This patch (of 2): At present there are multiple places where invalid node number is encoded as -1. Even though implicitly understood it is always better to have macros in there. Replace these open encodings for an invalid node number with the global macro NUMA_NO_NODE. This helps remove NUMA related assumptions like 'invalid node' from various places redirecting them to a common definition. Link: http://lkml.kernel.org/r/1545127933-10711-2-git-send-email-anshuman.khandual@arm.com Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> [ixgbe] Acked-by: Jens Axboe <axboe@kernel.dk> [mtip32xx] Acked-by: Vinod Koul <vkoul@kernel.org> [dmaengine.c] Acked-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc] Acked-by: Doug Ledford <dledford@redhat.com> [drivers/infiniband] Cc: Joseph Qi <jiangqi903@gmail.com> Cc: Hans Verkuil <hverkuil@xs4all.nl> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-06mm: convert PG_balloon to PG_offlineDavid Hildenbrand2-8/+11
PG_balloon was introduced to implement page migration/compaction for pages inflated in virtio-balloon. Nowadays, it is only a marker that a page is part of virtio-balloon and therefore logically offline. We also want to make use of this flag in other balloon drivers - for inflated pages or when onlining a section but keeping some pages offline (e.g. used right now by XEN and Hyper-V via set_online_page_callback()). We are going to expose this flag to dump tools like makedumpfile. But instead of exposing PG_balloon, let's generalize the concept of marking pages as logically offline, so it can be reused for other purposes later on. Rename PG_balloon to PG_offline. This is an indicator that the page is logically offline, the content stale and that it should not be touched (e.g. a hypervisor would have to allocate backing storage in order for the guest to dump an unused page). We can then e.g. exclude such pages from dumps. We replace and reuse KPF_BALLOON (23), as this shouldn't really harm (and for now the semantics stay the same). In following patches, we will make use of this bit also in other balloon drivers. While at it, document PGTABLE. [akpm@linux-foundation.org: fix comment text, per David] Link: http://lkml.kernel.org/r/20181119101616.8901-3-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Konstantin Khlebnikov <koct9i@gmail.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Pankaj gupta <pagupta@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Christian Hansen <chansen3@cisco.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Pavel Tatashin <pasha.tatashin@oracle.com> Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Miles Chen <miles.chen@mediatek.com> Cc: David Rientjes <rientjes@google.com> Cc: Kazuhito Hagio <k-hagio@ab.jp.nec.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Baoquan He <bhe@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Dave Young <dyoung@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Haiyang Zhang <haiyangz@microsoft.com> Cc: Juergen Gross <jgross@suse.com> Cc: Julien Freche <jfreche@vmware.com> Cc: Kairui Song <kasong@redhat.com> Cc: "K. Y. Srinivasan" <kys@microsoft.com> Cc: Len Brown <len.brown@intel.com> Cc: Lianbo Jiang <lijiang@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Nadav Amit <namit@vmware.com> Cc: Omar Sandoval <osandov@fb.com> Cc: Pavel Machek <pavel@ucw.cz> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Stephen Hemminger <sthemmin@microsoft.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Xavier Deguillard <xdeguillard@vmware.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-06mm: balloon: update comment about isolation/migration/compactionDavid Hildenbrand1-17/+9
Patch series "mm/kdump: allow to exclude pages that are logically offline" Right now, pages inflated as part of a balloon driver will be dumped by dump tools like makedumpfile. While XEN is able to check in the crash kernel whether a certain pfn is actuall backed by memory in the hypervisor (see xen_oldmem_pfn_is_ram) and optimize this case, dumps of virtio-balloon, hv-balloon and VMWare balloon inflated memory will essentially result in zero pages getting allocated by the hypervisor and the dump getting filled with this data. The allocation and reading of zero pages can directly be avoided if a dumping tool could know which pages only contain stale information not to be dumped. Also for XEN, calling into the kernel and asking the hypervisor if a pfn is backed can be avoided if the duming tool would skip such pages right from the beginning. Dumping tools have no idea whether a given page is part of a balloon driver and shall not be dumped. Esp. PG_reserved cannot be used for that purpose as all memory allocated during early boot is also PG_reserved, see discussion at [1]. So some other way of indication is required and a new page flag is frowned upon. We have PG_balloon (MAPCOUNT value), which is essentially unused now. I suggest renaming it to something more generic (PG_offline) to mark pages as logically offline. This flag can than e.g. also be used by virtio-mem in the future to mark subsections as offline. Or by other code that wants to put pages logically offline (e.g. later maybe poisoned pages that shall no longer be used). This series converts PG_balloon to PG_offline, allows dumping tools to query the value to detect such pages and marks pages in the hv-balloon and XEN balloon properly as PG_offline. Note that virtio-balloon already set pages to PG_balloon (and now PG_offline). Please note that this is also helpful for a problem we were seeing under Hyper-V: Dumping logically offline memory (pages kept fake offline while onlining a section via online_page_callback) would under some condicions result in a kernel panic when dumping them. As I don't have access to neither XEN nor Hyper-V nor VMWare installations, this was only tested with the virtio-balloon and pages were properly skipped when dumping. I'll also attach the makedumpfile patch to this series. [1] https://lkml.org/lkml/2018/7/20/566 This patch (of 8): Commit b1123ea6d3b3 ("mm: balloon: use general non-lru movable page feature") reworked balloon handling to make use of the general non-lru movable page feature. The big comment block in balloon_compaction.h contains quite some outdated information. Let's fix this. Link: http://lkml.kernel.org/r/20181119101616.8901-2-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Baoquan He <bhe@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Christian Hansen <chansen3@cisco.com> Cc: Dave Young <dyoung@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Haiyang Zhang <haiyangz@microsoft.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Juergen Gross <jgross@suse.com> Cc: Julien Freche <jfreche@vmware.com> Cc: Kairui Song <kasong@redhat.com> Cc: Kazuhito Hagio <k-hagio@ab.jp.nec.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: "K. Y. Srinivasan" <kys@microsoft.com> Cc: Len Brown <len.brown@intel.com> Cc: Lianbo Jiang <lijiang@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Miles Chen <miles.chen@mediatek.com> Cc: Nadav Amit <namit@vmware.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Omar Sandoval <osandov@fb.com> Cc: Pankaj gupta <pagupta@redhat.com> Cc: Pavel Machek <pavel@ucw.cz> Cc: Pavel Tatashin <pasha.tatashin@oracle.com> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Stephen Hemminger <sthemmin@microsoft.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Xavier Deguillard <xdeguillard@vmware.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-06mm/page_alloc.c: memory hotplug: free pages as higher orderArun KS1-1/+1
When freeing pages are done with higher order, time spent on coalescing pages by buddy allocator can be reduced. With section size of 256MB, hot add latency of a single section shows improvement from 50-60 ms to less than 1 ms, hence improving the hot add latency by 60 times. Modify external providers of online callback to align with the change. [arunks@codeaurora.org: v11] Link: http://lkml.kernel.org/r/1547792588-18032-1-git-send-email-arunks@codeaurora.org [akpm@linux-foundation.org: remove unused local, per Arun] [akpm@linux-foundation.org: avoid return of void-returning __free_pages_core(), per Oscar] [akpm@linux-foundation.org: fix it for mm-convert-totalram_pages-and-totalhigh_pages-variables-to-atomic.patch] [arunks@codeaurora.org: v8] Link: http://lkml.kernel.org/r/1547032395-24582-1-git-send-email-arunks@codeaurora.org [arunks@codeaurora.org: v9] Link: http://lkml.kernel.org/r/1547098543-26452-1-git-send-email-arunks@codeaurora.org Link: http://lkml.kernel.org/r/1538727006-5727-1-git-send-email-arunks@codeaurora.org Signed-off-by: Arun KS <arunks@codeaurora.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Alexander Duyck <alexander.h.duyck@linux.intel.com> Cc: K. Y. Srinivasan <kys@microsoft.com> Cc: Haiyang Zhang <haiyangz@microsoft.com> Cc: Stephen Hemminger <sthemmin@microsoft.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Juergen Gross <jgross@suse.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Mathieu Malaterre <malat@debian.org> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Souptick Joarder <jrdr.linux@gmail.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Aaron Lu <aaron.lu@intel.com> Cc: Srivatsa Vaddagiri <vatsa@codeaurora.org> Cc: Vinayak Menon <vinmenon@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-06include/linux/slub_def.h: comment fixesTobin C. Harding1-6/+6
Capitialize comment string, use C89 comment style, correct grammar/punctuation in comments. Link: http://lkml.kernel.org/r/20190204005713.9463-2-tobin@kernel.org Link: http://lkml.kernel.org/r/20190204005713.9463-3-tobin@kernel.org Link: http://lkml.kernel.org/r/20190204005713.9463-4-tobin@kernel.org Signed-off-by: Tobin C. Harding <tobin@kernel.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-06kasan: fix kasan_check_read/write definitionsArnd Bergmann1-1/+1
Building little-endian allmodconfig kernels on arm64 started failing with the generated atomic.h implementation, since we now try to call kasan helpers from the EFI stub: aarch64-linux-gnu-ld: drivers/firmware/efi/libstub/arm-stub.stub.o: in function `atomic_set': include/generated/atomic-instrumented.h:44: undefined reference to `__efistub_kasan_check_write' I suspect that we get similar problems in other files that explicitly disable KASAN for some reason but call atomic_t based helper functions. We can fix this by checking the predefined __SANITIZE_ADDRESS__ macro that the compiler sets instead of checking CONFIG_KASAN, but this in turn requires a small hack in mm/kasan/common.c so we do see the extern declaration there instead of the inline function. Link: http://lkml.kernel.org/r/20181211133453.2835077-1-arnd@arndb.de Fixes: b1864b828644 ("locking/atomics: build atomic headers as required") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reported-by: Anders Roxell <anders.roxell@linaro.org> Acked-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Will Deacon <will.deacon@arm.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Andrey Konovalov <andreyknvl@google.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au>, Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05Merge tag 's390-5.1-1' of ↵Linus Torvalds1-0/+4
git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux Pull s390 updates from Martin Schwidefsky: - A copy of Arnds compat wrapper generation series - Pass information about the KVM guest to the host in form the control program code and the control program version code - Map IOV resources to support PCI physical functions on s390 - Add vector load and store alignment hints to improve performance - Use the "jdd" constraint with gcc 9 to make jump labels working again - Remove amode workaround for old z/VM releases from the DCSS code - Add support for in-kernel performance measurements using the CPU measurement counter facility - Introduce a new PMU device cpum_cf_diag to capture counters and store thenn as event raw data. - Bug fixes and cleanups * tag 's390-5.1-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (54 commits) Revert "s390/cpum_cf: Add kernel message exaplanations" s390/dasd: fix read device characteristic with CONFIG_VMAP_STACK=y s390/suspend: fix prefix register reset in swsusp_arch_resume s390: warn about clearing als implied facilities s390: allow overriding facilities via command line s390: clean up redundant facilities list setup s390/als: remove duplicated in-place implementation of stfle s390/cio: Use cpa range elsewhere within vfio-ccw s390/cio: Fix vfio-ccw handling of recursive TICs s390: vfio_ap: link the vfio_ap devices to the vfio_ap bus subsystem s390/cpum_cf: Handle EBUSY return code from CPU counter facility reservation s390/cpum_cf: Add kernel message exaplanations s390/cpum_cf_diag: Add support for s390 counter facility diagnostic trace s390/cpum_cf: add ctr_stcctm() function s390/cpum_cf: move common functions into a separate file s390/cpum_cf: introduce kernel_cpumcf_avail() function s390/cpu_mf: replace stcctm5() with the stcctm() function s390/cpu_mf: add store cpu counter multiple instruction support s390/cpum_cf: Add minimal in-kernel interface for counter measurements s390/cpum_cf: introduce kernel_cpumcf_alert() to obtain measurement alerts ...
2019-03-05Merge branch 'linus' of ↵Linus Torvalds2-9/+2
git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 Pull crypto update from Herbert Xu: "API: - Add helper for simple skcipher modes. - Add helper to register multiple templates. - Set CRYPTO_TFM_NEED_KEY when setkey fails. - Require neither or both of export/import in shash. - AEAD decryption test vectors are now generated from encryption ones. - New option CONFIG_CRYPTO_MANAGER_EXTRA_TESTS that includes random fuzzing. Algorithms: - Conversions to skcipher and helper for many templates. - Add more test vectors for nhpoly1305 and adiantum. Drivers: - Add crypto4xx prng support. - Add xcbc/cmac/ecb support in caam. - Add AES support for Exynos5433 in s5p. - Remove sha384/sha512 from artpec7 as hardware cannot do partial hash" [ There is a merge of the Freescale SoC tree in order to pull in changes required by patches to the caam/qi2 driver. ] * 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (174 commits) crypto: s5p - add AES support for Exynos5433 dt-bindings: crypto: document Exynos5433 SlimSSS crypto: crypto4xx - add missing of_node_put after of_device_is_available crypto: cavium/zip - fix collision with generic cra_driver_name crypto: af_alg - use struct_size() in sock_kfree_s() crypto: caam - remove redundant likely/unlikely annotation crypto: s5p - update iv after AES-CBC op end crypto: x86/poly1305 - Clear key material from stack in SSE2 variant crypto: caam - generate hash keys in-place crypto: caam - fix DMA mapping xcbc key twice crypto: caam - fix hash context DMA unmap size hwrng: bcm2835 - fix probe as platform device crypto: s5p-sss - Use AES_BLOCK_SIZE define instead of number crypto: stm32 - drop pointless static qualifier in stm32_hash_remove() crypto: chelsio - Fixed Traffic Stall crypto: marvell - Remove set but not used variable 'ivsize' crypto: ccp - Update driver messages to remove some confusion crypto: adiantum - add 1536 and 4096-byte test vectors crypto: nhpoly1305 - add a test vector with len % 16 != 0 crypto: arm/aes-ce - update IV after partial final CTR block ...
2019-03-05Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-nextLinus Torvalds46-226/+1056
Pull networking updates from David Miller: "Here we go, another merge window full of networking and #ebpf changes: 1) Snoop DHCPACKS in batman-adv to learn MAC/IP pairs in the DHCP range without dealing with floods of ARP traffic, from Linus Lüssing. 2) Throttle buffered multicast packet transmission in mt76, from Felix Fietkau. 3) Support adaptive interrupt moderation in ice, from Brett Creeley. 4) A lot of struct_size conversions, from Gustavo A. R. Silva. 5) Add peek/push/pop commands to bpftool, as well as bash completion, from Stanislav Fomichev. 6) Optimize sk_msg_clone(), from Vakul Garg. 7) Add SO_BINDTOIFINDEX, from David Herrmann. 8) Be more conservative with local resends due to local congestion, from Yuchung Cheng. 9) Allow vetoing of unsupported VXLAN FDBs, from Petr Machata. 10) Add health buffer support to devlink, from Eran Ben Elisha. 11) Add TXQ scheduling API to mac80211, from Toke Høiland-Jørgensen. 12) Add statistics to basic packet scheduler filter, from Cong Wang. 13) Add GRE tunnel support for mlxsw Spectrum-2, from Nir Dotan. 14) Lots of new IP tunneling forwarding tests, also from Nir Dotan. 15) Add 3ad stats to bonding, from Nikolay Aleksandrov. 16) Lots of probing improvements for bpftool, from Quentin Monnet. 17) Various nfp drive #ebpf JIT improvements from Jakub Kicinski. 18) Allow #ebpf programs to access gso_segs from skb shared info, from Eric Dumazet. 19) Add sock_diag support for AF_XDP sockets, from Björn Töpel. 20) Support 22260 iwlwifi devices, from Luca Coelho. 21) Use rbtree for ipv6 defragmentation, from Peter Oskolkov. 22) Add JMP32 instruction class support to #ebpf, from Jiong Wang. 23) Add spinlock support to #ebpf, from Alexei Starovoitov. 24) Support 256-bit keys and TLS 1.3 in ktls, from Dave Watson. 25) Add device infomation API to devlink, from Jakub Kicinski. 26) Add new timestamping socket options which are y2038 safe, from Deepa Dinamani. 27) Add RX checksum offloading for various sh_eth chips, from Sergei Shtylyov. 28) Flow offload infrastructure, from Pablo Neira Ayuso. 29) Numerous cleanups, improvements, and bug fixes to the PHY layer and many drivers from Heiner Kallweit. 30) Lots of changes to try and make packet scheduler classifiers run lockless as much as possible, from Vlad Buslov. 31) Support BCM957504 chip in bnxt_en driver, from Erik Burrows. 32) Add concurrency tests to tc-tests infrastructure, from Vlad Buslov. 33) Add hwmon support to aquantia, from Heiner Kallweit. 34) Allow 64-bit values for SO_MAX_PACING_RATE, from Eric Dumazet. And I would be remiss if I didn't thank the various major networking subsystem maintainers for integrating much of this work before I even saw it. Alexei Starovoitov, Daniel Borkmann, Pablo Neira Ayuso, Johannes Berg, Kalle Valo, and many others. Thank you!" * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (2207 commits) net/sched: avoid unused-label warning net: ignore sysctl_devconf_inherit_init_net without SYSCTL phy: mdio-mux: fix Kconfig dependencies net: phy: use phy_modify_mmd_changed in genphy_c45_an_config_aneg net: dsa: mv88e6xxx: add call to mv88e6xxx_ports_cmode_init to probe for new DSA framework selftest/net: Remove duplicate header sky2: Disable MSI on Dell Inspiron 1545 and Gateway P-79 net/mlx5e: Update tx reporter status in case channels were successfully opened devlink: Add support for direct reporter health state update devlink: Update reporter state to error even if recover aborted sctp: call iov_iter_revert() after sending ABORT team: Free BPF filter when unregistering netdev ip6mr: Do not call __IP6_INC_STATS() from preemptible context isdn: mISDN: Fix potential NULL pointer dereference of kzalloc net: dsa: mv88e6xxx: support in-band signalling on SGMII ports with external PHYs cxgb4/chtls: Prefix adapter flags with CXGB4 net-sysfs: Switch to bitmap_zalloc() mellanox: Switch to bitmap_zalloc() bpf: add test cases for non-pointer sanitiation logic mlxsw: i2c: Extend initialization by querying resources data ...
2019-03-05Merge tag 'leds-for-5.1-rc1' of ↵Linus Torvalds1-0/+13
git://git.kernel.org/pub/scm/linux/kernel/git/j.anaszewski/linux-leds Pull LED updates from Jacek Anaszewski: - finalize previously announced support for initialization of pattern triggers from Device Tree - fix for null deref on firmware load failure in leds-lp55xx-common.c * tag 'leds-for-5.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/j.anaszewski/linux-leds: leds: lp55xx: fix null deref on firmware load failure leds: trigger: timer: Add initialization from Device Tree leds: trigger: oneshot: Add initialization from Device Tree leds: trigger: pattern: Add pattern initialization from Device Tree leds: Add helper for getting default pattern from Device Tree dt-bindings: leds: Add pattern initialization from Device Tree
2019-03-05Merge tag 'spi-v5.1' of ↵Linus Torvalds4-6/+51
git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi Pull spi updates from Mark Brown: "A fairly quiet release for SPI, the biggest thing is the conversion to use GPIO descriptors which is now 90% done but still needs some stragglers converting. Summary: - Support for inter-word delays - Conversion of the core and most drivers to use GPIO descriptors for GPIO controlled chip selects - New drivers for NXP FlexSPI and QuadSPI, SiFive and Spreadtrum" * tag 'spi-v5.1' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi: (104 commits) spi: sh-msiof: Restrict bits per word to 8/16/24/32 on R-Car Gen2/3 spi: sifive: Remove redundant dev_err call in sifive_spi_probe() spi: sifive: Remove spi_master_put in sifive_spi_remove() spi: spi-gpio: fix SPI_CS_HIGH capability spi: pxa2xx: Setup maximum supported DMA transfer length spi: sifive: Add driver for the SiFive SPI controller spi: sifive: Add DT documentation for SiFive SPI controller spi: sprd: Add a prefix for SPI DMA channel macros spi: sprd: spi: sprd: Add DMA mode support dt-bindings: spi: Add the DMA properties for the SPI dma mode spi: sprd: Add the SPI irq function for the SPI DMA mode dt-bindings: spi: imx: Add an entry for the i.MX8QM compatible spi: use gpio[d]_set_value_cansleep for setting chipselect GPIO spi: gpio: Advertise support for SPI_CS_HIGH spi: sh-msiof: Replace spi_master by spi_controller spi: sh-hspi: Replace spi_master by spi_controller spi: rspi: Replace spi_master by spi_controller spi: atmel-quadspi: add support for sam9x60 qspi controller dt-bindings: spi: atmel-quadspi: QuadSPI driver for Microchip SAM9X60 spi: atmel-quadspi: add support for named peripheral clock ...
2019-03-05Merge tag 'regulator-v5.1' of ↵Linus Torvalds4-35/+26
git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator Pull regulator updates from Mark Brown: "The bulk of the standout changes in this release are cleanups, with the core work being a combination of factoring out common code into helpers and the completion of the conversion of the core to use GPIO descriptors. Summary: - Addition of helper functions for current limits and conversion of drivers to use them by Axel Lin. - Lots and lots of cleanups from Axel Lin. - Conversion of the core to use GPIO descriptors rather than numbers by Linus Walleij. - New drivers for Maxim MAX77650 and ROHM BD70528" * tag 'regulator-v5.1' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator: (131 commits) regulator: mc13xxx: Constify regulator_ops variables regulator: palmas: Constify palmas_smps_ramp_delay array regulator: wm831x-dcdc: Convert to use regulator_set/get_current_limit_regmap regulator: pv88090: Convert to use regulator_set/get_current_limit_regmap regulator: pv88080: Convert to use regulator_set/get_current_limit_regmap regulator: pv88060: Convert to use regulator_set/get_current_limit_regmap regulator: max77650: Convert to use regulator_set/get_current_limit_regmap regulator: lp873x: Convert to use regulator_set/get_current_limit_regmap regulator: lp872x: Convert to use regulator_set/get_current_limit_regmap regulator: da9210: Convert to use regulator_set/get_current_limit_regmap regulator: da9055: Convert to use regulator_set/get_current_limit_regmap regulator: core: Add set/get_current_limit helpers for regmap users regulator: Fix comment for csel_reg and csel_mask regulator: stm32-vrefbuf: add power management support regulator: 88pm8607: Remove unused fields from struct pm8607_regulator_info regulator: 88pm8607: Simplify pm8607_list_voltage implementation regulator: cpcap: Constify omap4_regulators and xoom_regulators regulator: cpcap: Remove unused vsel_shift from struct cpcap_regulator dt-bindings: regulator: tps65218: rectify units of LS3 dt-bindings: regulator: add LS2 load switch documentation ...
2019-03-05Merge tag 'regmap-v5.1' of ↵Linus Torvalds1-0/+31
git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regmap Pull regmap updates from Mark Brown: "There are only two changes here: - fix for conflicting attributes on the rbtree node structure - implementation of main status register support in the interrupt code which supports chips that have a register to cut down on the number of per-interrupt status registers that need to be checked when handling interrupts" * tag 'regmap-v5.1' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regmap: regmap: Remove attribute packed from struct 'regcache_rbtree_node' regmap: regmap-irq: Add main status register support
2019-03-05Merge tag 'mmc-v5.1' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmcLinus Torvalds4-8/+15
Pull MMC updates from Ulf Hansson: "MMC core: - Fixup max_discard/trim calculations - Announce SD specs greater than 4.0 - Add discard support for SD cards - Don't do retries for CMD6 (SWITCH command) - Various cleanups and re-structuring MMC host: - cqhci: * Add maintainers for eMMC CQHCI driver - sdhci: * Consolidate WP GPIO code * Add ADMA3 DMA support for V4 enabled host * Fixup card detect support in pci-o2micro driver * Add support for CMDQ and SDMMC pads auto-calibration in tegra driver * Add DCMD support and CMDQ support, support for i.MX6ULL variant, fixup HS400 timing issue and add HS400_ES support for i.MX8QXP to esdhc-imx driver * Avoid CRC errors by adjusting settings to speed mode and fixup card initialization for high speed mode in renesas_sdhi * Fixup timeout settings for omap * Enable 8 bits bus-width support in atmel-mci * Convert some legacy code in jz4740 driver to use modern APIs * Send a CMD12 to clear DPSM at errors for STM32 sdmmc mmci driver" * tag 'mmc-v5.1' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc: (69 commits) mmc:fix a bug when max_discard is 0 mmc: core: Add a debug print when the card may have been replaced mmc: core: Add sd discard timeout mmc: core: Add discard support to sd mmc: sdhci-esdhc-imx: clear the HALT bit when enable CQE mmc: core: do not retry CMD6 in __mmc_switch() mmc: core: Convert mmc_align_data_size() into an SDIO specific function mmc: core: Move mmc_of_parse_voltage() to host.c mmc: core: Convert mmc_regulator_get_ocrmask() to static mmc: core: Move regulator helpers to separate file mmc: of_mmc_spi: Convert to mmc_of_parse_voltage() mmc: core: Drop retries as in-parameter to mmc_wait_for_app_cmd() mmc: core: Convert mmc_wait_for_app_cmd() to static mmc: renesas_sdhi: Change HW adjustment register according to speed mode mmc: mmci: Send a CMD12 to clear the DPSM at errors mmc: sdhci-xenon: Fixup already marked switch fall-through mmc: sdhci-tegra: drop ->get_ro() implementation mmc: sdhci-omap: drop ->get_ro() implementation mmc: sdhci: use WP GPIO in sdhci_check_ro() mmc: wmt-sdmmc: Drop unused include ...
2019-03-05Merge tag 'mtd/for-5.1' of git://git.infradead.org/linux-mtdLinus Torvalds2-21/+21
Pull MTD updates from Boris Brezillon: "Core MTD changes: - Use struct_size() where appropriate - mtd_{read,write}() as wrappers around mtd_{read,write}_oob() - Fix misuse of PTR_ERR() in docg3 - Coding style improvements in mtdcore.c SPI NOR changes: Core changes: - Add support of octal mode I/O transfer - Add a bunch of SPI NOR entries to the flash_info table SPI NOR controller driver changes: - cadence-quadspi: * Add support for Octal SPI controller * write upto 8-bytes data in STIG mode - mtk-quadspi: * rename config to a common one * add SNOR_HWCAPS_READ to spi_nor_hwcaps mask - Add Tudor as SPI-NOR co-maintainer NAND changes: NAND core changes: - Fourth batch of fixes/cleanup to the raw NAND core impacting various controller drivers (Sunxi, Marvell, MTK, TMIO, OMAP2). - Check the return code of nand_reset() and nand_readid_op(). - Remove ->legacy.erase and single_erase(). - Simplify the locking. - Several implicit fall through annotations. Raw NAND controllers drivers changes: - Fix various possible object reference leaks (MTK, JZ4780, Atmel) - ST: * Add support for STM32 FMC2 NAND flash controller - Meson: * Add support for Amlogic NAND flash controller - Denali: * Several cleanup patches - Sunxi: * Several cleanup patches - FSMC: * Disable NAND on remove() * Reset NAND timings on resume() SPI-NAND drivers changes: - Toshiba: * Add support for all Toshiba products. - Macronix: * Fix ECC status read. - Gigadevice: * Add support for GD5F1GQ4UExxG" * tag 'mtd/for-5.1' of git://git.infradead.org/linux-mtd: (64 commits) mtd: spi-nor: Fix wrong abbreviation HWCPAS mtd: spi-nor: cadence-quadspi: fix spelling mistake: "Couldnt't" -> "Couldn't" mtd: spi-nor: Add support for en25qh64 mtd: spi-nor: Add support for MX25V8035F mtd: spi-nor: Add support for EN25Q80A mtd: spi-nor: cadence-quadspi: Add support for Octal SPI controller dt-bindings: cadence-quadspi: Add new compatible for AM654 SoC mtd: spi-nor: split s25fl128s into s25fl128s0 and s25fl128s1 mtd: spi-nor: cadence-quadspi: write upto 8-bytes data in STIG mode mtd: spi-nor: Add support for mx25u3235f mtd: rawnand: denali_dt: remove single anonymous clock support mtd: rawnand: mtk: fix possible object reference leak mtd: rawnand: jz4780: fix possible object reference leak mtd: rawnand: atmel: fix possible object reference leak mtd: rawnand: fsmc: Disable NAND on remove() mtd: rawnand: fsmc: Reset NAND timings on resume() mtd: spinand: Add support for GigaDevice GD5F1GQ4UExxG mtd: rawnand: denali: remove unused dma_addr field from denali_nand_info mtd: rawnand: denali: remove unused function argument 'raw' mtd: rawnand: denali: remove unneeded denali_reset_irq() call ...
2019-03-05Merge tag 'vfio-v5.1-rc1' of git://github.com/awilliam/linux-vfioLinus Torvalds1-1/+1
Pull VFIO updates from Alex Williamson: - Switch mdev to generic UUID API (Andy Shevchenko) - Fixup platform reset include paths (Masahiro Yamada) - Fix usage of MINORMASK (Chengguang Xu) - Remove noise from duplicate spapr table unsets (Alexey Kardashevskiy) - Restore device state after PM reset (Alex Williamson) - Ensure memory translation enabled for PCI ROM access (Eric Auger) * tag 'vfio-v5.1-rc1' of git://github.com/awilliam/linux-vfio: vfio_pci: Enable memory accesses before calling pci_map_rom vfio/pci: Restore device state on PM transition vfio/spapr_tce: Skip unsetting already unset table samples/vfio-mdev/mtty: expand minor range when registering chrdev region samples/vfio-mdev/mdpy: expand minor range when registering chrdev region samples/vfio-mdev/mbochs: expand minor range when registering chrdev region vfio: expand minor range when registering chrdev region vfio: platform: reset: fix up include directives to remove ccflags-y vfio-mdev: Switch to use new generic UUID API
2019-03-05Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2-7/+1
2019-03-04aio: simplify - and fix - fget/fput for io_submit()Linus Torvalds1-1/+7
Al Viro root-caused a race where the IOCB_CMD_POLL handling of fget/fput() could cause us to access the file pointer after it had already been freed: "In more details - normally IOCB_CMD_POLL handling looks so: 1) io_submit(2) allocates aio_kiocb instance and passes it to aio_poll() 2) aio_poll() resolves the descriptor to struct file by req->file = fget(iocb->aio_fildes) 3) aio_poll() sets ->woken to false and raises ->ki_refcnt of that aio_kiocb to 2 (bumps by 1, that is). 4) aio_poll() calls vfs_poll(). After sanity checks (basically, "poll_wait() had been called and only once") it locks the queue. That's what the extra reference to iocb had been for - we know we can safely access it. 5) With queue locked, we check if ->woken has already been set to true (by aio_poll_wake()) and, if it had been, we unlock the queue, drop a reference to aio_kiocb and bugger off - at that point it's a responsibility to aio_poll_wake() and the stuff called/scheduled by it. That code will drop the reference to file in req->file, along with the other reference to our aio_kiocb. 6) otherwise, we see whether we need to wait. If we do, we unlock the queue, drop one reference to aio_kiocb and go away - eventual wakeup (or cancel) will deal with the reference to file and with the other reference to aio_kiocb 7) otherwise we remove ourselves from waitqueue (still under the queue lock), so that wakeup won't get us. No async activity will be happening, so we can safely drop req->file and iocb ourselves. If wakeup happens while we are in vfs_poll(), we are fine - aio_kiocb won't get freed under us, so we can do all the checks and locking safely. And we don't touch ->file if we detect that case. However, vfs_poll() most certainly *does* touch the file it had been given. So wakeup coming while we are still in ->poll() might end up doing fput() on that file. That case is not too rare, and usually we are saved by the still present reference from descriptor table - that fput() is not the final one. But if another thread closes that descriptor right after our fget() and wakeup does happen before ->poll() returns, we are in trouble - final fput() done while we are in the middle of a method: Al also wrote a patch to take an extra reference to the file descriptor to fix this, but I instead suggested we just streamline the whole file pointer handling by submit_io() so that the generic aio submission code simply keeps the file pointer around until the aio has completed. Fixes: bfe4037e722e ("aio: implement IOCB_CMD_POLL") Acked-by: Al Viro <viro@zeniv.linux.org.uk> Reported-by: syzbot+503d4cc169fcec1cb18c@syzkaller.appspotmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-04Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller3-3/+40
Daniel Borkmann says: ==================== pull-request: bpf-next 2019-03-04 The following pull-request contains BPF updates for your *net-next* tree. The main changes are: 1) Add AF_XDP support to libbpf. Rationale is to facilitate writing AF_XDP applications by offering higher-level APIs that hide many of the details of the AF_XDP uapi. Sample programs are converted over to this new interface as well, from Magnus. 2) Introduce a new cant_sleep() macro for annotation of functions that cannot sleep and use it in BPF_PROG_RUN() to assert that BPF programs run under preemption disabled context, from Peter. 3) Introduce per BPF prog stats in order to monitor the usage of BPF; this is controlled by kernel.bpf_stats_enabled sysctl knob where monitoring tools can make use of this to efficiently determine the average cost of programs, from Alexei. 4) Split up BPF selftest's test_progs similarly as we already did with test_verifier. This allows to further reduce merge conflicts in future and to get more structure into our quickly growing BPF selftest suite, from Stanislav. 5) Fix a bug in BTF's dedup algorithm which can cause an infinite loop in some circumstances; also various BPF doc fixes and improvements, from Andrii. 6) Various BPF sample cleanups and migration to libbpf in order to further isolate the old sample loader code (so we can get rid of it at some point), from Jakub. 7) Add a new BPF helper for BPF cgroup skb progs that allows to set ECN CE code point and a Host Bandwidth Manager (HBM) sample program for limiting the bandwidth used by v2 cgroups, from Lawrence. 8) Enable write access to skb->queue_mapping from tc BPF egress programs in order to let BPF pick TX queue, from Jesper. 9) Fix a bug in BPF spinlock handling for map-in-map which did not propagate spin_lock_off to the meta map, from Yonghong. 10) Fix a bug in the new per-CPU BPF prog counters to properly initialize stats for each CPU, from Eric. 11) Add various BPF helper prototypes to selftest's bpf_helpers.h, from Willem. 12) Fix various BPF samples bugs in XDP and tracing progs, from Toke, Daniel and Yonghong. 13) Silence preemption splat in test_bpf after BPF_PROG_RUN() enforces it now everywhere, from Anders. 14) Fix a signedness bug in libbpf's btf_dedup_ref_type() to get error handling working, from Dan. 15) Fix bpftool documentation and auto-completion with regards to stream_{verdict,parser} attach types, from Alban. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-04Merge branch 'spi-5.1' into spi-nextMark Brown4-6/+51
2019-03-04Merge branch 'regulator-5.1' into regulator-nextMark Brown4-35/+26
2019-03-04net: phy: remove gen10g_no_soft_resetHeiner Kallweit1-1/+0
genphy_no_soft_reset and gen10g_no_soft_reset are both the same no-ops, one is enough. Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-04net: phy: don't export gen10g_read_statusHeiner Kallweit1-1/+0
gen10g_read_status is deprecated, therefore stop exporting it. We don't want to encourage anybody to use it. Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-04net: phy: remove gen10g_config_initHeiner Kallweit1-1/+0
ETHTOOL_LINK_MODE_10000baseT_Full_BIT is set anyway in the supported and advertising bitmap because it's part of PHY_10GBIT_FEATURES. And all users of gen10g_config_init use PHY_10GBIT_FEATURES. Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-04net: phy: remove gen10g_suspend and gen10g_resumeHeiner Kallweit1-2/+0
phy_suspend() and phy_resume() are no-ops anyway if no callback is defined. Therefore we don't need these stubs. Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-04net: ipv6: add socket option IPV6_ROUTER_ALERT_ISOLATEFrancesco Ruggeri1-1/+2
By default IPv6 socket with IPV6_ROUTER_ALERT socket option set will receive all IPv6 RA packets from all namespaces. IPV6_ROUTER_ALERT_ISOLATE socket option restricts packets received by the socket to be only from the socket's namespace. Signed-off-by: Maxim Martynov <maxim@arista.com> Signed-off-by: Francesco Ruggeri <fruggeri@arista.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-04regulator: core: Add set/get_current_limit helpers for regmap usersAxel Lin1-0/+7
By setting curr_table, n_current_limits, csel_reg and csel_mask, the regmap users can use regulator_set_current_limit_regmap and regulator_get_current_limit_regmap for set/get_current_limit callbacks. Signed-off-by: Axel Lin <axel.lin@ingics.com> Signed-off-by: Mark Brown <broonie@kernel.org>
2019-03-04regulator: Fix comment for csel_reg and csel_maskAxel Lin1-2/+2
The csel_reg and csel_mask fields in struct regulator_desc needs to be generic for drivers. Not just for TPS65218. Signed-off-by: Axel Lin <axel.lin@ingics.com> Signed-off-by: Mark Brown <broonie@kernel.org>
2019-03-04appletalk: Fix use-after-free in atalk_proc_exitYueHaibing1-1/+1
KASAN report this: BUG: KASAN: use-after-free in pde_subdir_find+0x12d/0x150 fs/proc/generic.c:71 Read of size 8 at addr ffff8881f41fe5b0 by task syz-executor.0/2806 CPU: 0 PID: 2806 Comm: syz-executor.0 Not tainted 5.0.0-rc7+ #45 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0xfa/0x1ce lib/dump_stack.c:113 print_address_description+0x65/0x270 mm/kasan/report.c:187 kasan_report+0x149/0x18d mm/kasan/report.c:317 pde_subdir_find+0x12d/0x150 fs/proc/generic.c:71 remove_proc_entry+0xe8/0x420 fs/proc/generic.c:667 atalk_proc_exit+0x18/0x820 [appletalk] atalk_exit+0xf/0x5a [appletalk] __do_sys_delete_module kernel/module.c:1018 [inline] __se_sys_delete_module kernel/module.c:961 [inline] __x64_sys_delete_module+0x3dc/0x5e0 kernel/module.c:961 do_syscall_64+0x147/0x600 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x462e99 Code: f7 d8 64 89 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 bc ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007fb2de6b9c58 EFLAGS: 00000246 ORIG_RAX: 00000000000000b0 RAX: ffffffffffffffda RBX: 000000000073bf00 RCX: 0000000000462e99 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000000200001c0 RBP: 0000000000000002 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 00007fb2de6ba6bc R13: 00000000004bccaa R14: 00000000006f6bc8 R15: 00000000ffffffff Allocated by task 2806: set_track mm/kasan/common.c:85 [inline] __kasan_kmalloc.constprop.3+0xa0/0xd0 mm/kasan/common.c:496 slab_post_alloc_hook mm/slab.h:444 [inline] slab_alloc_node mm/slub.c:2739 [inline] slab_alloc mm/slub.c:2747 [inline] kmem_cache_alloc+0xcf/0x250 mm/slub.c:2752 kmem_cache_zalloc include/linux/slab.h:730 [inline] __proc_create+0x30f/0xa20 fs/proc/generic.c:408 proc_mkdir_data+0x47/0x190 fs/proc/generic.c:469 0xffffffffc10c01bb 0xffffffffc10c0166 do_one_initcall+0xfa/0x5ca init/main.c:887 do_init_module+0x204/0x5f6 kernel/module.c:3460 load_module+0x66b2/0x8570 kernel/module.c:3808 __do_sys_finit_module+0x238/0x2a0 kernel/module.c:3902 do_syscall_64+0x147/0x600 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe Freed by task 2806: set_track mm/kasan/common.c:85 [inline] __kasan_slab_free+0x130/0x180 mm/kasan/common.c:458 slab_free_hook mm/slub.c:1409 [inline] slab_free_freelist_hook mm/slub.c:1436 [inline] slab_free mm/slub.c:2986 [inline] kmem_cache_free+0xa6/0x2a0 mm/slub.c:3002 pde_put+0x6e/0x80 fs/proc/generic.c:647 remove_proc_entry+0x1d3/0x420 fs/proc/generic.c:684 0xffffffffc10c031c 0xffffffffc10c0166 do_one_initcall+0xfa/0x5ca init/main.c:887 do_init_module+0x204/0x5f6 kernel/module.c:3460 load_module+0x66b2/0x8570 kernel/module.c:3808 __do_sys_finit_module+0x238/0x2a0 kernel/module.c:3902 do_syscall_64+0x147/0x600 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe The buggy address belongs to the object at ffff8881f41fe500 which belongs to the cache proc_dir_entry of size 256 The buggy address is located 176 bytes inside of 256-byte region [ffff8881f41fe500, ffff8881f41fe600) The buggy address belongs to the page: page:ffffea0007d07f80 count:1 mapcount:0 mapping:ffff8881f6e69a00 index:0x0 flags: 0x2fffc0000000200(slab) raw: 02fffc0000000200 dead000000000100 dead000000000200 ffff8881f6e69a00 raw: 0000000000000000 00000000800c000c 00000001ffffffff 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff8881f41fe480: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc ffff8881f41fe500: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >ffff8881f41fe580: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ^ ffff8881f41fe600: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb ffff8881f41fe680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb It should check the return value of atalk_proc_init fails, otherwise atalk_exit will trgger use-after-free in pde_subdir_find while unload the module.This patch fix error cleanup path of atalk_init Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>