From 5d833062139d290adb8b62c093b654a01a353448 Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Thu, 12 Feb 2015 14:58:16 -0800 Subject: mm: numa: do not dereference pmd outside of the lock during NUMA hinting fault Automatic NUMA balancing depends on being able to protect PTEs to trap a fault and gather reference locality information. Very broadly speaking it would mark PTEs as not present and use another bit to distinguish between NUMA hinting faults and other types of faults. It was universally loved by everybody and caused no problems whatsoever. That last sentence might be a lie. This series is very heavily based on patches from Linus and Aneesh to replace the existing PTE/PMD NUMA helper functions with normal change protections. I did alter and add parts of it but I consider them relatively minor contributions. At their suggestion, acked-bys are in there but I've no problem converting them to Signed-off-by if requested. AFAIK, this has received no testing on ppc64 and I'm depending on Aneesh for that. I tested trinity under kvm-tool and passed and ran a few other basic tests. At the time of writing, only the short-lived tests have completed but testing of V2 indicated that long-term testing had no surprises. In most cases I'm leaving out detail as it's not that interesting. specjbb single JVM: There was negligible performance difference in the benchmark itself for short runs. However, system activity is higher and interrupts are much higher over time -- possibly TLB flushes. Migrations are also higher. Overall, this is more overhead but considering the problems faced with the old approach I think we just have to suck it up and find another way of reducing the overhead. specjbb multi JVM: Negligible performance difference to the actual benchmark but like the single JVM case, the system overhead is noticeably higher. Again, interrupts are a major factor. autonumabench: This was all over the place and about all that can be reasonably concluded is that it's different but not necessarily better or worse. autonumabench 3.18.0-rc5 3.18.0-rc5 mmotm-20141119 protnone-v3r3 User NUMA01 32380.24 ( 0.00%) 21642.92 ( 33.16%) User NUMA01_THEADLOCAL 22481.02 ( 0.00%) 22283.22 ( 0.88%) User NUMA02 3137.00 ( 0.00%) 3116.54 ( 0.65%) User NUMA02_SMT 1614.03 ( 0.00%) 1543.53 ( 4.37%) System NUMA01 322.97 ( 0.00%) 1465.89 (-353.88%) System NUMA01_THEADLOCAL 91.87 ( 0.00%) 49.32 ( 46.32%) System NUMA02 37.83 ( 0.00%) 14.61 ( 61.38%) System NUMA02_SMT 7.36 ( 0.00%) 7.45 ( -1.22%) Elapsed NUMA01 716.63 ( 0.00%) 599.29 ( 16.37%) Elapsed NUMA01_THEADLOCAL 553.98 ( 0.00%) 539.94 ( 2.53%) Elapsed NUMA02 83.85 ( 0.00%) 83.04 ( 0.97%) Elapsed NUMA02_SMT 86.57 ( 0.00%) 79.15 ( 8.57%) CPU NUMA01 4563.00 ( 0.00%) 3855.00 ( 15.52%) CPU NUMA01_THEADLOCAL 4074.00 ( 0.00%) 4136.00 ( -1.52%) CPU NUMA02 3785.00 ( 0.00%) 3770.00 ( 0.40%) CPU NUMA02_SMT 1872.00 ( 0.00%) 1959.00 ( -4.65%) System CPU usage of NUMA01 is worse but it's an adverse workload on this machine so I'm reluctant to conclude that it's a problem that matters. On the other workloads that are sensible on this machine, system CPU usage is great. Overall time to complete the benchmark is comparable 3.18.0-rc5 3.18.0-rc5 mmotm-20141119protnone-v3r3 User 59612.50 48586.44 System 460.22 1537.45 Elapsed 1442.20 1304.29 NUMA alloc hit 5075182 5743353 NUMA alloc miss 0 0 NUMA interleave hit 0 0 NUMA alloc local 5075174 5743339 NUMA base PTE updates 637061448 443106883 NUMA huge PMD updates 1243434 864747 NUMA page range updates 1273699656 885857347 NUMA hint faults 1658116 1214277 NUMA hint local faults 959487 754113 NUMA hint local percent 57 62 NUMA pages migrated 5467056 61676398 The NUMA pages migrated look terrible but when I looked at a graph of the activity over time I see that the massive spike in migration activity was during NUMA01. This correlates with high system CPU usage and could be simply down to bad luck but any modifications that affect that workload would be related to scan rates and migrations, not the protection mechanism. For all other workloads, migration activity was comparable. Overall, headline performance figures are comparable but the overhead is higher, mostly in interrupts. To some extent, higher overhead from this approach was anticipated but not to this degree. It's going to be necessary to reduce this again with a separate series in the future. It's still worth going ahead with this series though as it's likely to avoid constant headaches with Xen and is probably easier to maintain. This patch (of 10): A transhuge NUMA hinting fault may find the page is migrating and should wait until migration completes. The check is race-prone because the pmd is deferenced outside of the page lock and while the race is tiny, it'll be larger if the PMD is cleared while marking PMDs for hinting fault. This patch closes the race. Signed-off-by: Mel Gorman Cc: Aneesh Kumar K.V Cc: Benjamin Herrenschmidt Cc: Dave Jones Cc: Hugh Dickins Cc: Ingo Molnar Cc: Kirill Shutemov Cc: Linus Torvalds Cc: Paul Mackerras Cc: Rik van Riel Cc: Sasha Levin Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/migrate.h | 4 ---- mm/huge_memory.c | 3 ++- mm/migrate.c | 6 ------ 3 files changed, 2 insertions(+), 11 deletions(-) diff --git a/include/linux/migrate.h b/include/linux/migrate.h index fab9b32ace8e..78baed5f2952 100644 --- a/include/linux/migrate.h +++ b/include/linux/migrate.h @@ -67,7 +67,6 @@ static inline int migrate_huge_page_move_mapping(struct address_space *mapping, #ifdef CONFIG_NUMA_BALANCING extern bool pmd_trans_migrating(pmd_t pmd); -extern void wait_migrate_huge_page(struct anon_vma *anon_vma, pmd_t *pmd); extern int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma, int node); extern bool migrate_ratelimited(int node); @@ -76,9 +75,6 @@ static inline bool pmd_trans_migrating(pmd_t pmd) { return false; } -static inline void wait_migrate_huge_page(struct anon_vma *anon_vma, pmd_t *pmd) -{ -} static inline int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma, int node) { diff --git a/mm/huge_memory.c b/mm/huge_memory.c index cb7be110cad3..c6921362c5fc 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1272,8 +1272,9 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, * check_same as the page may no longer be mapped. */ if (unlikely(pmd_trans_migrating(*pmdp))) { + page = pmd_page(*pmdp); spin_unlock(ptl); - wait_migrate_huge_page(vma->anon_vma, pmdp); + wait_on_page_locked(page); goto out; } diff --git a/mm/migrate.c b/mm/migrate.c index f98067e5d353..5e8f03a8de2a 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -1654,12 +1654,6 @@ bool pmd_trans_migrating(pmd_t pmd) return PageLocked(page); } -void wait_migrate_huge_page(struct anon_vma *anon_vma, pmd_t *pmd) -{ - struct page *page = pmd_page(*pmd); - wait_on_page_locked(page); -} - /* * Attempt to migrate a misplaced page to the specified destination * node. Caller is expected to have an elevated reference count on -- cgit v1.2.3 From e7bb4b6d1609cce391af1e7bc6f31d14f1a3a890 Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Thu, 12 Feb 2015 14:58:19 -0800 Subject: mm: add p[te|md] protnone helpers for use by NUMA balancing This is a preparatory patch that introduces protnone helpers for automatic NUMA balancing. Signed-off-by: Mel Gorman Acked-by: Linus Torvalds Acked-by: Aneesh Kumar K.V Tested-by: Sasha Levin Cc: Benjamin Herrenschmidt Cc: Dave Jones Cc: Hugh Dickins Cc: Ingo Molnar Cc: Kirill Shutemov Cc: Paul Mackerras Cc: Rik van Riel Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/powerpc/include/asm/pgtable.h | 16 ++++++++++++++++ arch/x86/include/asm/pgtable.h | 16 ++++++++++++++++ include/asm-generic/pgtable.h | 20 ++++++++++++++++++++ 3 files changed, 52 insertions(+) diff --git a/arch/powerpc/include/asm/pgtable.h b/arch/powerpc/include/asm/pgtable.h index 7e77f2ca5132..1146006d3477 100644 --- a/arch/powerpc/include/asm/pgtable.h +++ b/arch/powerpc/include/asm/pgtable.h @@ -40,6 +40,22 @@ static inline int pte_none(pte_t pte) { return (pte_val(pte) & ~_PTE_NONE_MASK) static inline pgprot_t pte_pgprot(pte_t pte) { return __pgprot(pte_val(pte) & PAGE_PROT_BITS); } #ifdef CONFIG_NUMA_BALANCING +/* + * These work without NUMA balancing but the kernel does not care. See the + * comment in include/asm-generic/pgtable.h . On powerpc, this will only + * work for user pages and always return true for kernel pages. + */ +static inline int pte_protnone(pte_t pte) +{ + return (pte_val(pte) & + (_PAGE_PRESENT | _PAGE_USER)) == _PAGE_PRESENT; +} + +static inline int pmd_protnone(pmd_t pmd) +{ + return pte_protnone(pmd_pte(pmd)); +} + static inline int pte_present(pte_t pte) { return pte_val(pte) & _PAGE_NUMA_MASK; diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index 0fe03f834fb1..f519b0b529dd 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -483,6 +483,22 @@ static inline int pmd_present(pmd_t pmd) _PAGE_NUMA); } +#ifdef CONFIG_NUMA_BALANCING +/* + * These work without NUMA balancing but the kernel does not care. See the + * comment in include/asm-generic/pgtable.h + */ +static inline int pte_protnone(pte_t pte) +{ + return pte_flags(pte) & _PAGE_PROTNONE; +} + +static inline int pmd_protnone(pmd_t pmd) +{ + return pmd_flags(pmd) & _PAGE_PROTNONE; +} +#endif /* CONFIG_NUMA_BALANCING */ + static inline int pmd_none(pmd_t pmd) { /* Only check low word on 32-bit platforms, since it might be diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h index 129de9204d18..067922c06c29 100644 --- a/include/asm-generic/pgtable.h +++ b/include/asm-generic/pgtable.h @@ -673,6 +673,26 @@ static inline int pmd_trans_unstable(pmd_t *pmd) #endif } +#ifndef CONFIG_NUMA_BALANCING +/* + * Technically a PTE can be PROTNONE even when not doing NUMA balancing but + * the only case the kernel cares is for NUMA balancing and is only ever set + * when the VMA is accessible. For PROT_NONE VMAs, the PTEs are not marked + * _PAGE_PROTNONE so by by default, implement the helper as "always no". It + * is the responsibility of the caller to distinguish between PROT_NONE + * protections and NUMA hinting fault protections. + */ +static inline int pte_protnone(pte_t pte) +{ + return 0; +} + +static inline int pmd_protnone(pmd_t pmd) +{ + return 0; +} +#endif /* CONFIG_NUMA_BALANCING */ + #ifdef CONFIG_NUMA_BALANCING /* * _PAGE_NUMA distinguishes between an unmapped page table entry, an entry that -- cgit v1.2.3 From 8a0516ed8b90c95ffa1363b420caa37418149f21 Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Thu, 12 Feb 2015 14:58:22 -0800 Subject: mm: convert p[te|md]_numa users to p[te|md]_protnone_numa Convert existing users of pte_numa and friends to the new helper. Note that the kernel is broken after this patch is applied until the other page table modifiers are also altered. This patch layout is to make review easier. Signed-off-by: Mel Gorman Acked-by: Linus Torvalds Acked-by: Aneesh Kumar Acked-by: Benjamin Herrenschmidt Tested-by: Sasha Levin Cc: Dave Jones Cc: Hugh Dickins Cc: Ingo Molnar Cc: Kirill Shutemov Cc: Paul Mackerras Cc: Rik van Riel Cc: Sasha Levin Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/powerpc/kvm/book3s_hv_rm_mmu.c | 2 +- arch/powerpc/mm/fault.c | 5 ----- arch/powerpc/mm/pgtable.c | 11 ++++++++--- arch/powerpc/mm/pgtable_64.c | 3 ++- arch/x86/mm/gup.c | 4 ++-- include/uapi/linux/mempolicy.h | 2 +- mm/gup.c | 10 +++++----- mm/huge_memory.c | 16 ++++++++-------- mm/memory.c | 4 ++-- mm/mprotect.c | 38 ++++++++++--------------------------- mm/pgtable-generic.c | 2 +- 11 files changed, 40 insertions(+), 57 deletions(-) diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c b/arch/powerpc/kvm/book3s_hv_rm_mmu.c index 510bdfbc4073..625407e4d3b0 100644 --- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c +++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c @@ -212,7 +212,7 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags, /* Look up the Linux PTE for the backing page */ pte_size = psize; pte = lookup_linux_pte_and_update(pgdir, hva, writing, &pte_size); - if (pte_present(pte) && !pte_numa(pte)) { + if (pte_present(pte) && !pte_protnone(pte)) { if (writing && !pte_write(pte)) /* make the actual HPTE be read-only */ ptel = hpte_make_readonly(ptel); diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c index 6154b0a2b063..f38327b95f76 100644 --- a/arch/powerpc/mm/fault.c +++ b/arch/powerpc/mm/fault.c @@ -398,8 +398,6 @@ good_area: * processors use the same I/D cache coherency mechanism * as embedded. */ - if (error_code & DSISR_PROTFAULT) - goto bad_area; #endif /* CONFIG_PPC_STD_MMU */ /* @@ -423,9 +421,6 @@ good_area: flags |= FAULT_FLAG_WRITE; /* a read */ } else { - /* protection fault */ - if (error_code & 0x08000000) - goto bad_area; if (!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE))) goto bad_area; } diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c index c90e602677c9..83dfcb55ffef 100644 --- a/arch/powerpc/mm/pgtable.c +++ b/arch/powerpc/mm/pgtable.c @@ -172,9 +172,14 @@ static pte_t set_access_flags_filter(pte_t pte, struct vm_area_struct *vma, void set_pte_at(struct mm_struct *mm, unsigned long addr, pte_t *ptep, pte_t pte) { -#ifdef CONFIG_DEBUG_VM - WARN_ON(pte_val(*ptep) & _PAGE_PRESENT); -#endif + /* + * When handling numa faults, we already have the pte marked + * _PAGE_PRESENT, but we can be sure that it is not in hpte. + * Hence we can use set_pte_at for them. + */ + VM_WARN_ON((pte_val(*ptep) & (_PAGE_PRESENT | _PAGE_USER)) == + (_PAGE_PRESENT | _PAGE_USER)); + /* Note: mm->context.id might not yet have been assigned as * this context might not have been activated yet when this * is called. diff --git a/arch/powerpc/mm/pgtable_64.c b/arch/powerpc/mm/pgtable_64.c index 4fe5f64cc179..91bb8836825a 100644 --- a/arch/powerpc/mm/pgtable_64.c +++ b/arch/powerpc/mm/pgtable_64.c @@ -718,7 +718,8 @@ void set_pmd_at(struct mm_struct *mm, unsigned long addr, pmd_t *pmdp, pmd_t pmd) { #ifdef CONFIG_DEBUG_VM - WARN_ON(pmd_val(*pmdp) & _PAGE_PRESENT); + WARN_ON((pmd_val(*pmdp) & (_PAGE_PRESENT | _PAGE_USER)) == + (_PAGE_PRESENT | _PAGE_USER)); assert_spin_locked(&mm->page_table_lock); WARN_ON(!pmd_trans_huge(pmd)); #endif diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c index 89df70e0caa6..81bf3d2af3eb 100644 --- a/arch/x86/mm/gup.c +++ b/arch/x86/mm/gup.c @@ -84,7 +84,7 @@ static noinline int gup_pte_range(pmd_t pmd, unsigned long addr, struct page *page; /* Similar to the PMD case, NUMA hinting must take slow path */ - if (pte_numa(pte)) { + if (pte_protnone(pte)) { pte_unmap(ptep); return 0; } @@ -178,7 +178,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end, * slowpath for accounting purposes and so that they * can be serialised against THP migration. */ - if (pmd_numa(pmd)) + if (pmd_protnone(pmd)) return 0; if (!gup_huge_pmd(pmd, addr, next, write, pages, nr)) return 0; diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h index 0d11c3dcd3a1..9cd8b21dddbe 100644 --- a/include/uapi/linux/mempolicy.h +++ b/include/uapi/linux/mempolicy.h @@ -67,7 +67,7 @@ enum mpol_rebind_step { #define MPOL_F_LOCAL (1 << 1) /* preferred local allocation */ #define MPOL_F_REBINDING (1 << 2) /* identify policies in rebinding */ #define MPOL_F_MOF (1 << 3) /* this policy wants migrate on fault */ -#define MPOL_F_MORON (1 << 4) /* Migrate On pte_numa Reference On Node */ +#define MPOL_F_MORON (1 << 4) /* Migrate On protnone Reference On Node */ #endif /* _UAPI_LINUX_MEMPOLICY_H */ diff --git a/mm/gup.c b/mm/gup.c index c2da1163986a..51bf0b06ca7b 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -64,7 +64,7 @@ retry: migration_entry_wait(mm, pmd, address); goto retry; } - if ((flags & FOLL_NUMA) && pte_numa(pte)) + if ((flags & FOLL_NUMA) && pte_protnone(pte)) goto no_page; if ((flags & FOLL_WRITE) && !pte_write(pte)) { pte_unmap_unlock(ptep, ptl); @@ -184,7 +184,7 @@ struct page *follow_page_mask(struct vm_area_struct *vma, return page; return no_page_table(vma, flags); } - if ((flags & FOLL_NUMA) && pmd_numa(*pmd)) + if ((flags & FOLL_NUMA) && pmd_protnone(*pmd)) return no_page_table(vma, flags); if (pmd_trans_huge(*pmd)) { if (flags & FOLL_SPLIT) { @@ -906,10 +906,10 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end, /* * Similar to the PMD case below, NUMA hinting must take slow - * path + * path using the pte_protnone check. */ if (!pte_present(pte) || pte_special(pte) || - pte_numa(pte) || (write && !pte_write(pte))) + pte_protnone(pte) || (write && !pte_write(pte))) goto pte_unmap; VM_BUG_ON(!pfn_valid(pte_pfn(pte))); @@ -1104,7 +1104,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end, * slowpath for accounting purposes and so that they * can be serialised against THP migration. */ - if (pmd_numa(pmd)) + if (pmd_protnone(pmd)) return 0; if (!gup_huge_pmd(pmd, pmdp, addr, next, write, diff --git a/mm/huge_memory.c b/mm/huge_memory.c index c6921362c5fc..915941c45169 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1211,7 +1211,7 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma, return ERR_PTR(-EFAULT); /* Full NUMA hinting faults to serialise migration in fault paths */ - if ((flags & FOLL_NUMA) && pmd_numa(*pmd)) + if ((flags & FOLL_NUMA) && pmd_protnone(*pmd)) goto out; page = pmd_page(*pmd); @@ -1342,7 +1342,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, /* * Migrate the THP to the requested node, returns with page unlocked - * and pmd_numa cleared. + * and access rights restored. */ spin_unlock(ptl); migrated = migrate_misplaced_transhuge_page(mm, vma, @@ -1357,7 +1357,7 @@ clear_pmdnuma: BUG_ON(!PageLocked(page)); pmd = pmd_mknonnuma(pmd); set_pmd_at(mm, haddr, pmdp, pmd); - VM_BUG_ON(pmd_numa(*pmdp)); + VM_BUG_ON(pmd_protnone(*pmdp)); update_mmu_cache_pmd(vma, addr, pmdp); unlock_page(page); out_unlock: @@ -1483,7 +1483,7 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, ret = 1; if (!prot_numa) { entry = pmdp_get_and_clear_notify(mm, addr, pmd); - if (pmd_numa(entry)) + if (pmd_protnone(entry)) entry = pmd_mknonnuma(entry); entry = pmd_modify(entry, newprot); ret = HPAGE_PMD_NR; @@ -1499,7 +1499,7 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, * local vs remote hits on the zero page. */ if (!is_huge_zero_page(page) && - !pmd_numa(*pmd)) { + !pmd_protnone(*pmd)) { pmdp_set_numa(mm, addr, pmd); ret = HPAGE_PMD_NR; } @@ -1767,9 +1767,9 @@ static int __split_huge_page_map(struct page *page, pte_t *pte, entry; BUG_ON(PageCompound(page+i)); /* - * Note that pmd_numa is not transferred deliberately - * to avoid any possibility that pte_numa leaks to - * a PROT_NONE VMA by accident. + * Note that NUMA hinting access restrictions are not + * transferred to avoid any possibility of altering + * permissions across VMAs. */ entry = mk_pte(page + i, vma->vm_page_prot); entry = maybe_mkwrite(pte_mkdirty(entry), vma); diff --git a/mm/memory.c b/mm/memory.c index bbe6a73a899d..92e6a6299e86 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3124,7 +3124,7 @@ static int handle_pte_fault(struct mm_struct *mm, pte, pmd, flags, entry); } - if (pte_numa(entry)) + if (pte_protnone(entry)) return do_numa_page(mm, vma, address, entry, pte, pmd); ptl = pte_lockptr(mm, pmd); @@ -3202,7 +3202,7 @@ static int __handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, if (pmd_trans_splitting(orig_pmd)) return 0; - if (pmd_numa(orig_pmd)) + if (pmd_protnone(orig_pmd)) return do_huge_pmd_numa_page(mm, vma, address, orig_pmd, pmd); diff --git a/mm/mprotect.c b/mm/mprotect.c index 33121662f08b..44ffa698484d 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -75,36 +75,18 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, oldpte = *pte; if (pte_present(oldpte)) { pte_t ptent; - bool updated = false; - if (!prot_numa) { - ptent = ptep_modify_prot_start(mm, addr, pte); - if (pte_numa(ptent)) - ptent = pte_mknonnuma(ptent); - ptent = pte_modify(ptent, newprot); - /* - * Avoid taking write faults for pages we - * know to be dirty. - */ - if (dirty_accountable && pte_dirty(ptent) && - (pte_soft_dirty(ptent) || - !(vma->vm_flags & VM_SOFTDIRTY))) - ptent = pte_mkwrite(ptent); - ptep_modify_prot_commit(mm, addr, pte, ptent); - updated = true; - } else { - struct page *page; - - page = vm_normal_page(vma, addr, oldpte); - if (page && !PageKsm(page)) { - if (!pte_numa(oldpte)) { - ptep_set_numa(mm, addr, pte); - updated = true; - } - } + ptent = ptep_modify_prot_start(mm, addr, pte); + ptent = pte_modify(ptent, newprot); + + /* Avoid taking write faults for known dirty pages */ + if (dirty_accountable && pte_dirty(ptent) && + (pte_soft_dirty(ptent) || + !(vma->vm_flags & VM_SOFTDIRTY))) { + ptent = pte_mkwrite(ptent); } - if (updated) - pages++; + ptep_modify_prot_commit(mm, addr, pte, ptent); + pages++; } else if (IS_ENABLED(CONFIG_MIGRATION)) { swp_entry_t entry = pte_to_swp_entry(oldpte); diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c index dfb79e028ecb..4b8ad760dde3 100644 --- a/mm/pgtable-generic.c +++ b/mm/pgtable-generic.c @@ -193,7 +193,7 @@ void pmdp_invalidate(struct vm_area_struct *vma, unsigned long address, pmd_t *pmdp) { pmd_t entry = *pmdp; - if (pmd_numa(entry)) + if (pmd_protnone(entry)) entry = pmd_mknonnuma(entry); set_pmd_at(vma->vm_mm, address, pmdp, pmd_mknotpresent(entry)); flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE); -- cgit v1.2.3 From 842915f56667f9eebd85932f08c79565148c26d6 Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Thu, 12 Feb 2015 14:58:25 -0800 Subject: ppc64: add paranoid warnings for unexpected DSISR_PROTFAULT ppc64 should not be depending on DSISR_PROTFAULT and it's unexpected if they are triggered. This patch adds warnings just in case they are being accidentally depended upon. Signed-off-by: Mel Gorman Acked-by: Aneesh Kumar K.V Tested-by: Sasha Levin Cc: Benjamin Herrenschmidt Cc: Dave Jones Cc: Hugh Dickins Cc: Ingo Molnar Cc: Kirill Shutemov Cc: Linus Torvalds Cc: Paul Mackerras Cc: Rik van Riel Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/powerpc/mm/copro_fault.c | 8 ++++++-- arch/powerpc/mm/fault.c | 20 +++++++++----------- 2 files changed, 15 insertions(+), 13 deletions(-) diff --git a/arch/powerpc/mm/copro_fault.c b/arch/powerpc/mm/copro_fault.c index 1b5305d4bdab..f031a47d7701 100644 --- a/arch/powerpc/mm/copro_fault.c +++ b/arch/powerpc/mm/copro_fault.c @@ -64,10 +64,14 @@ int copro_handle_mm_fault(struct mm_struct *mm, unsigned long ea, if (!(vma->vm_flags & VM_WRITE)) goto out_unlock; } else { - if (dsisr & DSISR_PROTFAULT) - goto out_unlock; if (!(vma->vm_flags & (VM_READ | VM_EXEC))) goto out_unlock; + /* + * protfault should only happen due to us + * mapping a region readonly temporarily. PROT_NONE + * is also covered by the VMA check above. + */ + WARN_ON_ONCE(dsisr & DSISR_PROTFAULT); } ret = 0; diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c index f38327b95f76..b396868d2aa7 100644 --- a/arch/powerpc/mm/fault.c +++ b/arch/powerpc/mm/fault.c @@ -389,17 +389,6 @@ good_area: #endif /* CONFIG_8xx */ if (is_exec) { -#ifdef CONFIG_PPC_STD_MMU - /* Protection fault on exec go straight to failure on - * Hash based MMUs as they either don't support per-page - * execute permission, or if they do, it's handled already - * at the hash level. This test would probably have to - * be removed if we change the way this works to make hash - * processors use the same I/D cache coherency mechanism - * as embedded. - */ -#endif /* CONFIG_PPC_STD_MMU */ - /* * Allow execution from readable areas if the MMU does not * provide separate controls over reading and executing. @@ -414,6 +403,14 @@ good_area: (cpu_has_feature(CPU_FTR_NOEXECUTE) || !(vma->vm_flags & (VM_READ | VM_WRITE)))) goto bad_area; +#ifdef CONFIG_PPC_STD_MMU + /* + * protfault should only happen due to us + * mapping a region readonly temporarily. PROT_NONE + * is also covered by the VMA check above. + */ + WARN_ON_ONCE(error_code & DSISR_PROTFAULT); +#endif /* CONFIG_PPC_STD_MMU */ /* a write */ } else if (is_write) { if (!(vma->vm_flags & VM_WRITE)) @@ -423,6 +420,7 @@ good_area: } else { if (!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE))) goto bad_area; + WARN_ON_ONCE(error_code & DSISR_PROTFAULT); } /* -- cgit v1.2.3 From 4d9424669946532be754a6e116618dcb58430cb4 Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Thu, 12 Feb 2015 14:58:28 -0800 Subject: mm: convert p[te|md]_mknonnuma and remaining page table manipulations With PROT_NONE, the traditional page table manipulation functions are sufficient. [andre.przywara@arm.com: fix compiler warning in pmdp_invalidate()] [akpm@linux-foundation.org: fix build with STRICT_MM_TYPECHECKS] Signed-off-by: Mel Gorman Acked-by: Linus Torvalds Acked-by: Aneesh Kumar Tested-by: Sasha Levin Cc: Benjamin Herrenschmidt Cc: Dave Jones Cc: Hugh Dickins Cc: Ingo Molnar Cc: Kirill Shutemov Cc: Paul Mackerras Cc: Rik van Riel Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/arm/include/asm/pgtable-3level.h | 5 ++++- include/linux/huge_mm.h | 3 +-- mm/huge_memory.c | 33 +++++++-------------------------- mm/memory.c | 10 ++++++---- mm/mempolicy.c | 2 +- mm/migrate.c | 2 +- mm/mprotect.c | 2 +- mm/pgtable-generic.c | 2 -- 8 files changed, 21 insertions(+), 38 deletions(-) diff --git a/arch/arm/include/asm/pgtable-3level.h b/arch/arm/include/asm/pgtable-3level.h index 18dbc82f85e5..423a5ac09d3a 100644 --- a/arch/arm/include/asm/pgtable-3level.h +++ b/arch/arm/include/asm/pgtable-3level.h @@ -257,7 +257,10 @@ PMD_BIT_FUNC(mkyoung, |= PMD_SECT_AF); #define mk_pmd(page,prot) pfn_pmd(page_to_pfn(page),prot) /* represent a notpresent pmd by zero, this is used by pmdp_invalidate */ -#define pmd_mknotpresent(pmd) (__pmd(0)) +static inline pmd_t pmd_mknotpresent(pmd_t pmd) +{ + return __pmd(0); +} static inline pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot) { diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index f10b20f05159..062bd252e994 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -31,8 +31,7 @@ extern int move_huge_pmd(struct vm_area_struct *vma, unsigned long new_addr, unsigned long old_end, pmd_t *old_pmd, pmd_t *new_pmd); extern int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, - unsigned long addr, pgprot_t newprot, - int prot_numa); + unsigned long addr, pgprot_t newprot); enum transparent_hugepage_flag { TRANSPARENT_HUGEPAGE_FLAG, diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 915941c45169..cb9b3e847dac 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1355,9 +1355,8 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, goto out; clear_pmdnuma: BUG_ON(!PageLocked(page)); - pmd = pmd_mknonnuma(pmd); + pmd = pmd_modify(pmd, vma->vm_page_prot); set_pmd_at(mm, haddr, pmdp, pmd); - VM_BUG_ON(pmd_protnone(*pmdp)); update_mmu_cache_pmd(vma, addr, pmdp); unlock_page(page); out_unlock: @@ -1472,7 +1471,7 @@ out: * - HPAGE_PMD_NR is protections changed and TLB flush necessary */ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, - unsigned long addr, pgprot_t newprot, int prot_numa) + unsigned long addr, pgprot_t newprot) { struct mm_struct *mm = vma->vm_mm; spinlock_t *ptl; @@ -1481,29 +1480,11 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, if (__pmd_trans_huge_lock(pmd, vma, &ptl) == 1) { pmd_t entry; ret = 1; - if (!prot_numa) { - entry = pmdp_get_and_clear_notify(mm, addr, pmd); - if (pmd_protnone(entry)) - entry = pmd_mknonnuma(entry); - entry = pmd_modify(entry, newprot); - ret = HPAGE_PMD_NR; - set_pmd_at(mm, addr, pmd, entry); - BUG_ON(pmd_write(entry)); - } else { - struct page *page = pmd_page(*pmd); - - /* - * Do not trap faults against the zero page. The - * read-only data is likely to be read-cached on the - * local CPU cache and it is less useful to know about - * local vs remote hits on the zero page. - */ - if (!is_huge_zero_page(page) && - !pmd_protnone(*pmd)) { - pmdp_set_numa(mm, addr, pmd); - ret = HPAGE_PMD_NR; - } - } + entry = pmdp_get_and_clear_notify(mm, addr, pmd); + entry = pmd_modify(entry, newprot); + ret = HPAGE_PMD_NR; + set_pmd_at(mm, addr, pmd, entry); + BUG_ON(pmd_write(entry)); spin_unlock(ptl); } diff --git a/mm/memory.c b/mm/memory.c index 92e6a6299e86..d7921760cf79 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3018,9 +3018,9 @@ static int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, * validation through pte_unmap_same(). It's of NUMA type but * the pfn may be screwed if the read is non atomic. * - * ptep_modify_prot_start is not called as this is clearing - * the _PAGE_NUMA bit and it is not really expected that there - * would be concurrent hardware modifications to the PTE. + * We can safely just do a "set_pte_at()", because the old + * page table entry is not accessible, so there would be no + * concurrent hardware modifications to the PTE. */ ptl = pte_lockptr(mm, pmd); spin_lock(ptl); @@ -3029,7 +3029,9 @@ static int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, goto out; } - pte = pte_mknonnuma(pte); + /* Make it present again */ + pte = pte_modify(pte, vma->vm_page_prot); + pte = pte_mkyoung(pte); set_pte_at(mm, addr, ptep, pte); update_mmu_cache(vma, addr, ptep); diff --git a/mm/mempolicy.c b/mm/mempolicy.c index f1bd23803576..c75f4dcec808 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -569,7 +569,7 @@ unsigned long change_prot_numa(struct vm_area_struct *vma, { int nr_updated; - nr_updated = change_protection(vma, addr, end, vma->vm_page_prot, 0, 1); + nr_updated = change_protection(vma, addr, end, PAGE_NONE, 0, 1); if (nr_updated) count_vm_numa_events(NUMA_PTE_UPDATES, nr_updated); diff --git a/mm/migrate.c b/mm/migrate.c index 5e8f03a8de2a..85e042686031 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -1847,7 +1847,7 @@ out_fail: out_dropref: ptl = pmd_lock(mm, pmd); if (pmd_same(*pmd, entry)) { - entry = pmd_mknonnuma(entry); + entry = pmd_modify(entry, vma->vm_page_prot); set_pmd_at(mm, mmun_start, pmd, entry); update_mmu_cache_pmd(vma, address, &entry); } diff --git a/mm/mprotect.c b/mm/mprotect.c index 44ffa698484d..76824d73380d 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -142,7 +142,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma, split_huge_page_pmd(vma, addr, pmd); else { int nr_ptes = change_huge_pmd(vma, pmd, addr, - newprot, prot_numa); + newprot); if (nr_ptes) { if (nr_ptes == HPAGE_PMD_NR) { diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c index 4b8ad760dde3..c25f94b33811 100644 --- a/mm/pgtable-generic.c +++ b/mm/pgtable-generic.c @@ -193,8 +193,6 @@ void pmdp_invalidate(struct vm_area_struct *vma, unsigned long address, pmd_t *pmdp) { pmd_t entry = *pmdp; - if (pmd_protnone(entry)) - entry = pmd_mknonnuma(entry); set_pmd_at(vma->vm_mm, address, pmdp, pmd_mknotpresent(entry)); flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE); } -- cgit v1.2.3 From 21d9ee3eda7792c45880b2f11bff8e95c9a061fb Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Thu, 12 Feb 2015 14:58:32 -0800 Subject: mm: remove remaining references to NUMA hinting bits and helpers This patch removes the NUMA PTE bits and associated helpers. As a side-effect it increases the maximum possible swap space on x86-64. One potential source of problems is races between the marking of PTEs PROT_NONE, NUMA hinting faults and migration. It must be guaranteed that a PTE being protected is not faulted in parallel, seen as a pte_none and corrupting memory. The base case is safe but transhuge has problems in the past due to an different migration mechanism and a dependance on page lock to serialise migrations and warrants a closer look. task_work hinting update parallel fault ------------------------ -------------- change_pmd_range change_huge_pmd __pmd_trans_huge_lock pmdp_get_and_clear __handle_mm_fault pmd_none do_huge_pmd_anonymous_page read? pmd_lock blocks until hinting complete, fail !pmd_none test write? __do_huge_pmd_anonymous_page acquires pmd_lock, checks pmd_none pmd_modify set_pmd_at task_work hinting update parallel migration ------------------------ ------------------ change_pmd_range change_huge_pmd __pmd_trans_huge_lock pmdp_get_and_clear __handle_mm_fault do_huge_pmd_numa_page migrate_misplaced_transhuge_page pmd_lock waits for updates to complete, recheck pmd_same pmd_modify set_pmd_at Both of those are safe and the case where a transhuge page is inserted during a protection update is unchanged. The case where two processes try migrating at the same time is unchanged by this series so should still be ok. I could not find a case where we are accidentally depending on the PTE not being cleared and flushed. If one is missed, it'll manifest as corruption problems that start triggering shortly after this series is merged and only happen when NUMA balancing is enabled. Signed-off-by: Mel Gorman Tested-by: Sasha Levin Cc: Aneesh Kumar K.V Cc: Benjamin Herrenschmidt Cc: Dave Jones Cc: Hugh Dickins Cc: Ingo Molnar Cc: Kirill Shutemov Cc: Linus Torvalds Cc: Paul Mackerras Cc: Rik van Riel Cc: Mark Brown Cc: Stephen Rothwell Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/powerpc/include/asm/pgtable.h | 54 +----------- arch/powerpc/include/asm/pte-common.h | 5 -- arch/powerpc/include/asm/pte-hash64.h | 6 -- arch/x86/include/asm/pgtable.h | 22 +---- arch/x86/include/asm/pgtable_64.h | 5 -- arch/x86/include/asm/pgtable_types.h | 41 +-------- include/asm-generic/pgtable.h | 155 ---------------------------------- include/linux/swapops.h | 2 +- 8 files changed, 7 insertions(+), 283 deletions(-) diff --git a/arch/powerpc/include/asm/pgtable.h b/arch/powerpc/include/asm/pgtable.h index 1146006d3477..79fee2eb8d56 100644 --- a/arch/powerpc/include/asm/pgtable.h +++ b/arch/powerpc/include/asm/pgtable.h @@ -55,64 +55,12 @@ static inline int pmd_protnone(pmd_t pmd) { return pte_protnone(pmd_pte(pmd)); } - -static inline int pte_present(pte_t pte) -{ - return pte_val(pte) & _PAGE_NUMA_MASK; -} - -#define pte_present_nonuma pte_present_nonuma -static inline int pte_present_nonuma(pte_t pte) -{ - return pte_val(pte) & (_PAGE_PRESENT); -} - -#define ptep_set_numa ptep_set_numa -static inline void ptep_set_numa(struct mm_struct *mm, unsigned long addr, - pte_t *ptep) -{ - if ((pte_val(*ptep) & _PAGE_PRESENT) == 0) - VM_BUG_ON(1); - - pte_update(mm, addr, ptep, _PAGE_PRESENT, _PAGE_NUMA, 0); - return; -} - -#define pmdp_set_numa pmdp_set_numa -static inline void pmdp_set_numa(struct mm_struct *mm, unsigned long addr, - pmd_t *pmdp) -{ - if ((pmd_val(*pmdp) & _PAGE_PRESENT) == 0) - VM_BUG_ON(1); - - pmd_hugepage_update(mm, addr, pmdp, _PAGE_PRESENT, _PAGE_NUMA); - return; -} - -/* - * Generic NUMA pte helpers expect pteval_t and pmdval_t types to exist - * which was inherited from x86. For the purposes of powerpc pte_basic_t and - * pmd_t are equivalent - */ -#define pteval_t pte_basic_t -#define pmdval_t pmd_t -static inline pteval_t ptenuma_flags(pte_t pte) -{ - return pte_val(pte) & _PAGE_NUMA_MASK; -} - -static inline pmdval_t pmdnuma_flags(pmd_t pmd) -{ - return pmd_val(pmd) & _PAGE_NUMA_MASK; -} - -# else +#endif /* CONFIG_NUMA_BALANCING */ static inline int pte_present(pte_t pte) { return pte_val(pte) & _PAGE_PRESENT; } -#endif /* CONFIG_NUMA_BALANCING */ /* Conversion functions: convert a page and protection to a page entry, * and a page entry and page directory to the page they refer to. diff --git a/arch/powerpc/include/asm/pte-common.h b/arch/powerpc/include/asm/pte-common.h index 2aef9b7a0eb2..c5a755ef7011 100644 --- a/arch/powerpc/include/asm/pte-common.h +++ b/arch/powerpc/include/asm/pte-common.h @@ -104,11 +104,6 @@ extern unsigned long bad_call_to_PMD_PAGE_SIZE(void); _PAGE_USER | _PAGE_ACCESSED | _PAGE_RO | \ _PAGE_RW | _PAGE_HWWRITE | _PAGE_DIRTY | _PAGE_EXEC) -#ifdef CONFIG_NUMA_BALANCING -/* Mask of bits that distinguish present and numa ptes */ -#define _PAGE_NUMA_MASK (_PAGE_NUMA|_PAGE_PRESENT) -#endif - /* * We define 2 sets of base prot bits, one for basic pages (ie, * cacheable kernel and user pages) and one for non cacheable diff --git a/arch/powerpc/include/asm/pte-hash64.h b/arch/powerpc/include/asm/pte-hash64.h index 2505d8eab15c..55aea0caf95e 100644 --- a/arch/powerpc/include/asm/pte-hash64.h +++ b/arch/powerpc/include/asm/pte-hash64.h @@ -27,12 +27,6 @@ #define _PAGE_RW 0x0200 /* software: user write access allowed */ #define _PAGE_BUSY 0x0800 /* software: PTE & hash are busy */ -/* - * Used for tracking numa faults - */ -#define _PAGE_NUMA 0x00000010 /* Gather numa placement stats */ - - /* No separate kernel read-only */ #define _PAGE_KERNEL_RW (_PAGE_RW | _PAGE_DIRTY) /* user access blocked by key */ #define _PAGE_KERNEL_RO _PAGE_KERNEL_RW diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index f519b0b529dd..34d42a7d5595 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -300,7 +300,7 @@ static inline pmd_t pmd_mkwrite(pmd_t pmd) static inline pmd_t pmd_mknotpresent(pmd_t pmd) { - return pmd_clear_flags(pmd, _PAGE_PRESENT); + return pmd_clear_flags(pmd, _PAGE_PRESENT | _PAGE_PROTNONE); } #ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY @@ -442,13 +442,6 @@ static inline int pte_same(pte_t a, pte_t b) } static inline int pte_present(pte_t a) -{ - return pte_flags(a) & (_PAGE_PRESENT | _PAGE_PROTNONE | - _PAGE_NUMA); -} - -#define pte_present_nonuma pte_present_nonuma -static inline int pte_present_nonuma(pte_t a) { return pte_flags(a) & (_PAGE_PRESENT | _PAGE_PROTNONE); } @@ -459,7 +452,7 @@ static inline bool pte_accessible(struct mm_struct *mm, pte_t a) if (pte_flags(a) & _PAGE_PRESENT) return true; - if ((pte_flags(a) & (_PAGE_PROTNONE | _PAGE_NUMA)) && + if ((pte_flags(a) & _PAGE_PROTNONE) && mm_tlb_flush_pending(mm)) return true; @@ -479,8 +472,7 @@ static inline int pmd_present(pmd_t pmd) * the _PAGE_PSE flag will remain set at all times while the * _PAGE_PRESENT bit is clear). */ - return pmd_flags(pmd) & (_PAGE_PRESENT | _PAGE_PROTNONE | _PAGE_PSE | - _PAGE_NUMA); + return pmd_flags(pmd) & (_PAGE_PRESENT | _PAGE_PROTNONE | _PAGE_PSE); } #ifdef CONFIG_NUMA_BALANCING @@ -555,11 +547,6 @@ static inline pte_t *pte_offset_kernel(pmd_t *pmd, unsigned long address) static inline int pmd_bad(pmd_t pmd) { -#ifdef CONFIG_NUMA_BALANCING - /* pmd_numa check */ - if ((pmd_flags(pmd) & (_PAGE_NUMA|_PAGE_PRESENT)) == _PAGE_NUMA) - return 0; -#endif return (pmd_flags(pmd) & ~_PAGE_USER) != _KERNPG_TABLE; } @@ -878,19 +865,16 @@ static inline void update_mmu_cache_pmd(struct vm_area_struct *vma, #ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY static inline pte_t pte_swp_mksoft_dirty(pte_t pte) { - VM_BUG_ON(pte_present_nonuma(pte)); return pte_set_flags(pte, _PAGE_SWP_SOFT_DIRTY); } static inline int pte_swp_soft_dirty(pte_t pte) { - VM_BUG_ON(pte_present_nonuma(pte)); return pte_flags(pte) & _PAGE_SWP_SOFT_DIRTY; } static inline pte_t pte_swp_clear_soft_dirty(pte_t pte) { - VM_BUG_ON(pte_present_nonuma(pte)); return pte_clear_flags(pte, _PAGE_SWP_SOFT_DIRTY); } #endif diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h index e227970f983e..2ee781114d34 100644 --- a/arch/x86/include/asm/pgtable_64.h +++ b/arch/x86/include/asm/pgtable_64.h @@ -142,12 +142,7 @@ static inline int pgd_large(pgd_t pgd) { return 0; } /* Encode and de-code a swap entry */ #define SWP_TYPE_BITS 5 -#ifdef CONFIG_NUMA_BALANCING -/* Automatic NUMA balancing needs to be distinguishable from swap entries */ -#define SWP_OFFSET_SHIFT (_PAGE_BIT_PROTNONE + 2) -#else #define SWP_OFFSET_SHIFT (_PAGE_BIT_PROTNONE + 1) -#endif #define MAX_SWAPFILES_CHECK() BUILD_BUG_ON(MAX_SWAPFILES_SHIFT > SWP_TYPE_BITS) diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h index 3e0230c94cff..8c7c10802e9c 100644 --- a/arch/x86/include/asm/pgtable_types.h +++ b/arch/x86/include/asm/pgtable_types.h @@ -27,14 +27,6 @@ #define _PAGE_BIT_SOFT_DIRTY _PAGE_BIT_SOFTW3 /* software dirty tracking */ #define _PAGE_BIT_NX 63 /* No execute: only valid after cpuid check */ -/* - * Swap offsets on configurations that allow automatic NUMA balancing use the - * bits after _PAGE_BIT_GLOBAL. To uniquely distinguish NUMA hinting PTEs from - * swap entries, we use the first bit after _PAGE_BIT_GLOBAL and shrink the - * maximum possible swap space from 16TB to 8TB. - */ -#define _PAGE_BIT_NUMA (_PAGE_BIT_GLOBAL+1) - /* If _PAGE_BIT_PRESENT is clear, we use these: */ /* - if the user mapped it with PROT_NONE; pte_present gives true */ #define _PAGE_BIT_PROTNONE _PAGE_BIT_GLOBAL @@ -75,21 +67,6 @@ #define _PAGE_SOFT_DIRTY (_AT(pteval_t, 0)) #endif -/* - * _PAGE_NUMA distinguishes between a numa hinting minor fault and a page - * that is not present. The hinting fault gathers numa placement statistics - * (see pte_numa()). The bit is always zero when the PTE is not present. - * - * The bit picked must be always zero when the pmd is present and not - * present, so that we don't lose information when we set it while - * atomically clearing the present bit. - */ -#ifdef CONFIG_NUMA_BALANCING -#define _PAGE_NUMA (_AT(pteval_t, 1) << _PAGE_BIT_NUMA) -#else -#define _PAGE_NUMA (_AT(pteval_t, 0)) -#endif - /* * Tracking soft dirty bit when a page goes to a swap is tricky. * We need a bit which can be stored in pte _and_ not conflict @@ -122,8 +99,8 @@ /* Set of bits not changed in pte_modify */ #define _PAGE_CHG_MASK (PTE_PFN_MASK | _PAGE_PCD | _PAGE_PWT | \ _PAGE_SPECIAL | _PAGE_ACCESSED | _PAGE_DIRTY | \ - _PAGE_SOFT_DIRTY | _PAGE_NUMA) -#define _HPAGE_CHG_MASK (_PAGE_CHG_MASK | _PAGE_PSE | _PAGE_NUMA) + _PAGE_SOFT_DIRTY) +#define _HPAGE_CHG_MASK (_PAGE_CHG_MASK | _PAGE_PSE) /* * The cache modes defined here are used to translate between pure SW usage @@ -324,20 +301,6 @@ static inline pteval_t pte_flags(pte_t pte) return native_pte_val(pte) & PTE_FLAGS_MASK; } -#ifdef CONFIG_NUMA_BALANCING -/* Set of bits that distinguishes present, prot_none and numa ptes */ -#define _PAGE_NUMA_MASK (_PAGE_NUMA|_PAGE_PROTNONE|_PAGE_PRESENT) -static inline pteval_t ptenuma_flags(pte_t pte) -{ - return pte_flags(pte) & _PAGE_NUMA_MASK; -} - -static inline pmdval_t pmdnuma_flags(pmd_t pmd) -{ - return pmd_flags(pmd) & _PAGE_NUMA_MASK; -} -#endif /* CONFIG_NUMA_BALANCING */ - #define pgprot_val(x) ((x).pgprot) #define __pgprot(x) ((pgprot_t) { (x) } ) diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h index 067922c06c29..4d46085c1b90 100644 --- a/include/asm-generic/pgtable.h +++ b/include/asm-generic/pgtable.h @@ -244,10 +244,6 @@ static inline int pmd_same(pmd_t pmd_a, pmd_t pmd_b) # define pte_accessible(mm, pte) ((void)(pte), 1) #endif -#ifndef pte_present_nonuma -#define pte_present_nonuma(pte) pte_present(pte) -#endif - #ifndef flush_tlb_fix_spurious_fault #define flush_tlb_fix_spurious_fault(vma, address) flush_tlb_page(vma, address) #endif @@ -693,157 +689,6 @@ static inline int pmd_protnone(pmd_t pmd) } #endif /* CONFIG_NUMA_BALANCING */ -#ifdef CONFIG_NUMA_BALANCING -/* - * _PAGE_NUMA distinguishes between an unmapped page table entry, an entry that - * is protected for PROT_NONE and a NUMA hinting fault entry. If the - * architecture defines __PAGE_PROTNONE then it should take that into account - * but those that do not can rely on the fact that the NUMA hinting scanner - * skips inaccessible VMAs. - * - * pte/pmd_present() returns true if pte/pmd_numa returns true. Page - * fault triggers on those regions if pte/pmd_numa returns true - * (because _PAGE_PRESENT is not set). - */ -#ifndef pte_numa -static inline int pte_numa(pte_t pte) -{ - return ptenuma_flags(pte) == _PAGE_NUMA; -} -#endif - -#ifndef pmd_numa -static inline int pmd_numa(pmd_t pmd) -{ - return pmdnuma_flags(pmd) == _PAGE_NUMA; -} -#endif - -/* - * pte/pmd_mknuma sets the _PAGE_ACCESSED bitflag automatically - * because they're called by the NUMA hinting minor page fault. If we - * wouldn't set the _PAGE_ACCESSED bitflag here, the TLB miss handler - * would be forced to set it later while filling the TLB after we - * return to userland. That would trigger a second write to memory - * that we optimize away by setting _PAGE_ACCESSED here. - */ -#ifndef pte_mknonnuma -static inline pte_t pte_mknonnuma(pte_t pte) -{ - pteval_t val = pte_val(pte); - - val &= ~_PAGE_NUMA; - val |= (_PAGE_PRESENT|_PAGE_ACCESSED); - return __pte(val); -} -#endif - -#ifndef pmd_mknonnuma -static inline pmd_t pmd_mknonnuma(pmd_t pmd) -{ - pmdval_t val = pmd_val(pmd); - - val &= ~_PAGE_NUMA; - val |= (_PAGE_PRESENT|_PAGE_ACCESSED); - - return __pmd(val); -} -#endif - -#ifndef pte_mknuma -static inline pte_t pte_mknuma(pte_t pte) -{ - pteval_t val = pte_val(pte); - - VM_BUG_ON(!(val & _PAGE_PRESENT)); - - val &= ~_PAGE_PRESENT; - val |= _PAGE_NUMA; - - return __pte(val); -} -#endif - -#ifndef ptep_set_numa -static inline void ptep_set_numa(struct mm_struct *mm, unsigned long addr, - pte_t *ptep) -{ - pte_t ptent = *ptep; - - ptent = pte_mknuma(ptent); - set_pte_at(mm, addr, ptep, ptent); - return; -} -#endif - -#ifndef pmd_mknuma -static inline pmd_t pmd_mknuma(pmd_t pmd) -{ - pmdval_t val = pmd_val(pmd); - - val &= ~_PAGE_PRESENT; - val |= _PAGE_NUMA; - - return __pmd(val); -} -#endif - -#ifndef pmdp_set_numa -static inline void pmdp_set_numa(struct mm_struct *mm, unsigned long addr, - pmd_t *pmdp) -{ - pmd_t pmd = *pmdp; - - pmd = pmd_mknuma(pmd); - set_pmd_at(mm, addr, pmdp, pmd); - return; -} -#endif -#else -static inline int pmd_numa(pmd_t pmd) -{ - return 0; -} - -static inline int pte_numa(pte_t pte) -{ - return 0; -} - -static inline pte_t pte_mknonnuma(pte_t pte) -{ - return pte; -} - -static inline pmd_t pmd_mknonnuma(pmd_t pmd) -{ - return pmd; -} - -static inline pte_t pte_mknuma(pte_t pte) -{ - return pte; -} - -static inline void ptep_set_numa(struct mm_struct *mm, unsigned long addr, - pte_t *ptep) -{ - return; -} - - -static inline pmd_t pmd_mknuma(pmd_t pmd) -{ - return pmd; -} - -static inline void pmdp_set_numa(struct mm_struct *mm, unsigned long addr, - pmd_t *pmdp) -{ - return ; -} -#endif /* CONFIG_NUMA_BALANCING */ - #endif /* CONFIG_MMU */ #endif /* !__ASSEMBLY__ */ diff --git a/include/linux/swapops.h b/include/linux/swapops.h index 831a3168ab35..cedf3d3c373f 100644 --- a/include/linux/swapops.h +++ b/include/linux/swapops.h @@ -54,7 +54,7 @@ static inline pgoff_t swp_offset(swp_entry_t entry) /* check whether a pte points to a swap entry */ static inline int is_swap_pte(pte_t pte) { - return !pte_none(pte) && !pte_present_nonuma(pte); + return !pte_none(pte) && !pte_present(pte); } #endif -- cgit v1.2.3 From e944fd67b625c02bda4a78ddf85e413c5e401474 Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Thu, 12 Feb 2015 14:58:35 -0800 Subject: mm: numa: do not trap faults on the huge zero page Faults on the huge zero page are pointless and there is a BUG_ON to catch them during fault time. This patch reintroduces a check that avoids marking the zero page PAGE_NONE. Signed-off-by: Mel Gorman Cc: Aneesh Kumar K.V Cc: Benjamin Herrenschmidt Cc: Dave Jones Cc: Hugh Dickins Cc: Ingo Molnar Cc: Kirill Shutemov Cc: Linus Torvalds Cc: Paul Mackerras Cc: Rik van Riel Cc: Sasha Levin Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/huge_mm.h | 3 ++- mm/huge_memory.c | 13 ++++++++++++- mm/memory.c | 1 - mm/mprotect.c | 14 +++++++++++++- 4 files changed, 27 insertions(+), 4 deletions(-) diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 062bd252e994..f10b20f05159 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -31,7 +31,8 @@ extern int move_huge_pmd(struct vm_area_struct *vma, unsigned long new_addr, unsigned long old_end, pmd_t *old_pmd, pmd_t *new_pmd); extern int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, - unsigned long addr, pgprot_t newprot); + unsigned long addr, pgprot_t newprot, + int prot_numa); enum transparent_hugepage_flag { TRANSPARENT_HUGEPAGE_FLAG, diff --git a/mm/huge_memory.c b/mm/huge_memory.c index cb9b3e847dac..8e791a3db6b6 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1471,7 +1471,7 @@ out: * - HPAGE_PMD_NR is protections changed and TLB flush necessary */ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, - unsigned long addr, pgprot_t newprot) + unsigned long addr, pgprot_t newprot, int prot_numa) { struct mm_struct *mm = vma->vm_mm; spinlock_t *ptl; @@ -1479,6 +1479,17 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, if (__pmd_trans_huge_lock(pmd, vma, &ptl) == 1) { pmd_t entry; + + /* + * Avoid trapping faults against the zero page. The read-only + * data is likely to be read-cached on the local CPU and + * local/remote hits to the zero page are not interesting. + */ + if (prot_numa && is_huge_zero_pmd(*pmd)) { + spin_unlock(ptl); + return 0; + } + ret = 1; entry = pmdp_get_and_clear_notify(mm, addr, pmd); entry = pmd_modify(entry, newprot); diff --git a/mm/memory.c b/mm/memory.c index d7921760cf79..bf244f56b05a 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3040,7 +3040,6 @@ static int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, pte_unmap_unlock(ptep, ptl); return 0; } - BUG_ON(is_zero_pfn(page_to_pfn(page))); /* * Avoid grouping on DSO/COW pages in specific and RO pages diff --git a/mm/mprotect.c b/mm/mprotect.c index 76824d73380d..dd599fc235c2 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -76,6 +76,18 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, if (pte_present(oldpte)) { pte_t ptent; + /* + * Avoid trapping faults against the zero or KSM + * pages. See similar comment in change_huge_pmd. + */ + if (prot_numa) { + struct page *page; + + page = vm_normal_page(vma, addr, oldpte); + if (!page || PageKsm(page)) + continue; + } + ptent = ptep_modify_prot_start(mm, addr, pte); ptent = pte_modify(ptent, newprot); @@ -142,7 +154,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma, split_huge_page_pmd(vma, addr, pmd); else { int nr_ptes = change_huge_pmd(vma, pmd, addr, - newprot); + newprot, prot_numa); if (nr_ptes) { if (nr_ptes == HPAGE_PMD_NR) { -- cgit v1.2.3 From c819f37e7e174d68cd013abf33725b4e07ced023 Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Thu, 12 Feb 2015 14:58:38 -0800 Subject: x86: mm: restore original pte_special check Commit b38af4721f59 ("x86,mm: fix pte_special versus pte_numa") adjusted the pte_special check to take into account that a special pte had SPECIAL and neither PRESENT nor PROTNONE. Now that NUMA hinting PTEs are no longer modifying _PAGE_PRESENT it should be safe to restore the original pte_special behaviour. Signed-off-by: Mel Gorman Cc: Aneesh Kumar K.V Cc: Benjamin Herrenschmidt Cc: Dave Jones Cc: Hugh Dickins Cc: Ingo Molnar Cc: Kirill Shutemov Cc: Linus Torvalds Cc: Paul Mackerras Cc: Rik van Riel Cc: Sasha Levin Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/x86/include/asm/pgtable.h | 8 +------- 1 file changed, 1 insertion(+), 7 deletions(-) diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index 34d42a7d5595..67fc3d2b0aab 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -132,13 +132,7 @@ static inline int pte_exec(pte_t pte) static inline int pte_special(pte_t pte) { - /* - * See CONFIG_NUMA_BALANCING pte_numa in include/asm-generic/pgtable.h. - * On x86 we have _PAGE_BIT_NUMA == _PAGE_BIT_GLOBAL+1 == - * __PAGE_BIT_SOFTW1 == _PAGE_BIT_SPECIAL. - */ - return (pte_flags(pte) & _PAGE_SPECIAL) && - (pte_flags(pte) & (_PAGE_PRESENT|_PAGE_PROTNONE)); + return pte_flags(pte) & _PAGE_SPECIAL; } static inline unsigned long pte_pfn(pte_t pte) -- cgit v1.2.3 From c0e7cad9f2390087b53e26e7b98958d8793ee02d Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Thu, 12 Feb 2015 14:58:41 -0800 Subject: mm: numa: add paranoid check around pte_protnone_numa pte_protnone_numa is only safe to use after VMA checks for PROT_NONE are complete. Treating a real PROT_NONE PTE as a NUMA hinting fault is going to result in strangeness so add a check for it. BUG_ON looks like overkill but if this is hit then it's a serious bug that could result in corruption so do not even try recovering. It would have been more comprehensive to check VMA flags in pte_protnone_numa but it would have made the API ugly just for a debugging check. Signed-off-by: Mel Gorman Cc: Aneesh Kumar K.V Cc: Benjamin Herrenschmidt Cc: Dave Jones Cc: Hugh Dickins Cc: Ingo Molnar Cc: Kirill Shutemov Cc: Linus Torvalds Cc: Paul Mackerras Cc: Rik van Riel Cc: Sasha Levin Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/huge_memory.c | 3 +++ mm/memory.c | 3 +++ 2 files changed, 6 insertions(+) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 8e791a3db6b6..8e07342b52c0 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1262,6 +1262,9 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, bool migrated = false; int flags = 0; + /* A PROT_NONE fault should not end up here */ + BUG_ON(!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE))); + ptl = pmd_lock(mm, pmdp); if (unlikely(!pmd_same(pmd, *pmdp))) goto out_unlock; diff --git a/mm/memory.c b/mm/memory.c index bf244f56b05a..f7886ab036e7 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3013,6 +3013,9 @@ static int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, bool migrated = false; int flags = 0; + /* A PROT_NONE fault should not end up here */ + BUG_ON(!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE))); + /* * The "pte" at this point cannot be used safely without * validation through pte_unmap_same(). It's of NUMA type but -- cgit v1.2.3 From 10c1045f28e86ac90589a188f0be2d7a4347efdf Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Thu, 12 Feb 2015 14:58:44 -0800 Subject: mm: numa: avoid unnecessary TLB flushes when setting NUMA hinting entries If a PTE or PMD is already marked NUMA when scanning to mark entries for NUMA hinting then it is not necessary to update the entry and incur a TLB flush penalty. Avoid the avoidhead where possible. Signed-off-by: Mel Gorman Cc: Aneesh Kumar K.V Cc: Benjamin Herrenschmidt Cc: Dave Jones Cc: Hugh Dickins Cc: Ingo Molnar Cc: Kirill Shutemov Cc: Linus Torvalds Cc: Paul Mackerras Cc: Rik van Riel Cc: Sasha Levin Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/huge_memory.c | 14 ++++++++------ mm/mprotect.c | 4 ++++ 2 files changed, 12 insertions(+), 6 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 8e07342b52c0..fc00c8cb5a82 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1493,12 +1493,14 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, return 0; } - ret = 1; - entry = pmdp_get_and_clear_notify(mm, addr, pmd); - entry = pmd_modify(entry, newprot); - ret = HPAGE_PMD_NR; - set_pmd_at(mm, addr, pmd, entry); - BUG_ON(pmd_write(entry)); + if (!prot_numa || !pmd_protnone(*pmd)) { + ret = 1; + entry = pmdp_get_and_clear_notify(mm, addr, pmd); + entry = pmd_modify(entry, newprot); + ret = HPAGE_PMD_NR; + set_pmd_at(mm, addr, pmd, entry); + BUG_ON(pmd_write(entry)); + } spin_unlock(ptl); } diff --git a/mm/mprotect.c b/mm/mprotect.c index dd599fc235c2..44727811bf4c 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -86,6 +86,10 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, page = vm_normal_page(vma, addr, oldpte); if (!page || PageKsm(page)) continue; + + /* Avoid TLB flush if possible */ + if (pte_protnone(oldpte)) + continue; } ptent = ptep_modify_prot_start(mm, addr, pte); -- cgit v1.2.3 From 503c358cf1925853195ee39ec437e51138bbb7df Mon Sep 17 00:00:00 2001 From: Vladimir Davydov Date: Thu, 12 Feb 2015 14:58:47 -0800 Subject: list_lru: introduce list_lru_shrink_{count,walk} Kmem accounting of memcg is unusable now, because it lacks slab shrinker support. That means when we hit the limit we will get ENOMEM w/o any chance to recover. What we should do then is to call shrink_slab, which would reclaim old inode/dentry caches from this cgroup. This is what this patch set is intended to do. Basically, it does two things. First, it introduces the notion of per-memcg slab shrinker. A shrinker that wants to reclaim objects per cgroup should mark itself as SHRINKER_MEMCG_AWARE. Then it will be passed the memory cgroup to scan from in shrink_control->memcg. For such shrinkers shrink_slab iterates over the whole cgroup subtree under the target cgroup and calls the shrinker for each kmem-active memory cgroup. Secondly, this patch set makes the list_lru structure per-memcg. It's done transparently to list_lru users - everything they have to do is to tell list_lru_init that they want memcg-aware list_lru. Then the list_lru will automatically distribute objects among per-memcg lists basing on which cgroup the object is accounted to. This way to make FS shrinkers (icache, dcache) memcg-aware we only need to make them use memcg-aware list_lru, and this is what this patch set does. As before, this patch set only enables per-memcg kmem reclaim when the pressure goes from memory.limit, not from memory.kmem.limit. Handling memory.kmem.limit is going to be tricky due to GFP_NOFS allocations, and it is still unclear whether we will have this knob in the unified hierarchy. This patch (of 9): NUMA aware slab shrinkers use the list_lru structure to distribute objects coming from different NUMA nodes to different lists. Whenever such a shrinker needs to count or scan objects from a particular node, it issues commands like this: count = list_lru_count_node(lru, sc->nid); freed = list_lru_walk_node(lru, sc->nid, isolate_func, isolate_arg, &sc->nr_to_scan); where sc is an instance of the shrink_control structure passed to it from vmscan. To simplify this, let's add special list_lru functions to be used by shrinkers, list_lru_shrink_count() and list_lru_shrink_walk(), which consolidate the nid and nr_to_scan arguments in the shrink_control structure. This will also allow us to avoid patching shrinkers that use list_lru when we make shrink_slab() per-memcg - all we will have to do is extend the shrink_control structure to include the target memcg and make list_lru_shrink_{count,walk} handle this appropriately. Signed-off-by: Vladimir Davydov Suggested-by: Dave Chinner Cc: Johannes Weiner Cc: Michal Hocko Cc: Greg Thelen Cc: Glauber Costa Cc: Alexander Viro Cc: Christoph Lameter Cc: Pekka Enberg Cc: David Rientjes Cc: Joonsoo Kim Cc: Tejun Heo Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- fs/dcache.c | 14 ++++++-------- fs/gfs2/quota.c | 6 +++--- fs/inode.c | 7 +++---- fs/internal.h | 7 +++---- fs/super.c | 24 +++++++++++------------- fs/xfs/xfs_buf.c | 7 +++---- fs/xfs/xfs_qm.c | 7 +++---- include/linux/list_lru.h | 16 ++++++++++++++++ mm/workingset.c | 6 +++--- 9 files changed, 51 insertions(+), 43 deletions(-) diff --git a/fs/dcache.c b/fs/dcache.c index e368d4f412f9..56c5da89f58a 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -930,24 +930,22 @@ dentry_lru_isolate(struct list_head *item, spinlock_t *lru_lock, void *arg) /** * prune_dcache_sb - shrink the dcache * @sb: superblock - * @nr_to_scan : number of entries to try to free - * @nid: which node to scan for freeable entities + * @sc: shrink control, passed to list_lru_shrink_walk() * - * Attempt to shrink the superblock dcache LRU by @nr_to_scan entries. This is - * done when we need more memory an called from the superblock shrinker + * Attempt to shrink the superblock dcache LRU by @sc->nr_to_scan entries. This + * is done when we need more memory and called from the superblock shrinker * function. * * This function may fail to free any resources if all the dentries are in * use. */ -long prune_dcache_sb(struct super_block *sb, unsigned long nr_to_scan, - int nid) +long prune_dcache_sb(struct super_block *sb, struct shrink_control *sc) { LIST_HEAD(dispose); long freed; - freed = list_lru_walk_node(&sb->s_dentry_lru, nid, dentry_lru_isolate, - &dispose, &nr_to_scan); + freed = list_lru_shrink_walk(&sb->s_dentry_lru, sc, + dentry_lru_isolate, &dispose); shrink_dentry_list(&dispose); return freed; } diff --git a/fs/gfs2/quota.c b/fs/gfs2/quota.c index 3e193cb36996..c15d6b216d0b 100644 --- a/fs/gfs2/quota.c +++ b/fs/gfs2/quota.c @@ -171,8 +171,8 @@ static unsigned long gfs2_qd_shrink_scan(struct shrinker *shrink, if (!(sc->gfp_mask & __GFP_FS)) return SHRINK_STOP; - freed = list_lru_walk_node(&gfs2_qd_lru, sc->nid, gfs2_qd_isolate, - &dispose, &sc->nr_to_scan); + freed = list_lru_shrink_walk(&gfs2_qd_lru, sc, + gfs2_qd_isolate, &dispose); gfs2_qd_dispose(&dispose); @@ -182,7 +182,7 @@ static unsigned long gfs2_qd_shrink_scan(struct shrinker *shrink, static unsigned long gfs2_qd_shrink_count(struct shrinker *shrink, struct shrink_control *sc) { - return vfs_pressure_ratio(list_lru_count_node(&gfs2_qd_lru, sc->nid)); + return vfs_pressure_ratio(list_lru_shrink_count(&gfs2_qd_lru, sc)); } struct shrinker gfs2_qd_shrinker = { diff --git a/fs/inode.c b/fs/inode.c index 3a53b1da3fb8..524a32c2b0c6 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -751,14 +751,13 @@ inode_lru_isolate(struct list_head *item, spinlock_t *lru_lock, void *arg) * to trim from the LRU. Inodes to be freed are moved to a temporary list and * then are freed outside inode_lock by dispose_list(). */ -long prune_icache_sb(struct super_block *sb, unsigned long nr_to_scan, - int nid) +long prune_icache_sb(struct super_block *sb, struct shrink_control *sc) { LIST_HEAD(freeable); long freed; - freed = list_lru_walk_node(&sb->s_inode_lru, nid, inode_lru_isolate, - &freeable, &nr_to_scan); + freed = list_lru_shrink_walk(&sb->s_inode_lru, sc, + inode_lru_isolate, &freeable); dispose_list(&freeable); return freed; } diff --git a/fs/internal.h b/fs/internal.h index e9a61fe67575..d92c346a793d 100644 --- a/fs/internal.h +++ b/fs/internal.h @@ -14,6 +14,7 @@ struct file_system_type; struct linux_binprm; struct path; struct mount; +struct shrink_control; /* * block_dev.c @@ -111,8 +112,7 @@ extern int open_check_o_direct(struct file *f); * inode.c */ extern spinlock_t inode_sb_list_lock; -extern long prune_icache_sb(struct super_block *sb, unsigned long nr_to_scan, - int nid); +extern long prune_icache_sb(struct super_block *sb, struct shrink_control *sc); extern void inode_add_lru(struct inode *inode); /* @@ -129,8 +129,7 @@ extern int invalidate_inodes(struct super_block *, bool); */ extern struct dentry *__d_alloc(struct super_block *, const struct qstr *); extern int d_set_mounted(struct dentry *dentry); -extern long prune_dcache_sb(struct super_block *sb, unsigned long nr_to_scan, - int nid); +extern long prune_dcache_sb(struct super_block *sb, struct shrink_control *sc); /* * read_write.c diff --git a/fs/super.c b/fs/super.c index eae088f6aaae..4554ac257647 100644 --- a/fs/super.c +++ b/fs/super.c @@ -77,8 +77,8 @@ static unsigned long super_cache_scan(struct shrinker *shrink, if (sb->s_op->nr_cached_objects) fs_objects = sb->s_op->nr_cached_objects(sb, sc->nid); - inodes = list_lru_count_node(&sb->s_inode_lru, sc->nid); - dentries = list_lru_count_node(&sb->s_dentry_lru, sc->nid); + inodes = list_lru_shrink_count(&sb->s_inode_lru, sc); + dentries = list_lru_shrink_count(&sb->s_dentry_lru, sc); total_objects = dentries + inodes + fs_objects + 1; if (!total_objects) total_objects = 1; @@ -86,20 +86,20 @@ static unsigned long super_cache_scan(struct shrinker *shrink, /* proportion the scan between the caches */ dentries = mult_frac(sc->nr_to_scan, dentries, total_objects); inodes = mult_frac(sc->nr_to_scan, inodes, total_objects); + fs_objects = mult_frac(sc->nr_to_scan, fs_objects, total_objects); /* * prune the dcache first as the icache is pinned by it, then * prune the icache, followed by the filesystem specific caches */ - freed = prune_dcache_sb(sb, dentries, sc->nid); - freed += prune_icache_sb(sb, inodes, sc->nid); + sc->nr_to_scan = dentries; + freed = prune_dcache_sb(sb, sc); + sc->nr_to_scan = inodes; + freed += prune_icache_sb(sb, sc); - if (fs_objects) { - fs_objects = mult_frac(sc->nr_to_scan, fs_objects, - total_objects); + if (fs_objects) freed += sb->s_op->free_cached_objects(sb, fs_objects, sc->nid); - } drop_super(sb); return freed; @@ -118,17 +118,15 @@ static unsigned long super_cache_count(struct shrinker *shrink, * scalability bottleneck. The counts could get updated * between super_cache_count and super_cache_scan anyway. * Call to super_cache_count with shrinker_rwsem held - * ensures the safety of call to list_lru_count_node() and + * ensures the safety of call to list_lru_shrink_count() and * s_op->nr_cached_objects(). */ if (sb->s_op && sb->s_op->nr_cached_objects) total_objects = sb->s_op->nr_cached_objects(sb, sc->nid); - total_objects += list_lru_count_node(&sb->s_dentry_lru, - sc->nid); - total_objects += list_lru_count_node(&sb->s_inode_lru, - sc->nid); + total_objects += list_lru_shrink_count(&sb->s_dentry_lru, sc); + total_objects += list_lru_shrink_count(&sb->s_inode_lru, sc); total_objects = vfs_pressure_ratio(total_objects); return total_objects; diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c index bb502a391792..15c9d224c721 100644 --- a/fs/xfs/xfs_buf.c +++ b/fs/xfs/xfs_buf.c @@ -1583,10 +1583,9 @@ xfs_buftarg_shrink_scan( struct xfs_buftarg, bt_shrinker); LIST_HEAD(dispose); unsigned long freed; - unsigned long nr_to_scan = sc->nr_to_scan; - freed = list_lru_walk_node(&btp->bt_lru, sc->nid, xfs_buftarg_isolate, - &dispose, &nr_to_scan); + freed = list_lru_shrink_walk(&btp->bt_lru, sc, + xfs_buftarg_isolate, &dispose); while (!list_empty(&dispose)) { struct xfs_buf *bp; @@ -1605,7 +1604,7 @@ xfs_buftarg_shrink_count( { struct xfs_buftarg *btp = container_of(shrink, struct xfs_buftarg, bt_shrinker); - return list_lru_count_node(&btp->bt_lru, sc->nid); + return list_lru_shrink_count(&btp->bt_lru, sc); } void diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c index 3e8186279541..4f4b1274e144 100644 --- a/fs/xfs/xfs_qm.c +++ b/fs/xfs/xfs_qm.c @@ -523,7 +523,6 @@ xfs_qm_shrink_scan( struct xfs_qm_isolate isol; unsigned long freed; int error; - unsigned long nr_to_scan = sc->nr_to_scan; if ((sc->gfp_mask & (__GFP_FS|__GFP_WAIT)) != (__GFP_FS|__GFP_WAIT)) return 0; @@ -531,8 +530,8 @@ xfs_qm_shrink_scan( INIT_LIST_HEAD(&isol.buffers); INIT_LIST_HEAD(&isol.dispose); - freed = list_lru_walk_node(&qi->qi_lru, sc->nid, xfs_qm_dquot_isolate, &isol, - &nr_to_scan); + freed = list_lru_shrink_walk(&qi->qi_lru, sc, + xfs_qm_dquot_isolate, &isol); error = xfs_buf_delwri_submit(&isol.buffers); if (error) @@ -557,7 +556,7 @@ xfs_qm_shrink_count( struct xfs_quotainfo *qi = container_of(shrink, struct xfs_quotainfo, qi_shrinker); - return list_lru_count_node(&qi->qi_lru, sc->nid); + return list_lru_shrink_count(&qi->qi_lru, sc); } /* diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h index f3434533fbf8..f500a2e39b13 100644 --- a/include/linux/list_lru.h +++ b/include/linux/list_lru.h @@ -9,6 +9,7 @@ #include #include +#include /* list_lru_walk_cb has to always return one of those */ enum lru_status { @@ -81,6 +82,13 @@ bool list_lru_del(struct list_lru *lru, struct list_head *item); * Callers that want such a guarantee need to provide an outer lock. */ unsigned long list_lru_count_node(struct list_lru *lru, int nid); + +static inline unsigned long list_lru_shrink_count(struct list_lru *lru, + struct shrink_control *sc) +{ + return list_lru_count_node(lru, sc->nid); +} + static inline unsigned long list_lru_count(struct list_lru *lru) { long count = 0; @@ -119,6 +127,14 @@ unsigned long list_lru_walk_node(struct list_lru *lru, int nid, list_lru_walk_cb isolate, void *cb_arg, unsigned long *nr_to_walk); +static inline unsigned long +list_lru_shrink_walk(struct list_lru *lru, struct shrink_control *sc, + list_lru_walk_cb isolate, void *cb_arg) +{ + return list_lru_walk_node(lru, sc->nid, isolate, cb_arg, + &sc->nr_to_scan); +} + static inline unsigned long list_lru_walk(struct list_lru *lru, list_lru_walk_cb isolate, void *cb_arg, unsigned long nr_to_walk) diff --git a/mm/workingset.c b/mm/workingset.c index f7216fa7da27..d4fa7fb10a52 100644 --- a/mm/workingset.c +++ b/mm/workingset.c @@ -275,7 +275,7 @@ static unsigned long count_shadow_nodes(struct shrinker *shrinker, /* list_lru lock nests inside IRQ-safe mapping->tree_lock */ local_irq_disable(); - shadow_nodes = list_lru_count_node(&workingset_shadow_nodes, sc->nid); + shadow_nodes = list_lru_shrink_count(&workingset_shadow_nodes, sc); local_irq_enable(); pages = node_present_pages(sc->nid); @@ -376,8 +376,8 @@ static unsigned long scan_shadow_nodes(struct shrinker *shrinker, /* list_lru lock nests inside IRQ-safe mapping->tree_lock */ local_irq_disable(); - ret = list_lru_walk_node(&workingset_shadow_nodes, sc->nid, - shadow_lru_isolate, NULL, &sc->nr_to_scan); + ret = list_lru_shrink_walk(&workingset_shadow_nodes, sc, + shadow_lru_isolate, NULL); local_irq_enable(); return ret; } -- cgit v1.2.3 From 4101b624352fddb5ed72e7a1b6f8be8cffaa20fa Mon Sep 17 00:00:00 2001 From: Vladimir Davydov Date: Thu, 12 Feb 2015 14:58:51 -0800 Subject: fs: consolidate {nr,free}_cached_objects args in shrink_control We are going to make FS shrinkers memcg-aware. To achieve that, we will have to pass the memcg to scan to the nr_cached_objects and free_cached_objects VFS methods, which currently take only the NUMA node to scan. Since the shrink_control structure already holds the node, and the memcg to scan will be added to it when we introduce memcg-aware vmscan, let us consolidate the methods' arguments in this structure to keep things clean. Signed-off-by: Vladimir Davydov Suggested-by: Dave Chinner Cc: Johannes Weiner Cc: Michal Hocko Cc: Greg Thelen Cc: Glauber Costa Cc: Alexander Viro Cc: Christoph Lameter Cc: Pekka Enberg Cc: David Rientjes Cc: Joonsoo Kim Cc: Tejun Heo Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- fs/super.c | 12 ++++++------ fs/xfs/xfs_super.c | 7 +++---- include/linux/fs.h | 6 ++++-- 3 files changed, 13 insertions(+), 12 deletions(-) diff --git a/fs/super.c b/fs/super.c index 4554ac257647..a2b735a42e74 100644 --- a/fs/super.c +++ b/fs/super.c @@ -75,7 +75,7 @@ static unsigned long super_cache_scan(struct shrinker *shrink, return SHRINK_STOP; if (sb->s_op->nr_cached_objects) - fs_objects = sb->s_op->nr_cached_objects(sb, sc->nid); + fs_objects = sb->s_op->nr_cached_objects(sb, sc); inodes = list_lru_shrink_count(&sb->s_inode_lru, sc); dentries = list_lru_shrink_count(&sb->s_dentry_lru, sc); @@ -97,9 +97,10 @@ static unsigned long super_cache_scan(struct shrinker *shrink, sc->nr_to_scan = inodes; freed += prune_icache_sb(sb, sc); - if (fs_objects) - freed += sb->s_op->free_cached_objects(sb, fs_objects, - sc->nid); + if (fs_objects) { + sc->nr_to_scan = fs_objects; + freed += sb->s_op->free_cached_objects(sb, sc); + } drop_super(sb); return freed; @@ -122,8 +123,7 @@ static unsigned long super_cache_count(struct shrinker *shrink, * s_op->nr_cached_objects(). */ if (sb->s_op && sb->s_op->nr_cached_objects) - total_objects = sb->s_op->nr_cached_objects(sb, - sc->nid); + total_objects = sb->s_op->nr_cached_objects(sb, sc); total_objects += list_lru_shrink_count(&sb->s_dentry_lru, sc); total_objects += list_lru_shrink_count(&sb->s_inode_lru, sc); diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c index f2449fd86926..8fcc4ccc5c79 100644 --- a/fs/xfs/xfs_super.c +++ b/fs/xfs/xfs_super.c @@ -1537,7 +1537,7 @@ xfs_fs_mount( static long xfs_fs_nr_cached_objects( struct super_block *sb, - int nid) + struct shrink_control *sc) { return xfs_reclaim_inodes_count(XFS_M(sb)); } @@ -1545,10 +1545,9 @@ xfs_fs_nr_cached_objects( static long xfs_fs_free_cached_objects( struct super_block *sb, - long nr_to_scan, - int nid) + struct shrink_control *sc) { - return xfs_reclaim_inodes_nr(XFS_M(sb), nr_to_scan); + return xfs_reclaim_inodes_nr(XFS_M(sb), sc->nr_to_scan); } static const struct super_operations xfs_super_operations = { diff --git a/include/linux/fs.h b/include/linux/fs.h index cdcb1e9d9613..a20d6586e517 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1635,8 +1635,10 @@ struct super_operations { struct dquot **(*get_dquots)(struct inode *); #endif int (*bdev_try_to_free_page)(struct super_block*, struct page*, gfp_t); - long (*nr_cached_objects)(struct super_block *, int); - long (*free_cached_objects)(struct super_block *, long, int); + long (*nr_cached_objects)(struct super_block *, + struct shrink_control *); + long (*free_cached_objects)(struct super_block *, + struct shrink_control *); }; /* -- cgit v1.2.3 From cb731d6c62bbc2f890b08ea3d0386d5dad887326 Mon Sep 17 00:00:00 2001 From: Vladimir Davydov Date: Thu, 12 Feb 2015 14:58:54 -0800 Subject: vmscan: per memory cgroup slab shrinkers This patch adds SHRINKER_MEMCG_AWARE flag. If a shrinker has this flag set, it will be called per memory cgroup. The memory cgroup to scan objects from is passed in shrink_control->memcg. If the memory cgroup is NULL, a memcg aware shrinker is supposed to scan objects from the global list. Unaware shrinkers are only called on global pressure with memcg=NULL. Signed-off-by: Vladimir Davydov Cc: Dave Chinner Cc: Johannes Weiner Cc: Michal Hocko Cc: Greg Thelen Cc: Glauber Costa Cc: Alexander Viro Cc: Christoph Lameter Cc: Pekka Enberg Cc: David Rientjes Cc: Joonsoo Kim Cc: Tejun Heo Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- fs/drop_caches.c | 14 -------- include/linux/memcontrol.h | 7 ++++ include/linux/mm.h | 5 ++- include/linux/shrinker.h | 6 +++- mm/memcontrol.c | 2 +- mm/memory-failure.c | 11 ++---- mm/vmscan.c | 85 ++++++++++++++++++++++++++++++++++------------ 7 files changed, 80 insertions(+), 50 deletions(-) diff --git a/fs/drop_caches.c b/fs/drop_caches.c index 2bc2c87f35e7..5718cb9f7273 100644 --- a/fs/drop_caches.c +++ b/fs/drop_caches.c @@ -37,20 +37,6 @@ static void drop_pagecache_sb(struct super_block *sb, void *unused) iput(toput_inode); } -static void drop_slab(void) -{ - int nr_objects; - - do { - int nid; - - nr_objects = 0; - for_each_online_node(nid) - nr_objects += shrink_node_slabs(GFP_KERNEL, nid, - 1000, 1000); - } while (nr_objects > 10); -} - int drop_caches_sysctl_handler(struct ctl_table *table, int write, void __user *buffer, size_t *length, loff_t *ppos) { diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 6cfd934c7c9b..54992fe0959f 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -413,6 +413,8 @@ static inline bool memcg_kmem_enabled(void) return static_key_false(&memcg_kmem_enabled_key); } +bool memcg_kmem_is_active(struct mem_cgroup *memcg); + /* * In general, we'll do everything in our power to not incur in any overhead * for non-memcg users for the kmem functions. Not even a function call, if we @@ -542,6 +544,11 @@ static inline bool memcg_kmem_enabled(void) return false; } +static inline bool memcg_kmem_is_active(struct mem_cgroup *memcg) +{ + return false; +} + static inline bool memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **memcg, int order) { diff --git a/include/linux/mm.h b/include/linux/mm.h index a4d24f3c5430..af4ff88a11e0 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2168,9 +2168,8 @@ int drop_caches_sysctl_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *); #endif -unsigned long shrink_node_slabs(gfp_t gfp_mask, int nid, - unsigned long nr_scanned, - unsigned long nr_eligible); +void drop_slab(void); +void drop_slab_node(int nid); #ifndef CONFIG_MMU #define randomize_va_space 0 diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h index f4aee75f00b1..4fcacd915d45 100644 --- a/include/linux/shrinker.h +++ b/include/linux/shrinker.h @@ -20,6 +20,9 @@ struct shrink_control { /* current node being shrunk (for NUMA aware shrinkers) */ int nid; + + /* current memcg being shrunk (for memcg aware shrinkers) */ + struct mem_cgroup *memcg; }; #define SHRINK_STOP (~0UL) @@ -61,7 +64,8 @@ struct shrinker { #define DEFAULT_SEEKS 2 /* A good number if you don't know better. */ /* Flags */ -#define SHRINKER_NUMA_AWARE (1 << 0) +#define SHRINKER_NUMA_AWARE (1 << 0) +#define SHRINKER_MEMCG_AWARE (1 << 1) extern int register_shrinker(struct shrinker *); extern void unregister_shrinker(struct shrinker *); diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 095c1f96fbec..3c2a1a8286ac 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -352,7 +352,7 @@ struct mem_cgroup { }; #ifdef CONFIG_MEMCG_KMEM -static bool memcg_kmem_is_active(struct mem_cgroup *memcg) +bool memcg_kmem_is_active(struct mem_cgroup *memcg) { return memcg->kmemcg_id >= 0; } diff --git a/mm/memory-failure.c b/mm/memory-failure.c index feb803bf3443..1a735fad2a13 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -242,15 +242,8 @@ void shake_page(struct page *p, int access) * Only call shrink_node_slabs here (which would also shrink * other caches) if access is not potentially fatal. */ - if (access) { - int nr; - int nid = page_to_nid(p); - do { - nr = shrink_node_slabs(GFP_KERNEL, nid, 1000, 1000); - if (page_count(p) == 1) - break; - } while (nr > 10); - } + if (access) + drop_slab_node(page_to_nid(p)); } EXPORT_SYMBOL_GPL(shake_page); diff --git a/mm/vmscan.c b/mm/vmscan.c index 8e645ee52045..803886b8e353 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -232,10 +232,10 @@ EXPORT_SYMBOL(unregister_shrinker); #define SHRINK_BATCH 128 -static unsigned long shrink_slabs(struct shrink_control *shrinkctl, - struct shrinker *shrinker, - unsigned long nr_scanned, - unsigned long nr_eligible) +static unsigned long do_shrink_slab(struct shrink_control *shrinkctl, + struct shrinker *shrinker, + unsigned long nr_scanned, + unsigned long nr_eligible) { unsigned long freed = 0; unsigned long long delta; @@ -344,9 +344,10 @@ static unsigned long shrink_slabs(struct shrink_control *shrinkctl, } /** - * shrink_node_slabs - shrink slab caches of a given node + * shrink_slab - shrink slab caches * @gfp_mask: allocation context * @nid: node whose slab caches to target + * @memcg: memory cgroup whose slab caches to target * @nr_scanned: pressure numerator * @nr_eligible: pressure denominator * @@ -355,6 +356,12 @@ static unsigned long shrink_slabs(struct shrink_control *shrinkctl, * @nid is passed along to shrinkers with SHRINKER_NUMA_AWARE set, * unaware shrinkers will receive a node id of 0 instead. * + * @memcg specifies the memory cgroup to target. If it is not NULL, + * only shrinkers with SHRINKER_MEMCG_AWARE set will be called to scan + * objects from the memory cgroup specified. Otherwise all shrinkers + * are called, and memcg aware shrinkers are supposed to scan the + * global list then. + * * @nr_scanned and @nr_eligible form a ratio that indicate how much of * the available objects should be scanned. Page reclaim for example * passes the number of pages scanned and the number of pages on the @@ -365,13 +372,17 @@ static unsigned long shrink_slabs(struct shrink_control *shrinkctl, * * Returns the number of reclaimed slab objects. */ -unsigned long shrink_node_slabs(gfp_t gfp_mask, int nid, - unsigned long nr_scanned, - unsigned long nr_eligible) +static unsigned long shrink_slab(gfp_t gfp_mask, int nid, + struct mem_cgroup *memcg, + unsigned long nr_scanned, + unsigned long nr_eligible) { struct shrinker *shrinker; unsigned long freed = 0; + if (memcg && !memcg_kmem_is_active(memcg)) + return 0; + if (nr_scanned == 0) nr_scanned = SWAP_CLUSTER_MAX; @@ -390,12 +401,16 @@ unsigned long shrink_node_slabs(gfp_t gfp_mask, int nid, struct shrink_control sc = { .gfp_mask = gfp_mask, .nid = nid, + .memcg = memcg, }; + if (memcg && !(shrinker->flags & SHRINKER_MEMCG_AWARE)) + continue; + if (!(shrinker->flags & SHRINKER_NUMA_AWARE)) sc.nid = 0; - freed += shrink_slabs(&sc, shrinker, nr_scanned, nr_eligible); + freed += do_shrink_slab(&sc, shrinker, nr_scanned, nr_eligible); } up_read(&shrinker_rwsem); @@ -404,6 +419,29 @@ out: return freed; } +void drop_slab_node(int nid) +{ + unsigned long freed; + + do { + struct mem_cgroup *memcg = NULL; + + freed = 0; + do { + freed += shrink_slab(GFP_KERNEL, nid, memcg, + 1000, 1000); + } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)) != NULL); + } while (freed > 10); +} + +void drop_slab(void) +{ + int nid; + + for_each_online_node(nid) + drop_slab_node(nid); +} + static inline int is_page_cache_freeable(struct page *page) { /* @@ -2276,6 +2314,7 @@ static inline bool should_continue_reclaim(struct zone *zone, static bool shrink_zone(struct zone *zone, struct scan_control *sc, bool is_classzone) { + struct reclaim_state *reclaim_state = current->reclaim_state; unsigned long nr_reclaimed, nr_scanned; bool reclaimable = false; @@ -2294,6 +2333,7 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc, memcg = mem_cgroup_iter(root, NULL, &reclaim); do { unsigned long lru_pages; + unsigned long scanned; struct lruvec *lruvec; int swappiness; @@ -2305,10 +2345,16 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc, lruvec = mem_cgroup_zone_lruvec(zone, memcg); swappiness = mem_cgroup_swappiness(memcg); + scanned = sc->nr_scanned; shrink_lruvec(lruvec, swappiness, sc, &lru_pages); zone_lru_pages += lru_pages; + if (memcg && is_classzone) + shrink_slab(sc->gfp_mask, zone_to_nid(zone), + memcg, sc->nr_scanned - scanned, + lru_pages); + /* * Direct reclaim and kswapd have to scan all memory * cgroups to fulfill the overall scan target for the @@ -2330,19 +2376,14 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc, * Shrink the slab caches in the same proportion that * the eligible LRU pages were scanned. */ - if (global_reclaim(sc) && is_classzone) { - struct reclaim_state *reclaim_state; - - shrink_node_slabs(sc->gfp_mask, zone_to_nid(zone), - sc->nr_scanned - nr_scanned, - zone_lru_pages); - - reclaim_state = current->reclaim_state; - if (reclaim_state) { - sc->nr_reclaimed += - reclaim_state->reclaimed_slab; - reclaim_state->reclaimed_slab = 0; - } + if (global_reclaim(sc) && is_classzone) + shrink_slab(sc->gfp_mask, zone_to_nid(zone), NULL, + sc->nr_scanned - nr_scanned, + zone_lru_pages); + + if (reclaim_state) { + sc->nr_reclaimed += reclaim_state->reclaimed_slab; + reclaim_state->reclaimed_slab = 0; } vmpressure(sc->gfp_mask, sc->target_mem_cgroup, -- cgit v1.2.3 From dbcf73e26cd0b3d66e6db65ab595e664a55e58ff Mon Sep 17 00:00:00 2001 From: Vladimir Davydov Date: Thu, 12 Feb 2015 14:58:57 -0800 Subject: memcg: rename some cache id related variables memcg_limited_groups_array_size, which defines the size of memcg_caches arrays, sounds rather cumbersome. Also it doesn't point anyhow that it's related to kmem/caches stuff. So let's rename it to memcg_nr_cache_ids. It's concise and points us directly to memcg_cache_id. Also, rename kmem_limited_groups to memcg_cache_ida. Signed-off-by: Vladimir Davydov Cc: Dave Chinner Cc: Johannes Weiner Cc: Michal Hocko Cc: Greg Thelen Cc: Glauber Costa Cc: Alexander Viro Cc: Christoph Lameter Cc: Pekka Enberg Cc: David Rientjes Cc: Joonsoo Kim Cc: Tejun Heo Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/memcontrol.h | 4 ++-- mm/memcontrol.c | 19 +++++++++---------- mm/slab_common.c | 4 ++-- 3 files changed, 13 insertions(+), 14 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 54992fe0959f..2607c91230af 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -398,7 +398,7 @@ static inline void sock_release_memcg(struct sock *sk) #ifdef CONFIG_MEMCG_KMEM extern struct static_key memcg_kmem_enabled_key; -extern int memcg_limited_groups_array_size; +extern int memcg_nr_cache_ids; /* * Helper macro to loop through all memcg-specific caches. Callers must still @@ -406,7 +406,7 @@ extern int memcg_limited_groups_array_size; * the slab_mutex must be held when looping through those caches */ #define for_each_memcg_cache_index(_idx) \ - for ((_idx) = 0; (_idx) < memcg_limited_groups_array_size; (_idx)++) + for ((_idx) = 0; (_idx) < memcg_nr_cache_ids; (_idx)++) static inline bool memcg_kmem_enabled(void) { diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 3c2a1a8286ac..8608fa543b84 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -538,12 +538,11 @@ static void disarm_sock_keys(struct mem_cgroup *memcg) * memcgs, and none but the 200th is kmem-limited, we'd have to have a * 200 entry array for that. * - * The current size of the caches array is stored in - * memcg_limited_groups_array_size. It will double each time we have to - * increase it. + * The current size of the caches array is stored in memcg_nr_cache_ids. It + * will double each time we have to increase it. */ -static DEFINE_IDA(kmem_limited_groups); -int memcg_limited_groups_array_size; +static DEFINE_IDA(memcg_cache_ida); +int memcg_nr_cache_ids; /* * MIN_SIZE is different than 1, because we would like to avoid going through @@ -2538,12 +2537,12 @@ static int memcg_alloc_cache_id(void) int id, size; int err; - id = ida_simple_get(&kmem_limited_groups, + id = ida_simple_get(&memcg_cache_ida, 0, MEMCG_CACHES_MAX_SIZE, GFP_KERNEL); if (id < 0) return id; - if (id < memcg_limited_groups_array_size) + if (id < memcg_nr_cache_ids) return id; /* @@ -2559,7 +2558,7 @@ static int memcg_alloc_cache_id(void) err = memcg_update_all_caches(size); if (err) { - ida_simple_remove(&kmem_limited_groups, id); + ida_simple_remove(&memcg_cache_ida, id); return err; } return id; @@ -2567,7 +2566,7 @@ static int memcg_alloc_cache_id(void) static void memcg_free_cache_id(int id) { - ida_simple_remove(&kmem_limited_groups, id); + ida_simple_remove(&memcg_cache_ida, id); } /* @@ -2577,7 +2576,7 @@ static void memcg_free_cache_id(int id) */ void memcg_update_array_size(int num) { - memcg_limited_groups_array_size = num; + memcg_nr_cache_ids = num; } struct memcg_kmem_cache_create_work { diff --git a/mm/slab_common.c b/mm/slab_common.c index 6e1e4cf65836..f8899eedab68 100644 --- a/mm/slab_common.c +++ b/mm/slab_common.c @@ -116,7 +116,7 @@ static int memcg_alloc_cache_params(struct mem_cgroup *memcg, if (!memcg) { size = offsetof(struct memcg_cache_params, memcg_caches); - size += memcg_limited_groups_array_size * sizeof(void *); + size += memcg_nr_cache_ids * sizeof(void *); } else size = sizeof(struct memcg_cache_params); @@ -154,7 +154,7 @@ static int memcg_update_cache_params(struct kmem_cache *s, int num_memcgs) cur_params = s->memcg_params; memcpy(new_params->memcg_caches, cur_params->memcg_caches, - memcg_limited_groups_array_size * sizeof(void *)); + memcg_nr_cache_ids * sizeof(void *)); new_params->is_root_cache = true; -- cgit v1.2.3 From 05257a1a3dcc196c197714b5c9a8dd35b7f6aefc Mon Sep 17 00:00:00 2001 From: Vladimir Davydov Date: Thu, 12 Feb 2015 14:59:01 -0800 Subject: memcg: add rwsem to synchronize against memcg_caches arrays relocation We need a stable value of memcg_nr_cache_ids in kmem_cache_create() (memcg_alloc_cache_params() wants it for root caches), where we only hold the slab_mutex and no memcg-related locks. As a result, we have to update memcg_nr_cache_ids under the slab_mutex, which we can only take on the slab's side (see memcg_update_array_size). This looks awkward and will become even worse when per-memcg list_lru is introduced, which also wants stable access to memcg_nr_cache_ids. To get rid of this dependency between the memcg_nr_cache_ids and the slab_mutex, this patch introduces a special rwsem. The rwsem is held for writing during memcg_caches arrays relocation and memcg_nr_cache_ids updates. Therefore one can take it for reading to get a stable access to memcg_caches arrays and/or memcg_nr_cache_ids. Currently the semaphore is taken for reading only from kmem_cache_create, right before taking the slab_mutex, so right now there's no much point in using rwsem instead of mutex. However, once list_lru is made per-memcg it will allow list_lru initializations to proceed concurrently. Signed-off-by: Vladimir Davydov Cc: Dave Chinner Cc: Johannes Weiner Cc: Michal Hocko Cc: Greg Thelen Cc: Glauber Costa Cc: Alexander Viro Cc: Christoph Lameter Cc: Pekka Enberg Cc: David Rientjes Cc: Joonsoo Kim Cc: Tejun Heo Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/memcontrol.h | 12 ++++++++++-- mm/memcontrol.c | 29 +++++++++++++++++++---------- mm/slab_common.c | 9 ++++----- 3 files changed, 33 insertions(+), 17 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 2607c91230af..dbc4baa3619c 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -399,6 +399,8 @@ static inline void sock_release_memcg(struct sock *sk) extern struct static_key memcg_kmem_enabled_key; extern int memcg_nr_cache_ids; +extern void memcg_get_cache_ids(void); +extern void memcg_put_cache_ids(void); /* * Helper macro to loop through all memcg-specific caches. Callers must still @@ -434,8 +436,6 @@ void __memcg_kmem_uncharge_pages(struct page *page, int order); int memcg_cache_id(struct mem_cgroup *memcg); -void memcg_update_array_size(int num_groups); - struct kmem_cache *__memcg_kmem_get_cache(struct kmem_cache *cachep); void __memcg_kmem_put_cache(struct kmem_cache *cachep); @@ -569,6 +569,14 @@ static inline int memcg_cache_id(struct mem_cgroup *memcg) return -1; } +static inline void memcg_get_cache_ids(void) +{ +} + +static inline void memcg_put_cache_ids(void) +{ +} + static inline struct kmem_cache * memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp) { diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 8608fa543b84..6706e5fa5ac0 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -544,6 +544,19 @@ static void disarm_sock_keys(struct mem_cgroup *memcg) static DEFINE_IDA(memcg_cache_ida); int memcg_nr_cache_ids; +/* Protects memcg_nr_cache_ids */ +static DECLARE_RWSEM(memcg_cache_ids_sem); + +void memcg_get_cache_ids(void) +{ + down_read(&memcg_cache_ids_sem); +} + +void memcg_put_cache_ids(void) +{ + up_read(&memcg_cache_ids_sem); +} + /* * MIN_SIZE is different than 1, because we would like to avoid going through * the alloc/free process all the time. In a small machine, 4 kmem-limited @@ -2549,6 +2562,7 @@ static int memcg_alloc_cache_id(void) * There's no space for the new id in memcg_caches arrays, * so we have to grow them. */ + down_write(&memcg_cache_ids_sem); size = 2 * (id + 1); if (size < MEMCG_CACHES_MIN_SIZE) @@ -2557,6 +2571,11 @@ static int memcg_alloc_cache_id(void) size = MEMCG_CACHES_MAX_SIZE; err = memcg_update_all_caches(size); + if (!err) + memcg_nr_cache_ids = size; + + up_write(&memcg_cache_ids_sem); + if (err) { ida_simple_remove(&memcg_cache_ida, id); return err; @@ -2569,16 +2588,6 @@ static void memcg_free_cache_id(int id) ida_simple_remove(&memcg_cache_ida, id); } -/* - * We should update the current array size iff all caches updates succeed. This - * can only be done from the slab side. The slab mutex needs to be held when - * calling this. - */ -void memcg_update_array_size(int num) -{ - memcg_nr_cache_ids = num; -} - struct memcg_kmem_cache_create_work { struct mem_cgroup *memcg; struct kmem_cache *cachep; diff --git a/mm/slab_common.c b/mm/slab_common.c index f8899eedab68..23f5fcde6043 100644 --- a/mm/slab_common.c +++ b/mm/slab_common.c @@ -169,8 +169,8 @@ int memcg_update_all_caches(int num_memcgs) { struct kmem_cache *s; int ret = 0; - mutex_lock(&slab_mutex); + mutex_lock(&slab_mutex); list_for_each_entry(s, &slab_caches, list) { if (!is_root_cache(s)) continue; @@ -181,11 +181,8 @@ int memcg_update_all_caches(int num_memcgs) * up to this point in an updated state. */ if (ret) - goto out; + break; } - - memcg_update_array_size(num_memcgs); -out: mutex_unlock(&slab_mutex); return ret; } @@ -369,6 +366,7 @@ kmem_cache_create(const char *name, size_t size, size_t align, get_online_cpus(); get_online_mems(); + memcg_get_cache_ids(); mutex_lock(&slab_mutex); @@ -407,6 +405,7 @@ kmem_cache_create(const char *name, size_t size, size_t align, out_unlock: mutex_unlock(&slab_mutex); + memcg_put_cache_ids(); put_online_mems(); put_online_cpus(); -- cgit v1.2.3 From ff0b67ef5b1687692bc1fd3ce4bc3d1ff83587c7 Mon Sep 17 00:00:00 2001 From: Vladimir Davydov Date: Thu, 12 Feb 2015 14:59:04 -0800 Subject: list_lru: get rid of ->active_nodes The active_nodes mask allows us to skip empty nodes when walking over list_lru items from all nodes in list_lru_count/walk. However, these functions are never called from hot paths, so it doesn't seem we need such kind of optimization there. OTOH, removing the mask will make it easier to make list_lru per-memcg. Signed-off-by: Vladimir Davydov Cc: Dave Chinner Cc: Johannes Weiner Cc: Michal Hocko Cc: Greg Thelen Cc: Glauber Costa Cc: Alexander Viro Cc: Christoph Lameter Cc: Pekka Enberg Cc: David Rientjes Cc: Joonsoo Kim Cc: Tejun Heo Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/list_lru.h | 5 ++--- mm/list_lru.c | 10 +++------- 2 files changed, 5 insertions(+), 10 deletions(-) diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h index f500a2e39b13..53c1d6b78270 100644 --- a/include/linux/list_lru.h +++ b/include/linux/list_lru.h @@ -31,7 +31,6 @@ struct list_lru_node { struct list_lru { struct list_lru_node *node; - nodemask_t active_nodes; }; void list_lru_destroy(struct list_lru *lru); @@ -94,7 +93,7 @@ static inline unsigned long list_lru_count(struct list_lru *lru) long count = 0; int nid; - for_each_node_mask(nid, lru->active_nodes) + for_each_node_state(nid, N_NORMAL_MEMORY) count += list_lru_count_node(lru, nid); return count; @@ -142,7 +141,7 @@ list_lru_walk(struct list_lru *lru, list_lru_walk_cb isolate, long isolated = 0; int nid; - for_each_node_mask(nid, lru->active_nodes) { + for_each_node_state(nid, N_NORMAL_MEMORY) { isolated += list_lru_walk_node(lru, nid, isolate, cb_arg, &nr_to_walk); if (nr_to_walk <= 0) diff --git a/mm/list_lru.c b/mm/list_lru.c index f1a0db194173..07e198c77888 100644 --- a/mm/list_lru.c +++ b/mm/list_lru.c @@ -19,8 +19,7 @@ bool list_lru_add(struct list_lru *lru, struct list_head *item) WARN_ON_ONCE(nlru->nr_items < 0); if (list_empty(item)) { list_add_tail(item, &nlru->list); - if (nlru->nr_items++ == 0) - node_set(nid, lru->active_nodes); + nlru->nr_items++; spin_unlock(&nlru->lock); return true; } @@ -37,8 +36,7 @@ bool list_lru_del(struct list_lru *lru, struct list_head *item) spin_lock(&nlru->lock); if (!list_empty(item)) { list_del_init(item); - if (--nlru->nr_items == 0) - node_clear(nid, lru->active_nodes); + nlru->nr_items--; WARN_ON_ONCE(nlru->nr_items < 0); spin_unlock(&nlru->lock); return true; @@ -90,8 +88,7 @@ restart: case LRU_REMOVED_RETRY: assert_spin_locked(&nlru->lock); case LRU_REMOVED: - if (--nlru->nr_items == 0) - node_clear(nid, lru->active_nodes); + nlru->nr_items--; WARN_ON_ONCE(nlru->nr_items < 0); isolated++; /* @@ -133,7 +130,6 @@ int list_lru_init_key(struct list_lru *lru, struct lock_class_key *key) if (!lru->node) return -ENOMEM; - nodes_clear(lru->active_nodes); for (i = 0; i < nr_node_ids; i++) { spin_lock_init(&lru->node[i].lock); if (key) -- cgit v1.2.3 From c0a5b560938a0f2fd2fbf66ddc446c7c2b41383a Mon Sep 17 00:00:00 2001 From: Vladimir Davydov Date: Thu, 12 Feb 2015 14:59:07 -0800 Subject: list_lru: organize all list_lrus to list To make list_lru memcg aware, we need all list_lrus to be kept on a list protected by a mutex, so that we could sleep while walking over the list. Therefore after this change list_lru_destroy may sleep. Fortunately, there is only one user that calls it from an atomic context - it's put_super - and we can easily fix it by calling list_lru_destroy before put_super in destroy_locked_super - anyway we don't longer need lrus by that time. Another point that should be noted is that list_lru_destroy is allowed to be called on an uninitialized zeroed-out object, in which case it is a no-op. Before this patch this was guaranteed by kfree, but now we need an explicit check there. Signed-off-by: Vladimir Davydov Cc: Dave Chinner Cc: Johannes Weiner Cc: Michal Hocko Cc: Greg Thelen Cc: Glauber Costa Cc: Alexander Viro Cc: Christoph Lameter Cc: Pekka Enberg Cc: David Rientjes Cc: Joonsoo Kim Cc: Tejun Heo Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- fs/super.c | 8 ++++++++ include/linux/list_lru.h | 3 +++ mm/list_lru.c | 34 ++++++++++++++++++++++++++++++++++ 3 files changed, 45 insertions(+) diff --git a/fs/super.c b/fs/super.c index a2b735a42e74..b027849d92d2 100644 --- a/fs/super.c +++ b/fs/super.c @@ -282,6 +282,14 @@ void deactivate_locked_super(struct super_block *s) unregister_shrinker(&s->s_shrink); fs->kill_sb(s); + /* + * Since list_lru_destroy() may sleep, we cannot call it from + * put_super(), where we hold the sb_lock. Therefore we destroy + * the lru lists right now. + */ + list_lru_destroy(&s->s_dentry_lru); + list_lru_destroy(&s->s_inode_lru); + put_filesystem(fs); put_super(s); } else { diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h index 53c1d6b78270..ee9486ac0621 100644 --- a/include/linux/list_lru.h +++ b/include/linux/list_lru.h @@ -31,6 +31,9 @@ struct list_lru_node { struct list_lru { struct list_lru_node *node; +#ifdef CONFIG_MEMCG_KMEM + struct list_head list; +#endif }; void list_lru_destroy(struct list_lru *lru); diff --git a/mm/list_lru.c b/mm/list_lru.c index 07e198c77888..a9021cb3ccde 100644 --- a/mm/list_lru.c +++ b/mm/list_lru.c @@ -9,6 +9,34 @@ #include #include #include +#include + +#ifdef CONFIG_MEMCG_KMEM +static LIST_HEAD(list_lrus); +static DEFINE_MUTEX(list_lrus_mutex); + +static void list_lru_register(struct list_lru *lru) +{ + mutex_lock(&list_lrus_mutex); + list_add(&lru->list, &list_lrus); + mutex_unlock(&list_lrus_mutex); +} + +static void list_lru_unregister(struct list_lru *lru) +{ + mutex_lock(&list_lrus_mutex); + list_del(&lru->list); + mutex_unlock(&list_lrus_mutex); +} +#else +static void list_lru_register(struct list_lru *lru) +{ +} + +static void list_lru_unregister(struct list_lru *lru) +{ +} +#endif /* CONFIG_MEMCG_KMEM */ bool list_lru_add(struct list_lru *lru, struct list_head *item) { @@ -137,12 +165,18 @@ int list_lru_init_key(struct list_lru *lru, struct lock_class_key *key) INIT_LIST_HEAD(&lru->node[i].list); lru->node[i].nr_items = 0; } + list_lru_register(lru); return 0; } EXPORT_SYMBOL_GPL(list_lru_init_key); void list_lru_destroy(struct list_lru *lru) { + /* Already destroyed or not yet initialized? */ + if (!lru->node) + return; + list_lru_unregister(lru); kfree(lru->node); + lru->node = NULL; } EXPORT_SYMBOL_GPL(list_lru_destroy); -- cgit v1.2.3 From 60d3fd32a7a9da4c8c93a9f89cfda22a0b4c65ce Mon Sep 17 00:00:00 2001 From: Vladimir Davydov Date: Thu, 12 Feb 2015 14:59:10 -0800 Subject: list_lru: introduce per-memcg lists There are several FS shrinkers, including super_block::s_shrink, that keep reclaimable objects in the list_lru structure. Hence to turn them to memcg-aware shrinkers, it is enough to make list_lru per-memcg. This patch does the trick. It adds an array of lru lists to the list_lru_node structure (per-node part of the list_lru), one for each kmem-active memcg, and dispatches every item addition or removal to the list corresponding to the memcg which the item is accounted to. So now the list_lru structure is not just per node, but per node and per memcg. Not all list_lrus need this feature, so this patch also adds a new method, list_lru_init_memcg, which initializes a list_lru as memcg aware. Otherwise (i.e. if initialized with old list_lru_init), the list_lru won't have per memcg lists. Just like per memcg caches arrays, the arrays of per-memcg lists are indexed by memcg_cache_id, so we must grow them whenever memcg_nr_cache_ids is increased. So we introduce a callback, memcg_update_all_list_lrus, invoked by memcg_alloc_cache_id if the id space is full. The locking is implemented in a manner similar to lruvecs, i.e. we have one lock per node that protects all lists (both global and per cgroup) on the node. Signed-off-by: Vladimir Davydov Cc: Dave Chinner Cc: Johannes Weiner Cc: Michal Hocko Cc: Greg Thelen Cc: Glauber Costa Cc: Alexander Viro Cc: Christoph Lameter Cc: Pekka Enberg Cc: David Rientjes Cc: Joonsoo Kim Cc: Tejun Heo Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/list_lru.h | 52 +++++-- include/linux/memcontrol.h | 14 ++ mm/list_lru.c | 374 ++++++++++++++++++++++++++++++++++++++++++--- mm/memcontrol.c | 20 +++ 4 files changed, 424 insertions(+), 36 deletions(-) diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h index ee9486ac0621..305b598abac2 100644 --- a/include/linux/list_lru.h +++ b/include/linux/list_lru.h @@ -11,6 +11,8 @@ #include #include +struct mem_cgroup; + /* list_lru_walk_cb has to always return one of those */ enum lru_status { LRU_REMOVED, /* item removed from list */ @@ -22,11 +24,26 @@ enum lru_status { internally, but has to return locked. */ }; -struct list_lru_node { - spinlock_t lock; +struct list_lru_one { struct list_head list; /* kept as signed so we can catch imbalance bugs */ long nr_items; +}; + +struct list_lru_memcg { + /* array of per cgroup lists, indexed by memcg_cache_id */ + struct list_lru_one *lru[0]; +}; + +struct list_lru_node { + /* protects all lists on the node, including per cgroup */ + spinlock_t lock; + /* global list, used for the root cgroup in cgroup aware lrus */ + struct list_lru_one lru; +#ifdef CONFIG_MEMCG_KMEM + /* for cgroup aware lrus points to per cgroup lists, otherwise NULL */ + struct list_lru_memcg *memcg_lrus; +#endif } ____cacheline_aligned_in_smp; struct list_lru { @@ -37,11 +54,14 @@ struct list_lru { }; void list_lru_destroy(struct list_lru *lru); -int list_lru_init_key(struct list_lru *lru, struct lock_class_key *key); -static inline int list_lru_init(struct list_lru *lru) -{ - return list_lru_init_key(lru, NULL); -} +int __list_lru_init(struct list_lru *lru, bool memcg_aware, + struct lock_class_key *key); + +#define list_lru_init(lru) __list_lru_init((lru), false, NULL) +#define list_lru_init_key(lru, key) __list_lru_init((lru), false, (key)) +#define list_lru_init_memcg(lru) __list_lru_init((lru), true, NULL) + +int memcg_update_all_list_lrus(int num_memcgs); /** * list_lru_add: add an element to the lru list's tail @@ -75,20 +95,23 @@ bool list_lru_add(struct list_lru *lru, struct list_head *item); bool list_lru_del(struct list_lru *lru, struct list_head *item); /** - * list_lru_count_node: return the number of objects currently held by @lru + * list_lru_count_one: return the number of objects currently held by @lru * @lru: the lru pointer. * @nid: the node id to count from. + * @memcg: the cgroup to count from. * * Always return a non-negative number, 0 for empty lists. There is no * guarantee that the list is not updated while the count is being computed. * Callers that want such a guarantee need to provide an outer lock. */ +unsigned long list_lru_count_one(struct list_lru *lru, + int nid, struct mem_cgroup *memcg); unsigned long list_lru_count_node(struct list_lru *lru, int nid); static inline unsigned long list_lru_shrink_count(struct list_lru *lru, struct shrink_control *sc) { - return list_lru_count_node(lru, sc->nid); + return list_lru_count_one(lru, sc->nid, sc->memcg); } static inline unsigned long list_lru_count(struct list_lru *lru) @@ -105,9 +128,10 @@ static inline unsigned long list_lru_count(struct list_lru *lru) typedef enum lru_status (*list_lru_walk_cb)(struct list_head *item, spinlock_t *lock, void *cb_arg); /** - * list_lru_walk_node: walk a list_lru, isolating and disposing freeable items. + * list_lru_walk_one: walk a list_lru, isolating and disposing freeable items. * @lru: the lru pointer. * @nid: the node id to scan from. + * @memcg: the cgroup to scan from. * @isolate: callback function that is resposible for deciding what to do with * the item currently being scanned * @cb_arg: opaque type that will be passed to @isolate @@ -125,6 +149,10 @@ typedef enum lru_status * * Return value: the number of objects effectively removed from the LRU. */ +unsigned long list_lru_walk_one(struct list_lru *lru, + int nid, struct mem_cgroup *memcg, + list_lru_walk_cb isolate, void *cb_arg, + unsigned long *nr_to_walk); unsigned long list_lru_walk_node(struct list_lru *lru, int nid, list_lru_walk_cb isolate, void *cb_arg, unsigned long *nr_to_walk); @@ -133,8 +161,8 @@ static inline unsigned long list_lru_shrink_walk(struct list_lru *lru, struct shrink_control *sc, list_lru_walk_cb isolate, void *cb_arg) { - return list_lru_walk_node(lru, sc->nid, isolate, cb_arg, - &sc->nr_to_scan); + return list_lru_walk_one(lru, sc->nid, sc->memcg, isolate, cb_arg, + &sc->nr_to_scan); } static inline unsigned long diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index dbc4baa3619c..72dff5fb0d0c 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -439,6 +439,8 @@ int memcg_cache_id(struct mem_cgroup *memcg); struct kmem_cache *__memcg_kmem_get_cache(struct kmem_cache *cachep); void __memcg_kmem_put_cache(struct kmem_cache *cachep); +struct mem_cgroup *__mem_cgroup_from_kmem(void *ptr); + int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, unsigned long nr_pages); void memcg_uncharge_kmem(struct mem_cgroup *memcg, unsigned long nr_pages); @@ -535,6 +537,13 @@ static __always_inline void memcg_kmem_put_cache(struct kmem_cache *cachep) if (memcg_kmem_enabled()) __memcg_kmem_put_cache(cachep); } + +static __always_inline struct mem_cgroup *mem_cgroup_from_kmem(void *ptr) +{ + if (!memcg_kmem_enabled()) + return NULL; + return __mem_cgroup_from_kmem(ptr); +} #else #define for_each_memcg_cache_index(_idx) \ for (; NULL; ) @@ -586,6 +595,11 @@ memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp) static inline void memcg_kmem_put_cache(struct kmem_cache *cachep) { } + +static inline struct mem_cgroup *mem_cgroup_from_kmem(void *ptr) +{ + return NULL; +} #endif /* CONFIG_MEMCG_KMEM */ #endif /* _LINUX_MEMCONTROL_H */ diff --git a/mm/list_lru.c b/mm/list_lru.c index a9021cb3ccde..79aee70c3b9d 100644 --- a/mm/list_lru.c +++ b/mm/list_lru.c @@ -10,6 +10,7 @@ #include #include #include +#include #ifdef CONFIG_MEMCG_KMEM static LIST_HEAD(list_lrus); @@ -38,16 +39,71 @@ static void list_lru_unregister(struct list_lru *lru) } #endif /* CONFIG_MEMCG_KMEM */ +#ifdef CONFIG_MEMCG_KMEM +static inline bool list_lru_memcg_aware(struct list_lru *lru) +{ + return !!lru->node[0].memcg_lrus; +} + +static inline struct list_lru_one * +list_lru_from_memcg_idx(struct list_lru_node *nlru, int idx) +{ + /* + * The lock protects the array of per cgroup lists from relocation + * (see memcg_update_list_lru_node). + */ + lockdep_assert_held(&nlru->lock); + if (nlru->memcg_lrus && idx >= 0) + return nlru->memcg_lrus->lru[idx]; + + return &nlru->lru; +} + +static inline struct list_lru_one * +list_lru_from_kmem(struct list_lru_node *nlru, void *ptr) +{ + struct mem_cgroup *memcg; + + if (!nlru->memcg_lrus) + return &nlru->lru; + + memcg = mem_cgroup_from_kmem(ptr); + if (!memcg) + return &nlru->lru; + + return list_lru_from_memcg_idx(nlru, memcg_cache_id(memcg)); +} +#else +static inline bool list_lru_memcg_aware(struct list_lru *lru) +{ + return false; +} + +static inline struct list_lru_one * +list_lru_from_memcg_idx(struct list_lru_node *nlru, int idx) +{ + return &nlru->lru; +} + +static inline struct list_lru_one * +list_lru_from_kmem(struct list_lru_node *nlru, void *ptr) +{ + return &nlru->lru; +} +#endif /* CONFIG_MEMCG_KMEM */ + bool list_lru_add(struct list_lru *lru, struct list_head *item) { int nid = page_to_nid(virt_to_page(item)); struct list_lru_node *nlru = &lru->node[nid]; + struct list_lru_one *l; spin_lock(&nlru->lock); - WARN_ON_ONCE(nlru->nr_items < 0); + l = list_lru_from_kmem(nlru, item); + WARN_ON_ONCE(l->nr_items < 0); if (list_empty(item)) { - list_add_tail(item, &nlru->list); - nlru->nr_items++; + list_add_tail(item, &l->list); + l->nr_items++; spin_unlock(&nlru->lock); return true; } @@ -60,12 +116,14 @@ bool list_lru_del(struct list_lru *lru, struct list_head *item) { int nid = page_to_nid(virt_to_page(item)); struct list_lru_node *nlru = &lru->node[nid]; + struct list_lru_one *l; spin_lock(&nlru->lock); + l = list_lru_from_kmem(nlru, item); if (!list_empty(item)) { list_del_init(item); - nlru->nr_items--; - WARN_ON_ONCE(nlru->nr_items < 0); + l->nr_items--; + WARN_ON_ONCE(l->nr_items < 0); spin_unlock(&nlru->lock); return true; } @@ -74,33 +132,58 @@ bool list_lru_del(struct list_lru *lru, struct list_head *item) } EXPORT_SYMBOL_GPL(list_lru_del); -unsigned long -list_lru_count_node(struct list_lru *lru, int nid) +static unsigned long __list_lru_count_one(struct list_lru *lru, + int nid, int memcg_idx) { - unsigned long count = 0; struct list_lru_node *nlru = &lru->node[nid]; + struct list_lru_one *l; + unsigned long count; spin_lock(&nlru->lock); - WARN_ON_ONCE(nlru->nr_items < 0); - count += nlru->nr_items; + l = list_lru_from_memcg_idx(nlru, memcg_idx); + WARN_ON_ONCE(l->nr_items < 0); + count = l->nr_items; spin_unlock(&nlru->lock); return count; } + +unsigned long list_lru_count_one(struct list_lru *lru, + int nid, struct mem_cgroup *memcg) +{ + return __list_lru_count_one(lru, nid, memcg_cache_id(memcg)); +} +EXPORT_SYMBOL_GPL(list_lru_count_one); + +unsigned long list_lru_count_node(struct list_lru *lru, int nid) +{ + long count = 0; + int memcg_idx; + + count += __list_lru_count_one(lru, nid, -1); + if (list_lru_memcg_aware(lru)) { + for_each_memcg_cache_index(memcg_idx) + count += __list_lru_count_one(lru, nid, memcg_idx); + } + return count; +} EXPORT_SYMBOL_GPL(list_lru_count_node); -unsigned long -list_lru_walk_node(struct list_lru *lru, int nid, list_lru_walk_cb isolate, - void *cb_arg, unsigned long *nr_to_walk) +static unsigned long +__list_lru_walk_one(struct list_lru *lru, int nid, int memcg_idx, + list_lru_walk_cb isolate, void *cb_arg, + unsigned long *nr_to_walk) { - struct list_lru_node *nlru = &lru->node[nid]; + struct list_lru_node *nlru = &lru->node[nid]; + struct list_lru_one *l; struct list_head *item, *n; unsigned long isolated = 0; spin_lock(&nlru->lock); + l = list_lru_from_memcg_idx(nlru, memcg_idx); restart: - list_for_each_safe(item, n, &nlru->list) { + list_for_each_safe(item, n, &l->list) { enum lru_status ret; /* @@ -116,8 +199,8 @@ restart: case LRU_REMOVED_RETRY: assert_spin_locked(&nlru->lock); case LRU_REMOVED: - nlru->nr_items--; - WARN_ON_ONCE(nlru->nr_items < 0); + l->nr_items--; + WARN_ON_ONCE(l->nr_items < 0); isolated++; /* * If the lru lock has been dropped, our list @@ -128,7 +211,7 @@ restart: goto restart; break; case LRU_ROTATE: - list_move_tail(item, &nlru->list); + list_move_tail(item, &l->list); break; case LRU_SKIP: break; @@ -147,36 +230,279 @@ restart: spin_unlock(&nlru->lock); return isolated; } + +unsigned long +list_lru_walk_one(struct list_lru *lru, int nid, struct mem_cgroup *memcg, + list_lru_walk_cb isolate, void *cb_arg, + unsigned long *nr_to_walk) +{ + return __list_lru_walk_one(lru, nid, memcg_cache_id(memcg), + isolate, cb_arg, nr_to_walk); +} +EXPORT_SYMBOL_GPL(list_lru_walk_one); + +unsigned long list_lru_walk_node(struct list_lru *lru, int nid, + list_lru_walk_cb isolate, void *cb_arg, + unsigned long *nr_to_walk) +{ + long isolated = 0; + int memcg_idx; + + isolated += __list_lru_walk_one(lru, nid, -1, isolate, cb_arg, + nr_to_walk); + if (*nr_to_walk > 0 && list_lru_memcg_aware(lru)) { + for_each_memcg_cache_index(memcg_idx) { + isolated += __list_lru_walk_one(lru, nid, memcg_idx, + isolate, cb_arg, nr_to_walk); + if (*nr_to_walk <= 0) + break; + } + } + return isolated; +} EXPORT_SYMBOL_GPL(list_lru_walk_node); -int list_lru_init_key(struct list_lru *lru, struct lock_class_key *key) +static void init_one_lru(struct list_lru_one *l) +{ + INIT_LIST_HEAD(&l->list); + l->nr_items = 0; +} + +#ifdef CONFIG_MEMCG_KMEM +static void __memcg_destroy_list_lru_node(struct list_lru_memcg *memcg_lrus, + int begin, int end) +{ + int i; + + for (i = begin; i < end; i++) + kfree(memcg_lrus->lru[i]); +} + +static int __memcg_init_list_lru_node(struct list_lru_memcg *memcg_lrus, + int begin, int end) +{ + int i; + + for (i = begin; i < end; i++) { + struct list_lru_one *l; + + l = kmalloc(sizeof(struct list_lru_one), GFP_KERNEL); + if (!l) + goto fail; + + init_one_lru(l); + memcg_lrus->lru[i] = l; + } + return 0; +fail: + __memcg_destroy_list_lru_node(memcg_lrus, begin, i - 1); + return -ENOMEM; +} + +static int memcg_init_list_lru_node(struct list_lru_node *nlru) +{ + int size = memcg_nr_cache_ids; + + nlru->memcg_lrus = kmalloc(size * sizeof(void *), GFP_KERNEL); + if (!nlru->memcg_lrus) + return -ENOMEM; + + if (__memcg_init_list_lru_node(nlru->memcg_lrus, 0, size)) { + kfree(nlru->memcg_lrus); + return -ENOMEM; + } + + return 0; +} + +static void memcg_destroy_list_lru_node(struct list_lru_node *nlru) +{ + __memcg_destroy_list_lru_node(nlru->memcg_lrus, 0, memcg_nr_cache_ids); + kfree(nlru->memcg_lrus); +} + +static int memcg_update_list_lru_node(struct list_lru_node *nlru, + int old_size, int new_size) +{ + struct list_lru_memcg *old, *new; + + BUG_ON(old_size > new_size); + + old = nlru->memcg_lrus; + new = kmalloc(new_size * sizeof(void *), GFP_KERNEL); + if (!new) + return -ENOMEM; + + if (__memcg_init_list_lru_node(new, old_size, new_size)) { + kfree(new); + return -ENOMEM; + } + + memcpy(new, old, old_size * sizeof(void *)); + + /* + * The lock guarantees that we won't race with a reader + * (see list_lru_from_memcg_idx). + * + * Since list_lru_{add,del} may be called under an IRQ-safe lock, + * we have to use IRQ-safe primitives here to avoid deadlock. + */ + spin_lock_irq(&nlru->lock); + nlru->memcg_lrus = new; + spin_unlock_irq(&nlru->lock); + + kfree(old); + return 0; +} + +static void memcg_cancel_update_list_lru_node(struct list_lru_node *nlru, + int old_size, int new_size) +{ + /* do not bother shrinking the array back to the old size, because we + * cannot handle allocation failures here */ + __memcg_destroy_list_lru_node(nlru->memcg_lrus, old_size, new_size); +} + +static int memcg_init_list_lru(struct list_lru *lru, bool memcg_aware) +{ + int i; + + for (i = 0; i < nr_node_ids; i++) { + if (!memcg_aware) + lru->node[i].memcg_lrus = NULL; + else if (memcg_init_list_lru_node(&lru->node[i])) + goto fail; + } + return 0; +fail: + for (i = i - 1; i >= 0; i--) + memcg_destroy_list_lru_node(&lru->node[i]); + return -ENOMEM; +} + +static void memcg_destroy_list_lru(struct list_lru *lru) +{ + int i; + + if (!list_lru_memcg_aware(lru)) + return; + + for (i = 0; i < nr_node_ids; i++) + memcg_destroy_list_lru_node(&lru->node[i]); +} + +static int memcg_update_list_lru(struct list_lru *lru, + int old_size, int new_size) +{ + int i; + + if (!list_lru_memcg_aware(lru)) + return 0; + + for (i = 0; i < nr_node_ids; i++) { + if (memcg_update_list_lru_node(&lru->node[i], + old_size, new_size)) + goto fail; + } + return 0; +fail: + for (i = i - 1; i >= 0; i--) + memcg_cancel_update_list_lru_node(&lru->node[i], + old_size, new_size); + return -ENOMEM; +} + +static void memcg_cancel_update_list_lru(struct list_lru *lru, + int old_size, int new_size) +{ + int i; + + if (!list_lru_memcg_aware(lru)) + return; + + for (i = 0; i < nr_node_ids; i++) + memcg_cancel_update_list_lru_node(&lru->node[i], + old_size, new_size); +} + +int memcg_update_all_list_lrus(int new_size) +{ + int ret = 0; + struct list_lru *lru; + int old_size = memcg_nr_cache_ids; + + mutex_lock(&list_lrus_mutex); + list_for_each_entry(lru, &list_lrus, list) { + ret = memcg_update_list_lru(lru, old_size, new_size); + if (ret) + goto fail; + } +out: + mutex_unlock(&list_lrus_mutex); + return ret; +fail: + list_for_each_entry_continue_reverse(lru, &list_lrus, list) + memcg_cancel_update_list_lru(lru, old_size, new_size); + goto out; +} +#else +static int memcg_init_list_lru(struct list_lru *lru, bool memcg_aware) +{ + return 0; +} + +static void memcg_destroy_list_lru(struct list_lru *lru) +{ +} +#endif /* CONFIG_MEMCG_KMEM */ + +int __list_lru_init(struct list_lru *lru, bool memcg_aware, + struct lock_class_key *key) { int i; size_t size = sizeof(*lru->node) * nr_node_ids; + int err = -ENOMEM; + + memcg_get_cache_ids(); lru->node = kzalloc(size, GFP_KERNEL); if (!lru->node) - return -ENOMEM; + goto out; for (i = 0; i < nr_node_ids; i++) { spin_lock_init(&lru->node[i].lock); if (key) lockdep_set_class(&lru->node[i].lock, key); - INIT_LIST_HEAD(&lru->node[i].list); - lru->node[i].nr_items = 0; + init_one_lru(&lru->node[i].lru); + } + + err = memcg_init_list_lru(lru, memcg_aware); + if (err) { + kfree(lru->node); + goto out; } + list_lru_register(lru); - return 0; +out: + memcg_put_cache_ids(); + return err; } -EXPORT_SYMBOL_GPL(list_lru_init_key); +EXPORT_SYMBOL_GPL(__list_lru_init); void list_lru_destroy(struct list_lru *lru) { /* Already destroyed or not yet initialized? */ if (!lru->node) return; + + memcg_get_cache_ids(); + list_lru_unregister(lru); + + memcg_destroy_list_lru(lru); kfree(lru->node); lru->node = NULL; + + memcg_put_cache_ids(); } EXPORT_SYMBOL_GPL(list_lru_destroy); diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 6706e5fa5ac0..afa55bb38cbd 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2571,6 +2571,8 @@ static int memcg_alloc_cache_id(void) size = MEMCG_CACHES_MAX_SIZE; err = memcg_update_all_caches(size); + if (!err) + err = memcg_update_all_list_lrus(size); if (!err) memcg_nr_cache_ids = size; @@ -2765,6 +2767,24 @@ void __memcg_kmem_uncharge_pages(struct page *page, int order) memcg_uncharge_kmem(memcg, 1 << order); page->mem_cgroup = NULL; } + +struct mem_cgroup *__mem_cgroup_from_kmem(void *ptr) +{ + struct mem_cgroup *memcg = NULL; + struct kmem_cache *cachep; + struct page *page; + + page = virt_to_head_page(ptr); + if (PageSlab(page)) { + cachep = page->slab_cache; + if (!is_root_cache(cachep)) + memcg = cachep->memcg_params->memcg; + } else + /* page allocated by alloc_kmem_pages */ + memcg = page->mem_cgroup; + + return memcg; +} #endif /* CONFIG_MEMCG_KMEM */ #ifdef CONFIG_TRANSPARENT_HUGEPAGE -- cgit v1.2.3 From 2acb60a0461fec8a4516686c913a295ce872697f Mon Sep 17 00:00:00 2001 From: Vladimir Davydov Date: Thu, 12 Feb 2015 14:59:14 -0800 Subject: fs: make shrinker memcg aware Now, to make any list_lru-based shrinker memcg aware we should only initialize its list_lru as memcg aware. Let's do it for the general FS shrinker (super_block::s_shrink). There are other FS-specific shrinkers that use list_lru for storing objects, such as XFS and GFS2 dquot cache shrinkers, but since they reclaim objects that are shared among different cgroups, there is no point making them memcg aware. It's a big question whether we should account them to memcg at all. Signed-off-by: Vladimir Davydov Cc: Dave Chinner Cc: Johannes Weiner Cc: Michal Hocko Cc: Greg Thelen Cc: Glauber Costa Cc: Alexander Viro Cc: Christoph Lameter Cc: Pekka Enberg Cc: David Rientjes Cc: Joonsoo Kim Cc: Tejun Heo Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- fs/super.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/fs/super.c b/fs/super.c index b027849d92d2..482b4071f4de 100644 --- a/fs/super.c +++ b/fs/super.c @@ -189,9 +189,9 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags) INIT_HLIST_BL_HEAD(&s->s_anon); INIT_LIST_HEAD(&s->s_inodes); - if (list_lru_init(&s->s_dentry_lru)) + if (list_lru_init_memcg(&s->s_dentry_lru)) goto fail; - if (list_lru_init(&s->s_inode_lru)) + if (list_lru_init_memcg(&s->s_inode_lru)) goto fail; init_rwsem(&s->s_umount); @@ -227,7 +227,7 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags) s->s_shrink.scan_objects = super_cache_scan; s->s_shrink.count_objects = super_cache_count; s->s_shrink.batch = 1024; - s->s_shrink.flags = SHRINKER_NUMA_AWARE; + s->s_shrink.flags = SHRINKER_NUMA_AWARE | SHRINKER_MEMCG_AWARE; return s; fail: -- cgit v1.2.3 From 49e7e7ff8d551b5b1e2f8da8497b9058cfa25672 Mon Sep 17 00:00:00 2001 From: Vladimir Davydov Date: Thu, 12 Feb 2015 14:59:17 -0800 Subject: fs: shrinker: always scan at least one object of each type In super_cache_scan() we divide the number of objects of particular type by the total number of objects in order to distribute pressure among As a result, in some corner cases we can get nr_to_scan=0 even if there are some objects to reclaim, e.g. dentries=1, inodes=1, fs_objects=1, nr_to_scan=1/3=0. This is unacceptable for per memcg kmem accounting, because this means that some objects may never get reclaimed after memcg death, preventing it from being freed. This patch therefore assures that super_cache_scan() will scan at least one object of each type if any. [akpm@linux-foundation.org: add comment] Signed-off-by: Vladimir Davydov Cc: Johannes Weiner Cc: Michal Hocko Cc: Alexander Viro Cc: Dave Chinner Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- fs/super.c | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/fs/super.c b/fs/super.c index 482b4071f4de..a89d6250bab8 100644 --- a/fs/super.c +++ b/fs/super.c @@ -91,14 +91,17 @@ static unsigned long super_cache_scan(struct shrinker *shrink, /* * prune the dcache first as the icache is pinned by it, then * prune the icache, followed by the filesystem specific caches + * + * Ensure that we always scan at least one object - memcg kmem + * accounting uses this to fully empty the caches. */ - sc->nr_to_scan = dentries; + sc->nr_to_scan = dentries + 1; freed = prune_dcache_sb(sb, sc); - sc->nr_to_scan = inodes; + sc->nr_to_scan = inodes + 1; freed += prune_icache_sb(sb, sc); if (fs_objects) { - sc->nr_to_scan = fs_objects; + sc->nr_to_scan = fs_objects + 1; freed += sb->s_op->free_cached_objects(sb, sc); } -- cgit v1.2.3 From f7ce3190c4a35bf887adb7a1aa1ba899b679872d Mon Sep 17 00:00:00 2001 From: Vladimir Davydov Date: Thu, 12 Feb 2015 14:59:20 -0800 Subject: slab: embed memcg_cache_params to kmem_cache Currently, kmem_cache stores a pointer to struct memcg_cache_params instead of embedding it. The rationale is to save memory when kmem accounting is disabled. However, the memcg_cache_params has shrivelled drastically since it was first introduced: * Initially: struct memcg_cache_params { bool is_root_cache; union { struct kmem_cache *memcg_caches[0]; struct { struct mem_cgroup *memcg; struct list_head list; struct kmem_cache *root_cache; bool dead; atomic_t nr_pages; struct work_struct destroy; }; }; }; * Now: struct memcg_cache_params { bool is_root_cache; union { struct { struct rcu_head rcu_head; struct kmem_cache *memcg_caches[0]; }; struct { struct mem_cgroup *memcg; struct kmem_cache *root_cache; }; }; }; So the memory saving does not seem to be a clear win anymore. OTOH, keeping a pointer to memcg_cache_params struct instead of embedding it results in touching one more cache line on kmem alloc/free hot paths. Besides, it makes linking kmem caches in a list chained by a field of struct memcg_cache_params really painful due to a level of indirection, while I want to make them linked in the following patch. That said, let us embed it. Signed-off-by: Vladimir Davydov Cc: Johannes Weiner Cc: Michal Hocko Cc: Tejun Heo Cc: Christoph Lameter Cc: Pekka Enberg Cc: David Rientjes Cc: Joonsoo Kim Cc: Dave Chinner Cc: Dan Carpenter Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/slab.h | 17 +++---- include/linux/slab_def.h | 2 +- include/linux/slub_def.h | 2 +- mm/memcontrol.c | 11 ++-- mm/slab.h | 48 +++++++++--------- mm/slab_common.c | 129 ++++++++++++++++++++++++++--------------------- mm/slub.c | 5 +- 7 files changed, 111 insertions(+), 103 deletions(-) diff --git a/include/linux/slab.h b/include/linux/slab.h index 2e3b448cfa2d..1e03c11bbfbd 100644 --- a/include/linux/slab.h +++ b/include/linux/slab.h @@ -473,14 +473,14 @@ static __always_inline void *kmalloc_node(size_t size, gfp_t flags, int node) #ifndef ARCH_SLAB_MINALIGN #define ARCH_SLAB_MINALIGN __alignof__(unsigned long long) #endif + +struct memcg_cache_array { + struct rcu_head rcu; + struct kmem_cache *entries[0]; +}; + /* * This is the main placeholder for memcg-related information in kmem caches. - * struct kmem_cache will hold a pointer to it, so the memory cost while - * disabled is 1 pointer. The runtime cost while enabled, gets bigger than it - * would otherwise be if that would be bundled in kmem_cache: we'll need an - * extra pointer chase. But the trade off clearly lays in favor of not - * penalizing non-users. - * * Both the root cache and the child caches will have it. For the root cache, * this will hold a dynamically allocated array large enough to hold * information about the currently limited memcgs in the system. To allow the @@ -495,10 +495,7 @@ static __always_inline void *kmalloc_node(size_t size, gfp_t flags, int node) struct memcg_cache_params { bool is_root_cache; union { - struct { - struct rcu_head rcu_head; - struct kmem_cache *memcg_caches[0]; - }; + struct memcg_cache_array __rcu *memcg_caches; struct { struct mem_cgroup *memcg; struct kmem_cache *root_cache; diff --git a/include/linux/slab_def.h b/include/linux/slab_def.h index b869d1662ba3..33d049066c3d 100644 --- a/include/linux/slab_def.h +++ b/include/linux/slab_def.h @@ -70,7 +70,7 @@ struct kmem_cache { int obj_offset; #endif /* CONFIG_DEBUG_SLAB */ #ifdef CONFIG_MEMCG_KMEM - struct memcg_cache_params *memcg_params; + struct memcg_cache_params memcg_params; #endif struct kmem_cache_node *node[MAX_NUMNODES]; diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h index d82abd40a3c0..9abf04ed0999 100644 --- a/include/linux/slub_def.h +++ b/include/linux/slub_def.h @@ -85,7 +85,7 @@ struct kmem_cache { struct kobject kobj; /* For sysfs */ #endif #ifdef CONFIG_MEMCG_KMEM - struct memcg_cache_params *memcg_params; + struct memcg_cache_params memcg_params; int max_attr_size; /* for propagation, maximum size of a stored attr */ #ifdef CONFIG_SYSFS struct kset *memcg_kset; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index afa55bb38cbd..6f3c0fcd7a2d 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -332,7 +332,7 @@ struct mem_cgroup { struct cg_proto tcp_mem; #endif #if defined(CONFIG_MEMCG_KMEM) - /* Index in the kmem_cache->memcg_params->memcg_caches array */ + /* Index in the kmem_cache->memcg_params.memcg_caches array */ int kmemcg_id; #endif @@ -531,7 +531,7 @@ static void disarm_sock_keys(struct mem_cgroup *memcg) #ifdef CONFIG_MEMCG_KMEM /* - * This will be the memcg's index in each cache's ->memcg_params->memcg_caches. + * This will be the memcg's index in each cache's ->memcg_params.memcg_caches. * The main reason for not using cgroup id for this: * this works better in sparse environments, where we have a lot of memcgs, * but only a few kmem-limited. Or also, if we have, for instance, 200 @@ -2667,8 +2667,7 @@ struct kmem_cache *__memcg_kmem_get_cache(struct kmem_cache *cachep) struct mem_cgroup *memcg; struct kmem_cache *memcg_cachep; - VM_BUG_ON(!cachep->memcg_params); - VM_BUG_ON(!cachep->memcg_params->is_root_cache); + VM_BUG_ON(!is_root_cache(cachep)); if (current->memcg_kmem_skip_account) return cachep; @@ -2702,7 +2701,7 @@ out: void __memcg_kmem_put_cache(struct kmem_cache *cachep) { if (!is_root_cache(cachep)) - css_put(&cachep->memcg_params->memcg->css); + css_put(&cachep->memcg_params.memcg->css); } /* @@ -2778,7 +2777,7 @@ struct mem_cgroup *__mem_cgroup_from_kmem(void *ptr) if (PageSlab(page)) { cachep = page->slab_cache; if (!is_root_cache(cachep)) - memcg = cachep->memcg_params->memcg; + memcg = cachep->memcg_params.memcg; } else /* page allocated by alloc_kmem_pages */ memcg = page->mem_cgroup; diff --git a/mm/slab.h b/mm/slab.h index 90430d6f665e..53a623f85931 100644 --- a/mm/slab.h +++ b/mm/slab.h @@ -86,8 +86,6 @@ extern struct kmem_cache *create_kmalloc_cache(const char *name, size_t size, extern void create_boot_cache(struct kmem_cache *, const char *name, size_t size, unsigned long flags); -struct mem_cgroup; - int slab_unmergeable(struct kmem_cache *s); struct kmem_cache *find_mergeable(size_t size, size_t align, unsigned long flags, const char *name, void (*ctor)(void *)); @@ -167,14 +165,13 @@ ssize_t slabinfo_write(struct file *file, const char __user *buffer, #ifdef CONFIG_MEMCG_KMEM static inline bool is_root_cache(struct kmem_cache *s) { - return !s->memcg_params || s->memcg_params->is_root_cache; + return s->memcg_params.is_root_cache; } static inline bool slab_equal_or_root(struct kmem_cache *s, - struct kmem_cache *p) + struct kmem_cache *p) { - return (p == s) || - (s->memcg_params && (p == s->memcg_params->root_cache)); + return p == s || p == s->memcg_params.root_cache; } /* @@ -185,37 +182,30 @@ static inline bool slab_equal_or_root(struct kmem_cache *s, static inline const char *cache_name(struct kmem_cache *s) { if (!is_root_cache(s)) - return s->memcg_params->root_cache->name; + s = s->memcg_params.root_cache; return s->name; } /* * Note, we protect with RCU only the memcg_caches array, not per-memcg caches. - * That said the caller must assure the memcg's cache won't go away. Since once - * created a memcg's cache is destroyed only along with the root cache, it is - * true if we are going to allocate from the cache or hold a reference to the - * root cache by other means. Otherwise, we should hold either the slab_mutex - * or the memcg's slab_caches_mutex while calling this function and accessing - * the returned value. + * That said the caller must assure the memcg's cache won't go away by either + * taking a css reference to the owner cgroup, or holding the slab_mutex. */ static inline struct kmem_cache * cache_from_memcg_idx(struct kmem_cache *s, int idx) { struct kmem_cache *cachep; - struct memcg_cache_params *params; - - if (!s->memcg_params) - return NULL; + struct memcg_cache_array *arr; rcu_read_lock(); - params = rcu_dereference(s->memcg_params); + arr = rcu_dereference(s->memcg_params.memcg_caches); /* * Make sure we will access the up-to-date value. The code updating * memcg_caches issues a write barrier to match this (see - * memcg_register_cache()). + * memcg_create_kmem_cache()). */ - cachep = lockless_dereference(params->memcg_caches[idx]); + cachep = lockless_dereference(arr->entries[idx]); rcu_read_unlock(); return cachep; @@ -225,7 +215,7 @@ static inline struct kmem_cache *memcg_root_cache(struct kmem_cache *s) { if (is_root_cache(s)) return s; - return s->memcg_params->root_cache; + return s->memcg_params.root_cache; } static __always_inline int memcg_charge_slab(struct kmem_cache *s, @@ -235,7 +225,7 @@ static __always_inline int memcg_charge_slab(struct kmem_cache *s, return 0; if (is_root_cache(s)) return 0; - return memcg_charge_kmem(s->memcg_params->memcg, gfp, 1 << order); + return memcg_charge_kmem(s->memcg_params.memcg, gfp, 1 << order); } static __always_inline void memcg_uncharge_slab(struct kmem_cache *s, int order) @@ -244,9 +234,13 @@ static __always_inline void memcg_uncharge_slab(struct kmem_cache *s, int order) return; if (is_root_cache(s)) return; - memcg_uncharge_kmem(s->memcg_params->memcg, 1 << order); + memcg_uncharge_kmem(s->memcg_params.memcg, 1 << order); } -#else + +extern void slab_init_memcg_params(struct kmem_cache *); + +#else /* !CONFIG_MEMCG_KMEM */ + static inline bool is_root_cache(struct kmem_cache *s) { return true; @@ -282,7 +276,11 @@ static inline int memcg_charge_slab(struct kmem_cache *s, gfp_t gfp, int order) static inline void memcg_uncharge_slab(struct kmem_cache *s, int order) { } -#endif + +static inline void slab_init_memcg_params(struct kmem_cache *s) +{ +} +#endif /* CONFIG_MEMCG_KMEM */ static inline struct kmem_cache *cache_from_obj(struct kmem_cache *s, void *x) { diff --git a/mm/slab_common.c b/mm/slab_common.c index 23f5fcde6043..7cc32cf126ef 100644 --- a/mm/slab_common.c +++ b/mm/slab_common.c @@ -106,62 +106,66 @@ static inline int kmem_cache_sanity_check(const char *name, size_t size) #endif #ifdef CONFIG_MEMCG_KMEM -static int memcg_alloc_cache_params(struct mem_cgroup *memcg, - struct kmem_cache *s, struct kmem_cache *root_cache) +void slab_init_memcg_params(struct kmem_cache *s) { - size_t size; + s->memcg_params.is_root_cache = true; + RCU_INIT_POINTER(s->memcg_params.memcg_caches, NULL); +} + +static int init_memcg_params(struct kmem_cache *s, + struct mem_cgroup *memcg, struct kmem_cache *root_cache) +{ + struct memcg_cache_array *arr; - if (!memcg_kmem_enabled()) + if (memcg) { + s->memcg_params.is_root_cache = false; + s->memcg_params.memcg = memcg; + s->memcg_params.root_cache = root_cache; return 0; + } - if (!memcg) { - size = offsetof(struct memcg_cache_params, memcg_caches); - size += memcg_nr_cache_ids * sizeof(void *); - } else - size = sizeof(struct memcg_cache_params); + slab_init_memcg_params(s); - s->memcg_params = kzalloc(size, GFP_KERNEL); - if (!s->memcg_params) - return -ENOMEM; + if (!memcg_nr_cache_ids) + return 0; - if (memcg) { - s->memcg_params->memcg = memcg; - s->memcg_params->root_cache = root_cache; - } else - s->memcg_params->is_root_cache = true; + arr = kzalloc(sizeof(struct memcg_cache_array) + + memcg_nr_cache_ids * sizeof(void *), + GFP_KERNEL); + if (!arr) + return -ENOMEM; + RCU_INIT_POINTER(s->memcg_params.memcg_caches, arr); return 0; } -static void memcg_free_cache_params(struct kmem_cache *s) +static void destroy_memcg_params(struct kmem_cache *s) { - kfree(s->memcg_params); + if (is_root_cache(s)) + kfree(rcu_access_pointer(s->memcg_params.memcg_caches)); } -static int memcg_update_cache_params(struct kmem_cache *s, int num_memcgs) +static int update_memcg_params(struct kmem_cache *s, int new_array_size) { - int size; - struct memcg_cache_params *new_params, *cur_params; + struct memcg_cache_array *old, *new; - BUG_ON(!is_root_cache(s)); - - size = offsetof(struct memcg_cache_params, memcg_caches); - size += num_memcgs * sizeof(void *); + if (!is_root_cache(s)) + return 0; - new_params = kzalloc(size, GFP_KERNEL); - if (!new_params) + new = kzalloc(sizeof(struct memcg_cache_array) + + new_array_size * sizeof(void *), GFP_KERNEL); + if (!new) return -ENOMEM; - cur_params = s->memcg_params; - memcpy(new_params->memcg_caches, cur_params->memcg_caches, - memcg_nr_cache_ids * sizeof(void *)); - - new_params->is_root_cache = true; - - rcu_assign_pointer(s->memcg_params, new_params); - if (cur_params) - kfree_rcu(cur_params, rcu_head); + old = rcu_dereference_protected(s->memcg_params.memcg_caches, + lockdep_is_held(&slab_mutex)); + if (old) + memcpy(new->entries, old->entries, + memcg_nr_cache_ids * sizeof(void *)); + rcu_assign_pointer(s->memcg_params.memcg_caches, new); + if (old) + kfree_rcu(old, rcu); return 0; } @@ -172,10 +176,7 @@ int memcg_update_all_caches(int num_memcgs) mutex_lock(&slab_mutex); list_for_each_entry(s, &slab_caches, list) { - if (!is_root_cache(s)) - continue; - - ret = memcg_update_cache_params(s, num_memcgs); + ret = update_memcg_params(s, num_memcgs); /* * Instead of freeing the memory, we'll just leave the caches * up to this point in an updated state. @@ -187,13 +188,13 @@ int memcg_update_all_caches(int num_memcgs) return ret; } #else -static inline int memcg_alloc_cache_params(struct mem_cgroup *memcg, - struct kmem_cache *s, struct kmem_cache *root_cache) +static inline int init_memcg_params(struct kmem_cache *s, + struct mem_cgroup *memcg, struct kmem_cache *root_cache) { return 0; } -static inline void memcg_free_cache_params(struct kmem_cache *s) +static inline void destroy_memcg_params(struct kmem_cache *s) { } #endif /* CONFIG_MEMCG_KMEM */ @@ -311,7 +312,7 @@ do_kmem_cache_create(char *name, size_t object_size, size_t size, size_t align, s->align = align; s->ctor = ctor; - err = memcg_alloc_cache_params(memcg, s, root_cache); + err = init_memcg_params(s, memcg, root_cache); if (err) goto out_free_cache; @@ -327,7 +328,7 @@ out: return s; out_free_cache: - memcg_free_cache_params(s); + destroy_memcg_params(s); kmem_cache_free(kmem_cache, s); goto out; } @@ -439,11 +440,15 @@ static int do_kmem_cache_shutdown(struct kmem_cache *s, #ifdef CONFIG_MEMCG_KMEM if (!is_root_cache(s)) { - struct kmem_cache *root_cache = s->memcg_params->root_cache; - int memcg_id = memcg_cache_id(s->memcg_params->memcg); - - BUG_ON(root_cache->memcg_params->memcg_caches[memcg_id] != s); - root_cache->memcg_params->memcg_caches[memcg_id] = NULL; + int idx; + struct memcg_cache_array *arr; + + idx = memcg_cache_id(s->memcg_params.memcg); + arr = rcu_dereference_protected(s->memcg_params.root_cache-> + memcg_params.memcg_caches, + lockdep_is_held(&slab_mutex)); + BUG_ON(arr->entries[idx] != s); + arr->entries[idx] = NULL; } #endif list_move(&s->list, release); @@ -481,27 +486,32 @@ void memcg_create_kmem_cache(struct mem_cgroup *memcg, struct kmem_cache *root_cache) { static char memcg_name_buf[NAME_MAX + 1]; /* protected by slab_mutex */ - int memcg_id = memcg_cache_id(memcg); + struct memcg_cache_array *arr; struct kmem_cache *s = NULL; char *cache_name; + int idx; get_online_cpus(); get_online_mems(); mutex_lock(&slab_mutex); + idx = memcg_cache_id(memcg); + arr = rcu_dereference_protected(root_cache->memcg_params.memcg_caches, + lockdep_is_held(&slab_mutex)); + /* * Since per-memcg caches are created asynchronously on first * allocation (see memcg_kmem_get_cache()), several threads can try to * create the same cache, but only one of them may succeed. */ - if (cache_from_memcg_idx(root_cache, memcg_id)) + if (arr->entries[idx]) goto out_unlock; cgroup_name(mem_cgroup_css(memcg)->cgroup, memcg_name_buf, sizeof(memcg_name_buf)); cache_name = kasprintf(GFP_KERNEL, "%s(%d:%s)", root_cache->name, - memcg_cache_id(memcg), memcg_name_buf); + idx, memcg_name_buf); if (!cache_name) goto out_unlock; @@ -525,7 +535,7 @@ void memcg_create_kmem_cache(struct mem_cgroup *memcg, * initialized. */ smp_wmb(); - root_cache->memcg_params->memcg_caches[memcg_id] = s; + arr->entries[idx] = s; out_unlock: mutex_unlock(&slab_mutex); @@ -545,7 +555,7 @@ void memcg_destroy_kmem_caches(struct mem_cgroup *memcg) mutex_lock(&slab_mutex); list_for_each_entry_safe(s, s2, &slab_caches, list) { - if (is_root_cache(s) || s->memcg_params->memcg != memcg) + if (is_root_cache(s) || s->memcg_params.memcg != memcg) continue; /* * The cgroup is about to be freed and therefore has no charges @@ -564,7 +574,7 @@ void memcg_destroy_kmem_caches(struct mem_cgroup *memcg) void slab_kmem_cache_release(struct kmem_cache *s) { - memcg_free_cache_params(s); + destroy_memcg_params(s); kfree(s->name); kmem_cache_free(kmem_cache, s); } @@ -640,6 +650,9 @@ void __init create_boot_cache(struct kmem_cache *s, const char *name, size_t siz s->name = name; s->size = s->object_size = size; s->align = calculate_alignment(flags, ARCH_KMALLOC_MINALIGN, size); + + slab_init_memcg_params(s); + err = __kmem_cache_create(s, flags); if (err) @@ -980,7 +993,7 @@ int memcg_slab_show(struct seq_file *m, void *p) if (p == slab_caches.next) print_slabinfo_header(m); - if (!is_root_cache(s) && s->memcg_params->memcg == memcg) + if (!is_root_cache(s) && s->memcg_params.memcg == memcg) cache_show(s, m); return 0; } diff --git a/mm/slub.c b/mm/slub.c index 8b8508adf9c2..75d55fdfe3a1 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -3577,6 +3577,7 @@ static struct kmem_cache * __init bootstrap(struct kmem_cache *static_cache) p->slab_cache = s; #endif } + slab_init_memcg_params(s); list_add(&s->list, &slab_caches); return s; } @@ -4964,7 +4965,7 @@ static void memcg_propagate_slab_attrs(struct kmem_cache *s) if (is_root_cache(s)) return; - root_cache = s->memcg_params->root_cache; + root_cache = s->memcg_params.root_cache; /* * This mean this cache had no attribute written. Therefore, no point @@ -5044,7 +5045,7 @@ static inline struct kset *cache_kset(struct kmem_cache *s) { #ifdef CONFIG_MEMCG_KMEM if (!is_root_cache(s)) - return s->memcg_params->root_cache->memcg_kset; + return s->memcg_params.root_cache->memcg_kset; #endif return slab_kset; } -- cgit v1.2.3 From 426589f571f7d6d5ab2ca33ece73164149279ca1 Mon Sep 17 00:00:00 2001 From: Vladimir Davydov Date: Thu, 12 Feb 2015 14:59:23 -0800 Subject: slab: link memcg caches of the same kind into a list Sometimes, we need to iterate over all memcg copies of a particular root kmem cache. Currently, we use memcg_cache_params->memcg_caches array for that, because it contains all existing memcg caches. However, it's a bad practice to keep all caches, including those that belong to offline cgroups, in this array, because it will be growing beyond any bounds then. I'm going to wipe away dead caches from it to save space. To still be able to perform iterations over all memcg caches of the same kind, let us link them into a list. Signed-off-by: Vladimir Davydov Cc: Johannes Weiner Cc: Michal Hocko Cc: Tejun Heo Cc: Christoph Lameter Cc: Pekka Enberg Cc: David Rientjes Cc: Joonsoo Kim Cc: Dave Chinner Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/slab.h | 4 ++++ mm/slab.c | 13 +++++-------- mm/slab.h | 17 +++++++++++++++++ mm/slab_common.c | 21 ++++++++++----------- mm/slub.c | 19 +++++-------------- 5 files changed, 41 insertions(+), 33 deletions(-) diff --git a/include/linux/slab.h b/include/linux/slab.h index 1e03c11bbfbd..26d99f41b410 100644 --- a/include/linux/slab.h +++ b/include/linux/slab.h @@ -491,9 +491,13 @@ struct memcg_cache_array { * * @memcg: pointer to the memcg this cache belongs to * @root_cache: pointer to the global, root cache, this cache was derived from + * + * Both root and child caches of the same kind are linked into a list chained + * through @list. */ struct memcg_cache_params { bool is_root_cache; + struct list_head list; union { struct memcg_cache_array __rcu *memcg_caches; struct { diff --git a/mm/slab.c b/mm/slab.c index 65b5dcb6f671..7894017bc160 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -3708,8 +3708,7 @@ static int do_tune_cpucache(struct kmem_cache *cachep, int limit, int batchcount, int shared, gfp_t gfp) { int ret; - struct kmem_cache *c = NULL; - int i = 0; + struct kmem_cache *c; ret = __do_tune_cpucache(cachep, limit, batchcount, shared, gfp); @@ -3719,12 +3718,10 @@ static int do_tune_cpucache(struct kmem_cache *cachep, int limit, if ((ret < 0) || !is_root_cache(cachep)) return ret; - VM_BUG_ON(!mutex_is_locked(&slab_mutex)); - for_each_memcg_cache_index(i) { - c = cache_from_memcg_idx(cachep, i); - if (c) - /* return value determined by the parent cache only */ - __do_tune_cpucache(c, limit, batchcount, shared, gfp); + lockdep_assert_held(&slab_mutex); + for_each_memcg_cache(c, cachep) { + /* return value determined by the root cache only */ + __do_tune_cpucache(c, limit, batchcount, shared, gfp); } return ret; diff --git a/mm/slab.h b/mm/slab.h index 53a623f85931..0a56d76ac0e9 100644 --- a/mm/slab.h +++ b/mm/slab.h @@ -163,6 +163,18 @@ ssize_t slabinfo_write(struct file *file, const char __user *buffer, size_t count, loff_t *ppos); #ifdef CONFIG_MEMCG_KMEM +/* + * Iterate over all memcg caches of the given root cache. The caller must hold + * slab_mutex. + */ +#define for_each_memcg_cache(iter, root) \ + list_for_each_entry(iter, &(root)->memcg_params.list, \ + memcg_params.list) + +#define for_each_memcg_cache_safe(iter, tmp, root) \ + list_for_each_entry_safe(iter, tmp, &(root)->memcg_params.list, \ + memcg_params.list) + static inline bool is_root_cache(struct kmem_cache *s) { return s->memcg_params.is_root_cache; @@ -241,6 +253,11 @@ extern void slab_init_memcg_params(struct kmem_cache *); #else /* !CONFIG_MEMCG_KMEM */ +#define for_each_memcg_cache(iter, root) \ + for ((void)(iter), (void)(root); 0; ) +#define for_each_memcg_cache_safe(iter, tmp, root) \ + for ((void)(iter), (void)(tmp), (void)(root); 0; ) + static inline bool is_root_cache(struct kmem_cache *s) { return true; diff --git a/mm/slab_common.c b/mm/slab_common.c index 7cc32cf126ef..989784bd88be 100644 --- a/mm/slab_common.c +++ b/mm/slab_common.c @@ -109,6 +109,7 @@ static inline int kmem_cache_sanity_check(const char *name, size_t size) void slab_init_memcg_params(struct kmem_cache *s) { s->memcg_params.is_root_cache = true; + INIT_LIST_HEAD(&s->memcg_params.list); RCU_INIT_POINTER(s->memcg_params.memcg_caches, NULL); } @@ -449,6 +450,7 @@ static int do_kmem_cache_shutdown(struct kmem_cache *s, lockdep_is_held(&slab_mutex)); BUG_ON(arr->entries[idx] != s); arr->entries[idx] = NULL; + list_del(&s->memcg_params.list); } #endif list_move(&s->list, release); @@ -529,6 +531,8 @@ void memcg_create_kmem_cache(struct mem_cgroup *memcg, goto out_unlock; } + list_add(&s->memcg_params.list, &root_cache->memcg_params.list); + /* * Since readers won't lock (see cache_from_memcg_idx()), we need a * barrier here to ensure nobody will see the kmem_cache partially @@ -581,11 +585,13 @@ void slab_kmem_cache_release(struct kmem_cache *s) void kmem_cache_destroy(struct kmem_cache *s) { - int i; + struct kmem_cache *c, *c2; LIST_HEAD(release); bool need_rcu_barrier = false; bool busy = false; + BUG_ON(!is_root_cache(s)); + get_online_cpus(); get_online_mems(); @@ -595,10 +601,8 @@ void kmem_cache_destroy(struct kmem_cache *s) if (s->refcount) goto out_unlock; - for_each_memcg_cache_index(i) { - struct kmem_cache *c = cache_from_memcg_idx(s, i); - - if (c && do_kmem_cache_shutdown(c, &release, &need_rcu_barrier)) + for_each_memcg_cache_safe(c, c2, s) { + if (do_kmem_cache_shutdown(c, &release, &need_rcu_barrier)) busy = true; } @@ -932,16 +936,11 @@ memcg_accumulate_slabinfo(struct kmem_cache *s, struct slabinfo *info) { struct kmem_cache *c; struct slabinfo sinfo; - int i; if (!is_root_cache(s)) return; - for_each_memcg_cache_index(i) { - c = cache_from_memcg_idx(s, i); - if (!c) - continue; - + for_each_memcg_cache(c, s) { memset(&sinfo, 0, sizeof(sinfo)); get_slabinfo(c, &sinfo); diff --git a/mm/slub.c b/mm/slub.c index 75d55fdfe3a1..1e5a4636cb23 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -3636,13 +3636,10 @@ struct kmem_cache * __kmem_cache_alias(const char *name, size_t size, size_t align, unsigned long flags, void (*ctor)(void *)) { - struct kmem_cache *s; + struct kmem_cache *s, *c; s = find_mergeable(size, align, flags, name, ctor); if (s) { - int i; - struct kmem_cache *c; - s->refcount++; /* @@ -3652,10 +3649,7 @@ __kmem_cache_alias(const char *name, size_t size, size_t align, s->object_size = max(s->object_size, (int)size); s->inuse = max_t(int, s->inuse, ALIGN(size, sizeof(void *))); - for_each_memcg_cache_index(i) { - c = cache_from_memcg_idx(s, i); - if (!c) - continue; + for_each_memcg_cache(c, s) { c->object_size = s->object_size; c->inuse = max_t(int, c->inuse, ALIGN(size, sizeof(void *))); @@ -4921,7 +4915,7 @@ static ssize_t slab_attr_store(struct kobject *kobj, err = attribute->store(s, buf, len); #ifdef CONFIG_MEMCG_KMEM if (slab_state >= FULL && err >= 0 && is_root_cache(s)) { - int i; + struct kmem_cache *c; mutex_lock(&slab_mutex); if (s->max_attr_size < len) @@ -4944,11 +4938,8 @@ static ssize_t slab_attr_store(struct kobject *kobj, * directly either failed or succeeded, in which case we loop * through the descendants with best-effort propagation. */ - for_each_memcg_cache_index(i) { - struct kmem_cache *c = cache_from_memcg_idx(s, i); - if (c) - attribute->store(c, buf, len); - } + for_each_memcg_cache(c, s) + attribute->store(c, buf, len); mutex_unlock(&slab_mutex); } #endif -- cgit v1.2.3 From 01e586598b224d1a427acd8a7afa0b21e879d3a7 Mon Sep 17 00:00:00 2001 From: Vladimir Davydov Date: Thu, 12 Feb 2015 14:59:26 -0800 Subject: cgroup: release css->id after css_free Currently, we release css->id in css_release_work_fn, right before calling css_free callback, so that when css_free is called, the id may have already been reused for a new cgroup. I am going to use css->id to create unique names for per memcg kmem caches. Since kmem caches are destroyed only on css_free, I need css->id to be freed after css_free was called to avoid name clashes. This patch therefore moves css->id removal to css_free_work_fn. To prevent css_from_id from returning a pointer to a stale css, it makes css_release_work_fn replace the css ptr at css_idr:css->id with NULL. Signed-off-by: Vladimir Davydov Cc: Johannes Weiner Cc: Michal Hocko Acked-by: Tejun Heo Cc: Christoph Lameter Cc: Pekka Enberg Cc: David Rientjes Cc: Joonsoo Kim Cc: Dave Chinner Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/cgroup.c | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/kernel/cgroup.c b/kernel/cgroup.c index 04cfe8ace520..d5f6ec251fb2 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -4373,16 +4373,20 @@ static void css_free_work_fn(struct work_struct *work) { struct cgroup_subsys_state *css = container_of(work, struct cgroup_subsys_state, destroy_work); + struct cgroup_subsys *ss = css->ss; struct cgroup *cgrp = css->cgroup; percpu_ref_exit(&css->refcnt); - if (css->ss) { + if (ss) { /* css free path */ + int id = css->id; + if (css->parent) css_put(css->parent); - css->ss->css_free(css); + ss->css_free(css); + cgroup_idr_remove(&ss->css_idr, id); cgroup_put(cgrp); } else { /* cgroup free path */ @@ -4434,7 +4438,7 @@ static void css_release_work_fn(struct work_struct *work) if (ss) { /* css release path */ - cgroup_idr_remove(&ss->css_idr, css->id); + cgroup_idr_replace(&ss->css_idr, NULL, css->id); if (ss->css_released) ss->css_released(css); } else { -- cgit v1.2.3 From f1008365bbe4931d6a94dcfc11cf4cdada359664 Mon Sep 17 00:00:00 2001 From: Vladimir Davydov Date: Thu, 12 Feb 2015 14:59:29 -0800 Subject: slab: use css id for naming per memcg caches Currently, we use mem_cgroup->kmemcg_id to guarantee kmem_cache->name uniqueness. This is correct, because kmemcg_id is only released on css free after destroying all per memcg caches. However, I am going to change that and release kmemcg_id on css offline, because it is not wise to keep it for so long, wasting valuable entries of memcg_cache_params->memcg_caches arrays. Therefore, to preserve cache name uniqueness, let us switch to css->id. Signed-off-by: Vladimir Davydov Cc: Johannes Weiner Cc: Michal Hocko Cc: Tejun Heo Cc: Christoph Lameter Cc: Pekka Enberg Cc: David Rientjes Cc: Joonsoo Kim Cc: Dave Chinner Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/slab_common.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/mm/slab_common.c b/mm/slab_common.c index 989784bd88be..6087b1f9a385 100644 --- a/mm/slab_common.c +++ b/mm/slab_common.c @@ -488,6 +488,7 @@ void memcg_create_kmem_cache(struct mem_cgroup *memcg, struct kmem_cache *root_cache) { static char memcg_name_buf[NAME_MAX + 1]; /* protected by slab_mutex */ + struct cgroup_subsys_state *css = mem_cgroup_css(memcg); struct memcg_cache_array *arr; struct kmem_cache *s = NULL; char *cache_name; @@ -510,10 +511,9 @@ void memcg_create_kmem_cache(struct mem_cgroup *memcg, if (arr->entries[idx]) goto out_unlock; - cgroup_name(mem_cgroup_css(memcg)->cgroup, - memcg_name_buf, sizeof(memcg_name_buf)); + cgroup_name(css->cgroup, memcg_name_buf, sizeof(memcg_name_buf)); cache_name = kasprintf(GFP_KERNEL, "%s(%d:%s)", root_cache->name, - idx, memcg_name_buf); + css->id, memcg_name_buf); if (!cache_name) goto out_unlock; -- cgit v1.2.3 From 2a4db7eb9391a544ff58f4fa11d35246e87c87af Mon Sep 17 00:00:00 2001 From: Vladimir Davydov Date: Thu, 12 Feb 2015 14:59:32 -0800 Subject: memcg: free memcg_caches slot on css offline We need to look up a kmem_cache in ->memcg_params.memcg_caches arrays only on allocations, so there is no need to have the array entries set until css free - we can clear them on css offline. This will allow us to reuse array entries more efficiently and avoid costly array relocations. Signed-off-by: Vladimir Davydov Cc: Johannes Weiner Cc: Michal Hocko Cc: Tejun Heo Cc: Christoph Lameter Cc: Pekka Enberg Cc: David Rientjes Cc: Joonsoo Kim Cc: Dave Chinner Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/slab.h | 10 +++++----- mm/memcontrol.c | 38 ++++++++++++++++++++++++++++++++------ mm/slab_common.c | 39 ++++++++++++++++++++++++++++----------- 3 files changed, 65 insertions(+), 22 deletions(-) diff --git a/include/linux/slab.h b/include/linux/slab.h index 26d99f41b410..ed2ffaab59ea 100644 --- a/include/linux/slab.h +++ b/include/linux/slab.h @@ -115,13 +115,12 @@ int slab_is_available(void); struct kmem_cache *kmem_cache_create(const char *, size_t, size_t, unsigned long, void (*)(void *)); -#ifdef CONFIG_MEMCG_KMEM -void memcg_create_kmem_cache(struct mem_cgroup *, struct kmem_cache *); -void memcg_destroy_kmem_caches(struct mem_cgroup *); -#endif void kmem_cache_destroy(struct kmem_cache *); int kmem_cache_shrink(struct kmem_cache *); -void kmem_cache_free(struct kmem_cache *, void *); + +void memcg_create_kmem_cache(struct mem_cgroup *, struct kmem_cache *); +void memcg_deactivate_kmem_caches(struct mem_cgroup *); +void memcg_destroy_kmem_caches(struct mem_cgroup *); /* * Please use this macro to create slab caches. Simply specify the @@ -288,6 +287,7 @@ static __always_inline int kmalloc_index(size_t size) void *__kmalloc(size_t size, gfp_t flags); void *kmem_cache_alloc(struct kmem_cache *, gfp_t flags); +void kmem_cache_free(struct kmem_cache *, void *); #ifdef CONFIG_NUMA void *__kmalloc_node(size_t size, gfp_t flags, int node); diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 6f3c0fcd7a2d..abfe0135bfdc 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -334,6 +334,7 @@ struct mem_cgroup { #if defined(CONFIG_MEMCG_KMEM) /* Index in the kmem_cache->memcg_params.memcg_caches array */ int kmemcg_id; + bool kmem_acct_active; #endif int last_scanned_node; @@ -354,7 +355,7 @@ struct mem_cgroup { #ifdef CONFIG_MEMCG_KMEM bool memcg_kmem_is_active(struct mem_cgroup *memcg) { - return memcg->kmemcg_id >= 0; + return memcg->kmem_acct_active; } #endif @@ -585,7 +586,7 @@ static void memcg_free_cache_id(int id); static void disarm_kmem_keys(struct mem_cgroup *memcg) { - if (memcg_kmem_is_active(memcg)) { + if (memcg->kmemcg_id >= 0) { static_key_slow_dec(&memcg_kmem_enabled_key); memcg_free_cache_id(memcg->kmemcg_id); } @@ -2666,6 +2667,7 @@ struct kmem_cache *__memcg_kmem_get_cache(struct kmem_cache *cachep) { struct mem_cgroup *memcg; struct kmem_cache *memcg_cachep; + int kmemcg_id; VM_BUG_ON(!is_root_cache(cachep)); @@ -2673,10 +2675,11 @@ struct kmem_cache *__memcg_kmem_get_cache(struct kmem_cache *cachep) return cachep; memcg = get_mem_cgroup_from_mm(current->mm); - if (!memcg_kmem_is_active(memcg)) + kmemcg_id = ACCESS_ONCE(memcg->kmemcg_id); + if (kmemcg_id < 0) goto out; - memcg_cachep = cache_from_memcg_idx(cachep, memcg_cache_id(memcg)); + memcg_cachep = cache_from_memcg_idx(cachep, kmemcg_id); if (likely(memcg_cachep)) return memcg_cachep; @@ -3318,8 +3321,8 @@ static int memcg_activate_kmem(struct mem_cgroup *memcg, int err = 0; int memcg_id; - if (memcg_kmem_is_active(memcg)) - return 0; + BUG_ON(memcg->kmemcg_id >= 0); + BUG_ON(memcg->kmem_acct_active); /* * For simplicity, we won't allow this to be disabled. It also can't @@ -3362,6 +3365,7 @@ static int memcg_activate_kmem(struct mem_cgroup *memcg, * patched. */ memcg->kmemcg_id = memcg_id; + memcg->kmem_acct_active = true; out: return err; } @@ -4041,6 +4045,22 @@ static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss) return mem_cgroup_sockets_init(memcg, ss); } +static void memcg_deactivate_kmem(struct mem_cgroup *memcg) +{ + if (!memcg->kmem_acct_active) + return; + + /* + * Clear the 'active' flag before clearing memcg_caches arrays entries. + * Since we take the slab_mutex in memcg_deactivate_kmem_caches(), it + * guarantees no cache will be created for this cgroup after we are + * done (see memcg_create_kmem_cache()). + */ + memcg->kmem_acct_active = false; + + memcg_deactivate_kmem_caches(memcg); +} + static void memcg_destroy_kmem(struct mem_cgroup *memcg) { memcg_destroy_kmem_caches(memcg); @@ -4052,6 +4072,10 @@ static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss) return 0; } +static void memcg_deactivate_kmem(struct mem_cgroup *memcg) +{ +} + static void memcg_destroy_kmem(struct mem_cgroup *memcg) { } @@ -4608,6 +4632,8 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css) spin_unlock(&memcg->event_list_lock); vmpressure_cleanup(&memcg->vmpressure); + + memcg_deactivate_kmem(memcg); } static void mem_cgroup_css_free(struct cgroup_subsys_state *css) diff --git a/mm/slab_common.c b/mm/slab_common.c index 6087b1f9a385..0873bcc61c7a 100644 --- a/mm/slab_common.c +++ b/mm/slab_common.c @@ -440,18 +440,8 @@ static int do_kmem_cache_shutdown(struct kmem_cache *s, *need_rcu_barrier = true; #ifdef CONFIG_MEMCG_KMEM - if (!is_root_cache(s)) { - int idx; - struct memcg_cache_array *arr; - - idx = memcg_cache_id(s->memcg_params.memcg); - arr = rcu_dereference_protected(s->memcg_params.root_cache-> - memcg_params.memcg_caches, - lockdep_is_held(&slab_mutex)); - BUG_ON(arr->entries[idx] != s); - arr->entries[idx] = NULL; + if (!is_root_cache(s)) list_del(&s->memcg_params.list); - } #endif list_move(&s->list, release); return 0; @@ -499,6 +489,13 @@ void memcg_create_kmem_cache(struct mem_cgroup *memcg, mutex_lock(&slab_mutex); + /* + * The memory cgroup could have been deactivated while the cache + * creation work was pending. + */ + if (!memcg_kmem_is_active(memcg)) + goto out_unlock; + idx = memcg_cache_id(memcg); arr = rcu_dereference_protected(root_cache->memcg_params.memcg_caches, lockdep_is_held(&slab_mutex)); @@ -548,6 +545,26 @@ out_unlock: put_online_cpus(); } +void memcg_deactivate_kmem_caches(struct mem_cgroup *memcg) +{ + int idx; + struct memcg_cache_array *arr; + struct kmem_cache *s; + + idx = memcg_cache_id(memcg); + + mutex_lock(&slab_mutex); + list_for_each_entry(s, &slab_caches, list) { + if (!is_root_cache(s)) + continue; + + arr = rcu_dereference_protected(s->memcg_params.memcg_caches, + lockdep_is_held(&slab_mutex)); + arr->entries[idx] = NULL; + } + mutex_unlock(&slab_mutex); +} + void memcg_destroy_kmem_caches(struct mem_cgroup *memcg) { LIST_HEAD(release); -- cgit v1.2.3 From 3f97b163207c67a3b35931494ad3db1de66356f0 Mon Sep 17 00:00:00 2001 From: Vladimir Davydov Date: Thu, 12 Feb 2015 14:59:35 -0800 Subject: list_lru: add helpers to isolate items Currently, the isolate callback passed to the list_lru_walk family of functions is supposed to just delete an item from the list upon returning LRU_REMOVED or LRU_REMOVED_RETRY, while nr_items counter is fixed by __list_lru_walk_one after the callback returns. Since the callback is allowed to drop the lock after removing an item (it has to return LRU_REMOVED_RETRY then), the nr_items can be less than the actual number of elements on the list even if we check them under the lock. This makes it difficult to move items from one list_lru_one to another, which is required for per-memcg list_lru reparenting - we can't just splice the lists, we have to move entries one by one. This patch therefore introduces helpers that must be used by callback functions to isolate items instead of raw list_del/list_move. These are list_lru_isolate and list_lru_isolate_move. They not only remove the entry from the list, but also fix the nr_items counter, making sure nr_items always reflects the actual number of elements on the list if checked under the appropriate lock. Signed-off-by: Vladimir Davydov Cc: Johannes Weiner Cc: Michal Hocko Cc: Tejun Heo Cc: Christoph Lameter Cc: Pekka Enberg Cc: David Rientjes Cc: Joonsoo Kim Cc: Dave Chinner Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- fs/dcache.c | 21 +++++++++++---------- fs/gfs2/quota.c | 5 +++-- fs/inode.c | 8 ++++---- fs/xfs/xfs_buf.c | 6 ++++-- fs/xfs/xfs_qm.c | 5 +++-- include/linux/list_lru.h | 9 +++++++-- mm/list_lru.c | 19 ++++++++++++++++--- mm/workingset.c | 3 ++- 8 files changed, 50 insertions(+), 26 deletions(-) diff --git a/fs/dcache.c b/fs/dcache.c index 56c5da89f58a..d04be762b216 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -400,19 +400,20 @@ static void d_shrink_add(struct dentry *dentry, struct list_head *list) * LRU lists entirely, while shrink_move moves it to the indicated * private list. */ -static void d_lru_isolate(struct dentry *dentry) +static void d_lru_isolate(struct list_lru_one *lru, struct dentry *dentry) { D_FLAG_VERIFY(dentry, DCACHE_LRU_LIST); dentry->d_flags &= ~DCACHE_LRU_LIST; this_cpu_dec(nr_dentry_unused); - list_del_init(&dentry->d_lru); + list_lru_isolate(lru, &dentry->d_lru); } -static void d_lru_shrink_move(struct dentry *dentry, struct list_head *list) +static void d_lru_shrink_move(struct list_lru_one *lru, struct dentry *dentry, + struct list_head *list) { D_FLAG_VERIFY(dentry, DCACHE_LRU_LIST); dentry->d_flags |= DCACHE_SHRINK_LIST; - list_move_tail(&dentry->d_lru, list); + list_lru_isolate_move(lru, &dentry->d_lru, list); } /* @@ -869,8 +870,8 @@ static void shrink_dentry_list(struct list_head *list) } } -static enum lru_status -dentry_lru_isolate(struct list_head *item, spinlock_t *lru_lock, void *arg) +static enum lru_status dentry_lru_isolate(struct list_head *item, + struct list_lru_one *lru, spinlock_t *lru_lock, void *arg) { struct list_head *freeable = arg; struct dentry *dentry = container_of(item, struct dentry, d_lru); @@ -890,7 +891,7 @@ dentry_lru_isolate(struct list_head *item, spinlock_t *lru_lock, void *arg) * another pass through the LRU. */ if (dentry->d_lockref.count) { - d_lru_isolate(dentry); + d_lru_isolate(lru, dentry); spin_unlock(&dentry->d_lock); return LRU_REMOVED; } @@ -921,7 +922,7 @@ dentry_lru_isolate(struct list_head *item, spinlock_t *lru_lock, void *arg) return LRU_ROTATE; } - d_lru_shrink_move(dentry, freeable); + d_lru_shrink_move(lru, dentry, freeable); spin_unlock(&dentry->d_lock); return LRU_REMOVED; @@ -951,7 +952,7 @@ long prune_dcache_sb(struct super_block *sb, struct shrink_control *sc) } static enum lru_status dentry_lru_isolate_shrink(struct list_head *item, - spinlock_t *lru_lock, void *arg) + struct list_lru_one *lru, spinlock_t *lru_lock, void *arg) { struct list_head *freeable = arg; struct dentry *dentry = container_of(item, struct dentry, d_lru); @@ -964,7 +965,7 @@ static enum lru_status dentry_lru_isolate_shrink(struct list_head *item, if (!spin_trylock(&dentry->d_lock)) return LRU_SKIP; - d_lru_shrink_move(dentry, freeable); + d_lru_shrink_move(lru, dentry, freeable); spin_unlock(&dentry->d_lock); return LRU_REMOVED; diff --git a/fs/gfs2/quota.c b/fs/gfs2/quota.c index c15d6b216d0b..3aa17d4d1cfc 100644 --- a/fs/gfs2/quota.c +++ b/fs/gfs2/quota.c @@ -145,7 +145,8 @@ static void gfs2_qd_dispose(struct list_head *list) } -static enum lru_status gfs2_qd_isolate(struct list_head *item, spinlock_t *lock, void *arg) +static enum lru_status gfs2_qd_isolate(struct list_head *item, + struct list_lru_one *lru, spinlock_t *lru_lock, void *arg) { struct list_head *dispose = arg; struct gfs2_quota_data *qd = list_entry(item, struct gfs2_quota_data, qd_lru); @@ -155,7 +156,7 @@ static enum lru_status gfs2_qd_isolate(struct list_head *item, spinlock_t *lock, if (qd->qd_lockref.count == 0) { lockref_mark_dead(&qd->qd_lockref); - list_move(&qd->qd_lru, dispose); + list_lru_isolate_move(lru, &qd->qd_lru, dispose); } spin_unlock(&qd->qd_lockref.lock); diff --git a/fs/inode.c b/fs/inode.c index 524a32c2b0c6..86c612b92c6f 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -685,8 +685,8 @@ int invalidate_inodes(struct super_block *sb, bool kill_dirty) * LRU does not have strict ordering. Hence we don't want to reclaim inodes * with this flag set because they are the inodes that are out of order. */ -static enum lru_status -inode_lru_isolate(struct list_head *item, spinlock_t *lru_lock, void *arg) +static enum lru_status inode_lru_isolate(struct list_head *item, + struct list_lru_one *lru, spinlock_t *lru_lock, void *arg) { struct list_head *freeable = arg; struct inode *inode = container_of(item, struct inode, i_lru); @@ -704,7 +704,7 @@ inode_lru_isolate(struct list_head *item, spinlock_t *lru_lock, void *arg) */ if (atomic_read(&inode->i_count) || (inode->i_state & ~I_REFERENCED)) { - list_del_init(&inode->i_lru); + list_lru_isolate(lru, &inode->i_lru); spin_unlock(&inode->i_lock); this_cpu_dec(nr_unused); return LRU_REMOVED; @@ -738,7 +738,7 @@ inode_lru_isolate(struct list_head *item, spinlock_t *lru_lock, void *arg) WARN_ON(inode->i_state & I_NEW); inode->i_state |= I_FREEING; - list_move(&inode->i_lru, freeable); + list_lru_isolate_move(lru, &inode->i_lru, freeable); spin_unlock(&inode->i_lock); this_cpu_dec(nr_unused); diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c index 15c9d224c721..1790b00bea7a 100644 --- a/fs/xfs/xfs_buf.c +++ b/fs/xfs/xfs_buf.c @@ -1488,6 +1488,7 @@ xfs_buf_iomove( static enum lru_status xfs_buftarg_wait_rele( struct list_head *item, + struct list_lru_one *lru, spinlock_t *lru_lock, void *arg) @@ -1509,7 +1510,7 @@ xfs_buftarg_wait_rele( */ atomic_set(&bp->b_lru_ref, 0); bp->b_state |= XFS_BSTATE_DISPOSE; - list_move(item, dispose); + list_lru_isolate_move(lru, item, dispose); spin_unlock(&bp->b_lock); return LRU_REMOVED; } @@ -1546,6 +1547,7 @@ xfs_wait_buftarg( static enum lru_status xfs_buftarg_isolate( struct list_head *item, + struct list_lru_one *lru, spinlock_t *lru_lock, void *arg) { @@ -1569,7 +1571,7 @@ xfs_buftarg_isolate( } bp->b_state |= XFS_BSTATE_DISPOSE; - list_move(item, dispose); + list_lru_isolate_move(lru, item, dispose); spin_unlock(&bp->b_lock); return LRU_REMOVED; } diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c index 4f4b1274e144..53cc2aaf8d2b 100644 --- a/fs/xfs/xfs_qm.c +++ b/fs/xfs/xfs_qm.c @@ -430,6 +430,7 @@ struct xfs_qm_isolate { static enum lru_status xfs_qm_dquot_isolate( struct list_head *item, + struct list_lru_one *lru, spinlock_t *lru_lock, void *arg) __releases(lru_lock) __acquires(lru_lock) @@ -450,7 +451,7 @@ xfs_qm_dquot_isolate( XFS_STATS_INC(xs_qm_dqwants); trace_xfs_dqreclaim_want(dqp); - list_del_init(&dqp->q_lru); + list_lru_isolate(lru, &dqp->q_lru); XFS_STATS_DEC(xs_qm_dquot_unused); return LRU_REMOVED; } @@ -494,7 +495,7 @@ xfs_qm_dquot_isolate( xfs_dqunlock(dqp); ASSERT(dqp->q_nrefs == 0); - list_move_tail(&dqp->q_lru, &isol->dispose); + list_lru_isolate_move(lru, &dqp->q_lru, &isol->dispose); XFS_STATS_DEC(xs_qm_dquot_unused); trace_xfs_dqreclaim_done(dqp); XFS_STATS_INC(xs_qm_dqreclaims); diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h index 305b598abac2..7edf9c9ab9eb 100644 --- a/include/linux/list_lru.h +++ b/include/linux/list_lru.h @@ -125,8 +125,13 @@ static inline unsigned long list_lru_count(struct list_lru *lru) return count; } -typedef enum lru_status -(*list_lru_walk_cb)(struct list_head *item, spinlock_t *lock, void *cb_arg); +void list_lru_isolate(struct list_lru_one *list, struct list_head *item); +void list_lru_isolate_move(struct list_lru_one *list, struct list_head *item, + struct list_head *head); + +typedef enum lru_status (*list_lru_walk_cb)(struct list_head *item, + struct list_lru_one *list, spinlock_t *lock, void *cb_arg); + /** * list_lru_walk_one: walk a list_lru, isolating and disposing freeable items. * @lru: the lru pointer. diff --git a/mm/list_lru.c b/mm/list_lru.c index 79aee70c3b9d..8d9d168c6c38 100644 --- a/mm/list_lru.c +++ b/mm/list_lru.c @@ -132,6 +132,21 @@ bool list_lru_del(struct list_lru *lru, struct list_head *item) } EXPORT_SYMBOL_GPL(list_lru_del); +void list_lru_isolate(struct list_lru_one *list, struct list_head *item) +{ + list_del_init(item); + list->nr_items--; +} +EXPORT_SYMBOL_GPL(list_lru_isolate); + +void list_lru_isolate_move(struct list_lru_one *list, struct list_head *item, + struct list_head *head) +{ + list_move(item, head); + list->nr_items--; +} +EXPORT_SYMBOL_GPL(list_lru_isolate_move); + static unsigned long __list_lru_count_one(struct list_lru *lru, int nid, int memcg_idx) { @@ -194,13 +209,11 @@ restart: break; --*nr_to_walk; - ret = isolate(item, &nlru->lock, cb_arg); + ret = isolate(item, l, &nlru->lock, cb_arg); switch (ret) { case LRU_REMOVED_RETRY: assert_spin_locked(&nlru->lock); case LRU_REMOVED: - l->nr_items--; - WARN_ON_ONCE(l->nr_items < 0); isolated++; /* * If the lru lock has been dropped, our list diff --git a/mm/workingset.c b/mm/workingset.c index d4fa7fb10a52..aa017133744b 100644 --- a/mm/workingset.c +++ b/mm/workingset.c @@ -302,6 +302,7 @@ static unsigned long count_shadow_nodes(struct shrinker *shrinker, } static enum lru_status shadow_lru_isolate(struct list_head *item, + struct list_lru_one *lru, spinlock_t *lru_lock, void *arg) { @@ -332,7 +333,7 @@ static enum lru_status shadow_lru_isolate(struct list_head *item, goto out; } - list_del_init(item); + list_lru_isolate(lru, item); spin_unlock(lru_lock); /* -- cgit v1.2.3 From 2788cf0c401c268b4819c5407493a8769b7007aa Mon Sep 17 00:00:00 2001 From: Vladimir Davydov Date: Thu, 12 Feb 2015 14:59:38 -0800 Subject: memcg: reparent list_lrus and free kmemcg_id on css offline Now, the only reason to keep kmemcg_id till css free is list_lru, which uses it to distribute elements between per-memcg lists. However, it can be easily sorted out - we only need to change kmemcg_id of an offline cgroup to its parent's id, making further list_lru_add()'s add elements to the parent's list, and then move all elements from the offline cgroup's list to the one of its parent. It will work, because a racing list_lru_del() does not need to know the list it is deleting the element from. It can decrement the wrong nr_items counter though, but the ongoing reparenting will fix it. After list_lru reparenting is done we are free to release kmemcg_id saving a valuable slot in a per-memcg array for new cgroups. Signed-off-by: Vladimir Davydov Cc: Johannes Weiner Cc: Michal Hocko Cc: Tejun Heo Cc: Christoph Lameter Cc: Pekka Enberg Cc: David Rientjes Cc: Joonsoo Kim Cc: Dave Chinner Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/list_lru.h | 3 ++- mm/list_lru.c | 46 +++++++++++++++++++++++++++++++++++++++++++--- mm/memcontrol.c | 39 ++++++++++++++++++++++++++++++++++----- 3 files changed, 79 insertions(+), 9 deletions(-) diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h index 7edf9c9ab9eb..2a6b9947aaa3 100644 --- a/include/linux/list_lru.h +++ b/include/linux/list_lru.h @@ -26,7 +26,7 @@ enum lru_status { struct list_lru_one { struct list_head list; - /* kept as signed so we can catch imbalance bugs */ + /* may become negative during memcg reparenting */ long nr_items; }; @@ -62,6 +62,7 @@ int __list_lru_init(struct list_lru *lru, bool memcg_aware, #define list_lru_init_memcg(lru) __list_lru_init((lru), true, NULL) int memcg_update_all_list_lrus(int num_memcgs); +void memcg_drain_all_list_lrus(int src_idx, int dst_idx); /** * list_lru_add: add an element to the lru list's tail diff --git a/mm/list_lru.c b/mm/list_lru.c index 8d9d168c6c38..909eca2c820e 100644 --- a/mm/list_lru.c +++ b/mm/list_lru.c @@ -100,7 +100,6 @@ bool list_lru_add(struct list_lru *lru, struct list_head *item) spin_lock(&nlru->lock); l = list_lru_from_kmem(nlru, item); - WARN_ON_ONCE(l->nr_items < 0); if (list_empty(item)) { list_add_tail(item, &l->list); l->nr_items++; @@ -123,7 +122,6 @@ bool list_lru_del(struct list_lru *lru, struct list_head *item) if (!list_empty(item)) { list_del_init(item); l->nr_items--; - WARN_ON_ONCE(l->nr_items < 0); spin_unlock(&nlru->lock); return true; } @@ -156,7 +154,6 @@ static unsigned long __list_lru_count_one(struct list_lru *lru, spin_lock(&nlru->lock); l = list_lru_from_memcg_idx(nlru, memcg_idx); - WARN_ON_ONCE(l->nr_items < 0); count = l->nr_items; spin_unlock(&nlru->lock); @@ -458,6 +455,49 @@ fail: memcg_cancel_update_list_lru(lru, old_size, new_size); goto out; } + +static void memcg_drain_list_lru_node(struct list_lru_node *nlru, + int src_idx, int dst_idx) +{ + struct list_lru_one *src, *dst; + + /* + * Since list_lru_{add,del} may be called under an IRQ-safe lock, + * we have to use IRQ-safe primitives here to avoid deadlock. + */ + spin_lock_irq(&nlru->lock); + + src = list_lru_from_memcg_idx(nlru, src_idx); + dst = list_lru_from_memcg_idx(nlru, dst_idx); + + list_splice_init(&src->list, &dst->list); + dst->nr_items += src->nr_items; + src->nr_items = 0; + + spin_unlock_irq(&nlru->lock); +} + +static void memcg_drain_list_lru(struct list_lru *lru, + int src_idx, int dst_idx) +{ + int i; + + if (!list_lru_memcg_aware(lru)) + return; + + for (i = 0; i < nr_node_ids; i++) + memcg_drain_list_lru_node(&lru->node[i], src_idx, dst_idx); +} + +void memcg_drain_all_list_lrus(int src_idx, int dst_idx) +{ + struct list_lru *lru; + + mutex_lock(&list_lrus_mutex); + list_for_each_entry(lru, &list_lrus, list) + memcg_drain_list_lru(lru, src_idx, dst_idx); + mutex_unlock(&list_lrus_mutex); +} #else static int memcg_init_list_lru(struct list_lru *lru, bool memcg_aware) { diff --git a/mm/memcontrol.c b/mm/memcontrol.c index abfe0135bfdc..419c06b1794a 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -334,6 +334,7 @@ struct mem_cgroup { #if defined(CONFIG_MEMCG_KMEM) /* Index in the kmem_cache->memcg_params.memcg_caches array */ int kmemcg_id; + bool kmem_acct_activated; bool kmem_acct_active; #endif @@ -582,14 +583,10 @@ void memcg_put_cache_ids(void) struct static_key memcg_kmem_enabled_key; EXPORT_SYMBOL(memcg_kmem_enabled_key); -static void memcg_free_cache_id(int id); - static void disarm_kmem_keys(struct mem_cgroup *memcg) { - if (memcg->kmemcg_id >= 0) { + if (memcg->kmem_acct_activated) static_key_slow_dec(&memcg_kmem_enabled_key); - memcg_free_cache_id(memcg->kmemcg_id); - } /* * This check can't live in kmem destruction function, * since the charges will outlive the cgroup @@ -3322,6 +3319,7 @@ static int memcg_activate_kmem(struct mem_cgroup *memcg, int memcg_id; BUG_ON(memcg->kmemcg_id >= 0); + BUG_ON(memcg->kmem_acct_activated); BUG_ON(memcg->kmem_acct_active); /* @@ -3365,6 +3363,7 @@ static int memcg_activate_kmem(struct mem_cgroup *memcg, * patched. */ memcg->kmemcg_id = memcg_id; + memcg->kmem_acct_activated = true; memcg->kmem_acct_active = true; out: return err; @@ -4047,6 +4046,10 @@ static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss) static void memcg_deactivate_kmem(struct mem_cgroup *memcg) { + struct cgroup_subsys_state *css; + struct mem_cgroup *parent, *child; + int kmemcg_id; + if (!memcg->kmem_acct_active) return; @@ -4059,6 +4062,32 @@ static void memcg_deactivate_kmem(struct mem_cgroup *memcg) memcg->kmem_acct_active = false; memcg_deactivate_kmem_caches(memcg); + + kmemcg_id = memcg->kmemcg_id; + BUG_ON(kmemcg_id < 0); + + parent = parent_mem_cgroup(memcg); + if (!parent) + parent = root_mem_cgroup; + + /* + * Change kmemcg_id of this cgroup and all its descendants to the + * parent's id, and then move all entries from this cgroup's list_lrus + * to ones of the parent. After we have finished, all list_lrus + * corresponding to this cgroup are guaranteed to remain empty. The + * ordering is imposed by list_lru_node->lock taken by + * memcg_drain_all_list_lrus(). + */ + css_for_each_descendant_pre(css, &memcg->css) { + child = mem_cgroup_from_css(css); + BUG_ON(child->kmemcg_id != kmemcg_id); + child->kmemcg_id = parent->kmemcg_id; + if (!memcg->use_hierarchy) + break; + } + memcg_drain_all_list_lrus(kmemcg_id, parent->kmemcg_id); + + memcg_free_cache_id(kmemcg_id); } static void memcg_destroy_kmem(struct mem_cgroup *memcg) -- cgit v1.2.3 From 832f37f5d5f5c7281880c21eb09508750b67f540 Mon Sep 17 00:00:00 2001 From: Vladimir Davydov Date: Thu, 12 Feb 2015 14:59:41 -0800 Subject: slub: never fail to shrink cache SLUB's version of __kmem_cache_shrink() not only removes empty slabs, but also tries to rearrange the partial lists to place slabs filled up most to the head to cope with fragmentation. To achieve that, it allocates a temporary array of lists used to sort slabs by the number of objects in use. If the allocation fails, the whole procedure is aborted. This is unacceptable for the kernel memory accounting extension of the memory cgroup, where we want to make sure that kmem_cache_shrink() successfully discarded empty slabs. Although the allocation failure is utterly unlikely with the current page allocator implementation, which retries GFP_KERNEL allocations of order <= 2 infinitely, it is better not to rely on that. This patch therefore makes __kmem_cache_shrink() allocate the array on stack instead of calling kmalloc, which may fail. The array size is chosen to be equal to 32, because most SLUB caches store not more than 32 objects per slab page. Slab pages with <= 32 free objects are sorted using the array by the number of objects in use and promoted to the head of the partial list, while slab pages with > 32 free objects are left in the end of the list without any ordering imposed on them. Signed-off-by: Vladimir Davydov Acked-by: Christoph Lameter Acked-by: Pekka Enberg Cc: David Rientjes Cc: Joonsoo Kim Cc: Huang Ying Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/slub.c | 58 +++++++++++++++++++++++++++++++--------------------------- 1 file changed, 31 insertions(+), 27 deletions(-) diff --git a/mm/slub.c b/mm/slub.c index 1e5a4636cb23..d97b692165d2 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -3358,11 +3358,12 @@ void kfree(const void *x) } EXPORT_SYMBOL(kfree); +#define SHRINK_PROMOTE_MAX 32 + /* - * kmem_cache_shrink removes empty slabs from the partial lists and sorts - * the remaining slabs by the number of items in use. The slabs with the - * most items in use come first. New allocations will then fill those up - * and thus they can be removed from the partial lists. + * kmem_cache_shrink discards empty slabs and promotes the slabs filled + * up most to the head of the partial lists. New allocations will then + * fill those up and thus they can be removed from the partial lists. * * The slabs with the least items are placed last. This results in them * being allocated from last increasing the chance that the last objects @@ -3375,51 +3376,57 @@ int __kmem_cache_shrink(struct kmem_cache *s) struct kmem_cache_node *n; struct page *page; struct page *t; - int objects = oo_objects(s->max); - struct list_head *slabs_by_inuse = - kmalloc(sizeof(struct list_head) * objects, GFP_KERNEL); + struct list_head discard; + struct list_head promote[SHRINK_PROMOTE_MAX]; unsigned long flags; - if (!slabs_by_inuse) - return -ENOMEM; - flush_all(s); for_each_kmem_cache_node(s, node, n) { if (!n->nr_partial) continue; - for (i = 0; i < objects; i++) - INIT_LIST_HEAD(slabs_by_inuse + i); + INIT_LIST_HEAD(&discard); + for (i = 0; i < SHRINK_PROMOTE_MAX; i++) + INIT_LIST_HEAD(promote + i); spin_lock_irqsave(&n->list_lock, flags); /* - * Build lists indexed by the items in use in each slab. + * Build lists of slabs to discard or promote. * * Note that concurrent frees may occur while we hold the * list_lock. page->inuse here is the upper limit. */ list_for_each_entry_safe(page, t, &n->partial, lru) { - list_move(&page->lru, slabs_by_inuse + page->inuse); - if (!page->inuse) + int free = page->objects - page->inuse; + + /* Do not reread page->inuse */ + barrier(); + + /* We do not keep full slabs on the list */ + BUG_ON(free <= 0); + + if (free == page->objects) { + list_move(&page->lru, &discard); n->nr_partial--; + } else if (free <= SHRINK_PROMOTE_MAX) + list_move(&page->lru, promote + free - 1); } /* - * Rebuild the partial list with the slabs filled up most - * first and the least used slabs at the end. + * Promote the slabs filled up most to the head of the + * partial list. */ - for (i = objects - 1; i > 0; i--) - list_splice(slabs_by_inuse + i, n->partial.prev); + for (i = SHRINK_PROMOTE_MAX - 1; i >= 0; i--) + list_splice(promote + i, &n->partial); spin_unlock_irqrestore(&n->list_lock, flags); /* Release empty slabs */ - list_for_each_entry_safe(page, t, slabs_by_inuse, lru) + list_for_each_entry_safe(page, t, &discard, lru) discard_slab(s, page); } - kfree(slabs_by_inuse); return 0; } @@ -4686,12 +4693,9 @@ static ssize_t shrink_show(struct kmem_cache *s, char *buf) static ssize_t shrink_store(struct kmem_cache *s, const char *buf, size_t length) { - if (buf[0] == '1') { - int rc = kmem_cache_shrink(s); - - if (rc) - return rc; - } else + if (buf[0] == '1') + kmem_cache_shrink(s); + else return -EINVAL; return length; } -- cgit v1.2.3 From ce3712d74d8ed531a9fd0fbb711ff8fefbacdd9f Mon Sep 17 00:00:00 2001 From: Vladimir Davydov Date: Thu, 12 Feb 2015 14:59:44 -0800 Subject: slub: fix kmem_cache_shrink return value It is supposed to return 0 if the cache has no remaining objects and 1 otherwise, while currently it always returns 0. Fix it. Signed-off-by: Vladimir Davydov Acked-by: Christoph Lameter Cc: Pekka Enberg Cc: David Rientjes Cc: Joonsoo Kim Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/slub.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/mm/slub.c b/mm/slub.c index d97b692165d2..7fa27aee9b6e 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -3379,6 +3379,7 @@ int __kmem_cache_shrink(struct kmem_cache *s) struct list_head discard; struct list_head promote[SHRINK_PROMOTE_MAX]; unsigned long flags; + int ret = 0; flush_all(s); for_each_kmem_cache_node(s, node, n) { @@ -3425,9 +3426,12 @@ int __kmem_cache_shrink(struct kmem_cache *s) /* Release empty slabs */ list_for_each_entry_safe(page, t, &discard, lru) discard_slab(s, page); + + if (slabs_node(s, node)) + ret = 1; } - return 0; + return ret; } static int slab_mem_going_offline_callback(void *arg) -- cgit v1.2.3 From d6e0b7fa11862433773d986b5f995ffdf47ce672 Mon Sep 17 00:00:00 2001 From: Vladimir Davydov Date: Thu, 12 Feb 2015 14:59:47 -0800 Subject: slub: make dead caches discard free slabs immediately To speed up further allocations SLUB may store empty slabs in per cpu/node partial lists instead of freeing them immediately. This prevents per memcg caches destruction, because kmem caches created for a memory cgroup are only destroyed after the last page charged to the cgroup is freed. To fix this issue, this patch resurrects approach first proposed in [1]. It forbids SLUB to cache empty slabs after the memory cgroup that the cache belongs to was destroyed. It is achieved by setting kmem_cache's cpu_partial and min_partial constants to 0 and tuning put_cpu_partial() so that it would drop frozen empty slabs immediately if cpu_partial = 0. The runtime overhead is minimal. From all the hot functions, we only touch relatively cold put_cpu_partial(): we make it call unfreeze_partials() after freezing a slab that belongs to an offline memory cgroup. Since slab freezing exists to avoid moving slabs from/to a partial list on free/alloc, and there can't be allocations from dead caches, it shouldn't cause any overhead. We do have to disable preemption for put_cpu_partial() to achieve that though. The original patch was accepted well and even merged to the mm tree. However, I decided to withdraw it due to changes happening to the memcg core at that time. I had an idea of introducing per-memcg shrinkers for kmem caches, but now, as memcg has finally settled down, I do not see it as an option, because SLUB shrinker would be too costly to call since SLUB does not keep free slabs on a separate list. Besides, we currently do not even call per-memcg shrinkers for offline memcgs. Overall, it would introduce much more complexity to both SLUB and memcg than this small patch. Regarding to SLAB, there's no problem with it, because it shrinks per-cpu/node caches periodically. Thanks to list_lru reparenting, we no longer keep entries for offline cgroups in per-memcg arrays (such as memcg_cache_params->memcg_caches), so we do not have to bother if a per-memcg cache will be shrunk a bit later than it could be. [1] http://thread.gmane.org/gmane.linux.kernel.mm/118649/focus=118650 Signed-off-by: Vladimir Davydov Cc: Christoph Lameter Cc: Pekka Enberg Cc: David Rientjes Cc: Joonsoo Kim Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/slab.c | 4 ++-- mm/slab.h | 2 +- mm/slab_common.c | 15 +++++++++++++-- mm/slob.c | 2 +- mm/slub.c | 31 ++++++++++++++++++++++++++----- 5 files changed, 43 insertions(+), 11 deletions(-) diff --git a/mm/slab.c b/mm/slab.c index 7894017bc160..c4b89eaf4c96 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -2382,7 +2382,7 @@ out: return nr_freed; } -int __kmem_cache_shrink(struct kmem_cache *cachep) +int __kmem_cache_shrink(struct kmem_cache *cachep, bool deactivate) { int ret = 0; int node; @@ -2404,7 +2404,7 @@ int __kmem_cache_shutdown(struct kmem_cache *cachep) { int i; struct kmem_cache_node *n; - int rc = __kmem_cache_shrink(cachep); + int rc = __kmem_cache_shrink(cachep, false); if (rc) return rc; diff --git a/mm/slab.h b/mm/slab.h index 0a56d76ac0e9..4c3ac12dd644 100644 --- a/mm/slab.h +++ b/mm/slab.h @@ -138,7 +138,7 @@ static inline unsigned long kmem_cache_flags(unsigned long object_size, #define CACHE_CREATE_MASK (SLAB_CORE_FLAGS | SLAB_DEBUG_FLAGS | SLAB_CACHE_FLAGS) int __kmem_cache_shutdown(struct kmem_cache *); -int __kmem_cache_shrink(struct kmem_cache *); +int __kmem_cache_shrink(struct kmem_cache *, bool); void slab_kmem_cache_release(struct kmem_cache *); struct seq_file; diff --git a/mm/slab_common.c b/mm/slab_common.c index 0873bcc61c7a..1a1cc89acaa3 100644 --- a/mm/slab_common.c +++ b/mm/slab_common.c @@ -549,10 +549,13 @@ void memcg_deactivate_kmem_caches(struct mem_cgroup *memcg) { int idx; struct memcg_cache_array *arr; - struct kmem_cache *s; + struct kmem_cache *s, *c; idx = memcg_cache_id(memcg); + get_online_cpus(); + get_online_mems(); + mutex_lock(&slab_mutex); list_for_each_entry(s, &slab_caches, list) { if (!is_root_cache(s)) @@ -560,9 +563,17 @@ void memcg_deactivate_kmem_caches(struct mem_cgroup *memcg) arr = rcu_dereference_protected(s->memcg_params.memcg_caches, lockdep_is_held(&slab_mutex)); + c = arr->entries[idx]; + if (!c) + continue; + + __kmem_cache_shrink(c, true); arr->entries[idx] = NULL; } mutex_unlock(&slab_mutex); + + put_online_mems(); + put_online_cpus(); } void memcg_destroy_kmem_caches(struct mem_cgroup *memcg) @@ -649,7 +660,7 @@ int kmem_cache_shrink(struct kmem_cache *cachep) get_online_cpus(); get_online_mems(); - ret = __kmem_cache_shrink(cachep); + ret = __kmem_cache_shrink(cachep, false); put_online_mems(); put_online_cpus(); return ret; diff --git a/mm/slob.c b/mm/slob.c index 96a86206a26b..94a7fede6d48 100644 --- a/mm/slob.c +++ b/mm/slob.c @@ -618,7 +618,7 @@ int __kmem_cache_shutdown(struct kmem_cache *c) return 0; } -int __kmem_cache_shrink(struct kmem_cache *d) +int __kmem_cache_shrink(struct kmem_cache *d, bool deactivate) { return 0; } diff --git a/mm/slub.c b/mm/slub.c index 7fa27aee9b6e..06cdb1829dc9 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -2007,6 +2007,7 @@ static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain) int pages; int pobjects; + preempt_disable(); do { pages = 0; pobjects = 0; @@ -2040,6 +2041,14 @@ static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain) } while (this_cpu_cmpxchg(s->cpu_slab->partial, oldpage, page) != oldpage); + if (unlikely(!s->cpu_partial)) { + unsigned long flags; + + local_irq_save(flags); + unfreeze_partials(s, this_cpu_ptr(s->cpu_slab)); + local_irq_restore(flags); + } + preempt_enable(); #endif } @@ -3369,7 +3378,7 @@ EXPORT_SYMBOL(kfree); * being allocated from last increasing the chance that the last objects * are freed in them. */ -int __kmem_cache_shrink(struct kmem_cache *s) +int __kmem_cache_shrink(struct kmem_cache *s, bool deactivate) { int node; int i; @@ -3381,11 +3390,23 @@ int __kmem_cache_shrink(struct kmem_cache *s) unsigned long flags; int ret = 0; + if (deactivate) { + /* + * Disable empty slabs caching. Used to avoid pinning offline + * memory cgroups by kmem pages that can be freed. + */ + s->cpu_partial = 0; + s->min_partial = 0; + + /* + * s->cpu_partial is checked locklessly (see put_cpu_partial), + * so we have to make sure the change is visible. + */ + kick_all_cpus_sync(); + } + flush_all(s); for_each_kmem_cache_node(s, node, n) { - if (!n->nr_partial) - continue; - INIT_LIST_HEAD(&discard); for (i = 0; i < SHRINK_PROMOTE_MAX; i++) INIT_LIST_HEAD(promote + i); @@ -3440,7 +3461,7 @@ static int slab_mem_going_offline_callback(void *arg) mutex_lock(&slab_mutex); list_for_each_entry(s, &slab_caches, list) - __kmem_cache_shrink(s); + __kmem_cache_shrink(s, false); mutex_unlock(&slab_mutex); return 0; -- cgit v1.2.3 From 372549c2a3778fd3df445819811c944ad54609ca Mon Sep 17 00:00:00 2001 From: Joonsoo Kim Date: Thu, 12 Feb 2015 14:59:50 -0800 Subject: mm/compaction: fix wrong order check in compact_finished() What we want to check here is whether there is highorder freepage in buddy list of other migratetype in order to steal it without fragmentation. But, current code just checks cc->order which means allocation request order. So, this is wrong. Without this fix, non-movable synchronous compaction below pageblock order would not stopped until compaction is complete, because migratetype of most pageblocks are movable and high order freepage made by compaction is usually on movable type buddy list. There is some report related to this bug. See below link. http://www.spinics.net/lists/linux-mm/msg81666.html Although the issued system still has load spike comes from compaction, this makes that system completely stable and responsive according to his report. stress-highalloc test in mmtests with non movable order 7 allocation doesn't show any notable difference in allocation success rate, but, it shows more compaction success rate. Compaction success rate (Compaction success * 100 / Compaction stalls, %) 18.47 : 28.94 Fixes: 1fb3f8ca0e92 ("mm: compaction: capture a suitable high-order page immediately when it is made available") Signed-off-by: Joonsoo Kim Acked-by: Vlastimil Babka Reviewed-by: Zhang Yanfei Cc: Mel Gorman Cc: David Rientjes Cc: Rik van Riel Cc: [3.7+] Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/compaction.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/compaction.c b/mm/compaction.c index b68736c8a1ce..4954e196680c 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -1173,7 +1173,7 @@ static int __compact_finished(struct zone *zone, struct compact_control *cc, return COMPACT_PARTIAL; /* Job done if allocation would set block type */ - if (cc->order >= pageblock_order && area->nr_free) + if (order >= pageblock_order && area->nr_free) return COMPACT_PARTIAL; } -- cgit v1.2.3 From 932ff6bbbdcadd85b309ef4fd59d4d8a77329b8b Mon Sep 17 00:00:00 2001 From: Joonsoo Kim Date: Thu, 12 Feb 2015 14:59:53 -0800 Subject: mm/compaction: stop the isolation when we isolate enough freepage Currently, freepage isolation in one pageblock doesn't consider how many freepages we isolate. When I traced flow of compaction, compaction sometimes isolates more than 256 freepages to migrate just 32 pages. In this patch, freepage isolation is stopped at the point that we have more isolated freepage than isolated page for migration. This results in slowing down free page scanner and make compaction success rate higher. stress-highalloc test in mmtests with non movable order 7 allocation shows increase of compaction success rate. Compaction success rate (Compaction success * 100 / Compaction stalls, %) 27.13 : 31.82 pfn where both scanners meets on compaction complete (separate test due to enormous tracepoint buffer) (zone_start=4096, zone_end=1048576) 586034 : 654378 In fact, I didn't fully understand why this patch results in such good result. There was a guess that not used freepages are released to pcp list and on next compaction trial we won't isolate them again so compaction success rate would decrease. To prevent this effect, I tested with adding pcp drain code on release_freepages(), but, it has no good effect. Anyway, this patch reduces waste time to isolate unneeded freepages so seems reasonable. Vlastimil said: : I briefly tried it on top of the pivot-changing series and with order-9 : allocations it reduced free page scanned counter by almost 10%. No effect : on success rates (maybe because pivot changing already took care of the : scanners meeting problem) but the scanning reduction is good on its own. : : It also explains why e14c720efdd7 ("mm, compaction: remember position : within pageblock in free pages scanner") had less than expected : improvements. It would only actually stop within pageblock in case of : async compaction detecting contention. I guess that's also why the : infinite loop problem fixed by 1d5bfe1ffb5b affected so relatively few : people. Signed-off-by: Joonsoo Kim Acked-by: Vlastimil Babka Tested-by: Vlastimil Babka Reviewed-by: Zhang Yanfei Cc: Mel Gorman Cc: David Rientjes Cc: Rik van Riel Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/compaction.c | 17 ++++++++++------- 1 file changed, 10 insertions(+), 7 deletions(-) diff --git a/mm/compaction.c b/mm/compaction.c index 4954e196680c..782772df62c8 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -490,6 +490,13 @@ static unsigned long isolate_freepages_block(struct compact_control *cc, /* If a page was split, advance to the end of it */ if (isolated) { + cc->nr_freepages += isolated; + if (!strict && + cc->nr_migratepages <= cc->nr_freepages) { + blockpfn += isolated; + break; + } + blockpfn += isolated - 1; cursor += isolated - 1; continue; @@ -899,7 +906,6 @@ static void isolate_freepages(struct compact_control *cc) unsigned long isolate_start_pfn; /* exact pfn we start at */ unsigned long block_end_pfn; /* end of current pageblock */ unsigned long low_pfn; /* lowest pfn scanner is able to scan */ - int nr_freepages = cc->nr_freepages; struct list_head *freelist = &cc->freepages; /* @@ -924,11 +930,11 @@ static void isolate_freepages(struct compact_control *cc) * pages on cc->migratepages. We stop searching if the migrate * and free page scanners meet or enough free pages are isolated. */ - for (; block_start_pfn >= low_pfn && cc->nr_migratepages > nr_freepages; + for (; block_start_pfn >= low_pfn && + cc->nr_migratepages > cc->nr_freepages; block_end_pfn = block_start_pfn, block_start_pfn -= pageblock_nr_pages, isolate_start_pfn = block_start_pfn) { - unsigned long isolated; /* * This can iterate a massively long zone without finding any @@ -953,9 +959,8 @@ static void isolate_freepages(struct compact_control *cc) continue; /* Found a block suitable for isolating free pages from. */ - isolated = isolate_freepages_block(cc, &isolate_start_pfn, + isolate_freepages_block(cc, &isolate_start_pfn, block_end_pfn, freelist, false); - nr_freepages += isolated; /* * Remember where the free scanner should restart next time, @@ -987,8 +992,6 @@ static void isolate_freepages(struct compact_control *cc) */ if (block_start_pfn < low_pfn) cc->free_pfn = cc->migrate_pfn; - - cc->nr_freepages = nr_freepages; } /* -- cgit v1.2.3 From f48b80a5e22200347e91f96b8b237b24b93c7192 Mon Sep 17 00:00:00 2001 From: Vladimir Davydov Date: Thu, 12 Feb 2015 14:59:56 -0800 Subject: memcg: cleanup static keys decrement Move memcg_socket_limit_enabled decrement to tcp_destroy_cgroup (called from memcg_destroy_kmem -> mem_cgroup_sockets_destroy) and zap a bunch of wrapper functions. Although this patch moves static keys decrement from __mem_cgroup_free to mem_cgroup_css_free, it does not introduce any functional changes, because the keys are incremented on setting the limit (tcp or kmem), which can only happen after successful mem_cgroup_css_online. Signed-off-by: Vladimir Davydov Cc: Glauber Costa Cc: KAMEZAWA Hiroyuki Cc: Eric W. Biederman Cc: David S. Miller Cc: Johannes Weiner Acked-by: Michal Hocko Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/net/sock.h | 5 ----- mm/memcontrol.c | 38 +++++--------------------------------- net/ipv4/tcp_memcontrol.c | 4 ++++ 3 files changed, 9 insertions(+), 38 deletions(-) diff --git a/include/net/sock.h b/include/net/sock.h index e13824570b0f..ab186b1d31ff 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -1077,11 +1077,6 @@ static inline bool memcg_proto_active(struct cg_proto *cg_proto) return test_bit(MEMCG_SOCK_ACTIVE, &cg_proto->flags); } -static inline bool memcg_proto_activated(struct cg_proto *cg_proto) -{ - return test_bit(MEMCG_SOCK_ACTIVATED, &cg_proto->flags); -} - #ifdef SOCK_REFCNT_DEBUG static inline void sk_refcnt_debug_inc(struct sock *sk) { diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 419c06b1794a..d18d3a6e7337 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -519,16 +519,6 @@ struct cg_proto *tcp_proto_cgroup(struct mem_cgroup *memcg) } EXPORT_SYMBOL(tcp_proto_cgroup); -static void disarm_sock_keys(struct mem_cgroup *memcg) -{ - if (!memcg_proto_activated(&memcg->tcp_mem)) - return; - static_key_slow_dec(&memcg_socket_limit_enabled); -} -#else -static void disarm_sock_keys(struct mem_cgroup *memcg) -{ -} #endif #ifdef CONFIG_MEMCG_KMEM @@ -583,28 +573,8 @@ void memcg_put_cache_ids(void) struct static_key memcg_kmem_enabled_key; EXPORT_SYMBOL(memcg_kmem_enabled_key); -static void disarm_kmem_keys(struct mem_cgroup *memcg) -{ - if (memcg->kmem_acct_activated) - static_key_slow_dec(&memcg_kmem_enabled_key); - /* - * This check can't live in kmem destruction function, - * since the charges will outlive the cgroup - */ - WARN_ON(page_counter_read(&memcg->kmem)); -} -#else -static void disarm_kmem_keys(struct mem_cgroup *memcg) -{ -} #endif /* CONFIG_MEMCG_KMEM */ -static void disarm_static_keys(struct mem_cgroup *memcg) -{ - disarm_sock_keys(memcg); - disarm_kmem_keys(memcg); -} - static struct mem_cgroup_per_zone * mem_cgroup_zone_zoneinfo(struct mem_cgroup *memcg, struct zone *zone) { @@ -4092,7 +4062,11 @@ static void memcg_deactivate_kmem(struct mem_cgroup *memcg) static void memcg_destroy_kmem(struct mem_cgroup *memcg) { - memcg_destroy_kmem_caches(memcg); + if (memcg->kmem_acct_activated) { + memcg_destroy_kmem_caches(memcg); + static_key_slow_dec(&memcg_kmem_enabled_key); + WARN_ON(page_counter_read(&memcg->kmem)); + } mem_cgroup_sockets_destroy(memcg); } #else @@ -4523,8 +4497,6 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg) free_mem_cgroup_per_zone_info(memcg, node); free_percpu(memcg->stat); - - disarm_static_keys(memcg); kfree(memcg); } diff --git a/net/ipv4/tcp_memcontrol.c b/net/ipv4/tcp_memcontrol.c index c2a75c6957a1..2379c1b4efb2 100644 --- a/net/ipv4/tcp_memcontrol.c +++ b/net/ipv4/tcp_memcontrol.c @@ -47,6 +47,10 @@ void tcp_destroy_cgroup(struct mem_cgroup *memcg) return; percpu_counter_destroy(&cg_proto->sockets_allocated); + + if (test_bit(MEMCG_SOCK_ACTIVATED, &cg_proto->flags)) + static_key_slow_dec(&memcg_socket_limit_enabled); + } EXPORT_SYMBOL(tcp_destroy_cgroup); -- cgit v1.2.3 From 2d2f5119b8bb057595e18f5b2f07aa097ea1b233 Mon Sep 17 00:00:00 2001 From: "Kirill A. Shutemov" Date: Thu, 12 Feb 2015 14:59:59 -0800 Subject: mm: do not use mm->nr_pmds on !MMU configurations mm->nr_pmds doesn't make sense on !MMU configurations Signed-off-by: Kirill A. Shutemov Cc: Guenter Roeck Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/mm.h | 9 ++++++++- kernel/fork.c | 4 +--- 2 files changed, 9 insertions(+), 4 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index af4ff88a11e0..bd52e2f14027 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1447,13 +1447,15 @@ static inline int __pud_alloc(struct mm_struct *mm, pgd_t *pgd, int __pud_alloc(struct mm_struct *mm, pgd_t *pgd, unsigned long address); #endif -#ifdef __PAGETABLE_PMD_FOLDED +#if defined(__PAGETABLE_PMD_FOLDED) || !defined(CONFIG_MMU) static inline int __pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long address) { return 0; } +static inline void mm_nr_pmds_init(struct mm_struct *mm) {} + static inline unsigned long mm_nr_pmds(struct mm_struct *mm) { return 0; @@ -1465,6 +1467,11 @@ static inline void mm_dec_nr_pmds(struct mm_struct *mm) {} #else int __pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long address); +static inline void mm_nr_pmds_init(struct mm_struct *mm) +{ + atomic_long_set(&mm->nr_pmds, 0); +} + static inline unsigned long mm_nr_pmds(struct mm_struct *mm) { return atomic_long_read(&mm->nr_pmds); diff --git a/kernel/fork.c b/kernel/fork.c index 66e19c251581..cf65139615a0 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -555,9 +555,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p) INIT_LIST_HEAD(&mm->mmlist); mm->core_state = NULL; atomic_long_set(&mm->nr_ptes, 0); -#ifndef __PAGETABLE_PMD_FOLDED - atomic_long_set(&mm->nr_pmds, 0); -#endif + mm_nr_pmds_init(mm); mm->map_count = 0; mm->locked_vm = 0; mm->pinned_vm = 0; -- cgit v1.2.3 From fc5199d1a9c9ed14e22651d0fd3b10c79e7e1f6d Mon Sep 17 00:00:00 2001 From: Rasmus Villemoes Date: Thu, 12 Feb 2015 15:00:02 -0800 Subject: mm/internal.h: don't split printk call in two All users of mminit_dprintk pass a compile-time constant as level, so this just makes gcc emit a single printk call instead of two. Signed-off-by: Rasmus Villemoes Cc: Vlastimil Babka Cc: Rik van Riel Cc: Joonsoo Kim Cc: David Rientjes Cc: Vishnu Pratap Singh Cc: Pintu Kumar Cc: Michal Nazarewicz Cc: Mel Gorman Cc: Paul Gortmaker Cc: Peter Zijlstra Cc: Tim Chen Cc: Hugh Dickins Cc: Li Zefan Cc: Tejun Heo Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/internal.h | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/mm/internal.h b/mm/internal.h index c4d6c9b43491..a96da5b0029d 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -351,8 +351,10 @@ extern int mminit_loglevel; #define mminit_dprintk(level, prefix, fmt, arg...) \ do { \ if (level < mminit_loglevel) { \ - printk(level <= MMINIT_WARNING ? KERN_WARNING : KERN_DEBUG); \ - printk(KERN_CONT "mminit::" prefix " " fmt, ##arg); \ + if (level <= MMINIT_WARNING) \ + printk(KERN_WARNING "mminit::" prefix " " fmt, ##arg); \ + else \ + printk(KERN_DEBUG "mminit::" prefix " " fmt, ##arg); \ } \ } while (0) -- cgit v1.2.3 From 061f67bc4d053e03970a268fca99a55b6859f301 Mon Sep 17 00:00:00 2001 From: Rasmus Villemoes Date: Thu, 12 Feb 2015 15:00:06 -0800 Subject: mm/page_alloc.c: pull out init code from build_all_zonelists Pulling the code protected by if (system_state == SYSTEM_BOOTING) into its own helper allows us to shrink .text a little. This relies on build_all_zonelists already having a __ref annotation. Add a comment explaining why so one doesn't have to track it down through git log. The real saving comes in 3/5, ("mm/mm_init.c: Mark mminit_verify_zonelist as __init"), where we save about 400 bytes Signed-off-by: Rasmus Villemoes Cc: Vlastimil Babka Cc: Rik van Riel Cc: Joonsoo Kim Cc: David Rientjes Cc: Vishnu Pratap Singh Cc: Pintu Kumar Cc: Michal Nazarewicz Cc: Mel Gorman Cc: Paul Gortmaker Cc: Peter Zijlstra Cc: Tim Chen Cc: Hugh Dickins Cc: Li Zefan Cc: Tejun Heo Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/page_alloc.c | 17 ++++++++++++++--- 1 file changed, 14 insertions(+), 3 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 8d52ab18fe0d..375327b8e932 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3871,18 +3871,29 @@ static int __build_all_zonelists(void *data) return 0; } +static noinline void __init +build_all_zonelists_init(void) +{ + __build_all_zonelists(NULL); + mminit_verify_zonelist(); + cpuset_init_current_mems_allowed(); +} + /* * Called with zonelists_mutex held always * unless system_state == SYSTEM_BOOTING. + * + * __ref due to (1) call of __meminit annotated setup_zone_pageset + * [we're only called with non-NULL zone through __meminit paths] and + * (2) call of __init annotated helper build_all_zonelists_init + * [protected by SYSTEM_BOOTING]. */ void __ref build_all_zonelists(pg_data_t *pgdat, struct zone *zone) { set_zonelist_order(); if (system_state == SYSTEM_BOOTING) { - __build_all_zonelists(NULL); - mminit_verify_zonelist(); - cpuset_init_current_mems_allowed(); + build_all_zonelists_init(); } else { #ifdef CONFIG_MEMORY_HOTPLUG if (zone) -- cgit v1.2.3 From 0e2342c709aa568b90cde3387d6e588ca862a0ba Mon Sep 17 00:00:00 2001 From: Rasmus Villemoes Date: Thu, 12 Feb 2015 15:00:09 -0800 Subject: mm/mm_init.c: park mminit_verify_zonelist as __init The only caller of mminit_verify_zonelist is build_all_zonelists_init, which is annotated with __init, so it should be safe to also mark the former as __init, saving ~400 bytes of .text. Signed-off-by: Rasmus Villemoes Cc: Vlastimil Babka Cc: Rik van Riel Cc: Joonsoo Kim Cc: David Rientjes Cc: Vishnu Pratap Singh Cc: Pintu Kumar Cc: Michal Nazarewicz Cc: Mel Gorman Cc: Paul Gortmaker Cc: Peter Zijlstra Cc: Tim Chen Cc: Hugh Dickins Cc: Li Zefan Cc: Tejun Heo Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/mm_init.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/mm_init.c b/mm/mm_init.c index 4074caf9936b..e17c758b27bf 100644 --- a/mm/mm_init.c +++ b/mm/mm_init.c @@ -21,7 +21,7 @@ int mminit_loglevel; #endif /* The zonelists are simply reported, validation is manual. */ -void mminit_verify_zonelist(void) +void __init mminit_verify_zonelist(void) { int nid; -- cgit v1.2.3 From 194e81512063e96763025a990f841f5ab24815ee Mon Sep 17 00:00:00 2001 From: Rasmus Villemoes Date: Thu, 12 Feb 2015 15:00:12 -0800 Subject: mm/mm_init.c: mark mminit_loglevel __meminitdata mminit_loglevel is only referenced from __init and __meminit functions, so we can mark it __meminitdata. Signed-off-by: Rasmus Villemoes Cc: Vlastimil Babka Cc: Rik van Riel Cc: Joonsoo Kim Cc: David Rientjes Cc: Vishnu Pratap Singh Cc: Pintu Kumar Cc: Michal Nazarewicz Cc: Mel Gorman Cc: Paul Gortmaker Cc: Peter Zijlstra Cc: Tim Chen Cc: Hugh Dickins Cc: Li Zefan Cc: Tejun Heo Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/mm_init.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/mm_init.c b/mm/mm_init.c index e17c758b27bf..5f420f7fafa1 100644 --- a/mm/mm_init.c +++ b/mm/mm_init.c @@ -14,7 +14,7 @@ #include "internal.h" #ifdef CONFIG_DEBUG_MEMORY_INIT -int mminit_loglevel; +int __meminitdata mminit_loglevel; #ifndef SECTIONS_SHIFT #define SECTIONS_SHIFT 0 -- cgit v1.2.3 From 8f4ab07f4bf1b1069c01b7c6758a7d444406996b Mon Sep 17 00:00:00 2001 From: Rasmus Villemoes Date: Thu, 12 Feb 2015 15:00:16 -0800 Subject: kernel/cpuset.c: Mark cpuset_init_current_mems_allowed as __init The only caller of cpuset_init_current_mems_allowed is the __init annotated build_all_zonelists_init, so we can also make the former __init. Signed-off-by: Rasmus Villemoes Cc: Vlastimil Babka Cc: Rik van Riel Cc: Joonsoo Kim Cc: David Rientjes Cc: Vishnu Pratap Singh Cc: Pintu Kumar Cc: Michal Nazarewicz Cc: Mel Gorman Cc: Paul Gortmaker Cc: Peter Zijlstra Cc: Tim Chen Cc: Hugh Dickins Cc: Li Zefan Cc: Tejun Heo Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/cpuset.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/cpuset.c b/kernel/cpuset.c index 64b257f6bca2..c54a1dae6c11 100644 --- a/kernel/cpuset.c +++ b/kernel/cpuset.c @@ -2400,7 +2400,7 @@ void cpuset_cpus_allowed_fallback(struct task_struct *tsk) */ } -void cpuset_init_current_mems_allowed(void) +void __init cpuset_init_current_mems_allowed(void) { nodes_setall(current->mems_allowed); } -- cgit v1.2.3 From 9cb12d7b4ccaa976f97ce0c5fd0f1b6a83bc2a75 Mon Sep 17 00:00:00 2001 From: Grazvydas Ignotas Date: Thu, 12 Feb 2015 15:00:19 -0800 Subject: mm/memory.c: actually remap enough memory For whatever reason, generic_access_phys() only remaps one page, but actually allows to access arbitrary size. It's quite easy to trigger large reads, like printing out large structure with gdb, which leads to a crash. Fix it by remapping correct size. Fixes: 28b2ee20c7cb ("access_process_vm device memory infrastructure") Signed-off-by: Grazvydas Ignotas Cc: Rik van Riel Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/memory.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/memory.c b/mm/memory.c index f7886ab036e7..99275325f303 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3462,7 +3462,7 @@ int generic_access_phys(struct vm_area_struct *vma, unsigned long addr, if (follow_phys(vma, addr, write, &prot, &phys_addr)) return -EINVAL; - maddr = ioremap_prot(phys_addr, PAGE_SIZE, prot); + maddr = ioremap_prot(phys_addr, PAGE_ALIGN(len + offset), prot); if (write) memcpy_toio(maddr + offset, buf, len); else -- cgit v1.2.3 From 84109e15ddc8fe00c832d5228ca2aedf95d13b97 Mon Sep 17 00:00:00 2001 From: Yaowei Bai Date: Thu, 12 Feb 2015 15:00:22 -0800 Subject: mm/page_alloc: fix comment Add a necessary 'leave'. Signed-off-by: Yaowei Bai Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/page_alloc.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 375327b8e932..cb4758263f6b 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -172,7 +172,7 @@ static void __free_pages_ok(struct page *page, unsigned int order); * 1G machine -> (16M dma, 784M normal, 224M high) * NORMAL allocation will leave 784M/256 of ram reserved in the ZONE_DMA * HIGHMEM allocation will leave 224M/32 of ram reserved in ZONE_NORMAL - * HIGHMEM allocation will (224M+784M)/256 of ram reserved in ZONE_DMA + * HIGHMEM allocation will leave (224M+784M)/256 of ram reserved in ZONE_DMA * * TBD: should special case ZONE_DMA32 machines here - in those we normally * don't need any ZONE_NORMAL reservation -- cgit v1.2.3 From 9ab3b598d2dfbdb0153ffa7e4b1456bbff59a25d Mon Sep 17 00:00:00 2001 From: Naoya Horiguchi Date: Thu, 12 Feb 2015 15:00:25 -0800 Subject: mm: hwpoison: drop lru_add_drain_all() in __soft_offline_page() A race condition starts to be visible in recent mmotm, where a PG_hwpoison flag is set on a migration source page *before* it's back in buddy page poo= l. This is problematic because no page flag is supposed to be set when freeing (see __free_one_page().) So the user-visible effect of this race is that it could trigger the BUG_ON() when soft-offlining is called. The root cause is that we call lru_add_drain_all() to make sure that the page is in buddy, but that doesn't work because this function just schedule= s a work item and doesn't wait its completion. drain_all_pages() does drainin= g directly, so simply dropping lru_add_drain_all() solves this problem. Fixes: f15bdfa802bf ("mm/memory-failure.c: fix memory leak in successful soft offlining") Signed-off-by: Naoya Horiguchi Cc: Andi Kleen Cc: Tony Luck Cc: Chen Gong Cc: [3.11+] Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/memory-failure.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/mm/memory-failure.c b/mm/memory-failure.c index 1a735fad2a13..d487f8dc6d39 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -1646,8 +1646,6 @@ static int __soft_offline_page(struct page *page, int flags) * source page should be freed back to buddy before * setting PG_hwpoison. */ - if (!is_free_buddy_page(page)) - lru_add_drain_all(); if (!is_free_buddy_page(page)) drain_all_pages(page_zone(page)); SetPageHWPoison(page); -- cgit v1.2.3 From ff59909a077b3c51c168cb658601c6b63136a347 Mon Sep 17 00:00:00 2001 From: Hugh Dickins Date: Thu, 12 Feb 2015 15:00:28 -0800 Subject: mm: fix negative nr_isolated counts The vmstat interfaces are good at hiding negative counts (at least when CONFIG_SMP); but if you peer behind the curtain, you find that nr_isolated_anon and nr_isolated_file soon go negative, and grow ever more negative: so they can absorb larger and larger numbers of isolated pages, yet still appear to be zero. I'm happy to avoid a congestion_wait() when too_many_isolated() myself; but I guess it's there for a good reason, in which case we ought to get too_many_isolated() working again. The imbalance comes from isolate_migratepages()'s ISOLATE_ABORT case: putback_movable_pages() decrements the NR_ISOLATED counts, but we forgot to call acct_isolated() to increment them. It is possible that the bug whcih this patch fixes could cause OOM kills when the system still has a lot of reclaimable page cache. Fixes: edc2ca612496 ("mm, compaction: move pageblock checks up from isolate_migratepages_range()") Signed-off-by: Hugh Dickins Acked-by: Vlastimil Babka Acked-by: Joonsoo Kim Cc: [3.18+] Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/compaction.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/mm/compaction.c b/mm/compaction.c index 782772df62c8..d50d6de6f1b6 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -1103,8 +1103,10 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone, low_pfn = isolate_migratepages_block(cc, low_pfn, end_pfn, isolate_mode); - if (!low_pfn || cc->contended) + if (!low_pfn || cc->contended) { + acct_isolated(zone, cc); return ISOLATE_ABORT; + } /* * Either we isolated something and proceed with migration. Or -- cgit v1.2.3 From b8179958327a1f513efca095ba782a1986c7c4fb Mon Sep 17 00:00:00 2001 From: Sergey Senozhatsky Date: Thu, 12 Feb 2015 15:00:31 -0800 Subject: zram: clean up zram_meta_alloc() A trivial cleanup of zram_meta_alloc() error handling. Signed-off-by: Sergey Senozhatsky Acked-by: Minchan Kim Cc: Nitin Gupta Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- drivers/block/zram/zram_drv.c | 14 ++++++-------- 1 file changed, 6 insertions(+), 8 deletions(-) diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c index bd8bda386e02..369fe5642799 100644 --- a/drivers/block/zram/zram_drv.c +++ b/drivers/block/zram/zram_drv.c @@ -318,31 +318,29 @@ static struct zram_meta *zram_meta_alloc(u64 disksize) { size_t num_pages; struct zram_meta *meta = kmalloc(sizeof(*meta), GFP_KERNEL); + if (!meta) - goto out; + return NULL; num_pages = disksize >> PAGE_SHIFT; meta->table = vzalloc(num_pages * sizeof(*meta->table)); if (!meta->table) { pr_err("Error allocating zram address table\n"); - goto free_meta; + goto out_error; } meta->mem_pool = zs_create_pool(GFP_NOIO | __GFP_HIGHMEM); if (!meta->mem_pool) { pr_err("Error creating memory pool\n"); - goto free_table; + goto out_error; } return meta; -free_table: +out_error: vfree(meta->table); -free_meta: kfree(meta); - meta = NULL; -out: - return meta; + return NULL; } static void update_position(u32 *index, int *offset, struct bio_vec *bvec) -- cgit v1.2.3 From 1fec117281d9f5349c35279c9521f4096fa33357 Mon Sep 17 00:00:00 2001 From: Ganesh Mahendran Date: Thu, 12 Feb 2015 15:00:33 -0800 Subject: zram: free meta table in zram_meta_free zram_meta_alloc() and zram_meta_free() are a pair. In zram_meta_alloc(), meta table is allocated. So it it better to free it in zram_meta_free(). Signed-off-by: Ganesh Mahendran Acked-by: Minchan Kim Acked-by: Sergey Senozhatsky Cc: Nitin Gupta Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- drivers/block/zram/zram_drv.c | 33 ++++++++++++++++----------------- 1 file changed, 16 insertions(+), 17 deletions(-) diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c index 369fe5642799..0e07652cf7c1 100644 --- a/drivers/block/zram/zram_drv.c +++ b/drivers/block/zram/zram_drv.c @@ -307,8 +307,21 @@ static inline int valid_io_request(struct zram *zram, return 1; } -static void zram_meta_free(struct zram_meta *meta) +static void zram_meta_free(struct zram_meta *meta, u64 disksize) { + size_t num_pages = disksize >> PAGE_SHIFT; + size_t index; + + /* Free all pages that are still in this zram device */ + for (index = 0; index < num_pages; index++) { + unsigned long handle = meta->table[index].handle; + + if (!handle) + continue; + + zs_free(meta->mem_pool, handle); + } + zs_destroy_pool(meta->mem_pool); vfree(meta->table); kfree(meta); @@ -704,9 +717,6 @@ static void zram_bio_discard(struct zram *zram, u32 index, static void zram_reset_device(struct zram *zram, bool reset_capacity) { - size_t index; - struct zram_meta *meta; - down_write(&zram->init_lock); zram->limit_pages = 0; @@ -716,20 +726,9 @@ static void zram_reset_device(struct zram *zram, bool reset_capacity) return; } - meta = zram->meta; - /* Free all pages that are still in this zram device */ - for (index = 0; index < zram->disksize >> PAGE_SHIFT; index++) { - unsigned long handle = meta->table[index].handle; - if (!handle) - continue; - - zs_free(meta->mem_pool, handle); - } - zcomp_destroy(zram->comp); zram->max_comp_streams = 1; - - zram_meta_free(zram->meta); + zram_meta_free(zram->meta, zram->disksize); zram->meta = NULL; /* Reset stats */ memset(&zram->stats, 0, sizeof(zram->stats)); @@ -801,7 +800,7 @@ out_destroy_comp: up_write(&zram->init_lock); zcomp_destroy(comp); out_free_meta: - zram_meta_free(meta); + zram_meta_free(meta, disksize); return err; } -- cgit v1.2.3 From ba6b17d68c8e3aa8d55d0474299cb931965c5ea5 Mon Sep 17 00:00:00 2001 From: Sergey Senozhatsky Date: Thu, 12 Feb 2015 15:00:36 -0800 Subject: zram: fix umount-reset_store-mount race condition Ganesh Mahendran was the first one who proposed to use bdev->bd_mutex to avoid ->bd_holders race condition: CPU0 CPU1 umount /* zram->init_done is true */ reset_store() bdev->bd_holders == 0 mount ... zram_make_request() zram_reset_device() However, his solution required some considerable amount of code movement, which we can avoid. Apart from using bdev->bd_mutex in reset_store(), this patch also simplifies zram_reset_device(). zram_reset_device() has a bool parameter reset_capacity which tells it whether disk capacity and itself disk should be reset. There are two zram_reset_device() callers: -- zram_exit() passes reset_capacity=false -- reset_store() passes reset_capacity=true So we can move reset_capacity-sensitive work out of zram_reset_device() and perform it unconditionally in reset_store(). This also lets us drop reset_capacity parameter from zram_reset_device() and pass zram pointer only. Signed-off-by: Sergey Senozhatsky Reported-by: Ganesh Mahendran Cc: Minchan Kim Cc: Vlastimil Babka Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- drivers/block/zram/zram_drv.c | 23 +++++++++-------------- 1 file changed, 9 insertions(+), 14 deletions(-) diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c index 0e07652cf7c1..2607bd9f4955 100644 --- a/drivers/block/zram/zram_drv.c +++ b/drivers/block/zram/zram_drv.c @@ -715,7 +715,7 @@ static void zram_bio_discard(struct zram *zram, u32 index, } } -static void zram_reset_device(struct zram *zram, bool reset_capacity) +static void zram_reset_device(struct zram *zram) { down_write(&zram->init_lock); @@ -734,18 +734,7 @@ static void zram_reset_device(struct zram *zram, bool reset_capacity) memset(&zram->stats, 0, sizeof(zram->stats)); zram->disksize = 0; - if (reset_capacity) - set_capacity(zram->disk, 0); - up_write(&zram->init_lock); - - /* - * Revalidate disk out of the init_lock to avoid lockdep splat. - * It's okay because disk's capacity is protected by init_lock - * so that revalidate_disk always sees up-to-date capacity. - */ - if (reset_capacity) - revalidate_disk(zram->disk); } static ssize_t disksize_store(struct device *dev, @@ -818,6 +807,7 @@ static ssize_t reset_store(struct device *dev, if (!bdev) return -ENOMEM; + mutex_lock(&bdev->bd_mutex); /* Do not reset an active device! */ if (bdev->bd_holders) { ret = -EBUSY; @@ -835,12 +825,17 @@ static ssize_t reset_store(struct device *dev, /* Make sure all pending I/O is finished */ fsync_bdev(bdev); + zram_reset_device(zram); + set_capacity(zram->disk, 0); + + mutex_unlock(&bdev->bd_mutex); + revalidate_disk(zram->disk); bdput(bdev); - zram_reset_device(zram, true); return len; out: + mutex_unlock(&bdev->bd_mutex); bdput(bdev); return ret; } @@ -1186,7 +1181,7 @@ static void __exit zram_exit(void) * Shouldn't access zram->disk after destroy_device * because destroy_device already released zram->disk. */ - zram_reset_device(zram, false); + zram_reset_device(zram); } unregister_blkdev(zram_major, "zram"); -- cgit v1.2.3 From a096cafc31862c54da0b56c8441dc14023437008 Mon Sep 17 00:00:00 2001 From: Sergey Senozhatsky Date: Thu, 12 Feb 2015 15:00:39 -0800 Subject: zram: rework reset and destroy path We need to return set_capacity(disk, 0) from reset_store() back to zram_reset_device(), a catch by Ganesh Mahendran. Potentially, we can race set_capacity() calls from init and reset paths. The problem is that zram_reset_device() is also getting called from zram_exit(), which performs operations in misleading reversed order -- we first create_device() and then init it, while zram_exit() perform destroy_device() first and then does zram_reset_device(). This is done to remove sysfs group before we reset device, so we can continue with device reset/destruction not being raced by sysfs attr write (f.e. disksize). Apart from that, destroy_device() releases zram->disk (but we still have ->disk pointer), so we cannot acces zram->disk in later zram_reset_device() call, which may cause additional errors in the future. So, this patch rework and cleanup destroy path. 1) remove several unneeded goto labels in zram_init() 2) factor out zram_init() error path and zram_exit() into destroy_devices() function, which takes the number of devices to destroy as its argument. 3) remove sysfs group in destroy_devices() first, so we can reorder operations -- reset device (as expected) goes before disk destroy and queue cleanup. So we can always access ->disk in zram_reset_device(). 4) and, finally, return set_capacity() back under ->init_lock. [akpm@linux-foundation.org: tweak comment] Signed-off-by: Sergey Senozhatsky Reported-by: Ganesh Mahendran Cc: Minchan Kim Cc: Jerome Marchand Cc: Nitin Gupta Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- drivers/block/zram/zram_drv.c | 75 +++++++++++++++++++------------------------ 1 file changed, 33 insertions(+), 42 deletions(-) diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c index 2607bd9f4955..81ac8fd53340 100644 --- a/drivers/block/zram/zram_drv.c +++ b/drivers/block/zram/zram_drv.c @@ -732,8 +732,9 @@ static void zram_reset_device(struct zram *zram) zram->meta = NULL; /* Reset stats */ memset(&zram->stats, 0, sizeof(zram->stats)); - zram->disksize = 0; + set_capacity(zram->disk, 0); + up_write(&zram->init_lock); } @@ -826,7 +827,6 @@ static ssize_t reset_store(struct device *dev, /* Make sure all pending I/O is finished */ fsync_bdev(bdev); zram_reset_device(zram); - set_capacity(zram->disk, 0); mutex_unlock(&bdev->bd_mutex); revalidate_disk(zram->disk); @@ -1112,15 +1112,31 @@ out: return ret; } -static void destroy_device(struct zram *zram) +static void destroy_devices(unsigned int nr) { - sysfs_remove_group(&disk_to_dev(zram->disk)->kobj, - &zram_disk_attr_group); + struct zram *zram; + unsigned int i; - del_gendisk(zram->disk); - put_disk(zram->disk); + for (i = 0; i < nr; i++) { + zram = &zram_devices[i]; + /* + * Remove sysfs first, so no one will perform a disksize + * store while we destroy the devices + */ + sysfs_remove_group(&disk_to_dev(zram->disk)->kobj, + &zram_disk_attr_group); - blk_cleanup_queue(zram->queue); + zram_reset_device(zram); + + del_gendisk(zram->disk); + put_disk(zram->disk); + + blk_cleanup_queue(zram->queue); + } + + kfree(zram_devices); + unregister_blkdev(zram_major, "zram"); + pr_info("Destroyed %u device(s)\n", nr); } static int __init zram_init(void) @@ -1130,64 +1146,39 @@ static int __init zram_init(void) if (num_devices > max_num_devices) { pr_warn("Invalid value for num_devices: %u\n", num_devices); - ret = -EINVAL; - goto out; + return -EINVAL; } zram_major = register_blkdev(0, "zram"); if (zram_major <= 0) { pr_warn("Unable to get major number\n"); - ret = -EBUSY; - goto out; + return -EBUSY; } /* Allocate the device array and initialize each one */ zram_devices = kzalloc(num_devices * sizeof(struct zram), GFP_KERNEL); if (!zram_devices) { - ret = -ENOMEM; - goto unregister; + unregister_blkdev(zram_major, "zram"); + return -ENOMEM; } for (dev_id = 0; dev_id < num_devices; dev_id++) { ret = create_device(&zram_devices[dev_id], dev_id); if (ret) - goto free_devices; + goto out_error; } - pr_info("Created %u device(s) ...\n", num_devices); - + pr_info("Created %u device(s)\n", num_devices); return 0; -free_devices: - while (dev_id) - destroy_device(&zram_devices[--dev_id]); - kfree(zram_devices); -unregister: - unregister_blkdev(zram_major, "zram"); -out: +out_error: + destroy_devices(dev_id); return ret; } static void __exit zram_exit(void) { - int i; - struct zram *zram; - - for (i = 0; i < num_devices; i++) { - zram = &zram_devices[i]; - - destroy_device(zram); - /* - * Shouldn't access zram->disk after destroy_device - * because destroy_device already released zram->disk. - */ - zram_reset_device(zram); - } - - unregister_blkdev(zram_major, "zram"); - - kfree(zram_devices); - pr_debug("Cleanup done!\n"); + destroy_devices(num_devices); } module_init(zram_init); -- cgit v1.2.3 From 2b269ce6fcbfafc6cae37254cab4bf2309bfed0e Mon Sep 17 00:00:00 2001 From: Minchan Kim Date: Thu, 12 Feb 2015 15:00:42 -0800 Subject: zram: check bd_openers instead of bd_holders bd_holders is increased only when user open the device file as FMODE_EXCL so if something opens zram0 as !FMODE_EXCL and request I/O while another user reset zram0, we can see following warning. zram0: detected capacity change from 0 to 64424509440 Buffer I/O error on dev zram0, logical block 180823, lost async page write Buffer I/O error on dev zram0, logical block 180824, lost async page write Buffer I/O error on dev zram0, logical block 180825, lost async page write Buffer I/O error on dev zram0, logical block 180826, lost async page write Buffer I/O error on dev zram0, logical block 180827, lost async page write Buffer I/O error on dev zram0, logical block 180828, lost async page write Buffer I/O error on dev zram0, logical block 180829, lost async page write Buffer I/O error on dev zram0, logical block 180830, lost async page write Buffer I/O error on dev zram0, logical block 180831, lost async page write Buffer I/O error on dev zram0, logical block 180832, lost async page write ------------[ cut here ]------------ WARNING: CPU: 11 PID: 1996 at fs/block_dev.c:57 __blkdev_put+0x1d7/0x210() Modules linked in: CPU: 11 PID: 1996 Comm: dd Not tainted 3.19.0-rc6-next-20150202+ #1125 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 Call Trace: dump_stack+0x45/0x57 warn_slowpath_common+0x8a/0xc0 warn_slowpath_null+0x1a/0x20 __blkdev_put+0x1d7/0x210 blkdev_put+0x50/0x130 blkdev_close+0x25/0x30 __fput+0xdf/0x1e0 ____fput+0xe/0x10 task_work_run+0xa7/0xe0 do_notify_resume+0x49/0x60 int_signal+0x12/0x17 ---[ end trace 274fbbc5664827d2 ]--- The warning comes from bdev_write_node in blkdev_put path. static void bdev_write_inode(struct inode *inode) { spin_lock(&inode->i_lock); while (inode->i_state & I_DIRTY) { spin_unlock(&inode->i_lock); WARN_ON_ONCE(write_inode_now(inode, true)); <========= here. spin_lock(&inode->i_lock); } spin_unlock(&inode->i_lock); } The reason is dd process encounters I/O fails due to sudden block device disappear so in filemap_check_errors in __writeback_single_inode returns -EIO. If we check bd_openers instead of bd_holders, we could address the problem. When I see the brd, it already have used it rather than bd_holders so although I'm not a expert of block layer, it seems to be better. I can make following warning with below simple script. In addition, I added msleep(2000) below set_capacity(zram->disk, 0) after applying your patch to make window huge(Kudos to Ganesh!) script: echo $((60<<30)) > /sys/block/zram0/disksize setsid dd if=/dev/zero of=/dev/zram0 & sleep 1 setsid echo 1 > /sys/block/zram0/reset Signed-off-by: Minchan Kim Acked-by: Sergey Senozhatsky Cc: Nitin Gupta Cc: Jerome Marchand Cc: Ganesh Mahendran Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- drivers/block/zram/zram_drv.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c index 81ac8fd53340..0abcf4a31c78 100644 --- a/drivers/block/zram/zram_drv.c +++ b/drivers/block/zram/zram_drv.c @@ -810,7 +810,7 @@ static ssize_t reset_store(struct device *dev, mutex_lock(&bdev->bd_mutex); /* Do not reset an active device! */ - if (bdev->bd_holders) { + if (bdev->bd_openers) { ret = -EBUSY; goto out; } -- cgit v1.2.3 From 08eee69fcf6baea543a2b4d2a2fcba0e61aa3160 Mon Sep 17 00:00:00 2001 From: Minchan Kim Date: Thu, 12 Feb 2015 15:00:45 -0800 Subject: zram: remove init_lock in zram_make_request Admin could reset zram during I/O operation going on so we have used zram->init_lock as read-side lock in I/O path to prevent sudden zram meta freeing. However, the init_lock is really troublesome. We can't do call zram_meta_alloc under init_lock due to lockdep splat because zram_rw_page is one of the function under reclaim path and hold it as read_lock while other places in process context hold it as write_lock. So, we have used allocation out of the lock to avoid lockdep warn but it's not good for readability and fainally, I met another lockdep splat between init_lock and cpu_hotplug from kmem_cache_destroy during working zsmalloc compaction. :( Yes, the ideal is to remove horrible init_lock of zram in rw path. This patch removes it in rw path and instead, add atomic refcount for meta lifetime management and completion to free meta in process context. It's important to free meta in process context because some of resource destruction needs mutex lock, which could be held if we releases the resource in reclaim context so it's deadlock, again. As a bonus, we could remove init_done check in rw path because zram_meta_get will do a role for it, instead. Signed-off-by: Sergey Senozhatsky Signed-off-by: Minchan Kim Cc: Nitin Gupta Cc: Jerome Marchand Cc: Ganesh Mahendran Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- drivers/block/zram/zram_drv.c | 76 ++++++++++++++++++++++++++++++------------- drivers/block/zram/zram_drv.h | 20 +++++++----- 2 files changed, 64 insertions(+), 32 deletions(-) diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c index 0abcf4a31c78..db94572b35c4 100644 --- a/drivers/block/zram/zram_drv.c +++ b/drivers/block/zram/zram_drv.c @@ -53,9 +53,9 @@ static ssize_t name##_show(struct device *d, \ } \ static DEVICE_ATTR_RO(name); -static inline int init_done(struct zram *zram) +static inline bool init_done(struct zram *zram) { - return zram->meta != NULL; + return zram->disksize; } static inline struct zram *dev_to_zram(struct device *dev) @@ -356,6 +356,18 @@ out_error: return NULL; } +static inline bool zram_meta_get(struct zram *zram) +{ + if (atomic_inc_not_zero(&zram->refcount)) + return true; + return false; +} + +static inline void zram_meta_put(struct zram *zram) +{ + atomic_dec(&zram->refcount); +} + static void update_position(u32 *index, int *offset, struct bio_vec *bvec) { if (*offset + bvec->bv_len >= PAGE_SIZE) @@ -717,6 +729,10 @@ static void zram_bio_discard(struct zram *zram, u32 index, static void zram_reset_device(struct zram *zram) { + struct zram_meta *meta; + struct zcomp *comp; + u64 disksize; + down_write(&zram->init_lock); zram->limit_pages = 0; @@ -726,16 +742,31 @@ static void zram_reset_device(struct zram *zram) return; } - zcomp_destroy(zram->comp); - zram->max_comp_streams = 1; - zram_meta_free(zram->meta, zram->disksize); - zram->meta = NULL; + meta = zram->meta; + comp = zram->comp; + disksize = zram->disksize; + /* + * Refcount will go down to 0 eventually and r/w handler + * cannot handle further I/O so it will bail out by + * check zram_meta_get. + */ + zram_meta_put(zram); + /* + * We want to free zram_meta in process context to avoid + * deadlock between reclaim path and any other locks. + */ + wait_event(zram->io_done, atomic_read(&zram->refcount) == 0); + /* Reset stats */ memset(&zram->stats, 0, sizeof(zram->stats)); zram->disksize = 0; + zram->max_comp_streams = 1; set_capacity(zram->disk, 0); up_write(&zram->init_lock); + /* I/O operation under all of CPU are done so let's free */ + zram_meta_free(meta, disksize); + zcomp_destroy(comp); } static ssize_t disksize_store(struct device *dev, @@ -771,6 +802,8 @@ static ssize_t disksize_store(struct device *dev, goto out_destroy_comp; } + init_waitqueue_head(&zram->io_done); + atomic_set(&zram->refcount, 1); zram->meta = meta; zram->comp = comp; zram->disksize = disksize; @@ -901,23 +934,21 @@ static void zram_make_request(struct request_queue *queue, struct bio *bio) { struct zram *zram = queue->queuedata; - down_read(&zram->init_lock); - if (unlikely(!init_done(zram))) + if (unlikely(!zram_meta_get(zram))) goto error; if (!valid_io_request(zram, bio->bi_iter.bi_sector, bio->bi_iter.bi_size)) { atomic64_inc(&zram->stats.invalid_io); - goto error; + goto put_zram; } __zram_make_request(zram, bio); - up_read(&zram->init_lock); - + zram_meta_put(zram); return; - +put_zram: + zram_meta_put(zram); error: - up_read(&zram->init_lock); bio_io_error(bio); } @@ -939,21 +970,19 @@ static void zram_slot_free_notify(struct block_device *bdev, static int zram_rw_page(struct block_device *bdev, sector_t sector, struct page *page, int rw) { - int offset, err; + int offset, err = -EIO; u32 index; struct zram *zram; struct bio_vec bv; zram = bdev->bd_disk->private_data; + if (unlikely(!zram_meta_get(zram))) + goto out; + if (!valid_io_request(zram, sector, PAGE_SIZE)) { atomic64_inc(&zram->stats.invalid_io); - return -EINVAL; - } - - down_read(&zram->init_lock); - if (unlikely(!init_done(zram))) { - err = -EIO; - goto out_unlock; + err = -EINVAL; + goto put_zram; } index = sector >> SECTORS_PER_PAGE_SHIFT; @@ -964,8 +993,9 @@ static int zram_rw_page(struct block_device *bdev, sector_t sector, bv.bv_offset = 0; err = zram_bvec_rw(zram, &bv, index, offset, rw); -out_unlock: - up_read(&zram->init_lock); +put_zram: + zram_meta_put(zram); +out: /* * If I/O fails, just return error(ie, non-zero) without * calling page_endio. diff --git a/drivers/block/zram/zram_drv.h b/drivers/block/zram/zram_drv.h index b05a816b09ac..5249f51ccdb3 100644 --- a/drivers/block/zram/zram_drv.h +++ b/drivers/block/zram/zram_drv.h @@ -100,24 +100,26 @@ struct zram_meta { struct zram { struct zram_meta *meta; + struct zcomp *comp; struct request_queue *queue; struct gendisk *disk; - struct zcomp *comp; - - /* Prevent concurrent execution of device init, reset and R/W request */ + /* Prevent concurrent execution of device init */ struct rw_semaphore init_lock; /* - * This is the limit on amount of *uncompressed* worth of data - * we can store in a disk. + * the number of pages zram can consume for storing compressed data */ - u64 disksize; /* bytes */ + unsigned long limit_pages; int max_comp_streams; + struct zram_stats stats; + atomic_t refcount; /* refcount for zram_meta */ + /* wait all IO under all of cpu are done */ + wait_queue_head_t io_done; /* - * the number of pages zram can consume for storing compressed data + * This is the limit on amount of *uncompressed* worth of data + * we can store in a disk. */ - unsigned long limit_pages; - + u64 disksize; /* bytes */ char compressor[10]; }; #endif -- cgit v1.2.3 From ee98016010ae036a5b27300d83bd99ef3fd5776e Mon Sep 17 00:00:00 2001 From: Sergey Senozhatsky Date: Thu, 12 Feb 2015 15:00:48 -0800 Subject: zram: remove request_queue from struct zram `struct zram' contains both `struct gendisk' and `struct request_queue'. the latter can be deleted, because zram->disk carries ->queue pointer, and ->queue carries zram pointer: create_device() zram->queue->queuedata = zram zram->disk->queue = zram->queue zram->disk->private_data = zram so zram->queue is not needed, we can access all necessary data anyway. Signed-off-by: Sergey Senozhatsky Cc: Minchan Kim Cc: Jerome Marchand Cc: Nitin Gupta Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- drivers/block/zram/zram_drv.c | 16 ++++++++-------- drivers/block/zram/zram_drv.h | 1 - 2 files changed, 8 insertions(+), 9 deletions(-) diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c index db94572b35c4..eca4b67274c1 100644 --- a/drivers/block/zram/zram_drv.c +++ b/drivers/block/zram/zram_drv.c @@ -1061,19 +1061,19 @@ static struct attribute_group zram_disk_attr_group = { static int create_device(struct zram *zram, int device_id) { + struct request_queue *queue; int ret = -ENOMEM; init_rwsem(&zram->init_lock); - zram->queue = blk_alloc_queue(GFP_KERNEL); - if (!zram->queue) { + queue = blk_alloc_queue(GFP_KERNEL); + if (!queue) { pr_err("Error allocating disk queue for device %d\n", device_id); goto out; } - blk_queue_make_request(zram->queue, zram_make_request); - zram->queue->queuedata = zram; + blk_queue_make_request(queue, zram_make_request); /* gendisk structure */ zram->disk = alloc_disk(1); @@ -1086,7 +1086,8 @@ static int create_device(struct zram *zram, int device_id) zram->disk->major = zram_major; zram->disk->first_minor = device_id; zram->disk->fops = &zram_devops; - zram->disk->queue = zram->queue; + zram->disk->queue = queue; + zram->disk->queue->queuedata = zram; zram->disk->private_data = zram; snprintf(zram->disk->disk_name, 16, "zram%d", device_id); @@ -1137,7 +1138,7 @@ out_free_disk: del_gendisk(zram->disk); put_disk(zram->disk); out_free_queue: - blk_cleanup_queue(zram->queue); + blk_cleanup_queue(queue); out: return ret; } @@ -1158,10 +1159,9 @@ static void destroy_devices(unsigned int nr) zram_reset_device(zram); + blk_cleanup_queue(zram->disk->queue); del_gendisk(zram->disk); put_disk(zram->disk); - - blk_cleanup_queue(zram->queue); } kfree(zram_devices); diff --git a/drivers/block/zram/zram_drv.h b/drivers/block/zram/zram_drv.h index 5249f51ccdb3..17056e589146 100644 --- a/drivers/block/zram/zram_drv.h +++ b/drivers/block/zram/zram_drv.h @@ -101,7 +101,6 @@ struct zram_meta { struct zram { struct zram_meta *meta; struct zcomp *comp; - struct request_queue *queue; struct gendisk *disk; /* Prevent concurrent execution of device init */ struct rw_semaphore init_lock; -- cgit v1.2.3 From 3eba0c6a56c04f2b017b43641a821f1ebfb7fb4c Mon Sep 17 00:00:00 2001 From: Ganesh Mahendran Date: Thu, 12 Feb 2015 15:00:51 -0800 Subject: mm/zpool: add name argument to create zpool Currently the underlay of zpool: zsmalloc/zbud, do not know who creates them. There is not a method to let zsmalloc/zbud find which caller they belong to. Now we want to add statistics collection in zsmalloc. We need to name the debugfs dir for each pool created. The way suggested by Minchan Kim is to use a name passed by caller(such as zram) to create the zsmalloc pool. /sys/kernel/debug/zsmalloc/zram0 This patch adds an argument `name' to zs_create_pool() and other related functions. Signed-off-by: Ganesh Mahendran Acked-by: Minchan Kim Cc: Seth Jennings Cc: Nitin Gupta Cc: Dan Streetman Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- drivers/block/zram/zram_drv.c | 8 +++++--- include/linux/zpool.h | 5 +++-- include/linux/zsmalloc.h | 2 +- mm/zbud.c | 3 ++- mm/zpool.c | 6 ++++-- mm/zsmalloc.c | 6 +++--- mm/zswap.c | 5 +++-- 7 files changed, 21 insertions(+), 14 deletions(-) diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c index eca4b67274c1..8e233edd7a09 100644 --- a/drivers/block/zram/zram_drv.c +++ b/drivers/block/zram/zram_drv.c @@ -327,9 +327,10 @@ static void zram_meta_free(struct zram_meta *meta, u64 disksize) kfree(meta); } -static struct zram_meta *zram_meta_alloc(u64 disksize) +static struct zram_meta *zram_meta_alloc(int device_id, u64 disksize) { size_t num_pages; + char pool_name[8]; struct zram_meta *meta = kmalloc(sizeof(*meta), GFP_KERNEL); if (!meta) @@ -342,7 +343,8 @@ static struct zram_meta *zram_meta_alloc(u64 disksize) goto out_error; } - meta->mem_pool = zs_create_pool(GFP_NOIO | __GFP_HIGHMEM); + snprintf(pool_name, sizeof(pool_name), "zram%d", device_id); + meta->mem_pool = zs_create_pool(pool_name, GFP_NOIO | __GFP_HIGHMEM); if (!meta->mem_pool) { pr_err("Error creating memory pool\n"); goto out_error; @@ -783,7 +785,7 @@ static ssize_t disksize_store(struct device *dev, return -EINVAL; disksize = PAGE_ALIGN(disksize); - meta = zram_meta_alloc(disksize); + meta = zram_meta_alloc(zram->disk->first_minor, disksize); if (!meta) return -ENOMEM; diff --git a/include/linux/zpool.h b/include/linux/zpool.h index f14bd75f08b3..56529b34dc63 100644 --- a/include/linux/zpool.h +++ b/include/linux/zpool.h @@ -36,7 +36,8 @@ enum zpool_mapmode { ZPOOL_MM_DEFAULT = ZPOOL_MM_RW }; -struct zpool *zpool_create_pool(char *type, gfp_t gfp, struct zpool_ops *ops); +struct zpool *zpool_create_pool(char *type, char *name, + gfp_t gfp, struct zpool_ops *ops); char *zpool_get_type(struct zpool *pool); @@ -80,7 +81,7 @@ struct zpool_driver { atomic_t refcount; struct list_head list; - void *(*create)(gfp_t gfp, struct zpool_ops *ops); + void *(*create)(char *name, gfp_t gfp, struct zpool_ops *ops); void (*destroy)(void *pool); int (*malloc)(void *pool, size_t size, gfp_t gfp, diff --git a/include/linux/zsmalloc.h b/include/linux/zsmalloc.h index 05c214760977..3283c6a55425 100644 --- a/include/linux/zsmalloc.h +++ b/include/linux/zsmalloc.h @@ -36,7 +36,7 @@ enum zs_mapmode { struct zs_pool; -struct zs_pool *zs_create_pool(gfp_t flags); +struct zs_pool *zs_create_pool(char *name, gfp_t flags); void zs_destroy_pool(struct zs_pool *pool); unsigned long zs_malloc(struct zs_pool *pool, size_t size); diff --git a/mm/zbud.c b/mm/zbud.c index 4e387bea702e..2ee4e4520493 100644 --- a/mm/zbud.c +++ b/mm/zbud.c @@ -130,7 +130,8 @@ static struct zbud_ops zbud_zpool_ops = { .evict = zbud_zpool_evict }; -static void *zbud_zpool_create(gfp_t gfp, struct zpool_ops *zpool_ops) +static void *zbud_zpool_create(char *name, gfp_t gfp, + struct zpool_ops *zpool_ops) { return zbud_create_pool(gfp, zpool_ops ? &zbud_zpool_ops : NULL); } diff --git a/mm/zpool.c b/mm/zpool.c index 739cdf0d183a..bacdab6e47de 100644 --- a/mm/zpool.c +++ b/mm/zpool.c @@ -129,6 +129,7 @@ static void zpool_put_driver(struct zpool_driver *driver) /** * zpool_create_pool() - Create a new zpool * @type The type of the zpool to create (e.g. zbud, zsmalloc) + * @name The name of the zpool (e.g. zram0, zswap) * @gfp The GFP flags to use when allocating the pool. * @ops The optional ops callback. * @@ -140,7 +141,8 @@ static void zpool_put_driver(struct zpool_driver *driver) * * Returns: New zpool on success, NULL on failure. */ -struct zpool *zpool_create_pool(char *type, gfp_t gfp, struct zpool_ops *ops) +struct zpool *zpool_create_pool(char *type, char *name, gfp_t gfp, + struct zpool_ops *ops) { struct zpool_driver *driver; struct zpool *zpool; @@ -168,7 +170,7 @@ struct zpool *zpool_create_pool(char *type, gfp_t gfp, struct zpool_ops *ops) zpool->type = driver->type; zpool->driver = driver; - zpool->pool = driver->create(gfp, ops); + zpool->pool = driver->create(name, gfp, ops); zpool->ops = ops; if (!zpool->pool) { diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c index b72403927aa4..2359e61b02bf 100644 --- a/mm/zsmalloc.c +++ b/mm/zsmalloc.c @@ -246,9 +246,9 @@ struct mapping_area { #ifdef CONFIG_ZPOOL -static void *zs_zpool_create(gfp_t gfp, struct zpool_ops *zpool_ops) +static void *zs_zpool_create(char *name, gfp_t gfp, struct zpool_ops *zpool_ops) { - return zs_create_pool(gfp); + return zs_create_pool(name, gfp); } static void zs_zpool_destroy(void *pool) @@ -1148,7 +1148,7 @@ EXPORT_SYMBOL_GPL(zs_free); * On success, a pointer to the newly created pool is returned, * otherwise NULL. */ -struct zs_pool *zs_create_pool(gfp_t flags) +struct zs_pool *zs_create_pool(char *name, gfp_t flags) { int i; struct zs_pool *pool; diff --git a/mm/zswap.c b/mm/zswap.c index 0cfce9bc51e4..4249e82ff934 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -906,11 +906,12 @@ static int __init init_zswap(void) pr_info("loading zswap\n"); - zswap_pool = zpool_create_pool(zswap_zpool_type, gfp, &zswap_zpool_ops); + zswap_pool = zpool_create_pool(zswap_zpool_type, "zswap", gfp, + &zswap_zpool_ops); if (!zswap_pool && strcmp(zswap_zpool_type, ZSWAP_ZPOOL_DEFAULT)) { pr_info("%s zpool not available\n", zswap_zpool_type); zswap_zpool_type = ZSWAP_ZPOOL_DEFAULT; - zswap_pool = zpool_create_pool(zswap_zpool_type, gfp, + zswap_pool = zpool_create_pool(zswap_zpool_type, "zswap", gfp, &zswap_zpool_ops); } if (!zswap_pool) { -- cgit v1.2.3 From 0f050d997e275cf0e47ddc7006284eaa3c6fe049 Mon Sep 17 00:00:00 2001 From: Ganesh Mahendran Date: Thu, 12 Feb 2015 15:00:54 -0800 Subject: mm/zsmalloc: add statistics support Keeping fragmentation of zsmalloc in a low level is our target. But now we still need to add the debug code in zsmalloc to get the quantitative data. This patch adds a new configuration CONFIG_ZSMALLOC_STAT to enable the statistics collection for developers. Currently only the objects statatitics in each class are collected. User can get the information via debugfs. cat /sys/kernel/debug/zsmalloc/zram0/... For example: After I copied "jdk-8u25-linux-x64.tar.gz" to zram with ext4 filesystem: class size obj_allocated obj_used pages_used 0 32 0 0 0 1 48 256 12 3 2 64 64 14 1 3 80 51 7 1 4 96 128 5 3 5 112 73 5 2 6 128 32 4 1 7 144 0 0 0 8 160 0 0 0 9 176 0 0 0 10 192 0 0 0 11 208 0 0 0 12 224 0 0 0 13 240 0 0 0 14 256 16 1 1 15 272 15 9 1 16 288 0 0 0 17 304 0 0 0 18 320 0 0 0 19 336 0 0 0 20 352 0 0 0 21 368 0 0 0 22 384 0 0 0 23 400 0 0 0 24 416 0 0 0 25 432 0 0 0 26 448 0 0 0 27 464 0 0 0 28 480 0 0 0 29 496 33 1 4 30 512 0 0 0 31 528 0 0 0 32 544 0 0 0 33 560 0 0 0 34 576 0 0 0 35 592 0 0 0 36 608 0 0 0 37 624 0 0 0 38 640 0 0 0 40 672 0 0 0 42 704 0 0 0 43 720 17 1 3 44 736 0 0 0 46 768 0 0 0 49 816 0 0 0 51 848 0 0 0 52 864 14 1 3 54 896 0 0 0 57 944 13 1 3 58 960 0 0 0 62 1024 4 1 1 66 1088 15 2 4 67 1104 0 0 0 71 1168 0 0 0 74 1216 0 0 0 76 1248 0 0 0 83 1360 3 1 1 91 1488 11 1 4 94 1536 0 0 0 100 1632 5 1 2 107 1744 0 0 0 111 1808 9 1 4 126 2048 4 4 2 144 2336 7 3 4 151 2448 0 0 0 168 2720 15 15 10 190 3072 28 27 21 202 3264 0 0 0 254 4096 36209 36209 36209 Total 37022 36326 36288 We can calculate the overall fragentation by the last line: Total 37022 36326 36288 (37022 - 36326) / 37022 = 1.87% Also by analysing objects alocated in every class we know why we got so low fragmentation: Most of the allocated objects is in . And there is only 1 page in class 254 zspage. So, No fragmentation will be introduced by allocating objs in class 254. And in future, we can collect other zsmalloc statistics as we need and analyse them. Signed-off-by: Ganesh Mahendran Suggested-by: Minchan Kim Acked-by: Minchan Kim Cc: Nitin Gupta Cc: Seth Jennings Cc: Dan Streetman Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/Kconfig | 10 +++ mm/zsmalloc.c | 233 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++- 2 files changed, 239 insertions(+), 4 deletions(-) diff --git a/mm/Kconfig b/mm/Kconfig index 4395b12869c8..de5239c152f9 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -602,6 +602,16 @@ config PGTABLE_MAPPING You can check speed with zsmalloc benchmark: https://github.com/spartacus06/zsmapbench +config ZSMALLOC_STAT + bool "Export zsmalloc statistics" + depends on ZSMALLOC + select DEBUG_FS + help + This option enables code in the zsmalloc to collect various + statistics about whats happening in zsmalloc and exports that + information to userspace via debugfs. + If unsure, say N. + config GENERIC_EARLY_IOREMAP bool diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c index 2359e61b02bf..0dec1fa5f656 100644 --- a/mm/zsmalloc.c +++ b/mm/zsmalloc.c @@ -91,6 +91,7 @@ #include #include #include +#include #include #include @@ -168,6 +169,22 @@ enum fullness_group { ZS_FULL }; +enum zs_stat_type { + OBJ_ALLOCATED, + OBJ_USED, + NR_ZS_STAT_TYPE, +}; + +#ifdef CONFIG_ZSMALLOC_STAT + +static struct dentry *zs_stat_root; + +struct zs_size_stat { + unsigned long objs[NR_ZS_STAT_TYPE]; +}; + +#endif + /* * number of size_classes */ @@ -200,6 +217,10 @@ struct size_class { /* Number of PAGE_SIZE sized pages to combine to form a 'zspage' */ int pages_per_zspage; +#ifdef CONFIG_ZSMALLOC_STAT + struct zs_size_stat stats; +#endif + spinlock_t lock; struct page *fullness_list[_ZS_NR_FULLNESS_GROUPS]; @@ -217,10 +238,16 @@ struct link_free { }; struct zs_pool { + char *name; + struct size_class **size_class; gfp_t flags; /* allocation flags used when growing pool */ atomic_long_t pages_allocated; + +#ifdef CONFIG_ZSMALLOC_STAT + struct dentry *stat_dentry; +#endif }; /* @@ -942,6 +969,166 @@ static bool can_merge(struct size_class *prev, int size, int pages_per_zspage) return true; } +#ifdef CONFIG_ZSMALLOC_STAT + +static inline void zs_stat_inc(struct size_class *class, + enum zs_stat_type type, unsigned long cnt) +{ + class->stats.objs[type] += cnt; +} + +static inline void zs_stat_dec(struct size_class *class, + enum zs_stat_type type, unsigned long cnt) +{ + class->stats.objs[type] -= cnt; +} + +static inline unsigned long zs_stat_get(struct size_class *class, + enum zs_stat_type type) +{ + return class->stats.objs[type]; +} + +static int __init zs_stat_init(void) +{ + if (!debugfs_initialized()) + return -ENODEV; + + zs_stat_root = debugfs_create_dir("zsmalloc", NULL); + if (!zs_stat_root) + return -ENOMEM; + + return 0; +} + +static void __exit zs_stat_exit(void) +{ + debugfs_remove_recursive(zs_stat_root); +} + +static int zs_stats_size_show(struct seq_file *s, void *v) +{ + int i; + struct zs_pool *pool = s->private; + struct size_class *class; + int objs_per_zspage; + unsigned long obj_allocated, obj_used, pages_used; + unsigned long total_objs = 0, total_used_objs = 0, total_pages = 0; + + seq_printf(s, " %5s %5s %13s %10s %10s\n", "class", "size", + "obj_allocated", "obj_used", "pages_used"); + + for (i = 0; i < zs_size_classes; i++) { + class = pool->size_class[i]; + + if (class->index != i) + continue; + + spin_lock(&class->lock); + obj_allocated = zs_stat_get(class, OBJ_ALLOCATED); + obj_used = zs_stat_get(class, OBJ_USED); + spin_unlock(&class->lock); + + objs_per_zspage = get_maxobj_per_zspage(class->size, + class->pages_per_zspage); + pages_used = obj_allocated / objs_per_zspage * + class->pages_per_zspage; + + seq_printf(s, " %5u %5u %10lu %10lu %10lu\n", i, + class->size, obj_allocated, obj_used, pages_used); + + total_objs += obj_allocated; + total_used_objs += obj_used; + total_pages += pages_used; + } + + seq_puts(s, "\n"); + seq_printf(s, " %5s %5s %10lu %10lu %10lu\n", "Total", "", + total_objs, total_used_objs, total_pages); + + return 0; +} + +static int zs_stats_size_open(struct inode *inode, struct file *file) +{ + return single_open(file, zs_stats_size_show, inode->i_private); +} + +static const struct file_operations zs_stat_size_ops = { + .open = zs_stats_size_open, + .read = seq_read, + .llseek = seq_lseek, + .release = single_release, +}; + +static int zs_pool_stat_create(char *name, struct zs_pool *pool) +{ + struct dentry *entry; + + if (!zs_stat_root) + return -ENODEV; + + entry = debugfs_create_dir(name, zs_stat_root); + if (!entry) { + pr_warn("debugfs dir <%s> creation failed\n", name); + return -ENOMEM; + } + pool->stat_dentry = entry; + + entry = debugfs_create_file("obj_in_classes", S_IFREG | S_IRUGO, + pool->stat_dentry, pool, &zs_stat_size_ops); + if (!entry) { + pr_warn("%s: debugfs file entry <%s> creation failed\n", + name, "obj_in_classes"); + return -ENOMEM; + } + + return 0; +} + +static void zs_pool_stat_destroy(struct zs_pool *pool) +{ + debugfs_remove_recursive(pool->stat_dentry); +} + +#else /* CONFIG_ZSMALLOC_STAT */ + +static inline void zs_stat_inc(struct size_class *class, + enum zs_stat_type type, unsigned long cnt) +{ +} + +static inline void zs_stat_dec(struct size_class *class, + enum zs_stat_type type, unsigned long cnt) +{ +} + +static inline unsigned long zs_stat_get(struct size_class *class, + enum zs_stat_type type) +{ + return 0; +} + +static int __init zs_stat_init(void) +{ + return 0; +} + +static void __exit zs_stat_exit(void) +{ +} + +static inline int zs_pool_stat_create(char *name, struct zs_pool *pool) +{ + return 0; +} + +static inline void zs_pool_stat_destroy(struct zs_pool *pool) +{ +} + +#endif + unsigned long zs_get_total_pages(struct zs_pool *pool) { return atomic_long_read(&pool->pages_allocated); @@ -1074,7 +1261,10 @@ unsigned long zs_malloc(struct zs_pool *pool, size_t size) set_zspage_mapping(first_page, class->index, ZS_EMPTY); atomic_long_add(class->pages_per_zspage, &pool->pages_allocated); + spin_lock(&class->lock); + zs_stat_inc(class, OBJ_ALLOCATED, get_maxobj_per_zspage( + class->size, class->pages_per_zspage)); } obj = (unsigned long)first_page->freelist; @@ -1088,6 +1278,7 @@ unsigned long zs_malloc(struct zs_pool *pool, size_t size) kunmap_atomic(vaddr); first_page->inuse++; + zs_stat_inc(class, OBJ_USED, 1); /* Now move the zspage to another fullness group, if required */ fix_fullness_group(pool, first_page); spin_unlock(&class->lock); @@ -1128,6 +1319,12 @@ void zs_free(struct zs_pool *pool, unsigned long obj) first_page->inuse--; fullness = fix_fullness_group(pool, first_page); + + zs_stat_dec(class, OBJ_USED, 1); + if (fullness == ZS_EMPTY) + zs_stat_dec(class, OBJ_ALLOCATED, get_maxobj_per_zspage( + class->size, class->pages_per_zspage)); + spin_unlock(&class->lock); if (fullness == ZS_EMPTY) { @@ -1158,9 +1355,16 @@ struct zs_pool *zs_create_pool(char *name, gfp_t flags) if (!pool) return NULL; + pool->name = kstrdup(name, GFP_KERNEL); + if (!pool->name) { + kfree(pool); + return NULL; + } + pool->size_class = kcalloc(zs_size_classes, sizeof(struct size_class *), GFP_KERNEL); if (!pool->size_class) { + kfree(pool->name); kfree(pool); return NULL; } @@ -1210,6 +1414,9 @@ struct zs_pool *zs_create_pool(char *name, gfp_t flags) pool->flags = flags; + if (zs_pool_stat_create(name, pool)) + goto err; + return pool; err: @@ -1222,6 +1429,8 @@ void zs_destroy_pool(struct zs_pool *pool) { int i; + zs_pool_stat_destroy(pool); + for (i = 0; i < zs_size_classes; i++) { int fg; struct size_class *class = pool->size_class[i]; @@ -1242,6 +1451,7 @@ void zs_destroy_pool(struct zs_pool *pool) } kfree(pool->size_class); + kfree(pool->name); kfree(pool); } EXPORT_SYMBOL_GPL(zs_destroy_pool); @@ -1250,17 +1460,30 @@ static int __init zs_init(void) { int ret = zs_register_cpu_notifier(); - if (ret) { - zs_unregister_cpu_notifier(); - return ret; - } + if (ret) + goto notifier_fail; init_zs_size_classes(); #ifdef CONFIG_ZPOOL zpool_register_driver(&zs_zpool_driver); #endif + + ret = zs_stat_init(); + if (ret) { + pr_err("zs stat initialization failed\n"); + goto stat_fail; + } return 0; + +stat_fail: +#ifdef CONFIG_ZPOOL + zpool_unregister_driver(&zs_zpool_driver); +#endif +notifier_fail: + zs_unregister_cpu_notifier(); + + return ret; } static void __exit zs_exit(void) @@ -1269,6 +1492,8 @@ static void __exit zs_exit(void) zpool_unregister_driver(&zs_zpool_driver); #endif zs_unregister_cpu_notifier(); + + zs_stat_exit(); } module_init(zs_init); -- cgit v1.2.3 From d170cf4674768db22b66a1be4be7877d3b2c3874 Mon Sep 17 00:00:00 2001 From: Rickard Strandqvist Date: Thu, 12 Feb 2015 15:00:57 -0800 Subject: arch/frv/mm/extable.c: remove unused function Remove the function search_one_table() that is not used anywhere. This was partially found by using a static code analysis program called cppcheck. Signed-off-by: Rickard Strandqvist Cc: David Howells Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/frv/mm/extable.c | 23 ----------------------- 1 file changed, 23 deletions(-) diff --git a/arch/frv/mm/extable.c b/arch/frv/mm/extable.c index 2fb9b3ab57b9..8863d6c1df6e 100644 --- a/arch/frv/mm/extable.c +++ b/arch/frv/mm/extable.c @@ -10,29 +10,6 @@ extern const void __memset_end, __memset_user_error_lr, __memset_user_error_hand extern const void __memcpy_end, __memcpy_user_error_lr, __memcpy_user_error_handler; extern spinlock_t modlist_lock; -/*****************************************************************************/ -/* - * - */ -static inline unsigned long search_one_table(const struct exception_table_entry *first, - const struct exception_table_entry *last, - unsigned long value) -{ - while (first <= last) { - const struct exception_table_entry __attribute__((aligned(8))) *mid; - long diff; - - mid = (last - first) / 2 + first; - diff = mid->insn - value; - if (diff == 0) - return mid->fixup; - else if (diff < 0) - first = mid + 1; - else - last = mid - 1; - } - return 0; -} /* end search_one_table() */ /*****************************************************************************/ /* -- cgit v1.2.3 From 695f055936938c674473ea071ca7359a863551e7 Mon Sep 17 00:00:00 2001 From: Petr Cermak Date: Thu, 12 Feb 2015 15:01:00 -0800 Subject: fs/proc/task_mmu.c: add user-space support for resetting mm->hiwater_rss (peak RSS) Peak resident size of a process can be reset back to the process's current rss value by writing "5" to /proc/pid/clear_refs. The driving use-case for this would be getting the peak RSS value, which can be retrieved from the VmHWM field in /proc/pid/status, per benchmark iteration or test scenario. [akpm@linux-foundation.org: clarify behaviour in documentation] Signed-off-by: Petr Cermak Cc: Bjorn Helgaas Cc: Primiano Tucci Cc: Petr Cermak Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/filesystems/proc.txt | 4 ++++ fs/proc/task_mmu.c | 14 ++++++++++++++ include/linux/mm.h | 5 +++++ 3 files changed, 23 insertions(+) diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt index cf8fc2f0b34b..0b3448dba9ec 100644 --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.txt @@ -489,6 +489,10 @@ To clear the bits for the file mapped pages associated with the process To clear the soft-dirty bit > echo 4 > /proc/PID/clear_refs +To reset the peak resident set size ("high water mark") to the process's +current value: + > echo 5 > /proc/PID/clear_refs + Any other value written to /proc/PID/clear_refs will have no effect. The /proc/pid/pagemap gives the PFN, which can be used to find the pageflags diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 0e36c1e49fe3..1359a911d194 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -732,6 +732,7 @@ enum clear_refs_types { CLEAR_REFS_ANON, CLEAR_REFS_MAPPED, CLEAR_REFS_SOFT_DIRTY, + CLEAR_REFS_MM_HIWATER_RSS, CLEAR_REFS_LAST, }; @@ -907,6 +908,18 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf, .mm = mm, .private = &cp, }; + + if (type == CLEAR_REFS_MM_HIWATER_RSS) { + /* + * Writing 5 to /proc/pid/clear_refs resets the peak + * resident set size to this mm's current rss value. + */ + down_write(&mm->mmap_sem); + reset_mm_hiwater_rss(mm); + up_write(&mm->mmap_sem); + goto out_mm; + } + down_read(&mm->mmap_sem); if (type == CLEAR_REFS_SOFT_DIRTY) { for (vma = mm->mmap; vma; vma = vma->vm_next) { @@ -928,6 +941,7 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf, mmu_notifier_invalidate_range_end(mm, 0, -1); flush_tlb_mm(mm); up_read(&mm->mmap_sem); +out_mm: mmput(mm); } put_task_struct(task); diff --git a/include/linux/mm.h b/include/linux/mm.h index bd52e2f14027..9bee7ec0c31f 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1408,6 +1408,11 @@ static inline void update_hiwater_vm(struct mm_struct *mm) mm->hiwater_vm = mm->total_vm; } +static inline void reset_mm_hiwater_rss(struct mm_struct *mm) +{ + mm->hiwater_rss = get_mm_rss(mm); +} + static inline void setmax_mm_hiwater_rss(unsigned long *maxrss, struct mm_struct *mm) { -- cgit v1.2.3 From 6bee55f94f971ccf87c8c95c10fa1047ade0f122 Mon Sep 17 00:00:00 2001 From: Alexander Kuleshov Date: Thu, 12 Feb 2015 15:01:03 -0800 Subject: fs: proc: use PDE() to get proc_dir_entry Use the PDE() helper to get proc_dir_entry instead of coding it directly. Signed-off-by: Alexander Kuleshov Acked-by: Nicolas Dichtel Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- fs/proc/generic.c | 2 +- fs/proc/inode.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/fs/proc/generic.c b/fs/proc/generic.c index 7fea13229f33..de14e46fd807 100644 --- a/fs/proc/generic.c +++ b/fs/proc/generic.c @@ -122,7 +122,7 @@ static int proc_getattr(struct vfsmount *mnt, struct dentry *dentry, struct kstat *stat) { struct inode *inode = dentry->d_inode; - struct proc_dir_entry *de = PROC_I(inode)->pde; + struct proc_dir_entry *de = PDE(inode); if (de && de->nlink) set_nlink(inode, de->nlink); diff --git a/fs/proc/inode.c b/fs/proc/inode.c index 8420a2f80811..13a50a32652d 100644 --- a/fs/proc/inode.c +++ b/fs/proc/inode.c @@ -40,7 +40,7 @@ static void proc_evict_inode(struct inode *inode) put_pid(PROC_I(inode)->pid); /* Let go of any associated proc directory entry */ - de = PROC_I(inode)->pde; + de = PDE(inode); if (de) pde_put(de); head = PROC_I(inode)->sysctl; -- cgit v1.2.3 From 0c3697118bb4f0991b11dafea038e4457813cae0 Mon Sep 17 00:00:00 2001 From: Rafael Aquini Date: Thu, 12 Feb 2015 15:01:05 -0800 Subject: Documentation/filesystems/proc.txt: add /proc/pid/numa_maps interface explanation snippet Add a small section to proc.txt doc in order to document its /proc/pid/numa_maps interface. It does not introduce any functional changes, just documentation. Signed-off-by: Rafael Aquini Cc: Johannes Weiner Cc: Dave Hansen Cc: David Rientjes Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/filesystems/proc.txt | 33 +++++++++++++++++++++++++++++++++ 1 file changed, 33 insertions(+) diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt index 0b3448dba9ec..ff3eb2380831 100644 --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.txt @@ -145,6 +145,8 @@ Table 1-1: Process specific entries in /proc stack Report full stack trace, enable via CONFIG_STACKTRACE smaps a extension based on maps, showing the memory consumption of each mapping and flags associated with it + numa_maps an extension based on maps, showing the memory locality and + binding policy as well as mem usage (in pages) of each mapping. .............................................................................. For example, to get the status information of a process, all you have to do is @@ -499,6 +501,37 @@ The /proc/pid/pagemap gives the PFN, which can be used to find the pageflags using /proc/kpageflags and number of times a page is mapped using /proc/kpagecount. For detailed explanation, see Documentation/vm/pagemap.txt. +The /proc/pid/numa_maps is an extension based on maps, showing the memory +locality and binding policy, as well as the memory usage (in pages) of +each mapping. The output follows a general format where mapping details get +summarized separated by blank spaces, one mapping per each file line: + +address policy mapping details + +00400000 default file=/usr/local/bin/app kernelpagesize_kB=4 mapped=1 active=0 N3=1 +00600000 default file=/usr/local/bin/app kernelpagesize_kB=4 anon=1 dirty=1 N3=1 +3206000000 default file=/lib64/ld-2.12.so kernelpagesize_kB=4 mapped=26 mapmax=6 N0=24 N3=2 +320621f000 default file=/lib64/ld-2.12.so kernelpagesize_kB=4 anon=1 dirty=1 N3=1 +3206220000 default file=/lib64/ld-2.12.so kernelpagesize_kB=4 anon=1 dirty=1 N3=1 +3206221000 default kernelpagesize_kB=4 anon=1 dirty=1 N3=1 +3206800000 default file=/lib64/libc-2.12.so kernelpagesize_kB=4 mapped=59 mapmax=21 active=55 N0=41 N3=18 +320698b000 default file=/lib64/libc-2.12.so +3206b8a000 default file=/lib64/libc-2.12.so kernelpagesize_kB=4 anon=2 dirty=2 N3=2 +3206b8e000 default file=/lib64/libc-2.12.so kernelpagesize_kB=4 anon=1 dirty=1 N3=1 +3206b8f000 default kernelpagesize_kB=4 anon=3 dirty=3 active=1 N3=3 +7f4dc10a2000 default kernelpagesize_kB=4 anon=3 dirty=3 N3=3 +7f4dc10b4000 default kernelpagesize_kB=4 anon=2 dirty=2 active=1 N3=2 +7f4dc1200000 default file=/anon_hugepage\040(deleted) huge kernelpagesize_kB=2048 anon=1 dirty=1 N3=1 +7fff335f0000 default stack kernelpagesize_kB=4 anon=3 dirty=3 N3=3 +7fff3369d000 default kernelpagesize_kB=4 mapped=1 mapmax=35 active=0 N3=1 + +Where: +"address" is the starting address for the mapping; +"policy" reports the NUMA memory policy set for the mapping (see vm/numa_memory_policy.txt); +"mapping details" summarizes mapping data such as mapping type, page usage counters, +node locality page counters (N0 == node0, N1 == node1, ...) and the kernel page +size, in KB, that is backing the mapping up. + 1.2 Kernel data --------------- -- cgit v1.2.3 From 198d1597cc5a12d04af18b69338a5b1d66ee7020 Mon Sep 17 00:00:00 2001 From: Rafael Aquini Date: Thu, 12 Feb 2015 15:01:08 -0800 Subject: fs: proc: task_mmu: show page size in /proc//numa_maps The output of /proc/$pid/numa_maps is in terms of number of pages like anon=22 or dirty=54. Here's some output: 7f4680000000 default file=/hugetlb/bigfile anon=50 dirty=50 N0=50 7f7659600000 default file=/anon_hugepage\040(deleted) anon=50 dirty=50 N0=50 7fff8d425000 default stack anon=50 dirty=50 N0=50 Looks like we have a stack and a couple of anonymous hugetlbfs areas page which both use the same amount of memory. They don't. The 'bigfile' uses 1GB pages and takes up ~50GB of space. The anon_hugepage uses 2MB pages and takes up ~100MB of space while the stack uses normal 4k pages. You can go over to smaps to figure out what the page size _really_ is with KernelPageSize or MMUPageSize. But, I think this is a pretty nasty and counterintuitive interface as it stands. This patch introduces 'kernelpagesize_kB' line element to /proc//numa_maps report file in order to help identifying the size of pages that are backing memory areas mapped by a given task. This is specially useful to help differentiating between HUGE and GIGANTIC page backed VMAs. This patch is based on Dave Hansen's proposal and reviewer's follow-ups taken from the following dicussion threads: * https://lkml.org/lkml/2011/9/21/454 * https://lkml.org/lkml/2014/12/20/66 Signed-off-by: Rafael Aquini Cc: Johannes Weiner Cc: Dave Hansen Acked-by: David Rientjes Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/filesystems/proc.txt | 30 +++++++++++++++--------------- fs/proc/task_mmu.c | 2 ++ 2 files changed, 17 insertions(+), 15 deletions(-) diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt index ff3eb2380831..a07ba61662ed 100644 --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.txt @@ -508,22 +508,22 @@ summarized separated by blank spaces, one mapping per each file line: address policy mapping details -00400000 default file=/usr/local/bin/app kernelpagesize_kB=4 mapped=1 active=0 N3=1 -00600000 default file=/usr/local/bin/app kernelpagesize_kB=4 anon=1 dirty=1 N3=1 -3206000000 default file=/lib64/ld-2.12.so kernelpagesize_kB=4 mapped=26 mapmax=6 N0=24 N3=2 -320621f000 default file=/lib64/ld-2.12.so kernelpagesize_kB=4 anon=1 dirty=1 N3=1 -3206220000 default file=/lib64/ld-2.12.so kernelpagesize_kB=4 anon=1 dirty=1 N3=1 -3206221000 default kernelpagesize_kB=4 anon=1 dirty=1 N3=1 -3206800000 default file=/lib64/libc-2.12.so kernelpagesize_kB=4 mapped=59 mapmax=21 active=55 N0=41 N3=18 +00400000 default file=/usr/local/bin/app mapped=1 active=0 N3=1 kernelpagesize_kB=4 +00600000 default file=/usr/local/bin/app anon=1 dirty=1 N3=1 kernelpagesize_kB=4 +3206000000 default file=/lib64/ld-2.12.so mapped=26 mapmax=6 N0=24 N3=2 kernelpagesize_kB=4 +320621f000 default file=/lib64/ld-2.12.so anon=1 dirty=1 N3=1 kernelpagesize_kB=4 +3206220000 default file=/lib64/ld-2.12.so anon=1 dirty=1 N3=1 kernelpagesize_kB=4 +3206221000 default anon=1 dirty=1 N3=1 kernelpagesize_kB=4 +3206800000 default file=/lib64/libc-2.12.so mapped=59 mapmax=21 active=55 N0=41 N3=18 kernelpagesize_kB=4 320698b000 default file=/lib64/libc-2.12.so -3206b8a000 default file=/lib64/libc-2.12.so kernelpagesize_kB=4 anon=2 dirty=2 N3=2 -3206b8e000 default file=/lib64/libc-2.12.so kernelpagesize_kB=4 anon=1 dirty=1 N3=1 -3206b8f000 default kernelpagesize_kB=4 anon=3 dirty=3 active=1 N3=3 -7f4dc10a2000 default kernelpagesize_kB=4 anon=3 dirty=3 N3=3 -7f4dc10b4000 default kernelpagesize_kB=4 anon=2 dirty=2 active=1 N3=2 -7f4dc1200000 default file=/anon_hugepage\040(deleted) huge kernelpagesize_kB=2048 anon=1 dirty=1 N3=1 -7fff335f0000 default stack kernelpagesize_kB=4 anon=3 dirty=3 N3=3 -7fff3369d000 default kernelpagesize_kB=4 mapped=1 mapmax=35 active=0 N3=1 +3206b8a000 default file=/lib64/libc-2.12.so anon=2 dirty=2 N3=2 kernelpagesize_kB=4 +3206b8e000 default file=/lib64/libc-2.12.so anon=1 dirty=1 N3=1 kernelpagesize_kB=4 +3206b8f000 default anon=3 dirty=3 active=1 N3=3 kernelpagesize_kB=4 +7f4dc10a2000 default anon=3 dirty=3 N3=3 kernelpagesize_kB=4 +7f4dc10b4000 default anon=2 dirty=2 active=1 N3=2 kernelpagesize_kB=4 +7f4dc1200000 default file=/anon_hugepage\040(deleted) huge anon=1 dirty=1 N3=1 kernelpagesize_kB=2048 +7fff335f0000 default stack anon=3 dirty=3 N3=3 kernelpagesize_kB=4 +7fff3369d000 default mapped=1 mapmax=35 active=0 N3=1 kernelpagesize_kB=4 Where: "address" is the starting address for the mapping; diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 1359a911d194..956b75d61809 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -1557,6 +1557,8 @@ static int show_numa_map(struct seq_file *m, void *v, int is_pid) for_each_node_state(nid, N_MEMORY) if (md->node[nid]) seq_printf(m, " N%d=%lu", nid, md->node[nid]); + + seq_printf(m, " kernelpagesize_kB=%lu", vma_kernel_pagesize(vma) >> 10); out: seq_putc(m, '\n'); m_cache_vma(m, vma); -- cgit v1.2.3 From edc924e023ae2022fc29f1246e115c610c9abf38 Mon Sep 17 00:00:00 2001 From: Andy Shevchenko Date: Thu, 12 Feb 2015 15:01:11 -0800 Subject: fs/proc/array.c: convert to use string_escape_str() Instead of custom approach let's use string_escape_str() to escape a given string (task_name in this case). Signed-off-by: Andy Shevchenko Cc: Oleg Nesterov Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- fs/proc/array.c | 34 +++++++--------------------------- 1 file changed, 7 insertions(+), 27 deletions(-) diff --git a/fs/proc/array.c b/fs/proc/array.c index bd117d065b82..a3ccf4c4ce70 100644 --- a/fs/proc/array.c +++ b/fs/proc/array.c @@ -81,6 +81,7 @@ #include #include #include +#include #include #include @@ -89,39 +90,18 @@ static inline void task_name(struct seq_file *m, struct task_struct *p) { - int i; - char *buf, *end; - char *name; + char *buf; char tcomm[sizeof(p->comm)]; get_task_comm(tcomm, p); seq_puts(m, "Name:\t"); - end = m->buf + m->size; buf = m->buf + m->count; - name = tcomm; - i = sizeof(tcomm); - while (i && (buf < end)) { - unsigned char c = *name; - name++; - i--; - *buf = c; - if (!c) - break; - if (c == '\\') { - buf++; - if (buf < end) - *buf++ = c; - continue; - } - if (c == '\n') { - *buf++ = '\\'; - if (buf < end) - *buf++ = 'n'; - continue; - } - buf++; - } + + /* Ignore error for now */ + string_escape_str(tcomm, &buf, m->size - m->count, + ESCAPE_SPACE | ESCAPE_SPECIAL, "\n\\"); + m->count = buf - m->buf; seq_putc(m, '\n'); } -- cgit v1.2.3 From f56141e3e2d9aabf7e6b89680ab572c2cdbb2a24 Mon Sep 17 00:00:00 2001 From: Andy Lutomirski Date: Thu, 12 Feb 2015 15:01:14 -0800 Subject: all arches, signal: move restart_block to struct task_struct If an attacker can cause a controlled kernel stack overflow, overwriting the restart block is a very juicy exploit target. This is because the restart_block is held in the same memory allocation as the kernel stack. Moving the restart block to struct task_struct prevents this exploit by making the restart_block harder to locate. Note that there are other fields in thread_info that are also easy targets, at least on some architectures. It's also a decent simplification, since the restart code is more or less identical on all architectures. [james.hogan@imgtec.com: metag: align thread_info::supervisor_stack] Signed-off-by: Andy Lutomirski Cc: Thomas Gleixner Cc: Al Viro Cc: "H. Peter Anvin" Cc: Ingo Molnar Cc: Kees Cook Cc: David Miller Acked-by: Richard Weinberger Cc: Richard Henderson Cc: Ivan Kokshaysky Cc: Matt Turner Cc: Vineet Gupta Cc: Russell King Cc: Catalin Marinas Cc: Will Deacon Cc: Haavard Skinnemoen Cc: Hans-Christian Egtvedt Cc: Steven Miao Cc: Mark Salter Cc: Aurelien Jacquiot Cc: Mikael Starvik Cc: Jesper Nilsson Cc: David Howells Cc: Richard Kuo Cc: "Luck, Tony" Cc: Geert Uytterhoeven Cc: Michal Simek Cc: Ralf Baechle Cc: Jonas Bonn Cc: "James E.J. Bottomley" Cc: Helge Deller Cc: Benjamin Herrenschmidt Cc: Paul Mackerras Acked-by: Michael Ellerman (powerpc) Tested-by: Michael Ellerman (powerpc) Cc: Martin Schwidefsky Cc: Heiko Carstens Cc: Chen Liqin Cc: Lennox Wu Cc: Chris Metcalf Cc: Guan Xuetao Cc: Chris Zankel Cc: Max Filippov Cc: Oleg Nesterov Cc: Guenter Roeck Signed-off-by: James Hogan Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/alpha/include/asm/thread_info.h | 5 ----- arch/alpha/kernel/signal.c | 2 +- arch/arc/include/asm/thread_info.h | 4 ---- arch/arc/kernel/signal.c | 2 +- arch/arm/include/asm/thread_info.h | 4 ---- arch/arm/kernel/signal.c | 4 ++-- arch/arm64/include/asm/thread_info.h | 4 ---- arch/arm64/kernel/signal.c | 2 +- arch/arm64/kernel/signal32.c | 4 ++-- arch/avr32/include/asm/thread_info.h | 4 ---- arch/avr32/kernel/asm-offsets.c | 1 - arch/avr32/kernel/signal.c | 2 +- arch/blackfin/include/asm/thread_info.h | 4 ---- arch/blackfin/kernel/signal.c | 2 +- arch/c6x/include/asm/thread_info.h | 4 ---- arch/c6x/kernel/signal.c | 2 +- arch/cris/arch-v10/kernel/signal.c | 2 +- arch/cris/arch-v32/kernel/signal.c | 2 +- arch/cris/include/asm/thread_info.h | 4 ---- arch/frv/include/asm/thread_info.h | 4 ---- arch/frv/kernel/asm-offsets.c | 1 - arch/frv/kernel/signal.c | 2 +- arch/hexagon/include/asm/thread_info.h | 4 ---- arch/hexagon/kernel/signal.c | 2 +- arch/ia64/include/asm/thread_info.h | 4 ---- arch/ia64/kernel/signal.c | 2 +- arch/m32r/include/asm/thread_info.h | 5 ----- arch/m32r/kernel/signal.c | 2 +- arch/m68k/include/asm/thread_info.h | 4 ---- arch/m68k/kernel/signal.c | 4 ++-- arch/metag/include/asm/thread_info.h | 6 +----- arch/metag/kernel/signal.c | 2 +- arch/microblaze/include/asm/thread_info.h | 4 ---- arch/microblaze/kernel/signal.c | 2 +- arch/mips/include/asm/thread_info.h | 4 ---- arch/mips/kernel/asm-offsets.c | 1 - arch/mips/kernel/signal.c | 2 +- arch/mips/kernel/signal32.c | 2 +- arch/mn10300/include/asm/thread_info.h | 4 ---- arch/mn10300/kernel/asm-offsets.c | 1 - arch/mn10300/kernel/signal.c | 2 +- arch/openrisc/include/asm/thread_info.h | 4 ---- arch/openrisc/kernel/signal.c | 2 +- arch/parisc/include/asm/thread_info.h | 4 ---- arch/parisc/kernel/signal.c | 2 +- arch/powerpc/include/asm/thread_info.h | 4 ---- arch/powerpc/kernel/signal_32.c | 4 ++-- arch/powerpc/kernel/signal_64.c | 2 +- arch/s390/include/asm/thread_info.h | 4 ---- arch/s390/kernel/compat_signal.c | 2 +- arch/s390/kernel/signal.c | 2 +- arch/score/include/asm/thread_info.h | 4 ---- arch/score/kernel/asm-offsets.c | 1 - arch/score/kernel/signal.c | 2 +- arch/sh/include/asm/thread_info.h | 4 ---- arch/sh/kernel/asm-offsets.c | 1 - arch/sh/kernel/signal_32.c | 4 ++-- arch/sh/kernel/signal_64.c | 4 ++-- arch/sparc/include/asm/thread_info_32.h | 6 ------ arch/sparc/include/asm/thread_info_64.h | 12 +++--------- arch/sparc/kernel/signal32.c | 4 ++-- arch/sparc/kernel/signal_32.c | 2 +- arch/sparc/kernel/signal_64.c | 2 +- arch/sparc/kernel/traps_64.c | 2 -- arch/tile/include/asm/thread_info.h | 4 ---- arch/tile/kernel/signal.c | 2 +- arch/um/include/asm/thread_info.h | 4 ---- arch/unicore32/include/asm/thread_info.h | 4 ---- arch/unicore32/kernel/signal.c | 2 +- arch/x86/ia32/ia32_signal.c | 2 +- arch/x86/include/asm/thread_info.h | 4 ---- arch/x86/kernel/signal.c | 2 +- arch/x86/um/signal.c | 2 +- arch/xtensa/include/asm/thread_info.h | 5 ----- arch/xtensa/kernel/signal.c | 2 +- fs/select.c | 2 +- include/linux/init_task.h | 3 +++ include/linux/sched.h | 2 ++ kernel/compat.c | 5 ++--- kernel/futex.c | 2 +- kernel/signal.c | 2 +- kernel/time/alarmtimer.c | 2 +- kernel/time/hrtimer.c | 2 +- kernel/time/posix-cpu-timers.c | 3 +-- 84 files changed, 62 insertions(+), 194 deletions(-) diff --git a/arch/alpha/include/asm/thread_info.h b/arch/alpha/include/asm/thread_info.h index 48bbea6898b3..d5b98ab514bb 100644 --- a/arch/alpha/include/asm/thread_info.h +++ b/arch/alpha/include/asm/thread_info.h @@ -27,8 +27,6 @@ struct thread_info { int bpt_nsaved; unsigned long bpt_addr[2]; /* breakpoint handling */ unsigned int bpt_insn[2]; - - struct restart_block restart_block; }; /* @@ -40,9 +38,6 @@ struct thread_info { .exec_domain = &default_exec_domain, \ .addr_limit = KERNEL_DS, \ .preempt_count = INIT_PREEMPT_COUNT, \ - .restart_block = { \ - .fn = do_no_restart_syscall, \ - }, \ } #define init_thread_info (init_thread_union.thread_info) diff --git a/arch/alpha/kernel/signal.c b/arch/alpha/kernel/signal.c index 6cec2881acbf..8dbfb15f1745 100644 --- a/arch/alpha/kernel/signal.c +++ b/arch/alpha/kernel/signal.c @@ -150,7 +150,7 @@ restore_sigcontext(struct sigcontext __user *sc, struct pt_regs *regs) struct switch_stack *sw = (struct switch_stack *)regs - 1; long i, err = __get_user(regs->pc, &sc->sc_pc); - current_thread_info()->restart_block.fn = do_no_restart_syscall; + current->restart_block.fn = do_no_restart_syscall; sw->r26 = (unsigned long) ret_from_sys_call; diff --git a/arch/arc/include/asm/thread_info.h b/arch/arc/include/asm/thread_info.h index 02bc5ec0fb2e..1163a1838ac1 100644 --- a/arch/arc/include/asm/thread_info.h +++ b/arch/arc/include/asm/thread_info.h @@ -46,7 +46,6 @@ struct thread_info { struct exec_domain *exec_domain;/* execution domain */ __u32 cpu; /* current CPU */ unsigned long thr_ptr; /* TLS ptr */ - struct restart_block restart_block; }; /* @@ -62,9 +61,6 @@ struct thread_info { .cpu = 0, \ .preempt_count = INIT_PREEMPT_COUNT, \ .addr_limit = KERNEL_DS, \ - .restart_block = { \ - .fn = do_no_restart_syscall, \ - }, \ } #define init_thread_info (init_thread_union.thread_info) diff --git a/arch/arc/kernel/signal.c b/arch/arc/kernel/signal.c index cb3142a2d40b..114234e83caa 100644 --- a/arch/arc/kernel/signal.c +++ b/arch/arc/kernel/signal.c @@ -104,7 +104,7 @@ SYSCALL_DEFINE0(rt_sigreturn) struct pt_regs *regs = current_pt_regs(); /* Always make any pending restarted system calls return -EINTR */ - current_thread_info()->restart_block.fn = do_no_restart_syscall; + current->restart_block.fn = do_no_restart_syscall; /* Since we stacked the signal on a word boundary, * then 'sp' should be word aligned here. If it's diff --git a/arch/arm/include/asm/thread_info.h b/arch/arm/include/asm/thread_info.h index d890e41f5520..72812a1f3d1c 100644 --- a/arch/arm/include/asm/thread_info.h +++ b/arch/arm/include/asm/thread_info.h @@ -68,7 +68,6 @@ struct thread_info { #ifdef CONFIG_ARM_THUMBEE unsigned long thumbee_state; /* ThumbEE Handler Base register */ #endif - struct restart_block restart_block; }; #define INIT_THREAD_INFO(tsk) \ @@ -81,9 +80,6 @@ struct thread_info { .cpu_domain = domain_val(DOMAIN_USER, DOMAIN_MANAGER) | \ domain_val(DOMAIN_KERNEL, DOMAIN_MANAGER) | \ domain_val(DOMAIN_IO, DOMAIN_CLIENT), \ - .restart_block = { \ - .fn = do_no_restart_syscall, \ - }, \ } #define init_thread_info (init_thread_union.thread_info) diff --git a/arch/arm/kernel/signal.c b/arch/arm/kernel/signal.c index 8aa6f1b87c9e..023ac905e4c3 100644 --- a/arch/arm/kernel/signal.c +++ b/arch/arm/kernel/signal.c @@ -191,7 +191,7 @@ asmlinkage int sys_sigreturn(struct pt_regs *regs) struct sigframe __user *frame; /* Always make any pending restarted system calls return -EINTR */ - current_thread_info()->restart_block.fn = do_no_restart_syscall; + current->restart_block.fn = do_no_restart_syscall; /* * Since we stacked the signal on a 64-bit boundary, @@ -221,7 +221,7 @@ asmlinkage int sys_rt_sigreturn(struct pt_regs *regs) struct rt_sigframe __user *frame; /* Always make any pending restarted system calls return -EINTR */ - current_thread_info()->restart_block.fn = do_no_restart_syscall; + current->restart_block.fn = do_no_restart_syscall; /* * Since we stacked the signal on a 64-bit boundary, diff --git a/arch/arm64/include/asm/thread_info.h b/arch/arm64/include/asm/thread_info.h index 459bf8e53208..702e1e6a0d80 100644 --- a/arch/arm64/include/asm/thread_info.h +++ b/arch/arm64/include/asm/thread_info.h @@ -48,7 +48,6 @@ struct thread_info { mm_segment_t addr_limit; /* address limit */ struct task_struct *task; /* main task structure */ struct exec_domain *exec_domain; /* execution domain */ - struct restart_block restart_block; int preempt_count; /* 0 => preemptable, <0 => bug */ int cpu; /* cpu */ }; @@ -60,9 +59,6 @@ struct thread_info { .flags = 0, \ .preempt_count = INIT_PREEMPT_COUNT, \ .addr_limit = KERNEL_DS, \ - .restart_block = { \ - .fn = do_no_restart_syscall, \ - }, \ } #define init_thread_info (init_thread_union.thread_info) diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c index 6fa792137eda..660ccf9f7524 100644 --- a/arch/arm64/kernel/signal.c +++ b/arch/arm64/kernel/signal.c @@ -131,7 +131,7 @@ asmlinkage long sys_rt_sigreturn(struct pt_regs *regs) struct rt_sigframe __user *frame; /* Always make any pending restarted system calls return -EINTR */ - current_thread_info()->restart_block.fn = do_no_restart_syscall; + current->restart_block.fn = do_no_restart_syscall; /* * Since we stacked the signal on a 128-bit boundary, then 'sp' should diff --git a/arch/arm64/kernel/signal32.c b/arch/arm64/kernel/signal32.c index e299de396e9b..c20a300e2213 100644 --- a/arch/arm64/kernel/signal32.c +++ b/arch/arm64/kernel/signal32.c @@ -347,7 +347,7 @@ asmlinkage int compat_sys_sigreturn(struct pt_regs *regs) struct compat_sigframe __user *frame; /* Always make any pending restarted system calls return -EINTR */ - current_thread_info()->restart_block.fn = do_no_restart_syscall; + current->restart_block.fn = do_no_restart_syscall; /* * Since we stacked the signal on a 64-bit boundary, @@ -381,7 +381,7 @@ asmlinkage int compat_sys_rt_sigreturn(struct pt_regs *regs) struct compat_rt_sigframe __user *frame; /* Always make any pending restarted system calls return -EINTR */ - current_thread_info()->restart_block.fn = do_no_restart_syscall; + current->restart_block.fn = do_no_restart_syscall; /* * Since we stacked the signal on a 64-bit boundary, diff --git a/arch/avr32/include/asm/thread_info.h b/arch/avr32/include/asm/thread_info.h index a978f3fe7c25..d56afa99a514 100644 --- a/arch/avr32/include/asm/thread_info.h +++ b/arch/avr32/include/asm/thread_info.h @@ -30,7 +30,6 @@ struct thread_info { saved by debug handler when setting up trampoline */ - struct restart_block restart_block; __u8 supervisor_stack[0]; }; @@ -41,9 +40,6 @@ struct thread_info { .flags = 0, \ .cpu = 0, \ .preempt_count = INIT_PREEMPT_COUNT, \ - .restart_block = { \ - .fn = do_no_restart_syscall \ - } \ } #define init_thread_info (init_thread_union.thread_info) diff --git a/arch/avr32/kernel/asm-offsets.c b/arch/avr32/kernel/asm-offsets.c index d6a8193a1d2f..e41c84516e5d 100644 --- a/arch/avr32/kernel/asm-offsets.c +++ b/arch/avr32/kernel/asm-offsets.c @@ -18,7 +18,6 @@ void foo(void) OFFSET(TI_preempt_count, thread_info, preempt_count); OFFSET(TI_rar_saved, thread_info, rar_saved); OFFSET(TI_rsr_saved, thread_info, rsr_saved); - OFFSET(TI_restart_block, thread_info, restart_block); BLANK(); OFFSET(TSK_active_mm, task_struct, active_mm); BLANK(); diff --git a/arch/avr32/kernel/signal.c b/arch/avr32/kernel/signal.c index d309fbcc3bd6..8f1c63b9b983 100644 --- a/arch/avr32/kernel/signal.c +++ b/arch/avr32/kernel/signal.c @@ -69,7 +69,7 @@ asmlinkage int sys_rt_sigreturn(struct pt_regs *regs) sigset_t set; /* Always make any pending restarted system calls return -EINTR */ - current_thread_info()->restart_block.fn = do_no_restart_syscall; + current->restart_block.fn = do_no_restart_syscall; frame = (struct rt_sigframe __user *)regs->sp; pr_debug("SIG return: frame = %p\n", frame); diff --git a/arch/blackfin/include/asm/thread_info.h b/arch/blackfin/include/asm/thread_info.h index 55f473bdad36..57c3a8bd583d 100644 --- a/arch/blackfin/include/asm/thread_info.h +++ b/arch/blackfin/include/asm/thread_info.h @@ -42,7 +42,6 @@ struct thread_info { int cpu; /* cpu we're on */ int preempt_count; /* 0 => preemptable, <0 => BUG */ mm_segment_t addr_limit; /* address limit */ - struct restart_block restart_block; #ifndef CONFIG_SMP struct l1_scratch_task_info l1_task_info; #endif @@ -58,9 +57,6 @@ struct thread_info { .flags = 0, \ .cpu = 0, \ .preempt_count = INIT_PREEMPT_COUNT, \ - .restart_block = { \ - .fn = do_no_restart_syscall, \ - }, \ } #define init_thread_info (init_thread_union.thread_info) #define init_stack (init_thread_union.stack) diff --git a/arch/blackfin/kernel/signal.c b/arch/blackfin/kernel/signal.c index ef275571d885..f2a8b5493bd3 100644 --- a/arch/blackfin/kernel/signal.c +++ b/arch/blackfin/kernel/signal.c @@ -44,7 +44,7 @@ rt_restore_sigcontext(struct pt_regs *regs, struct sigcontext __user *sc, int *p int err = 0; /* Always make any pending restarted system calls return -EINTR */ - current_thread_info()->restart_block.fn = do_no_restart_syscall; + current->restart_block.fn = do_no_restart_syscall; #define RESTORE(x) err |= __get_user(regs->x, &sc->sc_##x) diff --git a/arch/c6x/include/asm/thread_info.h b/arch/c6x/include/asm/thread_info.h index d4e9ef87076d..584e253f3217 100644 --- a/arch/c6x/include/asm/thread_info.h +++ b/arch/c6x/include/asm/thread_info.h @@ -45,7 +45,6 @@ struct thread_info { int cpu; /* cpu we're on */ int preempt_count; /* 0 = preemptable, <0 = BUG */ mm_segment_t addr_limit; /* thread address space */ - struct restart_block restart_block; }; /* @@ -61,9 +60,6 @@ struct thread_info { .cpu = 0, \ .preempt_count = INIT_PREEMPT_COUNT, \ .addr_limit = KERNEL_DS, \ - .restart_block = { \ - .fn = do_no_restart_syscall, \ - }, \ } #define init_thread_info (init_thread_union.thread_info) diff --git a/arch/c6x/kernel/signal.c b/arch/c6x/kernel/signal.c index fe68226f6c4d..3c4bb5a5c382 100644 --- a/arch/c6x/kernel/signal.c +++ b/arch/c6x/kernel/signal.c @@ -68,7 +68,7 @@ asmlinkage int do_rt_sigreturn(struct pt_regs *regs) sigset_t set; /* Always make any pending restarted system calls return -EINTR */ - current_thread_info()->restart_block.fn = do_no_restart_syscall; + current->restart_block.fn = do_no_restart_syscall; /* * Since we stacked the signal on a dword boundary, diff --git a/arch/cris/arch-v10/kernel/signal.c b/arch/cris/arch-v10/kernel/signal.c index 9b32d338838b..74d7ba35120d 100644 --- a/arch/cris/arch-v10/kernel/signal.c +++ b/arch/cris/arch-v10/kernel/signal.c @@ -67,7 +67,7 @@ restore_sigcontext(struct pt_regs *regs, struct sigcontext __user *sc) unsigned long old_usp; /* Always make any pending restarted system calls return -EINTR */ - current_thread_info()->restart_block.fn = do_no_restart_syscall; + current->restart_block.fn = do_no_restart_syscall; /* restore the regs from &sc->regs (same as sc, since regs is first) * (sc is already checked for VERIFY_READ since the sigframe was diff --git a/arch/cris/arch-v32/kernel/signal.c b/arch/cris/arch-v32/kernel/signal.c index 78ce3b1c9bcb..870e3e069318 100644 --- a/arch/cris/arch-v32/kernel/signal.c +++ b/arch/cris/arch-v32/kernel/signal.c @@ -59,7 +59,7 @@ restore_sigcontext(struct pt_regs *regs, struct sigcontext __user *sc) unsigned long old_usp; /* Always make any pending restarted system calls return -EINTR */ - current_thread_info()->restart_block.fn = do_no_restart_syscall; + current->restart_block.fn = do_no_restart_syscall; /* * Restore the registers from &sc->regs. sc is already checked diff --git a/arch/cris/include/asm/thread_info.h b/arch/cris/include/asm/thread_info.h index 55dede18c032..7286db5ed90e 100644 --- a/arch/cris/include/asm/thread_info.h +++ b/arch/cris/include/asm/thread_info.h @@ -38,7 +38,6 @@ struct thread_info { 0-0xBFFFFFFF for user-thead 0-0xFFFFFFFF for kernel-thread */ - struct restart_block restart_block; __u8 supervisor_stack[0]; }; @@ -56,9 +55,6 @@ struct thread_info { .cpu = 0, \ .preempt_count = INIT_PREEMPT_COUNT, \ .addr_limit = KERNEL_DS, \ - .restart_block = { \ - .fn = do_no_restart_syscall, \ - }, \ } #define init_thread_info (init_thread_union.thread_info) diff --git a/arch/frv/include/asm/thread_info.h b/arch/frv/include/asm/thread_info.h index af29e17c0181..6b917f1c2955 100644 --- a/arch/frv/include/asm/thread_info.h +++ b/arch/frv/include/asm/thread_info.h @@ -41,7 +41,6 @@ struct thread_info { * 0-0xBFFFFFFF for user-thead * 0-0xFFFFFFFF for kernel-thread */ - struct restart_block restart_block; __u8 supervisor_stack[0]; }; @@ -65,9 +64,6 @@ struct thread_info { .cpu = 0, \ .preempt_count = INIT_PREEMPT_COUNT, \ .addr_limit = KERNEL_DS, \ - .restart_block = { \ - .fn = do_no_restart_syscall, \ - }, \ } #define init_thread_info (init_thread_union.thread_info) diff --git a/arch/frv/kernel/asm-offsets.c b/arch/frv/kernel/asm-offsets.c index 9de96843a278..446e89d500cc 100644 --- a/arch/frv/kernel/asm-offsets.c +++ b/arch/frv/kernel/asm-offsets.c @@ -40,7 +40,6 @@ void foo(void) OFFSET(TI_CPU, thread_info, cpu); OFFSET(TI_PREEMPT_COUNT, thread_info, preempt_count); OFFSET(TI_ADDR_LIMIT, thread_info, addr_limit); - OFFSET(TI_RESTART_BLOCK, thread_info, restart_block); BLANK(); /* offsets into register file storage */ diff --git a/arch/frv/kernel/signal.c b/arch/frv/kernel/signal.c index dc3d59de0870..336713ab4745 100644 --- a/arch/frv/kernel/signal.c +++ b/arch/frv/kernel/signal.c @@ -62,7 +62,7 @@ static int restore_sigcontext(struct sigcontext __user *sc, int *_gr8) unsigned long tbr, psr; /* Always make any pending restarted system calls return -EINTR */ - current_thread_info()->restart_block.fn = do_no_restart_syscall; + current->restart_block.fn = do_no_restart_syscall; tbr = user->i.tbr; psr = user->i.psr; diff --git a/arch/hexagon/include/asm/thread_info.h b/arch/hexagon/include/asm/thread_info.h index a59dad3b3695..bacd3d6030c5 100644 --- a/arch/hexagon/include/asm/thread_info.h +++ b/arch/hexagon/include/asm/thread_info.h @@ -56,7 +56,6 @@ struct thread_info { * used for syscalls somehow; * seems to have a function pointer and four arguments */ - struct restart_block restart_block; /* Points to the current pt_regs frame */ struct pt_regs *regs; /* @@ -83,9 +82,6 @@ struct thread_info { .cpu = 0, \ .preempt_count = 1, \ .addr_limit = KERNEL_DS, \ - .restart_block = { \ - .fn = do_no_restart_syscall, \ - }, \ .sp = 0, \ .regs = NULL, \ } diff --git a/arch/hexagon/kernel/signal.c b/arch/hexagon/kernel/signal.c index eadd70e47e7e..b039a624c170 100644 --- a/arch/hexagon/kernel/signal.c +++ b/arch/hexagon/kernel/signal.c @@ -239,7 +239,7 @@ asmlinkage int sys_rt_sigreturn(void) sigset_t blocked; /* Always make any pending restarted system calls return -EINTR */ - current_thread_info()->restart_block.fn = do_no_restart_syscall; + current->restart_block.fn = do_no_restart_syscall; frame = (struct rt_sigframe __user *)pt_psp(regs); if (!access_ok(VERIFY_READ, frame, sizeof(*frame))) diff --git a/arch/ia64/include/asm/thread_info.h b/arch/ia64/include/asm/thread_info.h index 5b17418b4223..c16f21a068ff 100644 --- a/arch/ia64/include/asm/thread_info.h +++ b/arch/ia64/include/asm/thread_info.h @@ -27,7 +27,6 @@ struct thread_info { __u32 status; /* Thread synchronous flags */ mm_segment_t addr_limit; /* user-level address space limit */ int preempt_count; /* 0=premptable, <0=BUG; will also serve as bh-counter */ - struct restart_block restart_block; #ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE __u64 ac_stamp; __u64 ac_leave; @@ -46,9 +45,6 @@ struct thread_info { .cpu = 0, \ .addr_limit = KERNEL_DS, \ .preempt_count = INIT_PREEMPT_COUNT, \ - .restart_block = { \ - .fn = do_no_restart_syscall, \ - }, \ } #ifndef ASM_OFFSETS_C diff --git a/arch/ia64/kernel/signal.c b/arch/ia64/kernel/signal.c index 6d92170be457..b3a124da71e5 100644 --- a/arch/ia64/kernel/signal.c +++ b/arch/ia64/kernel/signal.c @@ -46,7 +46,7 @@ restore_sigcontext (struct sigcontext __user *sc, struct sigscratch *scr) long err; /* Always make any pending restarted system calls return -EINTR */ - current_thread_info()->restart_block.fn = do_no_restart_syscall; + current->restart_block.fn = do_no_restart_syscall; /* restore scratch that always needs gets updated during signal delivery: */ err = __get_user(flags, &sc->sc_flags); diff --git a/arch/m32r/include/asm/thread_info.h b/arch/m32r/include/asm/thread_info.h index 00171703402f..32422d0211c3 100644 --- a/arch/m32r/include/asm/thread_info.h +++ b/arch/m32r/include/asm/thread_info.h @@ -34,7 +34,6 @@ struct thread_info { 0-0xBFFFFFFF for user-thread 0-0xFFFFFFFF for kernel-thread */ - struct restart_block restart_block; __u8 supervisor_stack[0]; }; @@ -49,7 +48,6 @@ struct thread_info { #define TI_CPU 0x00000010 #define TI_PRE_COUNT 0x00000014 #define TI_ADDR_LIMIT 0x00000018 -#define TI_RESTART_BLOCK 0x000001C #endif @@ -68,9 +66,6 @@ struct thread_info { .cpu = 0, \ .preempt_count = INIT_PREEMPT_COUNT, \ .addr_limit = KERNEL_DS, \ - .restart_block = { \ - .fn = do_no_restart_syscall, \ - }, \ } #define init_thread_info (init_thread_union.thread_info) diff --git a/arch/m32r/kernel/signal.c b/arch/m32r/kernel/signal.c index 95408b8f130a..7736c6660a15 100644 --- a/arch/m32r/kernel/signal.c +++ b/arch/m32r/kernel/signal.c @@ -48,7 +48,7 @@ restore_sigcontext(struct pt_regs *regs, struct sigcontext __user *sc, unsigned int err = 0; /* Always make any pending restarted system calls return -EINTR */ - current_thread_info()->restart_block.fn = do_no_restart_syscall; + current->restart_block.fn = do_no_restart_syscall; #define COPY(x) err |= __get_user(regs->x, &sc->sc_##x) COPY(r4); diff --git a/arch/m68k/include/asm/thread_info.h b/arch/m68k/include/asm/thread_info.h index 21a4784ca5a1..c54256e69e64 100644 --- a/arch/m68k/include/asm/thread_info.h +++ b/arch/m68k/include/asm/thread_info.h @@ -31,7 +31,6 @@ struct thread_info { int preempt_count; /* 0 => preemptable, <0 => BUG */ __u32 cpu; /* should always be 0 on m68k */ unsigned long tp_value; /* thread pointer */ - struct restart_block restart_block; }; #endif /* __ASSEMBLY__ */ @@ -41,9 +40,6 @@ struct thread_info { .exec_domain = &default_exec_domain, \ .addr_limit = KERNEL_DS, \ .preempt_count = INIT_PREEMPT_COUNT, \ - .restart_block = { \ - .fn = do_no_restart_syscall, \ - }, \ } #define init_stack (init_thread_union.stack) diff --git a/arch/m68k/kernel/signal.c b/arch/m68k/kernel/signal.c index 967a8b7e1527..d7179281e74a 100644 --- a/arch/m68k/kernel/signal.c +++ b/arch/m68k/kernel/signal.c @@ -655,7 +655,7 @@ restore_sigcontext(struct pt_regs *regs, struct sigcontext __user *usc, void __u int err = 0; /* Always make any pending restarted system calls return -EINTR */ - current_thread_info()->restart_block.fn = do_no_restart_syscall; + current->restart_block.fn = do_no_restart_syscall; /* get previous context */ if (copy_from_user(&context, usc, sizeof(context))) @@ -693,7 +693,7 @@ rt_restore_ucontext(struct pt_regs *regs, struct switch_stack *sw, int err; /* Always make any pending restarted system calls return -EINTR */ - current_thread_info()->restart_block.fn = do_no_restart_syscall; + current->restart_block.fn = do_no_restart_syscall; err = __get_user(temp, &uc->uc_mcontext.version); if (temp != MCONTEXT_VERSION) diff --git a/arch/metag/include/asm/thread_info.h b/arch/metag/include/asm/thread_info.h index 47711336119e..afb3ca4776d1 100644 --- a/arch/metag/include/asm/thread_info.h +++ b/arch/metag/include/asm/thread_info.h @@ -35,9 +35,8 @@ struct thread_info { int preempt_count; /* 0 => preemptable, <0 => BUG */ mm_segment_t addr_limit; /* thread address space */ - struct restart_block restart_block; - u8 supervisor_stack[0]; + u8 supervisor_stack[0] __aligned(8); }; #else /* !__ASSEMBLY__ */ @@ -74,9 +73,6 @@ struct thread_info { .cpu = 0, \ .preempt_count = INIT_PREEMPT_COUNT, \ .addr_limit = KERNEL_DS, \ - .restart_block = { \ - .fn = do_no_restart_syscall, \ - }, \ } #define init_thread_info (init_thread_union.thread_info) diff --git a/arch/metag/kernel/signal.c b/arch/metag/kernel/signal.c index 0d100d5c1407..ce49d429c74a 100644 --- a/arch/metag/kernel/signal.c +++ b/arch/metag/kernel/signal.c @@ -48,7 +48,7 @@ static int restore_sigcontext(struct pt_regs *regs, int err; /* Always make any pending restarted system calls return -EINTR */ - current_thread_info()->restart_block.fn = do_no_restart_syscall; + current->restart_block.fn = do_no_restart_syscall; err = metag_gp_regs_copyin(regs, 0, sizeof(struct user_gp_regs), NULL, &sc->regs); diff --git a/arch/microblaze/include/asm/thread_info.h b/arch/microblaze/include/asm/thread_info.h index 8c9d36591a03..b699fbd7de4a 100644 --- a/arch/microblaze/include/asm/thread_info.h +++ b/arch/microblaze/include/asm/thread_info.h @@ -71,7 +71,6 @@ struct thread_info { __u32 cpu; /* current CPU */ __s32 preempt_count; /* 0 => preemptable,< 0 => BUG*/ mm_segment_t addr_limit; /* thread address space */ - struct restart_block restart_block; struct cpu_context cpu_context; }; @@ -87,9 +86,6 @@ struct thread_info { .cpu = 0, \ .preempt_count = INIT_PREEMPT_COUNT, \ .addr_limit = KERNEL_DS, \ - .restart_block = { \ - .fn = do_no_restart_syscall, \ - }, \ } #define init_thread_info (init_thread_union.thread_info) diff --git a/arch/microblaze/kernel/signal.c b/arch/microblaze/kernel/signal.c index 235706055b7f..a1cbaf90e2ea 100644 --- a/arch/microblaze/kernel/signal.c +++ b/arch/microblaze/kernel/signal.c @@ -89,7 +89,7 @@ asmlinkage long sys_rt_sigreturn(struct pt_regs *regs) int rval; /* Always make any pending restarted system calls return -EINTR */ - current_thread_info()->restart_block.fn = do_no_restart_syscall; + current->restart_block.fn = do_no_restart_syscall; if (!access_ok(VERIFY_READ, frame, sizeof(*frame))) goto badframe; diff --git a/arch/mips/include/asm/thread_info.h b/arch/mips/include/asm/thread_info.h index e4440f92b366..9e1295f874f0 100644 --- a/arch/mips/include/asm/thread_info.h +++ b/arch/mips/include/asm/thread_info.h @@ -34,7 +34,6 @@ struct thread_info { * 0x7fffffff for user-thead * 0xffffffff for kernel-thread */ - struct restart_block restart_block; struct pt_regs *regs; long syscall; /* syscall number */ }; @@ -50,9 +49,6 @@ struct thread_info { .cpu = 0, \ .preempt_count = INIT_PREEMPT_COUNT, \ .addr_limit = KERNEL_DS, \ - .restart_block = { \ - .fn = do_no_restart_syscall, \ - }, \ } #define init_thread_info (init_thread_union.thread_info) diff --git a/arch/mips/kernel/asm-offsets.c b/arch/mips/kernel/asm-offsets.c index b1d84bd4efb3..3b2dfdb4865f 100644 --- a/arch/mips/kernel/asm-offsets.c +++ b/arch/mips/kernel/asm-offsets.c @@ -98,7 +98,6 @@ void output_thread_info_defines(void) OFFSET(TI_CPU, thread_info, cpu); OFFSET(TI_PRE_COUNT, thread_info, preempt_count); OFFSET(TI_ADDR_LIMIT, thread_info, addr_limit); - OFFSET(TI_RESTART_BLOCK, thread_info, restart_block); OFFSET(TI_REGS, thread_info, regs); DEFINE(_THREAD_SIZE, THREAD_SIZE); DEFINE(_THREAD_MASK, THREAD_MASK); diff --git a/arch/mips/kernel/signal.c b/arch/mips/kernel/signal.c index 545bf11bd2ed..6a28c792d862 100644 --- a/arch/mips/kernel/signal.c +++ b/arch/mips/kernel/signal.c @@ -243,7 +243,7 @@ int restore_sigcontext(struct pt_regs *regs, struct sigcontext __user *sc) int i; /* Always make any pending restarted system calls return -EINTR */ - current_thread_info()->restart_block.fn = do_no_restart_syscall; + current->restart_block.fn = do_no_restart_syscall; err |= __get_user(regs->cp0_epc, &sc->sc_pc); diff --git a/arch/mips/kernel/signal32.c b/arch/mips/kernel/signal32.c index d69179c0d49d..19a7705f2a01 100644 --- a/arch/mips/kernel/signal32.c +++ b/arch/mips/kernel/signal32.c @@ -220,7 +220,7 @@ static int restore_sigcontext32(struct pt_regs *regs, int i; /* Always make any pending restarted system calls return -EINTR */ - current_thread_info()->restart_block.fn = do_no_restart_syscall; + current->restart_block.fn = do_no_restart_syscall; err |= __get_user(regs->cp0_epc, &sc->sc_pc); err |= __get_user(regs->hi, &sc->sc_mdhi); diff --git a/arch/mn10300/include/asm/thread_info.h b/arch/mn10300/include/asm/thread_info.h index bf280eaccd36..c1c374f0ec12 100644 --- a/arch/mn10300/include/asm/thread_info.h +++ b/arch/mn10300/include/asm/thread_info.h @@ -50,7 +50,6 @@ struct thread_info { 0-0xBFFFFFFF for user-thead 0-0xFFFFFFFF for kernel-thread */ - struct restart_block restart_block; __u8 supervisor_stack[0]; }; @@ -80,9 +79,6 @@ struct thread_info { .cpu = 0, \ .preempt_count = INIT_PREEMPT_COUNT, \ .addr_limit = KERNEL_DS, \ - .restart_block = { \ - .fn = do_no_restart_syscall, \ - }, \ } #define init_thread_info (init_thread_union.thread_info) diff --git a/arch/mn10300/kernel/asm-offsets.c b/arch/mn10300/kernel/asm-offsets.c index 47b3bb0c04ff..d780670cbaf3 100644 --- a/arch/mn10300/kernel/asm-offsets.c +++ b/arch/mn10300/kernel/asm-offsets.c @@ -28,7 +28,6 @@ void foo(void) OFFSET(TI_cpu, thread_info, cpu); OFFSET(TI_preempt_count, thread_info, preempt_count); OFFSET(TI_addr_limit, thread_info, addr_limit); - OFFSET(TI_restart_block, thread_info, restart_block); BLANK(); OFFSET(REG_D0, pt_regs, d0); diff --git a/arch/mn10300/kernel/signal.c b/arch/mn10300/kernel/signal.c index a6c0858592c3..8609845f12c5 100644 --- a/arch/mn10300/kernel/signal.c +++ b/arch/mn10300/kernel/signal.c @@ -40,7 +40,7 @@ static int restore_sigcontext(struct pt_regs *regs, unsigned int err = 0; /* Always make any pending restarted system calls return -EINTR */ - current_thread_info()->restart_block.fn = do_no_restart_syscall; + current->restart_block.fn = do_no_restart_syscall; if (is_using_fpu(current)) fpu_kill_state(current); diff --git a/arch/openrisc/include/asm/thread_info.h b/arch/openrisc/include/asm/thread_info.h index d797acc901e4..875f0845a707 100644 --- a/arch/openrisc/include/asm/thread_info.h +++ b/arch/openrisc/include/asm/thread_info.h @@ -57,7 +57,6 @@ struct thread_info { 0-0x7FFFFFFF for user-thead 0-0xFFFFFFFF for kernel-thread */ - struct restart_block restart_block; __u8 supervisor_stack[0]; /* saved context data */ @@ -79,9 +78,6 @@ struct thread_info { .cpu = 0, \ .preempt_count = 1, \ .addr_limit = KERNEL_DS, \ - .restart_block = { \ - .fn = do_no_restart_syscall, \ - }, \ .ksp = 0, \ } diff --git a/arch/openrisc/kernel/signal.c b/arch/openrisc/kernel/signal.c index 7d1b8235bf90..4112175bf803 100644 --- a/arch/openrisc/kernel/signal.c +++ b/arch/openrisc/kernel/signal.c @@ -46,7 +46,7 @@ static int restore_sigcontext(struct pt_regs *regs, int err = 0; /* Always make any pending restarted system calls return -EINTR */ - current_thread_info()->restart_block.fn = do_no_restart_syscall; + current->restart_block.fn = do_no_restart_syscall; /* * Restore the regs from &sc->regs. diff --git a/arch/parisc/include/asm/thread_info.h b/arch/parisc/include/asm/thread_info.h index a84611835549..fb13e3865563 100644 --- a/arch/parisc/include/asm/thread_info.h +++ b/arch/parisc/include/asm/thread_info.h @@ -14,7 +14,6 @@ struct thread_info { mm_segment_t addr_limit; /* user-level address space limit */ __u32 cpu; /* current CPU */ int preempt_count; /* 0=premptable, <0=BUG; will also serve as bh-counter */ - struct restart_block restart_block; }; #define INIT_THREAD_INFO(tsk) \ @@ -25,9 +24,6 @@ struct thread_info { .cpu = 0, \ .addr_limit = KERNEL_DS, \ .preempt_count = INIT_PREEMPT_COUNT, \ - .restart_block = { \ - .fn = do_no_restart_syscall \ - } \ } #define init_thread_info (init_thread_union.thread_info) diff --git a/arch/parisc/kernel/signal.c b/arch/parisc/kernel/signal.c index 012d4fa63d97..9b910a0251b8 100644 --- a/arch/parisc/kernel/signal.c +++ b/arch/parisc/kernel/signal.c @@ -99,7 +99,7 @@ sys_rt_sigreturn(struct pt_regs *regs, int in_syscall) sigframe_size = PARISC_RT_SIGFRAME_SIZE32; #endif - current_thread_info()->restart_block.fn = do_no_restart_syscall; + current->restart_block.fn = do_no_restart_syscall; /* Unwind the user stack to get the rt_sigframe structure. */ frame = (struct rt_sigframe __user *) diff --git a/arch/powerpc/include/asm/thread_info.h b/arch/powerpc/include/asm/thread_info.h index e8abc83e699f..72489799cf02 100644 --- a/arch/powerpc/include/asm/thread_info.h +++ b/arch/powerpc/include/asm/thread_info.h @@ -43,7 +43,6 @@ struct thread_info { int cpu; /* cpu we're on */ int preempt_count; /* 0 => preemptable, <0 => BUG */ - struct restart_block restart_block; unsigned long local_flags; /* private flags for thread */ /* low level flags - has atomic operations done on it */ @@ -59,9 +58,6 @@ struct thread_info { .exec_domain = &default_exec_domain, \ .cpu = 0, \ .preempt_count = INIT_PREEMPT_COUNT, \ - .restart_block = { \ - .fn = do_no_restart_syscall, \ - }, \ .flags = 0, \ } diff --git a/arch/powerpc/kernel/signal_32.c b/arch/powerpc/kernel/signal_32.c index b171001698ff..d3a831ac0f92 100644 --- a/arch/powerpc/kernel/signal_32.c +++ b/arch/powerpc/kernel/signal_32.c @@ -1231,7 +1231,7 @@ long sys_rt_sigreturn(int r3, int r4, int r5, int r6, int r7, int r8, int tm_restore = 0; #endif /* Always make any pending restarted system calls return -EINTR */ - current_thread_info()->restart_block.fn = do_no_restart_syscall; + current->restart_block.fn = do_no_restart_syscall; rt_sf = (struct rt_sigframe __user *) (regs->gpr[1] + __SIGNAL_FRAMESIZE + 16); @@ -1504,7 +1504,7 @@ long sys_sigreturn(int r3, int r4, int r5, int r6, int r7, int r8, #endif /* Always make any pending restarted system calls return -EINTR */ - current_thread_info()->restart_block.fn = do_no_restart_syscall; + current->restart_block.fn = do_no_restart_syscall; sf = (struct sigframe __user *)(regs->gpr[1] + __SIGNAL_FRAMESIZE); sc = &sf->sctx; diff --git a/arch/powerpc/kernel/signal_64.c b/arch/powerpc/kernel/signal_64.c index 2cb0c94cafa5..c7c24d2e2bdb 100644 --- a/arch/powerpc/kernel/signal_64.c +++ b/arch/powerpc/kernel/signal_64.c @@ -666,7 +666,7 @@ int sys_rt_sigreturn(unsigned long r3, unsigned long r4, unsigned long r5, #endif /* Always make any pending restarted system calls return -EINTR */ - current_thread_info()->restart_block.fn = do_no_restart_syscall; + current->restart_block.fn = do_no_restart_syscall; if (!access_ok(VERIFY_READ, uc, sizeof(*uc))) goto badframe; diff --git a/arch/s390/include/asm/thread_info.h b/arch/s390/include/asm/thread_info.h index 4d62fd5b56e5..ef1df718642d 100644 --- a/arch/s390/include/asm/thread_info.h +++ b/arch/s390/include/asm/thread_info.h @@ -39,7 +39,6 @@ struct thread_info { unsigned long sys_call_table; /* System call table address */ unsigned int cpu; /* current CPU */ int preempt_count; /* 0 => preemptable, <0 => BUG */ - struct restart_block restart_block; unsigned int system_call; __u64 user_timer; __u64 system_timer; @@ -56,9 +55,6 @@ struct thread_info { .flags = 0, \ .cpu = 0, \ .preempt_count = INIT_PREEMPT_COUNT, \ - .restart_block = { \ - .fn = do_no_restart_syscall, \ - }, \ } #define init_thread_info (init_thread_union.thread_info) diff --git a/arch/s390/kernel/compat_signal.c b/arch/s390/kernel/compat_signal.c index 34d5fa7b01b5..bc1df12dd4f8 100644 --- a/arch/s390/kernel/compat_signal.c +++ b/arch/s390/kernel/compat_signal.c @@ -209,7 +209,7 @@ static int restore_sigregs32(struct pt_regs *regs,_sigregs32 __user *sregs) int i; /* Alwys make any pending restarted system call return -EINTR */ - current_thread_info()->restart_block.fn = do_no_restart_syscall; + current->restart_block.fn = do_no_restart_syscall; if (__copy_from_user(&user_sregs, &sregs->regs, sizeof(user_sregs))) return -EFAULT; diff --git a/arch/s390/kernel/signal.c b/arch/s390/kernel/signal.c index 6a2ac257d98f..b3ae6f70c6d6 100644 --- a/arch/s390/kernel/signal.c +++ b/arch/s390/kernel/signal.c @@ -162,7 +162,7 @@ static int restore_sigregs(struct pt_regs *regs, _sigregs __user *sregs) _sigregs user_sregs; /* Alwys make any pending restarted system call return -EINTR */ - current_thread_info()->restart_block.fn = do_no_restart_syscall; + current->restart_block.fn = do_no_restart_syscall; if (__copy_from_user(&user_sregs, sregs, sizeof(user_sregs))) return -EFAULT; diff --git a/arch/score/include/asm/thread_info.h b/arch/score/include/asm/thread_info.h index 656b7ada9326..33864fa2a8d4 100644 --- a/arch/score/include/asm/thread_info.h +++ b/arch/score/include/asm/thread_info.h @@ -42,7 +42,6 @@ struct thread_info { * 0-0xFFFFFFFF for kernel-thread */ mm_segment_t addr_limit; - struct restart_block restart_block; struct pt_regs *regs; }; @@ -58,9 +57,6 @@ struct thread_info { .cpu = 0, \ .preempt_count = 1, \ .addr_limit = KERNEL_DS, \ - .restart_block = { \ - .fn = do_no_restart_syscall, \ - }, \ } #define init_thread_info (init_thread_union.thread_info) diff --git a/arch/score/kernel/asm-offsets.c b/arch/score/kernel/asm-offsets.c index 57788f44c6fb..b4d5214a7a7e 100644 --- a/arch/score/kernel/asm-offsets.c +++ b/arch/score/kernel/asm-offsets.c @@ -106,7 +106,6 @@ void output_thread_info_defines(void) OFFSET(TI_CPU, thread_info, cpu); OFFSET(TI_PRE_COUNT, thread_info, preempt_count); OFFSET(TI_ADDR_LIMIT, thread_info, addr_limit); - OFFSET(TI_RESTART_BLOCK, thread_info, restart_block); OFFSET(TI_REGS, thread_info, regs); DEFINE(KERNEL_STACK_SIZE, THREAD_SIZE); DEFINE(KERNEL_STACK_MASK, THREAD_MASK); diff --git a/arch/score/kernel/signal.c b/arch/score/kernel/signal.c index 1651807774ad..e381c8c4ff65 100644 --- a/arch/score/kernel/signal.c +++ b/arch/score/kernel/signal.c @@ -141,7 +141,7 @@ score_rt_sigreturn(struct pt_regs *regs) int sig; /* Always make any pending restarted system calls return -EINTR */ - current_thread_info()->restart_block.fn = do_no_restart_syscall; + current->restart_block.fn = do_no_restart_syscall; frame = (struct rt_sigframe __user *) regs->regs[0]; if (!access_ok(VERIFY_READ, frame, sizeof(*frame))) diff --git a/arch/sh/include/asm/thread_info.h b/arch/sh/include/asm/thread_info.h index ad27ffa65e2e..657c03919627 100644 --- a/arch/sh/include/asm/thread_info.h +++ b/arch/sh/include/asm/thread_info.h @@ -33,7 +33,6 @@ struct thread_info { __u32 cpu; int preempt_count; /* 0 => preemptable, <0 => BUG */ mm_segment_t addr_limit; /* thread address space */ - struct restart_block restart_block; unsigned long previous_sp; /* sp of previous stack in case of nested IRQ stacks */ __u8 supervisor_stack[0]; @@ -63,9 +62,6 @@ struct thread_info { .cpu = 0, \ .preempt_count = INIT_PREEMPT_COUNT, \ .addr_limit = KERNEL_DS, \ - .restart_block = { \ - .fn = do_no_restart_syscall, \ - }, \ } #define init_thread_info (init_thread_union.thread_info) diff --git a/arch/sh/kernel/asm-offsets.c b/arch/sh/kernel/asm-offsets.c index 08a2be775b6c..542225fedb11 100644 --- a/arch/sh/kernel/asm-offsets.c +++ b/arch/sh/kernel/asm-offsets.c @@ -25,7 +25,6 @@ int main(void) DEFINE(TI_FLAGS, offsetof(struct thread_info, flags)); DEFINE(TI_CPU, offsetof(struct thread_info, cpu)); DEFINE(TI_PRE_COUNT, offsetof(struct thread_info, preempt_count)); - DEFINE(TI_RESTART_BLOCK,offsetof(struct thread_info, restart_block)); DEFINE(TI_SIZE, sizeof(struct thread_info)); #ifdef CONFIG_HIBERNATION diff --git a/arch/sh/kernel/signal_32.c b/arch/sh/kernel/signal_32.c index 2f002b24fb92..0b34f2a704fe 100644 --- a/arch/sh/kernel/signal_32.c +++ b/arch/sh/kernel/signal_32.c @@ -156,7 +156,7 @@ asmlinkage int sys_sigreturn(void) int r0; /* Always make any pending restarted system calls return -EINTR */ - current_thread_info()->restart_block.fn = do_no_restart_syscall; + current->restart_block.fn = do_no_restart_syscall; if (!access_ok(VERIFY_READ, frame, sizeof(*frame))) goto badframe; @@ -186,7 +186,7 @@ asmlinkage int sys_rt_sigreturn(void) int r0; /* Always make any pending restarted system calls return -EINTR */ - current_thread_info()->restart_block.fn = do_no_restart_syscall; + current->restart_block.fn = do_no_restart_syscall; if (!access_ok(VERIFY_READ, frame, sizeof(*frame))) goto badframe; diff --git a/arch/sh/kernel/signal_64.c b/arch/sh/kernel/signal_64.c index 897abe7b871e..71993c6a7d94 100644 --- a/arch/sh/kernel/signal_64.c +++ b/arch/sh/kernel/signal_64.c @@ -260,7 +260,7 @@ asmlinkage int sys_sigreturn(unsigned long r2, unsigned long r3, long long ret; /* Always make any pending restarted system calls return -EINTR */ - current_thread_info()->restart_block.fn = do_no_restart_syscall; + current->restart_block.fn = do_no_restart_syscall; if (!access_ok(VERIFY_READ, frame, sizeof(*frame))) goto badframe; @@ -294,7 +294,7 @@ asmlinkage int sys_rt_sigreturn(unsigned long r2, unsigned long r3, long long ret; /* Always make any pending restarted system calls return -EINTR */ - current_thread_info()->restart_block.fn = do_no_restart_syscall; + current->restart_block.fn = do_no_restart_syscall; if (!access_ok(VERIFY_READ, frame, sizeof(*frame))) goto badframe; diff --git a/arch/sparc/include/asm/thread_info_32.h b/arch/sparc/include/asm/thread_info_32.h index 025c98446b1e..fd7bd0a440ca 100644 --- a/arch/sparc/include/asm/thread_info_32.h +++ b/arch/sparc/include/asm/thread_info_32.h @@ -47,8 +47,6 @@ struct thread_info { struct reg_window32 reg_window[NSWINS]; /* align for ldd! */ unsigned long rwbuf_stkptrs[NSWINS]; unsigned long w_saved; - - struct restart_block restart_block; }; /* @@ -62,9 +60,6 @@ struct thread_info { .flags = 0, \ .cpu = 0, \ .preempt_count = INIT_PREEMPT_COUNT, \ - .restart_block = { \ - .fn = do_no_restart_syscall, \ - }, \ } #define init_thread_info (init_thread_union.thread_info) @@ -103,7 +98,6 @@ register struct thread_info *current_thread_info_reg asm("g6"); #define TI_REG_WINDOW 0x30 #define TI_RWIN_SPTRS 0x230 #define TI_W_SAVED 0x250 -/* #define TI_RESTART_BLOCK 0x25n */ /* Nobody cares */ /* * thread information flag bit numbers diff --git a/arch/sparc/include/asm/thread_info_64.h b/arch/sparc/include/asm/thread_info_64.h index 798f0279a4b5..ff455164732a 100644 --- a/arch/sparc/include/asm/thread_info_64.h +++ b/arch/sparc/include/asm/thread_info_64.h @@ -58,8 +58,6 @@ struct thread_info { unsigned long gsr[7]; unsigned long xfsr[7]; - struct restart_block restart_block; - struct pt_regs *kern_una_regs; unsigned int kern_una_insn; @@ -92,10 +90,9 @@ struct thread_info { #define TI_RWIN_SPTRS 0x000003c8 #define TI_GSR 0x00000400 #define TI_XFSR 0x00000438 -#define TI_RESTART_BLOCK 0x00000470 -#define TI_KUNA_REGS 0x000004a0 -#define TI_KUNA_INSN 0x000004a8 -#define TI_FPREGS 0x000004c0 +#define TI_KUNA_REGS 0x00000470 +#define TI_KUNA_INSN 0x00000478 +#define TI_FPREGS 0x00000480 /* We embed this in the uppermost byte of thread_info->flags */ #define FAULT_CODE_WRITE 0x01 /* Write access, implies D-TLB */ @@ -124,9 +121,6 @@ struct thread_info { .current_ds = ASI_P, \ .exec_domain = &default_exec_domain, \ .preempt_count = INIT_PREEMPT_COUNT, \ - .restart_block = { \ - .fn = do_no_restart_syscall, \ - }, \ } #define init_thread_info (init_thread_union.thread_info) diff --git a/arch/sparc/kernel/signal32.c b/arch/sparc/kernel/signal32.c index 62deba7be1a9..4eed773a7735 100644 --- a/arch/sparc/kernel/signal32.c +++ b/arch/sparc/kernel/signal32.c @@ -150,7 +150,7 @@ void do_sigreturn32(struct pt_regs *regs) int err, i; /* Always make any pending restarted system calls return -EINTR */ - current_thread_info()->restart_block.fn = do_no_restart_syscall; + current->restart_block.fn = do_no_restart_syscall; synchronize_user_stack(); @@ -235,7 +235,7 @@ asmlinkage void do_rt_sigreturn32(struct pt_regs *regs) int err, i; /* Always make any pending restarted system calls return -EINTR */ - current_thread_info()->restart_block.fn = do_no_restart_syscall; + current->restart_block.fn = do_no_restart_syscall; synchronize_user_stack(); regs->u_regs[UREG_FP] &= 0x00000000ffffffffUL; diff --git a/arch/sparc/kernel/signal_32.c b/arch/sparc/kernel/signal_32.c index 9ee72fc8e0e4..52aa5e4ce5e7 100644 --- a/arch/sparc/kernel/signal_32.c +++ b/arch/sparc/kernel/signal_32.c @@ -70,7 +70,7 @@ asmlinkage void do_sigreturn(struct pt_regs *regs) int err; /* Always make any pending restarted system calls return -EINTR */ - current_thread_info()->restart_block.fn = do_no_restart_syscall; + current->restart_block.fn = do_no_restart_syscall; synchronize_user_stack(); diff --git a/arch/sparc/kernel/signal_64.c b/arch/sparc/kernel/signal_64.c index 1a6999868031..d88beff47bab 100644 --- a/arch/sparc/kernel/signal_64.c +++ b/arch/sparc/kernel/signal_64.c @@ -254,7 +254,7 @@ void do_rt_sigreturn(struct pt_regs *regs) int err; /* Always make any pending restarted system calls return -EINTR */ - current_thread_info()->restart_block.fn = do_no_restart_syscall; + current->restart_block.fn = do_no_restart_syscall; synchronize_user_stack (); sf = (struct rt_signal_frame __user *) diff --git a/arch/sparc/kernel/traps_64.c b/arch/sparc/kernel/traps_64.c index 981a769b9558..a27651e866e7 100644 --- a/arch/sparc/kernel/traps_64.c +++ b/arch/sparc/kernel/traps_64.c @@ -2730,8 +2730,6 @@ void __init trap_init(void) TI_NEW_CHILD != offsetof(struct thread_info, new_child) || TI_CURRENT_DS != offsetof(struct thread_info, current_ds) || - TI_RESTART_BLOCK != offsetof(struct thread_info, - restart_block) || TI_KUNA_REGS != offsetof(struct thread_info, kern_una_regs) || TI_KUNA_INSN != offsetof(struct thread_info, diff --git a/arch/tile/include/asm/thread_info.h b/arch/tile/include/asm/thread_info.h index 48e4fd0f38e4..96c14c1430d8 100644 --- a/arch/tile/include/asm/thread_info.h +++ b/arch/tile/include/asm/thread_info.h @@ -36,7 +36,6 @@ struct thread_info { mm_segment_t addr_limit; /* thread address space (KERNEL_DS or USER_DS) */ - struct restart_block restart_block; struct single_step_state *step_state; /* single step state (if non-zero) */ int align_ctl; /* controls unaligned access */ @@ -57,9 +56,6 @@ struct thread_info { .cpu = 0, \ .preempt_count = INIT_PREEMPT_COUNT, \ .addr_limit = KERNEL_DS, \ - .restart_block = { \ - .fn = do_no_restart_syscall, \ - }, \ .step_state = NULL, \ .align_ctl = 0, \ } diff --git a/arch/tile/kernel/signal.c b/arch/tile/kernel/signal.c index bb0a9ce7ae23..8a524e332c1a 100644 --- a/arch/tile/kernel/signal.c +++ b/arch/tile/kernel/signal.c @@ -48,7 +48,7 @@ int restore_sigcontext(struct pt_regs *regs, int err; /* Always make any pending restarted system calls return -EINTR */ - current_thread_info()->restart_block.fn = do_no_restart_syscall; + current->restart_block.fn = do_no_restart_syscall; /* * Enforce that sigcontext is like pt_regs, and doesn't mess diff --git a/arch/um/include/asm/thread_info.h b/arch/um/include/asm/thread_info.h index 1c5b2a83046a..e04114c4fcd9 100644 --- a/arch/um/include/asm/thread_info.h +++ b/arch/um/include/asm/thread_info.h @@ -22,7 +22,6 @@ struct thread_info { mm_segment_t addr_limit; /* thread address space: 0-0xBFFFFFFF for user 0-0xFFFFFFFF for kernel */ - struct restart_block restart_block; struct thread_info *real_thread; /* Points to non-IRQ stack */ }; @@ -34,9 +33,6 @@ struct thread_info { .cpu = 0, \ .preempt_count = INIT_PREEMPT_COUNT, \ .addr_limit = KERNEL_DS, \ - .restart_block = { \ - .fn = do_no_restart_syscall, \ - }, \ .real_thread = NULL, \ } diff --git a/arch/unicore32/include/asm/thread_info.h b/arch/unicore32/include/asm/thread_info.h index af36d8eabdf1..63e2839dfeb8 100644 --- a/arch/unicore32/include/asm/thread_info.h +++ b/arch/unicore32/include/asm/thread_info.h @@ -79,7 +79,6 @@ struct thread_info { #ifdef CONFIG_UNICORE_FPU_F64 struct fp_state fpstate __attribute__((aligned(8))); #endif - struct restart_block restart_block; }; #define INIT_THREAD_INFO(tsk) \ @@ -89,9 +88,6 @@ struct thread_info { .flags = 0, \ .preempt_count = INIT_PREEMPT_COUNT, \ .addr_limit = KERNEL_DS, \ - .restart_block = { \ - .fn = do_no_restart_syscall, \ - }, \ } #define init_thread_info (init_thread_union.thread_info) diff --git a/arch/unicore32/kernel/signal.c b/arch/unicore32/kernel/signal.c index 7c8fb7018dc6..d329f85766cc 100644 --- a/arch/unicore32/kernel/signal.c +++ b/arch/unicore32/kernel/signal.c @@ -105,7 +105,7 @@ asmlinkage int __sys_rt_sigreturn(struct pt_regs *regs) struct rt_sigframe __user *frame; /* Always make any pending restarted system calls return -EINTR */ - current_thread_info()->restart_block.fn = do_no_restart_syscall; + current->restart_block.fn = do_no_restart_syscall; /* * Since we stacked the signal on a 64-bit boundary, diff --git a/arch/x86/ia32/ia32_signal.c b/arch/x86/ia32/ia32_signal.c index f9e181aaba97..d0165c9a2932 100644 --- a/arch/x86/ia32/ia32_signal.c +++ b/arch/x86/ia32/ia32_signal.c @@ -169,7 +169,7 @@ static int ia32_restore_sigcontext(struct pt_regs *regs, u32 tmp; /* Always make any pending restarted system calls return -EINTR */ - current_thread_info()->restart_block.fn = do_no_restart_syscall; + current->restart_block.fn = do_no_restart_syscall; get_user_try { /* diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h index e82e95abc92b..1d4e4f279a32 100644 --- a/arch/x86/include/asm/thread_info.h +++ b/arch/x86/include/asm/thread_info.h @@ -31,7 +31,6 @@ struct thread_info { __u32 cpu; /* current CPU */ int saved_preempt_count; mm_segment_t addr_limit; - struct restart_block restart_block; void __user *sysenter_return; unsigned int sig_on_uaccess_error:1; unsigned int uaccess_err:1; /* uaccess failed */ @@ -45,9 +44,6 @@ struct thread_info { .cpu = 0, \ .saved_preempt_count = INIT_PREEMPT_COUNT, \ .addr_limit = KERNEL_DS, \ - .restart_block = { \ - .fn = do_no_restart_syscall, \ - }, \ } #define init_thread_info (init_thread_union.thread_info) diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c index 2a33c8f68319..e5042463c1bc 100644 --- a/arch/x86/kernel/signal.c +++ b/arch/x86/kernel/signal.c @@ -69,7 +69,7 @@ int restore_sigcontext(struct pt_regs *regs, struct sigcontext __user *sc, unsigned int err = 0; /* Always make any pending restarted system calls return -EINTR */ - current_thread_info()->restart_block.fn = do_no_restart_syscall; + current->restart_block.fn = do_no_restart_syscall; get_user_try { diff --git a/arch/x86/um/signal.c b/arch/x86/um/signal.c index 79d824551c1a..0c8c32bfd792 100644 --- a/arch/x86/um/signal.c +++ b/arch/x86/um/signal.c @@ -157,7 +157,7 @@ static int copy_sc_from_user(struct pt_regs *regs, int err, pid; /* Always make any pending restarted system calls return -EINTR */ - current_thread_info()->restart_block.fn = do_no_restart_syscall; + current->restart_block.fn = do_no_restart_syscall; err = copy_from_user(&sc, from, sizeof(sc)); if (err) diff --git a/arch/xtensa/include/asm/thread_info.h b/arch/xtensa/include/asm/thread_info.h index 470153e8547c..a9b5d3ba196c 100644 --- a/arch/xtensa/include/asm/thread_info.h +++ b/arch/xtensa/include/asm/thread_info.h @@ -51,7 +51,6 @@ struct thread_info { __s32 preempt_count; /* 0 => preemptable,< 0 => BUG*/ mm_segment_t addr_limit; /* thread address space */ - struct restart_block restart_block; unsigned long cpenable; @@ -72,7 +71,6 @@ struct thread_info { #define TI_CPU 0x00000010 #define TI_PRE_COUNT 0x00000014 #define TI_ADDR_LIMIT 0x00000018 -#define TI_RESTART_BLOCK 0x000001C #endif @@ -90,9 +88,6 @@ struct thread_info { .cpu = 0, \ .preempt_count = INIT_PREEMPT_COUNT, \ .addr_limit = KERNEL_DS, \ - .restart_block = { \ - .fn = do_no_restart_syscall, \ - }, \ } #define init_thread_info (init_thread_union.thread_info) diff --git a/arch/xtensa/kernel/signal.c b/arch/xtensa/kernel/signal.c index 4612321c73cc..3d733ba16f28 100644 --- a/arch/xtensa/kernel/signal.c +++ b/arch/xtensa/kernel/signal.c @@ -245,7 +245,7 @@ asmlinkage long xtensa_rt_sigreturn(long a0, long a1, long a2, long a3, int ret; /* Always make any pending restarted system calls return -EINTR */ - current_thread_info()->restart_block.fn = do_no_restart_syscall; + current->restart_block.fn = do_no_restart_syscall; if (regs->depc > 64) panic("rt_sigreturn in double exception!\n"); diff --git a/fs/select.c b/fs/select.c index 467bb1cb3ea5..f684c750e08a 100644 --- a/fs/select.c +++ b/fs/select.c @@ -971,7 +971,7 @@ SYSCALL_DEFINE3(poll, struct pollfd __user *, ufds, unsigned int, nfds, if (ret == -EINTR) { struct restart_block *restart_block; - restart_block = ¤t_thread_info()->restart_block; + restart_block = ¤t->restart_block; restart_block->fn = do_restart_poll; restart_block->poll.ufds = ufds; restart_block->poll.nfds = nfds; diff --git a/include/linux/init_task.h b/include/linux/init_task.h index 3037fc085e8e..d3d43ecf148c 100644 --- a/include/linux/init_task.h +++ b/include/linux/init_task.h @@ -193,6 +193,9 @@ extern struct task_group root_task_group; .nr_cpus_allowed= NR_CPUS, \ .mm = NULL, \ .active_mm = &init_mm, \ + .restart_block = { \ + .fn = do_no_restart_syscall, \ + }, \ .se = { \ .group_node = LIST_HEAD_INIT(tsk.se.group_node), \ }, \ diff --git a/include/linux/sched.h b/include/linux/sched.h index 8db31ef98d2f..22ee0d5d7f8c 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1370,6 +1370,8 @@ struct task_struct { unsigned long atomic_flags; /* Flags needing atomic access. */ + struct restart_block restart_block; + pid_t pid; pid_t tgid; diff --git a/kernel/compat.c b/kernel/compat.c index ebb3c369d03d..24f00610c575 100644 --- a/kernel/compat.c +++ b/kernel/compat.c @@ -276,8 +276,7 @@ COMPAT_SYSCALL_DEFINE2(nanosleep, struct compat_timespec __user *, rqtp, * core implementation decides to return random nonsense. */ if (ret == -ERESTART_RESTARTBLOCK) { - struct restart_block *restart - = ¤t_thread_info()->restart_block; + struct restart_block *restart = ¤t->restart_block; restart->fn = compat_nanosleep_restart; restart->nanosleep.compat_rmtp = rmtp; @@ -860,7 +859,7 @@ COMPAT_SYSCALL_DEFINE4(clock_nanosleep, clockid_t, which_clock, int, flags, return -EFAULT; if (err == -ERESTART_RESTARTBLOCK) { - restart = ¤t_thread_info()->restart_block; + restart = ¤t->restart_block; restart->fn = compat_clock_nanosleep_restart; restart->nanosleep.compat_rmtp = rmtp; } diff --git a/kernel/futex.c b/kernel/futex.c index 4eeb63de7e54..2a5e3830e953 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -2217,7 +2217,7 @@ retry: if (!abs_time) goto out; - restart = ¤t_thread_info()->restart_block; + restart = ¤t->restart_block; restart->fn = futex_wait_restart; restart->futex.uaddr = uaddr; restart->futex.val = val; diff --git a/kernel/signal.c b/kernel/signal.c index 16a305295256..33a52759cc0e 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -2501,7 +2501,7 @@ EXPORT_SYMBOL(unblock_all_signals); */ SYSCALL_DEFINE0(restart_syscall) { - struct restart_block *restart = ¤t_thread_info()->restart_block; + struct restart_block *restart = ¤t->restart_block; return restart->fn(restart); } diff --git a/kernel/time/alarmtimer.c b/kernel/time/alarmtimer.c index a7077d3ae52f..1b001ed1edb9 100644 --- a/kernel/time/alarmtimer.c +++ b/kernel/time/alarmtimer.c @@ -788,7 +788,7 @@ static int alarm_timer_nsleep(const clockid_t which_clock, int flags, goto out; } - restart = ¤t_thread_info()->restart_block; + restart = ¤t->restart_block; restart->fn = alarm_timer_nsleep_restart; restart->nanosleep.clockid = type; restart->nanosleep.expires = exp.tv64; diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c index 3f5e183c3d97..bee0c1f78091 100644 --- a/kernel/time/hrtimer.c +++ b/kernel/time/hrtimer.c @@ -1583,7 +1583,7 @@ long hrtimer_nanosleep(struct timespec *rqtp, struct timespec __user *rmtp, goto out; } - restart = ¤t_thread_info()->restart_block; + restart = ¤t->restart_block; restart->fn = hrtimer_nanosleep_restart; restart->nanosleep.clockid = t.timer.base->clockid; restart->nanosleep.rmtp = rmtp; diff --git a/kernel/time/posix-cpu-timers.c b/kernel/time/posix-cpu-timers.c index a16b67859e2a..0075da74abf0 100644 --- a/kernel/time/posix-cpu-timers.c +++ b/kernel/time/posix-cpu-timers.c @@ -1334,8 +1334,7 @@ static long posix_cpu_nsleep_restart(struct restart_block *restart_block); static int posix_cpu_nsleep(const clockid_t which_clock, int flags, struct timespec *rqtp, struct timespec __user *rmtp) { - struct restart_block *restart_block = - ¤t_thread_info()->restart_block; + struct restart_block *restart_block = ¤t->restart_block; struct itimerspec it; int error; -- cgit v1.2.3 From dd10ca6c95600c0c357e724da4e5eb47066e3854 Mon Sep 17 00:00:00 2001 From: Andrey Skvortsov Date: Thu, 12 Feb 2015 15:01:19 -0800 Subject: gitignore: ignore tar-install build directory Have git ignore the Debian directory created when running: make tar-pkg / targz-pkg / tarbz2-pkg / tarxz-pkg Signed-off-by: Andrey Skvortsov Cc: Michal Marek Cc: Greg Kroah-Hartman Cc: Boaz Harrosh Cc: Andi Kleen Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- .gitignore | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/.gitignore b/.gitignore index ce57b79670a5..9ac91060ea64 100644 --- a/.gitignore +++ b/.gitignore @@ -52,6 +52,11 @@ Module.symvers # /debian/ +# +# tar directory (make tar*-pkg) +# +/tar-install/ + # # git files that we don't want to ignore even it they are dot-files # -- cgit v1.2.3 From dd4a5c1e6531b9c6ba6ee1f3cb11f0aeea25881e Mon Sep 17 00:00:00 2001 From: Geert Uytterhoeven Date: Thu, 12 Feb 2015 15:01:22 -0800 Subject: linux/types.h: Always use unsigned long for pgoff_t Everybody uses unsigned long for pgoff_t, and no one ever overrode the definition of pgoff_t. Keep it that way, and remove the option of overriding it. Signed-off-by: Geert Uytterhoeven Cc: Randy Dunlap Cc: Arnd Bergmann Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/types.h | 5 +---- 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/include/linux/types.h b/include/linux/types.h index 62323825cff9..6747247e3f9f 100644 --- a/include/linux/types.h +++ b/include/linux/types.h @@ -135,12 +135,9 @@ typedef unsigned long blkcnt_t; #endif /* - * The type of an index into the pagecache. Use a #define so asm/types.h - * can override it. + * The type of an index into the pagecache. */ -#ifndef pgoff_t #define pgoff_t unsigned long -#endif /* A dma_addr_t can hold any valid DMA or bus address for the platform */ #ifdef CONFIG_ARCH_DMA_ADDR_T_64BIT -- cgit v1.2.3 From 545a2bf742fb41f17d03486dd8a8c74ad511dec2 Mon Sep 17 00:00:00 2001 From: Cyril Bur Date: Thu, 12 Feb 2015 15:01:24 -0800 Subject: kernel/sched/clock.c: add another clock for use with the soft lockup watchdog When the hypervisor pauses a virtualised kernel the kernel will observe a jump in timebase, this can cause spurious messages from the softlockup detector. Whilst these messages are harmless, they are accompanied with a stack trace which causes undue concern and more problematically the stack trace in the guest has nothing to do with the observed problem and can only be misleading. Futhermore, on POWER8 this is completely avoidable with the introduction of the Virtual Time Base (VTB) register. This patch (of 2): This permits the use of arch specific clocks for which virtualised kernels can use their notion of 'running' time, not the elpased wall time which will include host execution time. Signed-off-by: Cyril Bur Cc: Michael Ellerman Cc: Andrew Jones Acked-by: Don Zickus Cc: Ingo Molnar Cc: Ulrich Obergfell Cc: chai wen Cc: Fabian Frederick Cc: Aaron Tomlin Cc: Ben Zhang Cc: Martin Schwidefsky Cc: John Stultz Cc: Thomas Gleixner Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/sched.h | 1 + kernel/sched/clock.c | 13 +++++++++++++ kernel/watchdog.c | 2 +- 3 files changed, 15 insertions(+), 1 deletion(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 22ee0d5d7f8c..048b91b983ed 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -2147,6 +2147,7 @@ extern unsigned long long notrace sched_clock(void); */ extern u64 cpu_clock(int cpu); extern u64 local_clock(void); +extern u64 running_clock(void); extern u64 sched_clock_cpu(int cpu); diff --git a/kernel/sched/clock.c b/kernel/sched/clock.c index c27e4f8f4879..c0a205101c23 100644 --- a/kernel/sched/clock.c +++ b/kernel/sched/clock.c @@ -420,3 +420,16 @@ u64 local_clock(void) EXPORT_SYMBOL_GPL(cpu_clock); EXPORT_SYMBOL_GPL(local_clock); + +/* + * Running clock - returns the time that has elapsed while a guest has been + * running. + * On a guest this value should be local_clock minus the time the guest was + * suspended by the hypervisor (for any reason). + * On bare metal this function should return the same as local_clock. + * Architectures and sub-architectures can override this. + */ +u64 __weak running_clock(void) +{ + return local_clock(); +} diff --git a/kernel/watchdog.c b/kernel/watchdog.c index 70bf11815f84..3174bf8e3538 100644 --- a/kernel/watchdog.c +++ b/kernel/watchdog.c @@ -154,7 +154,7 @@ static int get_softlockup_thresh(void) */ static unsigned long get_timestamp(void) { - return local_clock() >> 30LL; /* 2^30 ~= 10^9 */ + return running_clock() >> 30LL; /* 2^30 ~= 10^9 */ } static void set_sample_period(void) -- cgit v1.2.3 From 4be1b29795d692d512bb67b770665d6f8ea5cb0b Mon Sep 17 00:00:00 2001 From: Cyril Bur Date: Thu, 12 Feb 2015 15:01:28 -0800 Subject: powerpc: add running_clock for powerpc to prevent spurious softlockup warnings On POWER8 virtualised kernels the VTB register can be read to have a view of time that only increases while the guest is running. This will prevent guests from seeing time jump if a guest is paused for significant amounts of time. On POWER7 and below virtualised kernels stolen time is subtracted from local_clock as a best effort approximation. This will not eliminate spurious warnings in the case of a suspended guest but may reduce the occurance in the case of softlockups due to host over commit. Bare metal kernels should avoid reading the VTB as KVM does not restore sane values when not executing, the approxmation is fine as host kernels won't observe any stolen time. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Cyril Bur Cc: Michael Ellerman Cc: Andrew Jones Acked-by: Don Zickus Cc: Ingo Molnar Cc: Ulrich Obergfell Cc: chai wen Cc: Fabian Frederick Cc: Aaron Tomlin Cc: Ben Zhang Cc: Martin Schwidefsky Cc: John Stultz Cc: Thomas Gleixner Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/powerpc/kernel/time.c | 32 ++++++++++++++++++++++++++++++++ 1 file changed, 32 insertions(+) diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c index fa7c4f12104f..7316dd15278a 100644 --- a/arch/powerpc/kernel/time.c +++ b/arch/powerpc/kernel/time.c @@ -621,6 +621,38 @@ unsigned long long sched_clock(void) return mulhdu(get_tb() - boot_tb, tb_to_ns_scale) << tb_to_ns_shift; } + +#ifdef CONFIG_PPC_PSERIES + +/* + * Running clock - attempts to give a view of time passing for a virtualised + * kernels. + * Uses the VTB register if available otherwise a next best guess. + */ +unsigned long long running_clock(void) +{ + /* + * Don't read the VTB as a host since KVM does not switch in host + * timebase into the VTB when it takes a guest off the CPU, reading the + * VTB would result in reading 'last switched out' guest VTB. + * + * Host kernels are often compiled with CONFIG_PPC_PSERIES checked, it + * would be unsafe to rely only on the #ifdef above. + */ + if (firmware_has_feature(FW_FEATURE_LPAR) && + cpu_has_feature(CPU_FTR_ARCH_207S)) + return mulhdu(get_vtb() - boot_tb, tb_to_ns_scale) << tb_to_ns_shift; + + /* + * This is a next best approximation without a VTB. + * On a host which is running bare metal there should never be any stolen + * time and on a host which doesn't do any virtualisation TB *should* equal + * VTB so it makes no difference anyway. + */ + return local_clock() - cputime_to_nsecs(kcpustat_this_cpu->cpustat[CPUTIME_STEAL]); +} +#endif + static int __init get_freq(char *name, int cells, unsigned long *val) { struct device_node *cpu; -- cgit v1.2.3 From 02f1f2170d2831b3233e91091c60a66622f29e82 Mon Sep 17 00:00:00 2001 From: Rasmus Villemoes Date: Thu, 12 Feb 2015 15:01:31 -0800 Subject: kernel.h: remove ancient __FUNCTION__ hack __FUNCTION__ hasn't been treated as a string literal since gcc 3.4, so this only helps people who only test-compile using 3.3 (compiler-gcc3.h barks at anything older than that). Besides, there are almost no occurrences of __FUNCTION__ left in the tree. [akpm@linux-foundation.org: convert remaining __FUNCTION__ references] Signed-off-by: Rasmus Villemoes Cc: Michal Nazarewicz Cc: Joe Perches Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/x86/kernel/hpet.c | 2 +- arch/x86/kernel/rtc.c | 4 ++-- arch/x86/platform/intel-mid/intel_mid_vrtc.c | 2 +- drivers/acpi/acpica/utdebug.c | 4 ++-- drivers/block/xen-blkfront.c | 2 +- include/acpi/acoutput.h | 6 +++--- include/linux/kernel.h | 3 --- 7 files changed, 10 insertions(+), 13 deletions(-) diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c index 319bcb9372fe..3acbff4716b0 100644 --- a/arch/x86/kernel/hpet.c +++ b/arch/x86/kernel/hpet.c @@ -168,7 +168,7 @@ static void _hpet_print_config(const char *function, int line) #define hpet_print_config() \ do { \ if (hpet_verbose) \ - _hpet_print_config(__FUNCTION__, __LINE__); \ + _hpet_print_config(__func__, __LINE__); \ } while (0) /* diff --git a/arch/x86/kernel/rtc.c b/arch/x86/kernel/rtc.c index fe3dbfe0c4a5..cd9685235df9 100644 --- a/arch/x86/kernel/rtc.c +++ b/arch/x86/kernel/rtc.c @@ -49,11 +49,11 @@ int mach_set_rtc_mmss(const struct timespec *now) retval = set_rtc_time(&tm); if (retval) printk(KERN_ERR "%s: RTC write failed with error %d\n", - __FUNCTION__, retval); + __func__, retval); } else { printk(KERN_ERR "%s: Invalid RTC value: write of %lx to RTC failed\n", - __FUNCTION__, nowtime); + __func__, nowtime); retval = -EINVAL; } return retval; diff --git a/arch/x86/platform/intel-mid/intel_mid_vrtc.c b/arch/x86/platform/intel-mid/intel_mid_vrtc.c index 4762cff7facd..32947ba0f62d 100644 --- a/arch/x86/platform/intel-mid/intel_mid_vrtc.c +++ b/arch/x86/platform/intel-mid/intel_mid_vrtc.c @@ -110,7 +110,7 @@ int vrtc_set_mmss(const struct timespec *now) spin_unlock_irqrestore(&rtc_lock, flags); } else { pr_err("%s: Invalid vRTC value: write of %lx to vRTC failed\n", - __FUNCTION__, now->tv_sec); + __func__, now->tv_sec); retval = -EINVAL; } return retval; diff --git a/drivers/acpi/acpica/utdebug.c b/drivers/acpi/acpica/utdebug.c index 57078e3ea9b7..4f3f888d33bb 100644 --- a/drivers/acpi/acpica/utdebug.c +++ b/drivers/acpi/acpica/utdebug.c @@ -111,8 +111,8 @@ void acpi_ut_track_stack_ptr(void) * RETURN: Updated pointer to the function name * * DESCRIPTION: Remove the "Acpi" prefix from the function name, if present. - * This allows compiler macros such as __FUNCTION__ to be used - * with no change to the debug output. + * This allows compiler macros such as __func__ to be used with no + * change to the debug output. * ******************************************************************************/ diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c index 2236c6f31608..d2cae5fc211a 100644 --- a/drivers/block/xen-blkfront.c +++ b/drivers/block/xen-blkfront.c @@ -1391,7 +1391,7 @@ static int blkfront_probe(struct xenbus_device *dev, if (major != XENVBD_MAJOR) { printk(KERN_INFO "%s: HVM does not support vbd %d as xen block device\n", - __FUNCTION__, vdevice); + __func__, vdevice); return -ENODEV; } } diff --git a/include/acpi/acoutput.h b/include/acpi/acoutput.h index 9318a87ee39a..a8f344363e77 100644 --- a/include/acpi/acoutput.h +++ b/include/acpi/acoutput.h @@ -240,7 +240,7 @@ /* * If ACPI_GET_FUNCTION_NAME was not defined in the compiler-dependent header, * define it now. This is the case where there the compiler does not support - * a __FUNCTION__ macro or equivalent. + * a __func__ macro or equivalent. */ #ifndef ACPI_GET_FUNCTION_NAME #define ACPI_GET_FUNCTION_NAME _acpi_function_name @@ -249,12 +249,12 @@ * The Name parameter should be the procedure name as a quoted string. * The function name is also used by the function exit macros below. * Note: (const char) is used to be compatible with the debug interfaces - * and macros such as __FUNCTION__. + * and macros such as __func__. */ #define ACPI_FUNCTION_NAME(name) static const char _acpi_function_name[] = #name; #else -/* Compiler supports __FUNCTION__ (or equivalent) -- Ignore this macro */ +/* Compiler supports __func__ (or equivalent) -- Ignore this macro */ #define ACPI_FUNCTION_NAME(name) #endif /* ACPI_GET_FUNCTION_NAME */ diff --git a/include/linux/kernel.h b/include/linux/kernel.h index e42e7dc34c68..d6d630d31ef3 100644 --- a/include/linux/kernel.h +++ b/include/linux/kernel.h @@ -800,9 +800,6 @@ static inline void ftrace_dump(enum ftrace_dump_mode oops_dump_mode) { } const typeof( ((type *)0)->member ) *__mptr = (ptr); \ (type *)( (char *)__mptr - offsetof(type,member) );}) -/* Trap pasters of __FUNCTION__ at compile-time */ -#define __FUNCTION__ (__func__) - /* Rebuild everything on CONFIG_FTRACE_MCOUNT_RECORD */ #ifdef CONFIG_FTRACE_MCOUNT_RECORD # define REBUILD_DUE_TO_FTRACE_MCOUNT_RECORD -- cgit v1.2.3 From 205bd3d23e938530cb89ff9e14afda389ac85dc3 Mon Sep 17 00:00:00 2001 From: Joe Perches Date: Thu, 12 Feb 2015 15:01:34 -0800 Subject: printk: correct timeout comment, neaten MODULE_PARM_DESC Neaten the MODULE_PARAM_DESC message. Use 30 seconds in the comment for the zap console locks timeout. Signed-off-by: Joe Perches Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/printk/printk.c | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c index 02d6b6d28796..c06df7de0963 100644 --- a/kernel/printk/printk.c +++ b/kernel/printk/printk.c @@ -935,8 +935,8 @@ static int __init ignore_loglevel_setup(char *str) early_param("ignore_loglevel", ignore_loglevel_setup); module_param(ignore_loglevel, bool, S_IRUGO | S_IWUSR); -MODULE_PARM_DESC(ignore_loglevel, "ignore loglevel setting, to" - "print all kernel messages to the console."); +MODULE_PARM_DESC(ignore_loglevel, + "ignore loglevel setting (prints all kernel messages to the console)"); #ifdef CONFIG_BOOT_PRINTK_DELAY @@ -1419,16 +1419,16 @@ static void call_console_drivers(int level, const char *text, size_t len) } /* - * Zap console related locks when oopsing. Only zap at most once - * every 10 seconds, to leave time for slow consoles to print a - * full oops. + * Zap console related locks when oopsing. + * To leave time for slow consoles to print a full oops, + * only zap at most once every 30 seconds. */ static void zap_locks(void) { static unsigned long oops_timestamp; if (time_after_eq(jiffies, oops_timestamp) && - !time_after(jiffies, oops_timestamp + 30 * HZ)) + !time_after(jiffies, oops_timestamp + 30 * HZ)) return; oops_timestamp = jiffies; -- cgit v1.2.3 From ffbfed03b4bdd229b99c2611f5ace1fbc912caaa Mon Sep 17 00:00:00 2001 From: Rasmus Villemoes Date: Thu, 12 Feb 2015 15:01:37 -0800 Subject: lib/vsprintf.c: consume 'p' in format_decode It seems a little simpler to consume the p from a %p specifier in format_decode, just as it is done for the surrounding %c, %s and %% cases. While there, delete a redundant and misplaced comment. Signed-off-by: Rasmus Villemoes Cc: Jiri Kosina Cc: Randy Dunlap Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- lib/vsprintf.c | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/lib/vsprintf.c b/lib/vsprintf.c index ec337f64f52d..98ad170b10e0 100644 --- a/lib/vsprintf.c +++ b/lib/vsprintf.c @@ -1604,8 +1604,7 @@ qualifier: case 'p': spec->type = FORMAT_TYPE_PTR; - return fmt - start; - /* skip alnum */ + return ++fmt - start; case '%': spec->type = FORMAT_TYPE_PERCENT_CHAR; @@ -1794,7 +1793,7 @@ int vsnprintf(char *buf, size_t size, const char *fmt, va_list args) break; case FORMAT_TYPE_PTR: - str = pointer(fmt+1, str, end, va_arg(args, void *), + str = pointer(fmt, str, end, va_arg(args, void *), spec); while (isalnum(*fmt)) fmt++; @@ -2232,7 +2231,7 @@ int bstr_printf(char *buf, size_t size, const char *fmt, const u32 *bin_buf) } case FORMAT_TYPE_PTR: - str = pointer(fmt+1, str, end, get_arg(void *), spec); + str = pointer(fmt, str, end, get_arg(void *), spec); while (isalnum(*fmt)) fmt++; break; -- cgit v1.2.3 From 2aa2f9e21e4eb25c720b2e7d80f8929638f6ad73 Mon Sep 17 00:00:00 2001 From: Rasmus Villemoes Date: Thu, 12 Feb 2015 15:01:39 -0800 Subject: lib/vsprintf.c: improve sanity check in vsnprintf() On 64 bit, size may very well be huge even if bit 31 happens to be 0. Somehow it doesn't feel right that one can pass a 5 GiB buffer but not a 3 GiB one. So cap at INT_MAX as was probably the intention all along. This is also the made-up value passed by sprintf and vsprintf. Signed-off-by: Rasmus Villemoes Cc: Jiri Kosina Cc: Randy Dunlap Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- lib/vsprintf.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/lib/vsprintf.c b/lib/vsprintf.c index 98ad170b10e0..cf12ba86205c 100644 --- a/lib/vsprintf.c +++ b/lib/vsprintf.c @@ -1727,7 +1727,7 @@ int vsnprintf(char *buf, size_t size, const char *fmt, va_list args) /* Reject out-of-range values early. Large positive sizes are used for unknown buffer sizes. */ - if (WARN_ON_ONCE((int) size < 0)) + if (WARN_ON_ONCE(size > INT_MAX)) return 0; str = buf; -- cgit v1.2.3 From 43e5b666cf25516b5c27cd10c47d287dc9d1f376 Mon Sep 17 00:00:00 2001 From: Rasmus Villemoes Date: Thu, 12 Feb 2015 15:01:42 -0800 Subject: lib/vsprintf.c: replace while with do-while in skip_atoi All callers of skip_atoi have already checked for the first character being a digit. In this case, gcc generates simpler code for a do while-loop. Signed-off-by: Rasmus Villemoes Cc: Jiri Kosina Cc: Randy Dunlap Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- lib/vsprintf.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/lib/vsprintf.c b/lib/vsprintf.c index cf12ba86205c..602d2081e713 100644 --- a/lib/vsprintf.c +++ b/lib/vsprintf.c @@ -114,8 +114,9 @@ int skip_atoi(const char **s) { int i = 0; - while (isdigit(**s)) + do { i = i*10 + *((*s)++) - '0'; + } while (isdigit(**s)); return i; } -- cgit v1.2.3 From 7eed8fde021b4e169e325e5f50d9f12320668bf2 Mon Sep 17 00:00:00 2001 From: Rasmus Villemoes Date: Thu, 12 Feb 2015 15:01:45 -0800 Subject: lib/string_helpers.c:string_get_size(): remove redundant prefixes While commit 3c9f3681d0b4 ("[SCSI] lib: add generic helper to print sizes rounded to the correct SI range") says that Z and Y are included in preparation for 128 bit computers, they just waste .text currently. If and when we get u128, string_get_size needs updating anyway (and ISO needs to come up with four more prefixes). Also there's no need to include and test for the NULL sentinel; once we reach "E" size is at most 18. [The test is also wrong; it should be units_str[units][i+1]; if we've reached NULL we're already doomed.] Signed-off-by: Rasmus Villemoes Cc: James Bottomley Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- lib/string_helpers.c | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/lib/string_helpers.c b/lib/string_helpers.c index 58b78ba57439..0d25f7aa732c 100644 --- a/lib/string_helpers.c +++ b/lib/string_helpers.c @@ -28,11 +28,10 @@ int string_get_size(u64 size, const enum string_size_units units, char *buf, int len) { static const char *const units_10[] = { - "B", "kB", "MB", "GB", "TB", "PB", "EB", "ZB", "YB", NULL + "B", "kB", "MB", "GB", "TB", "PB", "EB" }; static const char *const units_2[] = { - "B", "KiB", "MiB", "GiB", "TiB", "PiB", "EiB", "ZiB", "YiB", - NULL + "B", "KiB", "MiB", "GiB", "TiB", "PiB", "EiB" }; static const char *const *const units_str[] = { [STRING_UNITS_10] = units_10, @@ -49,7 +48,7 @@ int string_get_size(u64 size, const enum string_size_units units, tmp[0] = '\0'; i = 0; if (size >= divisor[units]) { - while (size >= divisor[units] && units_str[units][i]) { + while (size >= divisor[units]) { remainder = do_div(size, divisor[units]); i++; } -- cgit v1.2.3 From 84b9fbedf54a6ea4fba62ef8a167138233586ad3 Mon Sep 17 00:00:00 2001 From: Rasmus Villemoes Date: Thu, 12 Feb 2015 15:01:48 -0800 Subject: lib/string_helpers.c:string_get_size(): use 32 bit arithmetic when possible The remainder from do_div is always a u32, and after size has been reduced to be below 1000 (or 1024), it certainly fits in u32. So both remainder and sf_cap can be made u32s, the format specifiers can be simplified (%lld wasn't the right thing to use for _unsigned_ long long anyway), and we can replace a do_div with an ordinary 32/32 bit division. Signed-off-by: Rasmus Villemoes Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- lib/string_helpers.c | 9 ++++----- 1 file changed, 4 insertions(+), 5 deletions(-) diff --git a/lib/string_helpers.c b/lib/string_helpers.c index 0d25f7aa732c..2b3757f84b3b 100644 --- a/lib/string_helpers.c +++ b/lib/string_helpers.c @@ -42,7 +42,7 @@ int string_get_size(u64 size, const enum string_size_units units, [STRING_UNITS_2] = 1024, }; int i, j; - u64 remainder = 0, sf_cap; + u32 remainder = 0, sf_cap; char tmp[8]; tmp[0] = '\0'; @@ -59,14 +59,13 @@ int string_get_size(u64 size, const enum string_size_units units, if (j) { remainder *= 1000; - do_div(remainder, divisor[units]); - snprintf(tmp, sizeof(tmp), ".%03lld", - (unsigned long long)remainder); + remainder /= divisor[units]; + snprintf(tmp, sizeof(tmp), ".%03u", remainder); tmp[j+1] = '\0'; } } - snprintf(buf, len, "%lld%s %s", (unsigned long long)size, + snprintf(buf, len, "%u%s %s", (u32)size, tmp, units_str[units][i]); return 0; -- cgit v1.2.3 From d1214c65c02d503330ce86bd38e344a36599e055 Mon Sep 17 00:00:00 2001 From: Rasmus Villemoes Date: Thu, 12 Feb 2015 15:01:50 -0800 Subject: libstring_helpers.c:string_get_size(): return void string_get_size() was documented to return an error, but in fact always returned 0. Since the output always fits in 9 bytes, just document that and let callers do what they do now: pass a small stack buffer and ignore the return value. Signed-off-by: Rasmus Villemoes Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/string_helpers.h | 4 ++-- lib/string_helpers.c | 10 ++++------ 2 files changed, 6 insertions(+), 8 deletions(-) diff --git a/include/linux/string_helpers.h b/include/linux/string_helpers.h index 6eb567ac56bc..657571817260 100644 --- a/include/linux/string_helpers.h +++ b/include/linux/string_helpers.h @@ -10,8 +10,8 @@ enum string_size_units { STRING_UNITS_2, /* use binary powers of 2^10 */ }; -int string_get_size(u64 size, enum string_size_units units, - char *buf, int len); +void string_get_size(u64 size, enum string_size_units units, + char *buf, int len); #define UNESCAPE_SPACE 0x01 #define UNESCAPE_OCTAL 0x02 diff --git a/lib/string_helpers.c b/lib/string_helpers.c index 2b3757f84b3b..8f8c4417f228 100644 --- a/lib/string_helpers.c +++ b/lib/string_helpers.c @@ -20,12 +20,12 @@ * @len: length of buffer * * This function returns a string formatted to 3 significant figures - * giving the size in the required units. Returns 0 on success or - * error on failure. @buf is always zero terminated. + * giving the size in the required units. @buf should have room for + * at least 9 bytes and will always be zero terminated. * */ -int string_get_size(u64 size, const enum string_size_units units, - char *buf, int len) +void string_get_size(u64 size, const enum string_size_units units, + char *buf, int len) { static const char *const units_10[] = { "B", "kB", "MB", "GB", "TB", "PB", "EB" @@ -67,8 +67,6 @@ int string_get_size(u64 size, const enum string_size_units units, snprintf(buf, len, "%u%s %s", (u32)size, tmp, units_str[units][i]); - - return 0; } EXPORT_SYMBOL(string_get_size); -- cgit v1.2.3 From 8b4daad52fee7731ea4bae22a99ece2d4ba7ba43 Mon Sep 17 00:00:00 2001 From: Rasmus Villemoes Date: Thu, 12 Feb 2015 15:01:53 -0800 Subject: lib/bitmap.c: more signed->unsigned conversions For consistency with the other bitmap_* functions, also make the nbits parameter of bitmap_zero, bitmap_fill and bitmap_copy unsigned. Signed-off-by: Rasmus Villemoes Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/bitmap.h | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/include/linux/bitmap.h b/include/linux/bitmap.h index 202e4034fe26..1406d5453781 100644 --- a/include/linux/bitmap.h +++ b/include/linux/bitmap.h @@ -185,33 +185,33 @@ extern int bitmap_print_to_pagebuf(bool list, char *buf, #define small_const_nbits(nbits) \ (__builtin_constant_p(nbits) && (nbits) <= BITS_PER_LONG) -static inline void bitmap_zero(unsigned long *dst, int nbits) +static inline void bitmap_zero(unsigned long *dst, unsigned int nbits) { if (small_const_nbits(nbits)) *dst = 0UL; else { - int len = BITS_TO_LONGS(nbits) * sizeof(unsigned long); + unsigned int len = BITS_TO_LONGS(nbits) * sizeof(unsigned long); memset(dst, 0, len); } } -static inline void bitmap_fill(unsigned long *dst, int nbits) +static inline void bitmap_fill(unsigned long *dst, unsigned int nbits) { - size_t nlongs = BITS_TO_LONGS(nbits); + unsigned int nlongs = BITS_TO_LONGS(nbits); if (!small_const_nbits(nbits)) { - int len = (nlongs - 1) * sizeof(unsigned long); + unsigned int len = (nlongs - 1) * sizeof(unsigned long); memset(dst, 0xff, len); } dst[nlongs - 1] = BITMAP_LAST_WORD_MASK(nbits); } static inline void bitmap_copy(unsigned long *dst, const unsigned long *src, - int nbits) + unsigned int nbits) { if (small_const_nbits(nbits)) *dst = *src; else { - int len = BITS_TO_LONGS(nbits) * sizeof(unsigned long); + unsigned int len = BITS_TO_LONGS(nbits) * sizeof(unsigned long); memcpy(dst, src, len); } } -- cgit v1.2.3 From 33c4fa8c6763f1ba9f4ea64079882eaa6d7957b7 Mon Sep 17 00:00:00 2001 From: Rasmus Villemoes Date: Thu, 12 Feb 2015 15:01:56 -0800 Subject: linux/nodemask.h: update bitmap wrappers to take unsigned int Since the various bitmap_* functions now take an unsigned int as nbits parameter, it makes sense to also update the various wrappers. Signed-off-by: Rasmus Villemoes Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/nodemask.h | 26 +++++++++++++------------- 1 file changed, 13 insertions(+), 13 deletions(-) diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h index 83a6aeda899d..21cef483dc1b 100644 --- a/include/linux/nodemask.h +++ b/include/linux/nodemask.h @@ -120,13 +120,13 @@ static inline void __node_clear(int node, volatile nodemask_t *dstp) } #define nodes_setall(dst) __nodes_setall(&(dst), MAX_NUMNODES) -static inline void __nodes_setall(nodemask_t *dstp, int nbits) +static inline void __nodes_setall(nodemask_t *dstp, unsigned int nbits) { bitmap_fill(dstp->bits, nbits); } #define nodes_clear(dst) __nodes_clear(&(dst), MAX_NUMNODES) -static inline void __nodes_clear(nodemask_t *dstp, int nbits) +static inline void __nodes_clear(nodemask_t *dstp, unsigned int nbits) { bitmap_zero(dstp->bits, nbits); } @@ -144,7 +144,7 @@ static inline int __node_test_and_set(int node, nodemask_t *addr) #define nodes_and(dst, src1, src2) \ __nodes_and(&(dst), &(src1), &(src2), MAX_NUMNODES) static inline void __nodes_and(nodemask_t *dstp, const nodemask_t *src1p, - const nodemask_t *src2p, int nbits) + const nodemask_t *src2p, unsigned int nbits) { bitmap_and(dstp->bits, src1p->bits, src2p->bits, nbits); } @@ -152,7 +152,7 @@ static inline void __nodes_and(nodemask_t *dstp, const nodemask_t *src1p, #define nodes_or(dst, src1, src2) \ __nodes_or(&(dst), &(src1), &(src2), MAX_NUMNODES) static inline void __nodes_or(nodemask_t *dstp, const nodemask_t *src1p, - const nodemask_t *src2p, int nbits) + const nodemask_t *src2p, unsigned int nbits) { bitmap_or(dstp->bits, src1p->bits, src2p->bits, nbits); } @@ -160,7 +160,7 @@ static inline void __nodes_or(nodemask_t *dstp, const nodemask_t *src1p, #define nodes_xor(dst, src1, src2) \ __nodes_xor(&(dst), &(src1), &(src2), MAX_NUMNODES) static inline void __nodes_xor(nodemask_t *dstp, const nodemask_t *src1p, - const nodemask_t *src2p, int nbits) + const nodemask_t *src2p, unsigned int nbits) { bitmap_xor(dstp->bits, src1p->bits, src2p->bits, nbits); } @@ -168,7 +168,7 @@ static inline void __nodes_xor(nodemask_t *dstp, const nodemask_t *src1p, #define nodes_andnot(dst, src1, src2) \ __nodes_andnot(&(dst), &(src1), &(src2), MAX_NUMNODES) static inline void __nodes_andnot(nodemask_t *dstp, const nodemask_t *src1p, - const nodemask_t *src2p, int nbits) + const nodemask_t *src2p, unsigned int nbits) { bitmap_andnot(dstp->bits, src1p->bits, src2p->bits, nbits); } @@ -176,7 +176,7 @@ static inline void __nodes_andnot(nodemask_t *dstp, const nodemask_t *src1p, #define nodes_complement(dst, src) \ __nodes_complement(&(dst), &(src), MAX_NUMNODES) static inline void __nodes_complement(nodemask_t *dstp, - const nodemask_t *srcp, int nbits) + const nodemask_t *srcp, unsigned int nbits) { bitmap_complement(dstp->bits, srcp->bits, nbits); } @@ -184,7 +184,7 @@ static inline void __nodes_complement(nodemask_t *dstp, #define nodes_equal(src1, src2) \ __nodes_equal(&(src1), &(src2), MAX_NUMNODES) static inline int __nodes_equal(const nodemask_t *src1p, - const nodemask_t *src2p, int nbits) + const nodemask_t *src2p, unsigned int nbits) { return bitmap_equal(src1p->bits, src2p->bits, nbits); } @@ -192,7 +192,7 @@ static inline int __nodes_equal(const nodemask_t *src1p, #define nodes_intersects(src1, src2) \ __nodes_intersects(&(src1), &(src2), MAX_NUMNODES) static inline int __nodes_intersects(const nodemask_t *src1p, - const nodemask_t *src2p, int nbits) + const nodemask_t *src2p, unsigned int nbits) { return bitmap_intersects(src1p->bits, src2p->bits, nbits); } @@ -200,25 +200,25 @@ static inline int __nodes_intersects(const nodemask_t *src1p, #define nodes_subset(src1, src2) \ __nodes_subset(&(src1), &(src2), MAX_NUMNODES) static inline int __nodes_subset(const nodemask_t *src1p, - const nodemask_t *src2p, int nbits) + const nodemask_t *src2p, unsigned int nbits) { return bitmap_subset(src1p->bits, src2p->bits, nbits); } #define nodes_empty(src) __nodes_empty(&(src), MAX_NUMNODES) -static inline int __nodes_empty(const nodemask_t *srcp, int nbits) +static inline int __nodes_empty(const nodemask_t *srcp, unsigned int nbits) { return bitmap_empty(srcp->bits, nbits); } #define nodes_full(nodemask) __nodes_full(&(nodemask), MAX_NUMNODES) -static inline int __nodes_full(const nodemask_t *srcp, int nbits) +static inline int __nodes_full(const nodemask_t *srcp, unsigned int nbits) { return bitmap_full(srcp->bits, nbits); } #define nodes_weight(nodemask) __nodes_weight(&(nodemask), MAX_NUMNODES) -static inline int __nodes_weight(const nodemask_t *srcp, int nbits) +static inline int __nodes_weight(const nodemask_t *srcp, unsigned int nbits) { return bitmap_weight(srcp->bits, nbits); } -- cgit v1.2.3 From f5ac1f55204fec9d9c63644bc1de0ab6a59af9f1 Mon Sep 17 00:00:00 2001 From: Rasmus Villemoes Date: Thu, 12 Feb 2015 15:01:59 -0800 Subject: linux/cpumask.h: update bitmap wrappers to take unsigned int Since the various bitmap_* functions now take an unsigned int as nbits parameter, it makes sense to also update the various wrappers, even though they're marked as obsolete. Signed-off-by: Rasmus Villemoes Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/cpumask.h | 22 +++++++++++----------- 1 file changed, 11 insertions(+), 11 deletions(-) diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h index b950e9d6008b..ff9044286d88 100644 --- a/include/linux/cpumask.h +++ b/include/linux/cpumask.h @@ -905,13 +905,13 @@ static inline void __cpu_clear(int cpu, volatile cpumask_t *dstp) } #define cpus_setall(dst) __cpus_setall(&(dst), NR_CPUS) -static inline void __cpus_setall(cpumask_t *dstp, int nbits) +static inline void __cpus_setall(cpumask_t *dstp, unsigned int nbits) { bitmap_fill(dstp->bits, nbits); } #define cpus_clear(dst) __cpus_clear(&(dst), NR_CPUS) -static inline void __cpus_clear(cpumask_t *dstp, int nbits) +static inline void __cpus_clear(cpumask_t *dstp, unsigned int nbits) { bitmap_zero(dstp->bits, nbits); } @@ -927,21 +927,21 @@ static inline int __cpu_test_and_set(int cpu, cpumask_t *addr) #define cpus_and(dst, src1, src2) __cpus_and(&(dst), &(src1), &(src2), NR_CPUS) static inline int __cpus_and(cpumask_t *dstp, const cpumask_t *src1p, - const cpumask_t *src2p, int nbits) + const cpumask_t *src2p, unsigned int nbits) { return bitmap_and(dstp->bits, src1p->bits, src2p->bits, nbits); } #define cpus_or(dst, src1, src2) __cpus_or(&(dst), &(src1), &(src2), NR_CPUS) static inline void __cpus_or(cpumask_t *dstp, const cpumask_t *src1p, - const cpumask_t *src2p, int nbits) + const cpumask_t *src2p, unsigned int nbits) { bitmap_or(dstp->bits, src1p->bits, src2p->bits, nbits); } #define cpus_xor(dst, src1, src2) __cpus_xor(&(dst), &(src1), &(src2), NR_CPUS) static inline void __cpus_xor(cpumask_t *dstp, const cpumask_t *src1p, - const cpumask_t *src2p, int nbits) + const cpumask_t *src2p, unsigned int nbits) { bitmap_xor(dstp->bits, src1p->bits, src2p->bits, nbits); } @@ -949,40 +949,40 @@ static inline void __cpus_xor(cpumask_t *dstp, const cpumask_t *src1p, #define cpus_andnot(dst, src1, src2) \ __cpus_andnot(&(dst), &(src1), &(src2), NR_CPUS) static inline int __cpus_andnot(cpumask_t *dstp, const cpumask_t *src1p, - const cpumask_t *src2p, int nbits) + const cpumask_t *src2p, unsigned int nbits) { return bitmap_andnot(dstp->bits, src1p->bits, src2p->bits, nbits); } #define cpus_equal(src1, src2) __cpus_equal(&(src1), &(src2), NR_CPUS) static inline int __cpus_equal(const cpumask_t *src1p, - const cpumask_t *src2p, int nbits) + const cpumask_t *src2p, unsigned int nbits) { return bitmap_equal(src1p->bits, src2p->bits, nbits); } #define cpus_intersects(src1, src2) __cpus_intersects(&(src1), &(src2), NR_CPUS) static inline int __cpus_intersects(const cpumask_t *src1p, - const cpumask_t *src2p, int nbits) + const cpumask_t *src2p, unsigned int nbits) { return bitmap_intersects(src1p->bits, src2p->bits, nbits); } #define cpus_subset(src1, src2) __cpus_subset(&(src1), &(src2), NR_CPUS) static inline int __cpus_subset(const cpumask_t *src1p, - const cpumask_t *src2p, int nbits) + const cpumask_t *src2p, unsigned int nbits) { return bitmap_subset(src1p->bits, src2p->bits, nbits); } #define cpus_empty(src) __cpus_empty(&(src), NR_CPUS) -static inline int __cpus_empty(const cpumask_t *srcp, int nbits) +static inline int __cpus_empty(const cpumask_t *srcp, unsigned int nbits) { return bitmap_empty(srcp->bits, nbits); } #define cpus_weight(cpumask) __cpus_weight(&(cpumask), NR_CPUS) -static inline int __cpus_weight(const cpumask_t *srcp, int nbits) +static inline int __cpus_weight(const cpumask_t *srcp, unsigned int nbits) { return bitmap_weight(srcp->bits, nbits); } -- cgit v1.2.3 From eb5698837881687841c9e477e4162ac3387c6b59 Mon Sep 17 00:00:00 2001 From: Rasmus Villemoes Date: Thu, 12 Feb 2015 15:02:01 -0800 Subject: lib/bitmap.c: update bitmap_onto to unsigned Change the nbits parameter of bitmap_onto to unsigned int for consistency with other bitmap_* functions. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Rasmus Villemoes Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/bitmap.h | 2 +- lib/bitmap.c | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/include/linux/bitmap.h b/include/linux/bitmap.h index 1406d5453781..d0c6214eb190 100644 --- a/include/linux/bitmap.h +++ b/include/linux/bitmap.h @@ -164,7 +164,7 @@ extern void bitmap_remap(unsigned long *dst, const unsigned long *src, extern int bitmap_bitremap(int oldbit, const unsigned long *old, const unsigned long *new, int bits); extern void bitmap_onto(unsigned long *dst, const unsigned long *orig, - const unsigned long *relmap, int bits); + const unsigned long *relmap, unsigned int bits); extern void bitmap_fold(unsigned long *dst, const unsigned long *orig, int sz, int bits); extern int bitmap_find_free_region(unsigned long *bitmap, unsigned int bits, int order); diff --git a/lib/bitmap.c b/lib/bitmap.c index 324ea9eab8c1..ee598a496895 100644 --- a/lib/bitmap.c +++ b/lib/bitmap.c @@ -1006,9 +1006,9 @@ EXPORT_SYMBOL(bitmap_bitremap); * All bits in @dst not set by the above rule are cleared. */ void bitmap_onto(unsigned long *dst, const unsigned long *orig, - const unsigned long *relmap, int bits) + const unsigned long *relmap, unsigned int bits) { - int n, m; /* same meaning as in above comment */ + unsigned int n, m; /* same meaning as in above comment */ if (dst == orig) /* following doesn't handle inplace mappings */ return; -- cgit v1.2.3 From b26ad5836c3a0a9d456eb60b9f841ca15403ee59 Mon Sep 17 00:00:00 2001 From: Rasmus Villemoes Date: Thu, 12 Feb 2015 15:02:04 -0800 Subject: lib/bitmap.c: change parameters of bitmap_fold to unsigned Change the sz and nbits parameters of bitmap_fold to unsigned int for consistency with other bitmap_* functions, and to save another few bytes in the generated code. [akpm@linux-foundation.org: fix kerneldoc] Signed-off-by: Rasmus Villemoes Cc: Wu Fengguang Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/bitmap.h | 2 +- lib/bitmap.c | 10 +++++----- 2 files changed, 6 insertions(+), 6 deletions(-) diff --git a/include/linux/bitmap.h b/include/linux/bitmap.h index d0c6214eb190..95dcd2f76e1a 100644 --- a/include/linux/bitmap.h +++ b/include/linux/bitmap.h @@ -166,7 +166,7 @@ extern int bitmap_bitremap(int oldbit, extern void bitmap_onto(unsigned long *dst, const unsigned long *orig, const unsigned long *relmap, unsigned int bits); extern void bitmap_fold(unsigned long *dst, const unsigned long *orig, - int sz, int bits); + unsigned int sz, unsigned int nbits); extern int bitmap_find_free_region(unsigned long *bitmap, unsigned int bits, int order); extern void bitmap_release_region(unsigned long *bitmap, unsigned int pos, int order); extern int bitmap_allocate_region(unsigned long *bitmap, unsigned int pos, int order); diff --git a/lib/bitmap.c b/lib/bitmap.c index ee598a496895..b0d3e823dab3 100644 --- a/lib/bitmap.c +++ b/lib/bitmap.c @@ -1039,22 +1039,22 @@ EXPORT_SYMBOL(bitmap_onto); * @dst: resulting smaller bitmap * @orig: original larger bitmap * @sz: specified size - * @bits: number of bits in each of these bitmaps + * @nbits: number of bits in each of these bitmaps * * For each bit oldbit in @orig, set bit oldbit mod @sz in @dst. * Clear all other bits in @dst. See further the comment and * Example [2] for bitmap_onto() for why and how to use this. */ void bitmap_fold(unsigned long *dst, const unsigned long *orig, - int sz, int bits) + unsigned int sz, unsigned int nbits) { - int oldbit; + unsigned int oldbit; if (dst == orig) /* following doesn't handle inplace mappings */ return; - bitmap_zero(dst, bits); + bitmap_zero(dst, nbits); - for_each_set_bit(oldbit, orig, bits) + for_each_set_bit(oldbit, orig, nbits) set_bit(oldbit % sz, dst); } EXPORT_SYMBOL(bitmap_fold); -- cgit v1.2.3 From df1d80a9eb16d98002673f68a7ebbe881f6e6946 Mon Sep 17 00:00:00 2001 From: Rasmus Villemoes Date: Thu, 12 Feb 2015 15:02:07 -0800 Subject: lib/bitmap.c: simplify bitmap_pos_to_ord The ordinal of a set bit is simply the number of set bits before it; counting those doesn't need to be done one bit at a time. While at it, update the parameters to unsigned int. It is not completely unthinkable that gcc would see pos as compile-time constant 0 in one of the uses of bitmap_pos_to_ord. Since the static inline frontend bitmap_weight doesn't handle nbits==0 correctly (it would behave exactly as if nbits==BITS_PER_LONG), use __bitmap_weight. Alternatively, the last line could be spelled bitmap_weight(buf, pos+1)-1, but this is simpler. Signed-off-by: Rasmus Villemoes Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- lib/bitmap.c | 22 ++++++---------------- 1 file changed, 6 insertions(+), 16 deletions(-) diff --git a/lib/bitmap.c b/lib/bitmap.c index b0d3e823dab3..84d20b5c6bf1 100644 --- a/lib/bitmap.c +++ b/lib/bitmap.c @@ -744,10 +744,10 @@ EXPORT_SYMBOL(bitmap_parselist_user); /** * bitmap_pos_to_ord - find ordinal of set bit at given position in bitmap * @buf: pointer to a bitmap - * @pos: a bit position in @buf (0 <= @pos < @bits) - * @bits: number of valid bit positions in @buf + * @pos: a bit position in @buf (0 <= @pos < @nbits) + * @nbits: number of valid bit positions in @buf * - * Map the bit at position @pos in @buf (of length @bits) to the + * Map the bit at position @pos in @buf (of length @nbits) to the * ordinal of which set bit it is. If it is not set or if @pos * is not a valid bit position, map to -1. * @@ -759,22 +759,12 @@ EXPORT_SYMBOL(bitmap_parselist_user); * * The bit positions 0 through @bits are valid positions in @buf. */ -static int bitmap_pos_to_ord(const unsigned long *buf, int pos, int bits) +static int bitmap_pos_to_ord(const unsigned long *buf, unsigned int pos, unsigned int nbits) { - int i, ord; - - if (pos < 0 || pos >= bits || !test_bit(pos, buf)) + if (pos >= nbits || !test_bit(pos, buf)) return -1; - i = find_first_bit(buf, bits); - ord = 0; - while (i < pos) { - i = find_next_bit(buf, bits, i + 1); - ord++; - } - BUG_ON(i != pos); - - return ord; + return __bitmap_weight(buf, pos); } /** -- cgit v1.2.3 From f6a1f5db8d7a7a94ff07251996959d27daba4ee7 Mon Sep 17 00:00:00 2001 From: Rasmus Villemoes Date: Thu, 12 Feb 2015 15:02:10 -0800 Subject: lib/bitmap.c: simplify bitmap_ord_to_pos Make the return value and the ord and nbits parameters of bitmap_ord_to_pos unsigned. Also, simplify the implementation and as a side effect make the result fully defined, returning nbits for ord >= weight, in analogy with what find_{first,next}_bit does. This is a better sentinel than the former ("unofficial") 0. No current users are affected by this change. Signed-off-by: Rasmus Villemoes Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/bitmap.h | 2 +- lib/bitmap.c | 28 +++++++++++----------------- 2 files changed, 12 insertions(+), 18 deletions(-) diff --git a/include/linux/bitmap.h b/include/linux/bitmap.h index 95dcd2f76e1a..1e74fe7aa167 100644 --- a/include/linux/bitmap.h +++ b/include/linux/bitmap.h @@ -171,7 +171,7 @@ extern int bitmap_find_free_region(unsigned long *bitmap, unsigned int bits, int extern void bitmap_release_region(unsigned long *bitmap, unsigned int pos, int order); extern int bitmap_allocate_region(unsigned long *bitmap, unsigned int pos, int order); extern void bitmap_copy_le(void *dst, const unsigned long *src, int nbits); -extern int bitmap_ord_to_pos(const unsigned long *bitmap, int n, int bits); +extern unsigned int bitmap_ord_to_pos(const unsigned long *bitmap, unsigned int ord, unsigned int nbits); extern int bitmap_print_to_pagebuf(bool list, char *buf, const unsigned long *maskp, int nmaskbits); diff --git a/lib/bitmap.c b/lib/bitmap.c index 84d20b5c6bf1..e8a38bde7af9 100644 --- a/lib/bitmap.c +++ b/lib/bitmap.c @@ -771,34 +771,28 @@ static int bitmap_pos_to_ord(const unsigned long *buf, unsigned int pos, unsigne * bitmap_ord_to_pos - find position of n-th set bit in bitmap * @buf: pointer to bitmap * @ord: ordinal bit position (n-th set bit, n >= 0) - * @bits: number of valid bit positions in @buf + * @nbits: number of valid bit positions in @buf * * Map the ordinal offset of bit @ord in @buf to its position in @buf. - * Value of @ord should be in range 0 <= @ord < weight(buf), else - * results are undefined. + * Value of @ord should be in range 0 <= @ord < weight(buf). If @ord + * >= weight(buf), returns @nbits. * * If for example, just bits 4 through 7 are set in @buf, then @ord * values 0 through 3 will get mapped to 4 through 7, respectively, - * and all other @ord values return undefined values. When @ord value 3 + * and all other @ord values returns @nbits. When @ord value 3 * gets mapped to (returns) @pos value 7 in this example, that means * that the 3rd set bit (starting with 0th) is at position 7 in @buf. * - * The bit positions 0 through @bits are valid positions in @buf. + * The bit positions 0 through @nbits-1 are valid positions in @buf. */ -int bitmap_ord_to_pos(const unsigned long *buf, int ord, int bits) +unsigned int bitmap_ord_to_pos(const unsigned long *buf, unsigned int ord, unsigned int nbits) { - int pos = 0; + unsigned int pos; - if (ord >= 0 && ord < bits) { - int i; - - for (i = find_first_bit(buf, bits); - i < bits && ord > 0; - i = find_next_bit(buf, bits, i + 1)) - ord--; - if (i < bits && ord == 0) - pos = i; - } + for (pos = find_first_bit(buf, nbits); + pos < nbits && ord; + pos = find_next_bit(buf, nbits, pos + 1)) + ord--; return pos; } -- cgit v1.2.3 From 9814ec135dedb70a9daa41e68798d540d7ba71f2 Mon Sep 17 00:00:00 2001 From: Rasmus Villemoes Date: Thu, 12 Feb 2015 15:02:13 -0800 Subject: lib/bitmap.c: make the bits parameter of bitmap_remap unsigned Also, rename bits to nbits. Both changes for consistency with other bitmap_* functions. Signed-off-by: Rasmus Villemoes Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/bitmap.h | 2 +- lib/bitmap.c | 16 ++++++++-------- 2 files changed, 9 insertions(+), 9 deletions(-) diff --git a/include/linux/bitmap.h b/include/linux/bitmap.h index 1e74fe7aa167..5f5c00de39f0 100644 --- a/include/linux/bitmap.h +++ b/include/linux/bitmap.h @@ -160,7 +160,7 @@ extern int bitmap_parselist(const char *buf, unsigned long *maskp, extern int bitmap_parselist_user(const char __user *ubuf, unsigned int ulen, unsigned long *dst, int nbits); extern void bitmap_remap(unsigned long *dst, const unsigned long *src, - const unsigned long *old, const unsigned long *new, int bits); + const unsigned long *old, const unsigned long *new, unsigned int nbits); extern int bitmap_bitremap(int oldbit, const unsigned long *old, const unsigned long *new, int bits); extern void bitmap_onto(unsigned long *dst, const unsigned long *orig, diff --git a/lib/bitmap.c b/lib/bitmap.c index e8a38bde7af9..ad161a6c82db 100644 --- a/lib/bitmap.c +++ b/lib/bitmap.c @@ -803,7 +803,7 @@ unsigned int bitmap_ord_to_pos(const unsigned long *buf, unsigned int ord, unsig * @src: subset to be remapped * @old: defines domain of map * @new: defines range of map - * @bits: number of bits in each of these bitmaps + * @nbits: number of bits in each of these bitmaps * * Let @old and @new define a mapping of bit positions, such that * whatever position is held by the n-th set bit in @old is mapped @@ -831,22 +831,22 @@ unsigned int bitmap_ord_to_pos(const unsigned long *buf, unsigned int ord, unsig */ void bitmap_remap(unsigned long *dst, const unsigned long *src, const unsigned long *old, const unsigned long *new, - int bits) + unsigned int nbits) { - int oldbit, w; + unsigned int oldbit, w; if (dst == src) /* following doesn't handle inplace remaps */ return; - bitmap_zero(dst, bits); + bitmap_zero(dst, nbits); - w = bitmap_weight(new, bits); - for_each_set_bit(oldbit, src, bits) { - int n = bitmap_pos_to_ord(old, oldbit, bits); + w = bitmap_weight(new, nbits); + for_each_set_bit(oldbit, src, nbits) { + int n = bitmap_pos_to_ord(old, oldbit, nbits); if (n < 0 || w == 0) set_bit(oldbit, dst); /* identity map */ else - set_bit(bitmap_ord_to_pos(new, n % w, bits), dst); + set_bit(bitmap_ord_to_pos(new, n % w, nbits), dst); } } EXPORT_SYMBOL(bitmap_remap); -- cgit v1.2.3 From af3cd13501eb04ca61d017ff4406f1cbffafdc04 Mon Sep 17 00:00:00 2001 From: Rasmus Villemoes Date: Thu, 12 Feb 2015 15:02:15 -0800 Subject: lib/string.c: remove strnicmp() Now that all in-tree users of strnicmp have been converted to strncasecmp, the wrapper can be removed. Signed-off-by: Rasmus Villemoes Cc: David Howells Cc: Heiko Carstens Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/frv/include/asm/string.h | 1 - arch/s390/include/asm/string.h | 1 - include/linux/string.h | 3 --- lib/string.c | 8 -------- 4 files changed, 13 deletions(-) diff --git a/arch/frv/include/asm/string.h b/arch/frv/include/asm/string.h index 5ed310f64b7e..1f6c35990439 100644 --- a/arch/frv/include/asm/string.h +++ b/arch/frv/include/asm/string.h @@ -33,7 +33,6 @@ extern void *memcpy(void *, const void *, __kernel_size_t); #define __HAVE_ARCH_STRNCAT 1 #define __HAVE_ARCH_STRCMP 1 #define __HAVE_ARCH_STRNCMP 1 -#define __HAVE_ARCH_STRNICMP 1 #define __HAVE_ARCH_STRCHR 1 #define __HAVE_ARCH_STRRCHR 1 #define __HAVE_ARCH_STRSTR 1 diff --git a/arch/s390/include/asm/string.h b/arch/s390/include/asm/string.h index 7e2dcd7c57ef..8662f5c8e17f 100644 --- a/arch/s390/include/asm/string.h +++ b/arch/s390/include/asm/string.h @@ -44,7 +44,6 @@ extern char *strstr(const char *, const char *); #undef __HAVE_ARCH_STRCHR #undef __HAVE_ARCH_STRNCHR #undef __HAVE_ARCH_STRNCMP -#undef __HAVE_ARCH_STRNICMP #undef __HAVE_ARCH_STRPBRK #undef __HAVE_ARCH_STRSEP #undef __HAVE_ARCH_STRSPN diff --git a/include/linux/string.h b/include/linux/string.h index 2e22a2e58f3a..b9bc9a5d9e21 100644 --- a/include/linux/string.h +++ b/include/linux/string.h @@ -40,9 +40,6 @@ extern int strcmp(const char *,const char *); #ifndef __HAVE_ARCH_STRNCMP extern int strncmp(const char *,const char *,__kernel_size_t); #endif -#ifndef __HAVE_ARCH_STRNICMP -#define strnicmp strncasecmp -#endif #ifndef __HAVE_ARCH_STRCASECMP extern int strcasecmp(const char *s1, const char *s2); #endif diff --git a/lib/string.c b/lib/string.c index 10063300b830..3206d0178296 100644 --- a/lib/string.c +++ b/lib/string.c @@ -58,14 +58,6 @@ int strncasecmp(const char *s1, const char *s2, size_t len) } EXPORT_SYMBOL(strncasecmp); #endif -#ifndef __HAVE_ARCH_STRNICMP -#undef strnicmp -int strnicmp(const char *s1, const char *s2, size_t len) -{ - return strncasecmp(s1, s2, len); -} -EXPORT_SYMBOL(strnicmp); -#endif #ifndef __HAVE_ARCH_STRCASECMP int strcasecmp(const char *s1, const char *s2) -- cgit v1.2.3 From ad3d5d2f7deca6d0fd72163573bcb0cca6337e33 Mon Sep 17 00:00:00 2001 From: Toshi Kikuchi Date: Thu, 12 Feb 2015 15:02:18 -0800 Subject: lib/genalloc.c: fix the end addr check in addr_in_gen_pool() Since chunk->end_addr is (chunk->start_addr + size - 1), the end address to compare should be (start + size - 1). Signed-off-by: Toshi Kikuchi Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- lib/genalloc.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/lib/genalloc.c b/lib/genalloc.c index 2e65d206b01c..42a95e99b754 100644 --- a/lib/genalloc.c +++ b/lib/genalloc.c @@ -415,7 +415,7 @@ bool addr_in_gen_pool(struct gen_pool *pool, unsigned long start, size_t size) { bool found = false; - unsigned long end = start + size; + unsigned long end = start + size - 1; struct gen_pool_chunk *chunk; rcu_read_lock(); -- cgit v1.2.3 From 64d1d77a44697af8e314939ecef30642c68309cb Mon Sep 17 00:00:00 2001 From: Andy Shevchenko Date: Thu, 12 Feb 2015 15:02:21 -0800 Subject: hexdump: introduce test suite Test different scenarios of function calls located in lib/hexdump.c. Currently hex_dump_to_buffer() is only tested and test data is provided for little endian CPUs. Signed-off-by: Andy Shevchenko Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- lib/Kconfig.debug | 3 ++ lib/Makefile | 4 +- lib/test-hexdump.c | 135 +++++++++++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 141 insertions(+), 1 deletion(-) create mode 100644 lib/test-hexdump.c diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index e5ea3ab856bf..79a9bb67aeaf 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -1580,6 +1580,9 @@ config ASYNC_RAID6_TEST If unsure, say N. +config TEST_HEXDUMP + tristate "Test functions located in the hexdump module at runtime" + config TEST_STRING_HELPERS tristate "Test functions located in the string_helpers module at runtime" diff --git a/lib/Makefile b/lib/Makefile index 25c061f77df7..e456defd1021 100644 --- a/lib/Makefile +++ b/lib/Makefile @@ -23,12 +23,14 @@ lib-y += kobject.o klist.o obj-y += lockref.o obj-y += bcd.o div64.o sort.o parser.o halfmd4.o debug_locks.o random32.o \ - bust_spinlocks.o hexdump.o kasprintf.o bitmap.o scatterlist.o \ + bust_spinlocks.o kasprintf.o bitmap.o scatterlist.o \ gcd.o lcm.o list_sort.o uuid.o flex_array.o clz_ctz.o \ bsearch.o find_last_bit.o find_next_bit.o llist.o memweight.o kfifo.o \ percpu-refcount.o percpu_ida.o rhashtable.o reciprocal_div.o obj-y += string_helpers.o obj-$(CONFIG_TEST_STRING_HELPERS) += test-string_helpers.o +obj-y += hexdump.o +obj-$(CONFIG_TEST_HEXDUMP) += test-hexdump.o obj-y += kstrtox.o obj-$(CONFIG_TEST_KSTRTOX) += test-kstrtox.o obj-$(CONFIG_TEST_LKM) += test_module.o diff --git a/lib/test-hexdump.c b/lib/test-hexdump.c new file mode 100644 index 000000000000..9d3bd1e9ae48 --- /dev/null +++ b/lib/test-hexdump.c @@ -0,0 +1,135 @@ +/* + * Test cases for lib/hexdump.c module. + */ +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt + +#include +#include +#include +#include +#include + +static const unsigned char data_b[] = { + '\xbe', '\x32', '\xdb', '\x7b', '\x0a', '\x18', '\x93', '\xb2', /* 00 - 07 */ + '\x70', '\xba', '\xc4', '\x24', '\x7d', '\x83', '\x34', '\x9b', /* 08 - 0f */ + '\xa6', '\x9c', '\x31', '\xad', '\x9c', '\x0f', '\xac', '\xe9', /* 10 - 17 */ + '\x4c', '\xd1', '\x19', '\x99', '\x43', '\xb1', '\xaf', '\x0c', /* 18 - 1f */ +}; + +static const unsigned char data_a[] = ".2.{....p..$}.4...1.....L...C..."; + +static const char *test_data_1_le[] __initconst = { + "be", "32", "db", "7b", "0a", "18", "93", "b2", + "70", "ba", "c4", "24", "7d", "83", "34", "9b", + "a6", "9c", "31", "ad", "9c", "0f", "ac", "e9", + "4c", "d1", "19", "99", "43", "b1", "af", "0c", +}; + +static const char *test_data_2_le[] __initconst = { + "32be", "7bdb", "180a", "b293", + "ba70", "24c4", "837d", "9b34", + "9ca6", "ad31", "0f9c", "e9ac", + "d14c", "9919", "b143", "0caf", +}; + +static const char *test_data_4_le[] __initconst = { + "7bdb32be", "b293180a", "24c4ba70", "9b34837d", + "ad319ca6", "e9ac0f9c", "9919d14c", "0cafb143", +}; + +static const char *test_data_8_le[] __initconst = { + "b293180a7bdb32be", "9b34837d24c4ba70", + "e9ac0f9cad319ca6", "0cafb1439919d14c", +}; + +static void __init test_hexdump(size_t len, int rowsize, int groupsize, + bool ascii) +{ + char test[32 * 3 + 2 + 32 + 1]; + char real[32 * 3 + 2 + 32 + 1]; + char *p; + const char **result; + size_t l = len; + int gs = groupsize, rs = rowsize; + unsigned int i; + + hex_dump_to_buffer(data_b, l, rs, gs, real, sizeof(real), ascii); + + if (rs != 16 && rs != 32) + rs = 16; + + if (l > rs) + l = rs; + + if (!is_power_of_2(gs) || gs > 8 || (len % gs != 0)) + gs = 1; + + if (gs == 8) + result = test_data_8_le; + else if (gs == 4) + result = test_data_4_le; + else if (gs == 2) + result = test_data_2_le; + else + result = test_data_1_le; + + memset(test, ' ', sizeof(test)); + + /* hex dump */ + p = test; + for (i = 0; i < l / gs; i++) { + const char *q = *result++; + size_t amount = strlen(q); + + strncpy(p, q, amount); + p += amount + 1; + } + if (i) + p--; + + /* ASCII part */ + if (ascii) { + p = test + rs * 2 + rs / gs + 1; + strncpy(p, data_a, l); + p += l; + } + + *p = '\0'; + + if (strcmp(test, real)) { + pr_err("Len: %zu row: %d group: %d\n", len, rowsize, groupsize); + pr_err("Result: '%s'\n", real); + pr_err("Expect: '%s'\n", test); + } +} + +static void __init test_hexdump_set(int rowsize, bool ascii) +{ + size_t d = min_t(size_t, sizeof(data_b), rowsize); + size_t len = get_random_int() % d + 1; + + test_hexdump(len, rowsize, 4, ascii); + test_hexdump(len, rowsize, 2, ascii); + test_hexdump(len, rowsize, 8, ascii); + test_hexdump(len, rowsize, 1, ascii); +} + +static int __init test_hexdump_init(void) +{ + unsigned int i; + int rowsize; + + pr_info("Running tests...\n"); + + rowsize = (get_random_int() % 2 + 1) * 16; + for (i = 0; i < 16; i++) + test_hexdump_set(rowsize, false); + + rowsize = (get_random_int() % 2 + 1) * 16; + for (i = 0; i < 16; i++) + test_hexdump_set(rowsize, true); + + return -EINVAL; +} +module_init(test_hexdump_init); +MODULE_LICENSE("Dual BSD/GPL"); -- cgit v1.2.3 From 6f6f3fcb87a53c720941f4bd039ec2c0bce66625 Mon Sep 17 00:00:00 2001 From: Andy Shevchenko Date: Thu, 12 Feb 2015 15:02:24 -0800 Subject: hexdump: fix ascii column for the tail of a dump In the current implementation we have a floating ascii column in the tail of the dump. For example, for row size equal to 16 the ascii column as in following table group size \ length 8 12 16 1 50 50 50 2 22 32 42 4 20 29 38 8 19 - 36 This patch makes it the same independently of amount of bytes dumped. The change is safe since all current users, which use ASCII part of the dump, rely on the group size equal to 1. The patch doesn't change behaviour for such group size (see the table above). Signed-off-by: Andy Shevchenko Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- lib/hexdump.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/lib/hexdump.c b/lib/hexdump.c index 270773b91923..a61cb6b1f011 100644 --- a/lib/hexdump.c +++ b/lib/hexdump.c @@ -126,7 +126,7 @@ void hex_dump_to_buffer(const void *buf, size_t len, int rowsize, lx += scnprintf(linebuf + lx, linebuflen - lx, "%s%16.16llx", j ? " " : "", (unsigned long long)*(ptr8 + j)); - ascii_column = 17 * ngroups + 2; + ascii_column = rowsize * 2 + rowsize / 8 + 2; break; } @@ -137,7 +137,7 @@ void hex_dump_to_buffer(const void *buf, size_t len, int rowsize, for (j = 0; j < ngroups; j++) lx += scnprintf(linebuf + lx, linebuflen - lx, "%s%8.8x", j ? " " : "", *(ptr4 + j)); - ascii_column = 9 * ngroups + 2; + ascii_column = rowsize * 2 + rowsize / 4 + 2; break; } @@ -148,7 +148,7 @@ void hex_dump_to_buffer(const void *buf, size_t len, int rowsize, for (j = 0; j < ngroups; j++) lx += scnprintf(linebuf + lx, linebuflen - lx, "%s%4.4x", j ? " " : "", *(ptr2 + j)); - ascii_column = 5 * ngroups + 2; + ascii_column = rowsize * 2 + rowsize / 2 + 2; break; } -- cgit v1.2.3 From 5d909c8d54b114eddb7c50506f03bf7309a9192e Mon Sep 17 00:00:00 2001 From: Andy Shevchenko Date: Thu, 12 Feb 2015 15:02:26 -0800 Subject: hexdump: do a few calculations ahead Instead of doing calculations in each case of different groupsize let's do them beforehand. While there, change the switch to an if-else-if construction. Signed-off-by: Andy Shevchenko Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- lib/hexdump.c | 34 ++++++++++------------------------ 1 file changed, 10 insertions(+), 24 deletions(-) diff --git a/lib/hexdump.c b/lib/hexdump.c index a61cb6b1f011..4af53f73c7cc 100644 --- a/lib/hexdump.c +++ b/lib/hexdump.c @@ -103,6 +103,7 @@ void hex_dump_to_buffer(const void *buf, size_t len, int rowsize, bool ascii) { const u8 *ptr = buf; + int ngroups; u8 ch; int j, lx = 0; int ascii_column; @@ -114,45 +115,33 @@ void hex_dump_to_buffer(const void *buf, size_t len, int rowsize, goto nil; if (len > rowsize) /* limit to one line at a time */ len = rowsize; + if (!is_power_of_2(groupsize) || groupsize > 8) + groupsize = 1; if ((len % groupsize) != 0) /* no mixed size output */ groupsize = 1; - switch (groupsize) { - case 8: { + ngroups = len / groupsize; + ascii_column = rowsize * 2 + rowsize / groupsize + 1; + if (groupsize == 8) { const u64 *ptr8 = buf; - int ngroups = len / groupsize; for (j = 0; j < ngroups; j++) lx += scnprintf(linebuf + lx, linebuflen - lx, "%s%16.16llx", j ? " " : "", (unsigned long long)*(ptr8 + j)); - ascii_column = rowsize * 2 + rowsize / 8 + 2; - break; - } - - case 4: { + } else if (groupsize == 4) { const u32 *ptr4 = buf; - int ngroups = len / groupsize; for (j = 0; j < ngroups; j++) lx += scnprintf(linebuf + lx, linebuflen - lx, "%s%8.8x", j ? " " : "", *(ptr4 + j)); - ascii_column = rowsize * 2 + rowsize / 4 + 2; - break; - } - - case 2: { + } else if (groupsize == 2) { const u16 *ptr2 = buf; - int ngroups = len / groupsize; for (j = 0; j < ngroups; j++) lx += scnprintf(linebuf + lx, linebuflen - lx, "%s%4.4x", j ? " " : "", *(ptr2 + j)); - ascii_column = rowsize * 2 + rowsize / 2 + 2; - break; - } - - default: + } else { for (j = 0; (j < len) && (lx + 3) <= linebuflen; j++) { ch = ptr[j]; linebuf[lx++] = hex_asc_hi(ch); @@ -161,14 +150,11 @@ void hex_dump_to_buffer(const void *buf, size_t len, int rowsize, } if (j) lx--; - - ascii_column = 3 * rowsize + 2; - break; } if (!ascii) goto nil; - while (lx < (linebuflen - 1) && lx < (ascii_column - 1)) + while (lx < (linebuflen - 1) && lx < ascii_column) linebuf[lx++] = ' '; for (j = 0; (j < len) && (lx + 2) < linebuflen; j++) { ch = ptr[j]; -- cgit v1.2.3 From 114fc1afb2de7dec40da137dc2a55cd38fc220f2 Mon Sep 17 00:00:00 2001 From: Andy Shevchenko Date: Thu, 12 Feb 2015 15:02:29 -0800 Subject: hexdump: make it return number of bytes placed in buffer This patch makes hexdump return the number of bytes placed in the buffer excluding trailing NUL. In the case of overflow it returns the desired amount of bytes to produce the entire dump. Thus, it mimics snprintf(). This will be useful for users that would like to repeat with a bigger buffer. [akpm@linux-foundation.org: fix printk warning] Signed-off-by: Andy Shevchenko Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/printk.h | 6 ++--- lib/hexdump.c | 73 +++++++++++++++++++++++++++++++++++++------------- lib/test-hexdump.c | 45 +++++++++++++++++++++++++++++++ 3 files changed, 103 insertions(+), 21 deletions(-) diff --git a/include/linux/printk.h b/include/linux/printk.h index 4d5bf5726578..baa3f97d8ce8 100644 --- a/include/linux/printk.h +++ b/include/linux/printk.h @@ -417,9 +417,9 @@ enum { DUMP_PREFIX_ADDRESS, DUMP_PREFIX_OFFSET }; -extern void hex_dump_to_buffer(const void *buf, size_t len, - int rowsize, int groupsize, - char *linebuf, size_t linebuflen, bool ascii); +extern int hex_dump_to_buffer(const void *buf, size_t len, int rowsize, + int groupsize, char *linebuf, size_t linebuflen, + bool ascii); #ifdef CONFIG_PRINTK extern void print_hex_dump(const char *level, const char *prefix_str, int prefix_type, int rowsize, int groupsize, diff --git a/lib/hexdump.c b/lib/hexdump.c index 4af53f73c7cc..7ea09699855d 100644 --- a/lib/hexdump.c +++ b/lib/hexdump.c @@ -97,22 +97,26 @@ EXPORT_SYMBOL(bin2hex); * * example output buffer: * 40 41 42 43 44 45 46 47 48 49 4a 4b 4c 4d 4e 4f @ABCDEFGHIJKLMNO + * + * Return: + * The amount of bytes placed in the buffer without terminating NUL. If the + * output was truncated, then the return value is the number of bytes + * (excluding the terminating NUL) which would have been written to the final + * string if enough space had been available. */ -void hex_dump_to_buffer(const void *buf, size_t len, int rowsize, - int groupsize, char *linebuf, size_t linebuflen, - bool ascii) +int hex_dump_to_buffer(const void *buf, size_t len, int rowsize, int groupsize, + char *linebuf, size_t linebuflen, bool ascii) { const u8 *ptr = buf; int ngroups; u8 ch; int j, lx = 0; int ascii_column; + int ret; if (rowsize != 16 && rowsize != 32) rowsize = 16; - if (!len) - goto nil; if (len > rowsize) /* limit to one line at a time */ len = rowsize; if (!is_power_of_2(groupsize) || groupsize > 8) @@ -122,27 +126,50 @@ void hex_dump_to_buffer(const void *buf, size_t len, int rowsize, ngroups = len / groupsize; ascii_column = rowsize * 2 + rowsize / groupsize + 1; + + if (!linebuflen) + goto overflow1; + + if (!len) + goto nil; + if (groupsize == 8) { const u64 *ptr8 = buf; - for (j = 0; j < ngroups; j++) - lx += scnprintf(linebuf + lx, linebuflen - lx, - "%s%16.16llx", j ? " " : "", - (unsigned long long)*(ptr8 + j)); + for (j = 0; j < ngroups; j++) { + ret = snprintf(linebuf + lx, linebuflen - lx, + "%s%16.16llx", j ? " " : "", + (unsigned long long)*(ptr8 + j)); + if (ret >= linebuflen - lx) + goto overflow1; + lx += ret; + } } else if (groupsize == 4) { const u32 *ptr4 = buf; - for (j = 0; j < ngroups; j++) - lx += scnprintf(linebuf + lx, linebuflen - lx, - "%s%8.8x", j ? " " : "", *(ptr4 + j)); + for (j = 0; j < ngroups; j++) { + ret = snprintf(linebuf + lx, linebuflen - lx, + "%s%8.8x", j ? " " : "", + *(ptr4 + j)); + if (ret >= linebuflen - lx) + goto overflow1; + lx += ret; + } } else if (groupsize == 2) { const u16 *ptr2 = buf; - for (j = 0; j < ngroups; j++) - lx += scnprintf(linebuf + lx, linebuflen - lx, - "%s%4.4x", j ? " " : "", *(ptr2 + j)); + for (j = 0; j < ngroups; j++) { + ret = snprintf(linebuf + lx, linebuflen - lx, + "%s%4.4x", j ? " " : "", + *(ptr2 + j)); + if (ret >= linebuflen - lx) + goto overflow1; + lx += ret; + } } else { - for (j = 0; (j < len) && (lx + 3) <= linebuflen; j++) { + for (j = 0; j < len; j++) { + if (linebuflen < lx + 3) + goto overflow2; ch = ptr[j]; linebuf[lx++] = hex_asc_hi(ch); linebuf[lx++] = hex_asc_lo(ch); @@ -154,14 +181,24 @@ void hex_dump_to_buffer(const void *buf, size_t len, int rowsize, if (!ascii) goto nil; - while (lx < (linebuflen - 1) && lx < ascii_column) + while (lx < ascii_column) { + if (linebuflen < lx + 2) + goto overflow2; linebuf[lx++] = ' '; - for (j = 0; (j < len) && (lx + 2) < linebuflen; j++) { + } + for (j = 0; j < len; j++) { + if (linebuflen < lx + 2) + goto overflow2; ch = ptr[j]; linebuf[lx++] = (isascii(ch) && isprint(ch)) ? ch : '.'; } nil: + linebuf[lx] = '\0'; + return lx; +overflow2: linebuf[lx++] = '\0'; +overflow1: + return ascii ? ascii_column + len : (groupsize * 2 + 1) * ngroups - 1; } EXPORT_SYMBOL(hex_dump_to_buffer); diff --git a/lib/test-hexdump.c b/lib/test-hexdump.c index 9d3bd1e9ae48..daf29a390a89 100644 --- a/lib/test-hexdump.c +++ b/lib/test-hexdump.c @@ -114,6 +114,45 @@ static void __init test_hexdump_set(int rowsize, bool ascii) test_hexdump(len, rowsize, 1, ascii); } +static void __init test_hexdump_overflow(bool ascii) +{ + char buf[56]; + const char *t = test_data_1_le[0]; + size_t l = get_random_int() % sizeof(buf); + bool a; + int e, r; + + memset(buf, ' ', sizeof(buf)); + + r = hex_dump_to_buffer(data_b, 1, 16, 1, buf, l, ascii); + + if (ascii) + e = 50; + else + e = 2; + buf[e + 2] = '\0'; + + if (!l) { + a = r == e && buf[0] == ' '; + } else if (l < 3) { + a = r == e && buf[0] == '\0'; + } else if (l < 4) { + a = r == e && !strcmp(buf, t); + } else if (ascii) { + if (l < 51) + a = r == e && buf[l - 1] == '\0' && buf[l - 2] == ' '; + else + a = r == e && buf[50] == '\0' && buf[49] == '.'; + } else { + a = r == e && buf[e] == '\0'; + } + + if (!a) { + pr_err("Len: %zu rc: %u strlen: %zu\n", l, r, strlen(buf)); + pr_err("Result: '%s'\n", buf); + } +} + static int __init test_hexdump_init(void) { unsigned int i; @@ -129,6 +168,12 @@ static int __init test_hexdump_init(void) for (i = 0; i < 16; i++) test_hexdump_set(rowsize, true); + for (i = 0; i < 16; i++) + test_hexdump_overflow(false); + + for (i = 0; i < 16; i++) + test_hexdump_overflow(true); + return -EINVAL; } module_init(test_hexdump_init); -- cgit v1.2.3 From 85c5e27c4a7d085e8a0e112f659f6375c6f309e1 Mon Sep 17 00:00:00 2001 From: Rasmus Villemoes Date: Thu, 12 Feb 2015 15:02:32 -0800 Subject: lib/interval_tree.c: simplify includes The file uses nothing from init.h, and also doesn't need the full module.h machinery; export.h is sufficient. The latter requires the user to ensure compiler.h is included, so do that explicitly instead of relying on some other header pulling it in. Signed-off-by: Rasmus Villemoes Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- lib/interval_tree.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/lib/interval_tree.c b/lib/interval_tree.c index f367f9ad544c..c85f6600a5f8 100644 --- a/lib/interval_tree.c +++ b/lib/interval_tree.c @@ -1,7 +1,7 @@ -#include #include #include -#include +#include +#include #define START(node) ((node)->start) #define LAST(node) ((node)->last) -- cgit v1.2.3 From 42cf809654e4ea2fa16dd73608e153f1c6f7c2ed Mon Sep 17 00:00:00 2001 From: Rasmus Villemoes Date: Thu, 12 Feb 2015 15:02:35 -0800 Subject: lib/sort.c: use simpler includes sort.c doesn't use facilities from kernel.h, but does use some types defined in linux/types.h. Include the latter directly instead of relying on some other header doing it. Similarly, include linux/export.h directly instead of through module.h. This removes 80 lines from the dependency file .sort.o.cmd. Signed-off-by: Rasmus Villemoes Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- lib/sort.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/lib/sort.c b/lib/sort.c index 926d00429ed2..14fc1dfadb3f 100644 --- a/lib/sort.c +++ b/lib/sort.c @@ -4,8 +4,8 @@ * Jan 23 2005 Matt Mackall */ -#include -#include +#include +#include #include #include -- cgit v1.2.3 From 565ac23b81ebbcfad4111fa9c743ebc7fc25a4f7 Mon Sep 17 00:00:00 2001 From: Rasmus Villemoes Date: Thu, 12 Feb 2015 15:02:37 -0800 Subject: lib/dynamic_queue_limits.c: simplify includes The file doesn't use anything from ctype.h. Instead of module.h, just use export.h for EXPORT_SYMBOL. The latter requires the user to include compiler.h, so do that explicitly instead of relying on some other header pulling it in. Signed-off-by: Rasmus Villemoes Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- lib/dynamic_queue_limits.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/lib/dynamic_queue_limits.c b/lib/dynamic_queue_limits.c index 0777c5a45fa0..f346715e2255 100644 --- a/lib/dynamic_queue_limits.c +++ b/lib/dynamic_queue_limits.c @@ -3,12 +3,12 @@ * * Copyright (c) 2011, Tom Herbert */ -#include #include -#include #include #include #include +#include +#include #define POSDIFF(A, B) ((int)((A) - (B)) > 0 ? (A) - (B) : 0) #define AFTER_EQ(A, B) ((int)((A) - (B)) >= 0) -- cgit v1.2.3 From 3248340d3f10871d2ae2d1df9e323f284fa1fdb6 Mon Sep 17 00:00:00 2001 From: Rasmus Villemoes Date: Thu, 12 Feb 2015 15:02:40 -0800 Subject: lib/halfmd4.c: simplify includes We only need EXPORT_SYMBOL, so compiler.h and export.h suffice. This means linux/types.h is no longer implicitly included, so add an include of uapi/linux/types.h to linux/cryptohash.h for __u32. Other users of cryptohash.h cannot be affected, since they must already have been including uapi/linux/types.h in order for gcc not to complain about unknown types. Signed-off-by: Rasmus Villemoes Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/cryptohash.h | 2 ++ lib/halfmd4.c | 2 +- 2 files changed, 3 insertions(+), 1 deletion(-) diff --git a/include/linux/cryptohash.h b/include/linux/cryptohash.h index 2cd9f1cf9fa3..f4754282c9c2 100644 --- a/include/linux/cryptohash.h +++ b/include/linux/cryptohash.h @@ -1,6 +1,8 @@ #ifndef __CRYPTOHASH_H #define __CRYPTOHASH_H +#include + #define SHA_DIGEST_WORDS 5 #define SHA_MESSAGE_BYTES (512 /*bits*/ / 8) #define SHA_WORKSPACE_WORDS 16 diff --git a/lib/halfmd4.c b/lib/halfmd4.c index 66d0ee8b7776..a8fe6274a13c 100644 --- a/lib/halfmd4.c +++ b/lib/halfmd4.c @@ -1,4 +1,4 @@ -#include +#include #include #include -- cgit v1.2.3 From 87d1d16937f64dd7822aee8b2e35b2f3ed3200b4 Mon Sep 17 00:00:00 2001 From: Rasmus Villemoes Date: Thu, 12 Feb 2015 15:02:43 -0800 Subject: lib/idr.c: remove redundant include idr.c doesn't seem to use anything from hardirq.h (or anything included from that). Removing it produces identical objdump -d output, and gives 44 fewer lines in the .idr.o.cmd dependency file. Signed-off-by: Rasmus Villemoes Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- lib/idr.c | 1 - 1 file changed, 1 deletion(-) diff --git a/lib/idr.c b/lib/idr.c index e654aebd5f80..5335c43adf46 100644 --- a/lib/idr.c +++ b/lib/idr.c @@ -30,7 +30,6 @@ #include #include #include -#include #define MAX_IDR_SHIFT (sizeof(int) * 8 - 1) #define MAX_IDR_BIT (1U << MAX_IDR_SHIFT) -- cgit v1.2.3 From 18fa6d2e4574d2a44e0c2cc5ae1c1812ec8019d8 Mon Sep 17 00:00:00 2001 From: Rasmus Villemoes Date: Thu, 12 Feb 2015 15:02:46 -0800 Subject: lib/genalloc.c: remove redundant include Removing this include produces byte-identical output, and thus removes a false dependency. Signed-off-by: Rasmus Villemoes Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- lib/genalloc.c | 1 - 1 file changed, 1 deletion(-) diff --git a/lib/genalloc.c b/lib/genalloc.c index 42a95e99b754..0fe1cbe87700 100644 --- a/lib/genalloc.c +++ b/lib/genalloc.c @@ -34,7 +34,6 @@ #include #include #include -#include #include static inline size_t chunk_size(const struct gen_pool_chunk *chunk) -- cgit v1.2.3 From 7259fa0424208fb7ab19a914f10e2502d2f6d18b Mon Sep 17 00:00:00 2001 From: Rasmus Villemoes Date: Thu, 12 Feb 2015 15:02:48 -0800 Subject: lib/list_sort.c: rearrange includes Memory allocation only happens in the self test, just as random numbers are only used there. So move the inclusion of slab.h inside the CONFIG_TEST_LIST_SORT. We don't need module.h and all of the stuff it carries with it, so replace with export.h and compiler.h. Unfortunately, the ARRAY_SIZE macro from kernel.h requires the user to ensure bug.h is also included (for BUILD_BUG_ON_ZERO, used by __must_be_array). We used to get that through some maze of nested includes, but just include it explicitly. linux/string.h is then only included implicitly through kernel.h->printk.h->dynamic_debug.h, but only if !CONFIG_DYNAMIC_DEBUG, so just include it explicitly (for memset). objdump -d says the generated code is the same, and wc -l says that lib/.list_sort.o.cmd went from 579 to 165 lines. Signed-off-by: Rasmus Villemoes Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- lib/list_sort.c | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/lib/list_sort.c b/lib/list_sort.c index 12bcba1c8612..b29015102698 100644 --- a/lib/list_sort.c +++ b/lib/list_sort.c @@ -2,9 +2,11 @@ #define pr_fmt(fmt) "list_sort_test: " fmt #include -#include +#include +#include +#include +#include #include -#include #include #define MAX_LIST_LENGTH_BITS 20 @@ -146,6 +148,7 @@ EXPORT_SYMBOL(list_sort); #ifdef CONFIG_TEST_LIST_SORT +#include #include /* -- cgit v1.2.3 From 9a29ae84c147a348c3cb7aef249b0d40ed6da1ed Mon Sep 17 00:00:00 2001 From: Rasmus Villemoes Date: Thu, 12 Feb 2015 15:02:51 -0800 Subject: lib/md5.c: simplify include md5.c doesn't use anything from kernel.h, except that that pulls in compiler.h, which is needed for the export.h to work. Signed-off-by: Rasmus Villemoes Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- lib/md5.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/lib/md5.c b/lib/md5.c index 958a3c15923c..bb0cd01d356d 100644 --- a/lib/md5.c +++ b/lib/md5.c @@ -1,4 +1,4 @@ -#include +#include #include #include -- cgit v1.2.3 From 9b40570bd986128bb8e3c5f6f1abc60e1ad89794 Mon Sep 17 00:00:00 2001 From: Rasmus Villemoes Date: Thu, 12 Feb 2015 15:02:54 -0800 Subject: lib/llist.c: remove redundant include This file doesn't seem to use anything provided by linux/interrupt.h or anything recursively included through that. Removing it produces byte-identical output, while reducing .llist.o.cmd from 541 to 156 lines. Signed-off-by: Rasmus Villemoes Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- lib/llist.c | 1 - 1 file changed, 1 deletion(-) diff --git a/lib/llist.c b/lib/llist.c index f76196d07409..0b0e9779d675 100644 --- a/lib/llist.c +++ b/lib/llist.c @@ -24,7 +24,6 @@ */ #include #include -#include #include -- cgit v1.2.3 From a69ae45c260d24a4497ed38ec87c1e5ba461cae4 Mon Sep 17 00:00:00 2001 From: Rasmus Villemoes Date: Thu, 12 Feb 2015 15:02:57 -0800 Subject: lib/kobject_uevent.c: remove redundant include The file doesn't seem to use anything from linux/user_namespace.h, and removing it yields byte-identical object code and strictly fewer dependencies in the .cmd file. Signed-off-by: Rasmus Villemoes Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- lib/kobject_uevent.c | 1 - 1 file changed, 1 deletion(-) diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c index 9ebf9e20de53..f6c2c1e7779c 100644 --- a/lib/kobject_uevent.c +++ b/lib/kobject_uevent.c @@ -20,7 +20,6 @@ #include #include #include -#include #include #include #include -- cgit v1.2.3 From fb41f9d71c4b15474640d1c01ae2e90982d8d07e Mon Sep 17 00:00:00 2001 From: Rasmus Villemoes Date: Thu, 12 Feb 2015 15:02:59 -0800 Subject: lib/nlattr.c: remove redundant include nlattr.c doesn't seem to rely on anything from netdevice.h. Removing it yields identical objdump -d output for each of {allyes,allno,def}config, and eliminates more than 200 lines from the generated dependency file. Signed-off-by: Rasmus Villemoes Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- lib/nlattr.c | 1 - 1 file changed, 1 deletion(-) diff --git a/lib/nlattr.c b/lib/nlattr.c index 9c3e85ff0a6c..76a1b59523ab 100644 --- a/lib/nlattr.c +++ b/lib/nlattr.c @@ -9,7 +9,6 @@ #include #include #include -#include #include #include #include -- cgit v1.2.3 From 7f1ce3c86411f5b836f84358e2fc4e225c7e145e Mon Sep 17 00:00:00 2001 From: Rasmus Villemoes Date: Thu, 12 Feb 2015 15:03:02 -0800 Subject: lib/plist.c: remove redundant include Removing the include of linux/spinlock.h produces byte-identical output for {allno,def}config, and identical objdump -d output for allyesconfig. In the former two cases, more than a 100 lines are eliminated from the generated dependency file. Signed-off-by: Rasmus Villemoes Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- lib/plist.c | 1 - 1 file changed, 1 deletion(-) diff --git a/lib/plist.c b/lib/plist.c index d408e774b746..3a30c53db061 100644 --- a/lib/plist.c +++ b/lib/plist.c @@ -25,7 +25,6 @@ #include #include -#include #ifdef CONFIG_DEBUG_PI_LIST -- cgit v1.2.3 From 886d3dfa85d5aa6f11813c319e50f5402c7cf4e4 Mon Sep 17 00:00:00 2001 From: Rasmus Villemoes Date: Thu, 12 Feb 2015 15:03:05 -0800 Subject: lib/radix-tree.c: change to simpler include The comment helpfully explains why hardirq.h is included, but since commit 2d4b84739f0a ("hardirq: Split preempt count mask definitions") in_interrupt() has been provided by preempt_mask.h. Use that instead, saving around 40 lines in the generated dependency file. Signed-off-by: Rasmus Villemoes Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- lib/radix-tree.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/lib/radix-tree.c b/lib/radix-tree.c index 3291a8e37490..3d2aa27b845b 100644 --- a/lib/radix-tree.c +++ b/lib/radix-tree.c @@ -33,7 +33,7 @@ #include #include #include -#include /* in_interrupt() */ +#include /* in_interrupt() */ /* -- cgit v1.2.3 From b8b6db1793551fac131834d413199197f86b6b5f Mon Sep 17 00:00:00 2001 From: Rasmus Villemoes Date: Thu, 12 Feb 2015 15:03:08 -0800 Subject: lib/show_mem.c: remove redundant include show_mem.c doesn't use anything from nmi.h. Removing it yields identical objdump -d output for each of {allyes,allno,def}config and eliminates more than 100 lines in the dependency file. Signed-off-by: Rasmus Villemoes Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- lib/show_mem.c | 1 - 1 file changed, 1 deletion(-) diff --git a/lib/show_mem.c b/lib/show_mem.c index 7de89f4a36cf..adc98e1825ba 100644 --- a/lib/show_mem.c +++ b/lib/show_mem.c @@ -6,7 +6,6 @@ */ #include -#include #include #include -- cgit v1.2.3 From 2ddae683bf36d50b960402a94a55047ab0c73e2c Mon Sep 17 00:00:00 2001 From: Rasmus Villemoes Date: Thu, 12 Feb 2015 15:03:10 -0800 Subject: lib/sort.c: move include inside #if 0 The sort function and its helpers don't do memory allocation, so the slab.h include is redundant. Move it inside the #if 0 protecting the self-test, similar to how it is done in lib/list_sort.c. This removes over 450 lines from the generated dependency file. Signed-off-by: Rasmus Villemoes Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- lib/sort.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/lib/sort.c b/lib/sort.c index 14fc1dfadb3f..43c9fe73ae2e 100644 --- a/lib/sort.c +++ b/lib/sort.c @@ -7,7 +7,6 @@ #include #include #include -#include static void u32_swap(void *a, void *b, int size) { @@ -85,6 +84,7 @@ void sort(void *base, size_t num, size_t size, EXPORT_SYMBOL(sort); #if 0 +#include /* a simple boot-time regression test */ int cmpint(const void *a, const void *b) -- cgit v1.2.3 From b6d4f3221d7fe00d440edaeeb6cde7257ae63a73 Mon Sep 17 00:00:00 2001 From: Rasmus Villemoes Date: Thu, 12 Feb 2015 15:03:13 -0800 Subject: lib/stmp_device.c: replace module.h include stmp_device.c only needs EXPORT_SYMBOL, so just include compiler.h and export.h instead of the whole module.h machinery. Signed-off-by: Rasmus Villemoes Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- lib/stmp_device.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/lib/stmp_device.c b/lib/stmp_device.c index 8ac9bcc4289a..a904656f4fd7 100644 --- a/lib/stmp_device.c +++ b/lib/stmp_device.c @@ -15,7 +15,8 @@ #include #include #include -#include +#include +#include #include #define STMP_MODULE_CLKGATE (1 << 30) -- cgit v1.2.3 From bf3c2d6d2f9649f1cdfac96781e98a097e871147 Mon Sep 17 00:00:00 2001 From: Rasmus Villemoes Date: Thu, 12 Feb 2015 15:03:16 -0800 Subject: lib/strncpy_from_user.c: replace module.h include strncpy_from_user.c only needs EXPORT_SYMBOL, so just include compiler.h and export.h instead of the whole module.h machinery. Signed-off-by: Rasmus Villemoes Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- lib/strncpy_from_user.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/lib/strncpy_from_user.c b/lib/strncpy_from_user.c index bb2b201d6ad0..e0af6ff73d14 100644 --- a/lib/strncpy_from_user.c +++ b/lib/strncpy_from_user.c @@ -1,4 +1,5 @@ -#include +#include +#include #include #include #include -- cgit v1.2.3 From 6918584aad8e9c6b9a9be74b15bf9e7dafc2d691 Mon Sep 17 00:00:00 2001 From: Rasmus Villemoes Date: Thu, 12 Feb 2015 15:03:19 -0800 Subject: lib/percpu_ida.c: remove redundant includes These three #includes seem to be completely redundant: Removing them yields identical objdump -d output for each of {allyes,allno,def}config, and neither included file end up in the generated dependency file through some recursive include. In total, about 50 lines are eliminated from .percpu.o.cmd. Signed-off-by: Rasmus Villemoes Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- lib/percpu_ida.c | 3 --- 1 file changed, 3 deletions(-) diff --git a/lib/percpu_ida.c b/lib/percpu_ida.c index 93d145e5539c..f75715131f20 100644 --- a/lib/percpu_ida.c +++ b/lib/percpu_ida.c @@ -19,13 +19,10 @@ #include #include #include -#include -#include #include #include #include #include -#include #include #include #include -- cgit v1.2.3 From 6016daed58ee482a2f7684e93342e89139cf4419 Mon Sep 17 00:00:00 2001 From: Rasmus Villemoes Date: Thu, 12 Feb 2015 15:03:21 -0800 Subject: lib/lcm.c: replace include We don't need all the stuff kernel.h pulls in; just compiler.h since export.h doesn't do necessary #includes. This removes more than 100 dependencies. Signed-off-by: Rasmus Villemoes Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- lib/lcm.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/lib/lcm.c b/lib/lcm.c index 51cc6b13cd52..e97dbd51e756 100644 --- a/lib/lcm.c +++ b/lib/lcm.c @@ -1,4 +1,4 @@ -#include +#include #include #include #include -- cgit v1.2.3