summaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2015-06-26zpool: remove zpool_evict()Dan Streetman4-36/+24
Remove zpool_evict() helper function. As zbud is currently the only zpool implementation that supports eviction, add zpool and zpool_ops references to struct zbud_pool and directly call zpool_ops->evict(zpool, handle) on eviction. Currently zpool provides the zpool_evict helper which locks the zpool list lock and searches through all pools to find the specific one matching the caller, and call the corresponding zpool_ops->evict function. However, this is unnecessary, as the zbud pool can simply keep a reference to the zpool that created it, as well as the zpool_ops, and directly call the zpool_ops->evict function, when it needs to evict a page. This avoids a spinlock and list search in zpool for each eviction. Signed-off-by: Dan Streetman <ddstreet@ieee.org> Cc: Seth Jennings <sjennings@variantweb.net> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-26zpool: change pr_info to pr_debugDan Streetman1-3/+3
Change the pr_info() calls to pr_debug(). There's no need for the extra verbosity in the log. Also change the msg formats to be consistent. Signed-off-by: Dan Streetman <ddstreet@ieee.org> Cc: Kees Cook <keescook@chromium.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Ganesh Mahendran <opensource.ganesh@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-26zswap: runtime enable/disableDan Streetman2-9/+21
Change the "enabled" parameter to be configurable at runtime. Remove the enabled check from init(), and move it to the frontswap store() function; when enabled, pages will be stored, and when disabled, pages won't be stored. This is almost identical to Seth's patch from 2 years ago: http://lkml.iu.edu/hypermail/linux/kernel/1307.2/04289.html [akpm@linux-foundation.org: tweak documentation] Signed-off-by: Dan Streetman <ddstreet@ieee.org> Suggested-by: Seth Jennings <sjennings@variantweb.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-26zram: check comp algorithm availability earlierSergey Senozhatsky3-0/+9
Improvement idea by Marcin Jabrzyk. comp_algorithm_store() silently accepts any supplied algorithm name, because zram performs algorithm availability check later, during the device configuration phase in disksize_store() and emits the following error: "zram: Cannot initialise %s compressing backend" this error line is somewhat generic and, besides, can indicate a failed attempt to allocate compression backend's working buffers. add algorithm availability check to comp_algorithm_store(): echo lzz > /sys/block/zram0/comp_algorithm -bash: echo: write error: Invalid argument Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Reported-by: Marcin Jabrzyk <m.jabrzyk@samsung.com> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-26zram: cut trailing newline in algorithm nameSergey Senozhatsky2-1/+9
Supplied sysfs values sometimes contain new-line symbols (echo vs. echo -n), which we also copy as a compression algorithm name. it works fine when we lookup for compression algorithm, because we use sysfs_streq() which takes care of new line symbols. however, it doesn't look nice when we print compression algorithm name if zcomp_create() failed: zram: Cannot initialise LXZ compressing backend cut trailing new-line, so the error string will look like zram: Cannot initialise LXZ compressing backend we also now can replace sysfs_streq() in zcomp_available_show() with strcmp(). Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-26zram: cosmetic zram_bvec_write() cleanupSergey Senozhatsky1-5/+3
`bool locked' local variable tells us if we should perform zcomp_strm_release() or not (jumped to `out' label before zcomp_strm_find() occurred), which is equivalent to `zstrm' being or not being NULL. remove `locked' and check `zstrm' instead. Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-26zram: add dynamic device add/remove functionalitySergey Senozhatsky3-6/+141
We currently don't support on-demand device creation. The one and only way to have N zram devices is to specify num_devices module parameter (default value: 1). IOW if, for some reason, at some point, user wants to have N + 1 devies he/she must umount all the existing devices, unload the module, load the module passing num_devices equals to N + 1. And do this again, if needed. This patch introduces zram control sysfs class, which has two sysfs attrs: - hot_add -- add a new zram device - hot_remove -- remove a specific (device_id) zram device hot_add sysfs attr is read-only and has only automatic device id assignment mode (as requested by Minchan Kim). read operation performed on this attr creates a new zram device and returns back its device_id or error status. Usage example: # add a new specific zram device cat /sys/class/zram-control/hot_add 2 # remove a specific zram device echo 4 > /sys/class/zram-control/hot_remove Returning zram_add() error code back to user (-ENOMEM in this case) cat /sys/class/zram-control/hot_add cat: /sys/class/zram-control/hot_add: Cannot allocate memory NOTE, there might be users who already depend on the fact that at least zram0 device gets always created by zram_init(). Preserve this behavior. [minchan@kernel.org: use zram->claim to avoid lockdep splat] Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-26zram: close race by open overridingSergey Senozhatsky2-19/+38
[ Original patch from Minchan Kim <minchan@kernel.org> ] Commit ba6b17d68c8e ("zram: fix umount-reset_store-mount race condition") introduced bdev->bd_mutex to protect a race between mount and reset. At that time, we don't have dynamic zram-add/remove feature so it was okay. However, as we introduce dynamic device feature, bd_mutex became trouble. CPU 0 echo 1 > /sys/block/zram<id>/reset -> kernfs->s_active(A) -> zram:reset_store->bd_mutex(B) CPU 1 echo <id> > /sys/class/zram/zram-remove ->zram:zram_remove: bd_mutex(B) -> sysfs_remove_group -> kernfs->s_active(A) IOW, AB -> BA deadlock The reason we are holding bd_mutex for zram_remove is to prevent any incoming open /dev/zram[0-9]. Otherwise, we could remove zram others already have opened. But it causes above deadlock problem. To fix the problem, this patch overrides block_device.open and it returns -EBUSY if zram asserts he claims zram to reset so any incoming open will be failed so we don't need to hold bd_mutex for zram_remove ayn more. This patch is to prepare for zram-add/remove feature. [sergey.senozhatsky@gmail.com: simplify reset_store()] Signed-off-by: Minchan Kim <minchan@kernel.org> Acked-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-26zram: return zram device_id from zram_add()Sergey Senozhatsky1-9/+14
This patch prepares zram to enable on-demand device creation. zram_add() performs automatic device_id assignment and returns new device id (>= 0) or error code (< 0). Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-26zram: trivial: correct flag operations commentSergey Senozhatsky1-1/+1
We don't have meta->tb_lock anymore and use meta table entry bit_spin_lock instead. update corresponding comment. Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-26zram: report every added and removed deviceSergey Senozhatsky1-2/+3
With dynamic device creation/removal (which will be introduced later in the series) printing num_devices in zram_init() will not make a lot of sense, as well as printing the number of destroyed devices in destroy_devices(). Print per-device action (added/removed) in zram_add() and zram_remove() instead. Example: [ 3645.259652] zram: Added device: zram5 [ 3646.152074] zram: Added device: zram6 [ 3650.585012] zram: Removed device: zram5 [ 3655.845584] zram: Added device: zram8 [ 3660.975223] zram: Removed device: zram6 Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-26zram: remove max_num_devices limitationSergey Senozhatsky3-14/+4
Limiting the number of zram devices to 32 (default max_num_devices value) is confusing, let's drop it. A user with 2TB or 4TB of RAM, for example, can request as many devices as he can handle. Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-26zram: reorganize code layoutSergey Senozhatsky1-363/+362
This patch looks big, but basically it just moves code blocks. No functional changes. Our current code layout looks like a sandwitch. For example, a) between read/write handlers, we have update_used_max() helper function: static int zram_decompress_page static int zram_bvec_read static inline void update_used_max static int zram_bvec_write static int zram_bvec_rw b) RW request handlers __zram_make_request/zram_bio_discard are divided by sysfs attr reset_store() function and corresponding zram_reset_device() handler: static void zram_bio_discard static void zram_reset_device static ssize_t disksize_store static ssize_t reset_store static void __zram_make_request c) we first a bunch of sysfs read/store functions. then a number of one-liners, then helper functions, RW functions, sysfs functions, helper functions again, and so on. Reorganize layout to be more logically grouped (a brief description, `cat zram_drv.c | grep static` gives a bigger picture): -- one-liners: zram_test_flag/etc. -- helpers: is_partial_io/update_position/etc -- sysfs attr show/store functions + ZRAM_ATTR_RO() generated stats show() functions exception: reset and disksize store functions are required to be after meta() functions. because we do device create/destroy actions in these sysfs handlers. -- "mm" functions: meta get/put, meta alloc/free, page free static inline bool zram_meta_get static inline void zram_meta_put static void zram_meta_free static struct zram_meta *zram_meta_alloc static void zram_free_page -- a block of I/O functions static int zram_decompress_page static int zram_bvec_read static int zram_bvec_write static void zram_bio_discard static int zram_bvec_rw static void __zram_make_request static void zram_make_request static void zram_slot_free_notify static int zram_rw_page -- device contol: add/remove/init/reset functions (+zram-control class will sit here) static int zram_reset_device static ssize_t reset_store static ssize_t disksize_store static int zram_add static void zram_remove static int __init zram_init static void __exit zram_exit Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-26zram: use idr instead of `zram_devices' arraySergey Senozhatsky1-37/+49
This patch makes some preparations for on-demand device add/remove functionality. Remove `zram_devices' array and switch to id-to-pointer translation (idr). idr doesn't bloat zram struct with additional members, f.e. list_head, yet still provides ability to match the device_id with the device pointer. No user-space visible changes. [Julia.Lawall@lip6.fr: return -ENOMEM when `queue' alloc fails] Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Reported-by: Julia Lawall <Julia.Lawall@lip6.fr> Acked-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-26zram: cosmetic ZRAM_ATTR_RO code formatting tweakSergey Senozhatsky1-1/+1
Fix a misplaced backslash. Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-26zram: add `compact` sysfs entry to documentationSergey Senozhatsky1-1/+1
We currently don't support zram on-demand device creation. The only way to have N zram devices is to specify num_devices module parameter (default value 1). That means that if, for some reason, at some point, user wants to have N + 1 devies he/she must umount all the existing devices, unload the module, load the module passing num_devices equals to N + 1. This patchset introduces zram-control sysfs class, which has two sysfs attrs: - hot_add -- add a new zram device - hot_remove -- remove a specific (device_id) zram device Usage example: # add a new specific zram device cat /sys/class/zram-control/hot_add 1 # remove a specific zram device echo 4 > /sys/class/zram-control/hot_remove This patch (of 10): Briefly describe missing `compact` sysfs entry. Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-26zsmalloc: remove obsolete ZSMALLOC_DEBUGMarcin Jabrzyk1-4/+0
The DEBUG define in zsmalloc is useless, there is no usage of it at all. Signed-off-by: Marcin Jabrzyk <m.jabrzyk@samsung.com> Acked-by: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-26zram: remove obsolete ZRAM_DEBUG optionMarcin Jabrzyk2-13/+1
This config option doesn't provide any usage for zram. Signed-off-by: Marcin Jabrzyk <m.jabrzyk@samsung.com> Acked-by: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-26s390/mm: change HPAGE_SHIFT type to intDominik Dingel2-2/+2
With making HPAGE_SHIFT an unsigned integer we also accidentally changed pageblock_order. In order to avoid compiler warnings we make HPAGE_SHFIT an int again. Signed-off-by: Dominik Dingel <dingel@linux.vnet.ibm.com> Suggested-by: Andrew Morton <akpm@linux-foundation.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-26s390/mm: forward check for huge pmds to pmd_large()Dominik Dingel1-4/+1
We already do the check in pmd_large, so we can just forward the call. Signed-off-by: Dominik Dingel <dingel@linux.vnet.ibm.com> Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-26s390/hugetlb: remove dead code for sw emulated huge pagesDominik Dingel2-60/+3
We now support only hugepages on hardware with EDAT1 support. So we remove the prepare/release_hugepage hooks and simplify set_huge_pte_at and huge_ptep_get. Signed-off-by: Dominik Dingel <dingel@linux.vnet.ibm.com> Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-26mm/hugetlb: remove arch_prepare/release_hugepage from arch headersDominik Dingel10-90/+0
Nobody used these hooks so they were removed from common code, and can now be removed from the architectures. Signed-off-by: Dominik Dingel <dingel@linux.vnet.ibm.com> Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Acked-by: Ralf Baechle <ralf@linux-mips.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-26mm/hugetlb: remove unused arch hook prepare/release_hugepageDominik Dingel1-10/+0
With s390 dropping support for emulated hugepages, the last user of arch_prepare_hugepage and arch_release_hugepage is gone. Signed-off-by: Dominik Dingel <dingel@linux.vnet.ibm.com> Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-26s390/mm: make hugepages_supported a boot time decisionDominik Dingel3-4/+8
There is a potential bug with KVM and hugetlbfs if the hardware does not support hugepages (EDAT1). We fix this by making EDAT1 a hard requirement for hugepages and therefore removing and simplifying code. As s390, with the sw-emulated hugepages, was the only user of arch_prepare/release_hugepage I also removed theses calls from common and other architecture code. This patch (of 5): By dropping support for hugepages on machines which do not have the hardware feature EDAT1, we fix a potential s390 KVM bug. The bug would happen if a guest is backed by hugetlbfs (not supported currently), but does not get pagetables with PGSTE. This would lead to random memory overwrites. Signed-off-by: Dominik Dingel <dingel@linux.vnet.ibm.com> Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-25Merge branch 'akpm' (patches from Andrew)Linus Torvalds151-1321/+2277
Merge first patchbomb from Andrew Morton: - a few misc things - ocfs2 udpates - kernel/watchdog.c feature work (took ages to get right) - most of MM. A few tricky bits are held up and probably won't make 4.2. * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (91 commits) mm: kmemleak_alloc_percpu() should follow the gfp from per_alloc() mm, thp: respect MPOL_PREFERRED policy with non-local node tmpfs: truncate prealloc blocks past i_size mm/memory hotplug: print the last vmemmap region at the end of hot add memory mm/mmap.c: optimization of do_mmap_pgoff function mm: kmemleak: optimise kmemleak_lock acquiring during kmemleak_scan mm: kmemleak: avoid deadlock on the kmemleak object insertion error path mm: kmemleak: do not acquire scan_mutex in kmemleak_do_cleanup() mm: kmemleak: fix delete_object_*() race when called on the same memory block mm: kmemleak: allow safe memory scanning during kmemleak disabling memcg: convert mem_cgroup->under_oom from atomic_t to int memcg: remove unused mem_cgroup->oom_wakeups frontswap: allow multiple backends x86, mirror: x86 enabling - find mirrored memory ranges mm/memblock: allocate boot time data structures from mirrored memory mm/memblock: add extra "flags" to memblock to allow selection of memory based on attribute mm: do not ignore mapping_gfp_mask in page cache allocation paths mm/cma.c: fix typos in comments mm/oom_kill.c: print points as unsigned int mm/hugetlb: handle races in alloc_huge_page and hugetlb_reserve_pages ...
2015-06-25Merge tag 'please-pull-pstore' of ↵Linus Torvalds2-19/+39
git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux Pull pstore updates from Tony Luck: "Miscellaneous pstore improvements" * tag 'please-pull-pstore' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux: ramoops: make it possible to change mem_type param. pstore/ram: verify ramoops header before saving record fs/pstore: Optimization function ramoops_init_przs fs/pstore: update the backend parameter in pstore module pstore: do not use message compression without lock
2015-06-25Merge tag 'for-f2fs-4.2' of ↵Linus Torvalds31-630/+3802
git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs Pull f2fs updates from Jaegeuk Kim: "New features: - per-file encryption (e.g., ext4) - FALLOC_FL_ZERO_RANGE - FALLOC_FL_COLLAPSE_RANGE - RENAME_WHITEOUT Major enhancement/fixes: - recovery broken superblocks - enhance f2fs_trim_fs with a discard_map - fix a race condition on dentry block allocation - fix a deadlock during summary operation - fix a missing fiemap result .. and many minor bug fixes and clean-ups were done" * tag 'for-f2fs-4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (83 commits) f2fs: do not trim preallocated blocks when truncating after i_size f2fs crypto: add alloc_bounce_page f2fs crypto: fix to handle errors likewise ext4 f2fs: drop the volatile_write flag only f2fs: skip committing valid superblock f2fs: setting discard option in parse_options() f2fs: fix to return exact trimmed size f2fs: support FALLOC_FL_INSERT_RANGE f2fs: hide common code in f2fs_replace_block f2fs: disable the discard option when device doesn't support f2fs crypto: remove alloc_page for bounce_page f2fs: fix a deadlock for summary page lock vs. sentry_lock f2fs crypto: clean up error handling in f2fs_fname_setup_filename f2fs crypto: avoid f2fs_inherit_context for symlink f2fs crypto: do not set encryption policy for non-directory by ioctl f2fs crypto: allow setting encryption policy once f2fs crypto: check context consistent for rename2 f2fs: avoid duplicated code by reusing f2fs_read_end_io f2fs crypto: use per-inode tfm structure f2fs: recovering broken superblock during mount ...
2015-06-25Merge branch 'for_linus' of ↵Linus Torvalds8-60/+124
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull UDF fixes and cleanups from Jan Kara: "The contains some small fixes and improvements in error handling for UDF. Bundled is also one ext3 coding style fix and a fix in quota documentation" * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: udf: fix udf_load_pvoldesc() udf: remove double err declaration in udf_file_write_iter() UDF: support NFSv2 export fs: ext3: super: fixed a space coding style issue quota: Update documentation udf: Return error from udf_find_entry() udf: Make udf_get_filename() return error instead of 0 length file name udf: bug on exotic flag in udf_get_filename() udf: improve error management in udf_CS0toNLS() udf: improve error management in udf_CS0toUTF8() udf: unicode: update function name in comments udf: remove unnecessary test in udf_build_ustr_exact() udf: Return -ENOMEM when allocation fails in udf_get_filename()
2015-06-25Merge tag 'docs-for-linus' of git://git.lwn.net/linux-2.6Linus Torvalds88-79/+1957
Pull documentation updates from Jonathan Corbet: "The main thing here is Ingo's big subdirectory documenting feature support for each architecture. Beyond that, it's the usual pile of fixes, tweaks, and small additions" * tag 'docs-for-linus' of git://git.lwn.net/linux-2.6: (79 commits) doc:md: fix typo in md.txt. Documentation/mic/mpssd: don't build x86 userspace when cross compiling Documentation/prctl: don't build tsc tests when cross compiling Documentation/vDSO: don't build tests when cross compiling Doc:ABI/testing: Fix typo in sysfs-bus-fcoe Doc: Docbook: Change wikipedia's URL from http to https in scsi.tmpl Doc: Change wikipedia's URL from http to https Documentation/kernel-parameters: add missing pciserial to the earlyprintk Doc:pps: Fix typo in pps.txt kbuild : Fix documentation of INSTALL_HDR_PATH Documentation: filesystems: updated struct file_operations documentation in vfs.txt kbuild: edit explanation of clean-files variable Doc: ja_JP: Fix typo in HOWTO Move freefall program from Documentation/ to tools/ Documentation: ARM: EXYNOS: Describe boot loaders interface Doc:nfc: Fix typo in nfc-hci.txt vfs: Minor documentation fix Doc: networking: txtimestamp: fix printf format warning Documentation, intel_pstate: Improve legacy mode internal governors description Documentation: extend use case for EXPORT_SYMBOL_GPL() ...
2015-06-25Merge branch 'for-linus' of ↵Linus Torvalds54-375/+1278
git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input Pull input subsystem updates from Dmitry Torokhov: "Thanks to Samuel Thibault input device (keyboard) LEDs are no longer hardwired within the input core but use LED subsystem and so allow use of different triggers; Hans de Goede did a large update for the ALPS touchpad driver; we have new TI drv2665 haptics driver and DA9063 OnKey driver, and host of other drivers got various fixes" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input: (55 commits) Input: pixcir_i2c_ts - fix receive error MAINTAINERS: remove non existent input mt git tree Input: improve usage of gpiod API tty/vt/keyboard: define LED triggers for VT keyboard lock states tty/vt/keyboard: define LED triggers for VT LED states Input: export LEDs as class devices in sysfs Input: cyttsp4 - use swap() in cyttsp4_get_touch() Input: goodix - do not explicitly set evbits in input device Input: goodix - export id and version read from device Input: goodix - fix variable length array warning Input: goodix - fix alignment issues Input: add OnKey driver for DA9063 MFD part Input: elan_i2c - add product IDs FW names Input: elan_i2c - add support for multi IC type and iap format Input: focaltech - report finger width to userspace tty: remove platform_sysrq_reset_seq Input: synaptics_i2c - use proper boolean values Input: psmouse - use true instead of 1 for boolean values Input: cyapa - fix a few typos in comments Input: stmpe-ts - enforce device tree only mode ...
2015-06-25Merge tag 'edac_for_4.2_2' of ↵Linus Torvalds27-377/+2175
git://git.kernel.org/pub/scm/linux/kernel/git/bp/bp Pull EDAC updates from Borislav Petkov: - New APM X-Gene SoC EDAC driver (Loc Ho) - AMD error injection module improvements (Aravind Gopalakrishnan) - Altera Arria 10 support (Thor Thayer) - misc fixes and cleanups all over the place * tag 'edac_for_4.2_2' of git://git.kernel.org/pub/scm/linux/kernel/git/bp/bp: (28 commits) EDAC: Update Documentation/edac.txt EDAC: Fix typos in Documentation/edac.txt EDAC, mce_amd_inj: Set MISCV on injection EDAC, mce_amd_inj: Move bit preparations before the injection EDAC, mce_amd_inj: Cleanup and simplify README EDAC, altera: Do not allow suspend when EDAC is enabled EDAC, mce_amd_inj: Make inj_type static arm: socfpga: dts: Add Arria10 SDRAM EDAC DTS support EDAC, altera: Add Arria10 EDAC support EDAC, altera: Refactor for Altera CycloneV SoC EDAC, altera: Generalize driver to use DT Memory size EDAC, mce_amd_inj: Add README file EDAC, mce_amd_inj: Add individual permissions field to dfs_node EDAC, mce_amd_inj: Modify flags attribute to use string arguments EDAC, mce_amd_inj: Read out number of MCE banks from the hardware EDAC, mce_amd_inj: Use MCE_INJECT_GET macro for bank node too EDAC, xgene: Fix cpuid abuse EDAC, mpc85xx: Extend error address to 64 bit EDAC, mpc8xxx: Adapt for FSL SoC EDAC, edac_stub: Drop arch-specific include ...
2015-06-25Merge tag 'pinctrl-v4.2-1' of ↵Linus Torvalds113-843/+18261
git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl Pull pin control updates from Linus Walleij: "Here is the bulk of pin control changes for the v4.2 series: Quite a lot of new SoC subdrivers and two new main drivers this time, apart from that business as usual. Details: Core functionality: - Enable exclusive pin ownership: it is possible to flag a pin controller so that GPIO and other functions cannot use a single pin simultaneously. New drivers: - NXP LPC18xx System Control Unit pin controller - Imagination Pistachio SoC pin controller New subdrivers: - Freescale i.MX7d SoC - Intel Sunrisepoint-H PCH - Renesas PFC R8A7793 - Renesas PFC R8A7794 - Mediatek MT6397, MT8127 - SiRF Atlas 7 - Allwinner A33 - Qualcomm MSM8660 - Marvell Armada 395 - Rockchip RK3368 Cleanups: - A big cleanup of the Marvell MVEBU driver rectifying it to correspond to reality - Drop platform device probing from the SH PFC driver, we are now a DT only shop for SuperH - Drop obsolte multi-platform check for SH PFC - Various janitorial: constification, grammar etc Improvements: - The AT91 GPIO portions now supports the set_multiple() feature - Split out SPI pins on the Xilinx Zynq - Support DTs without specific function nodes in the i.MX driver" * tag 'pinctrl-v4.2-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl: (99 commits) pinctrl: rockchip: add support for the rk3368 pinctrl: rockchip: generalize perpin driver-strength setting pinctrl: sh-pfc: r8a7794: add SDHI pin groups pinctrl: sh-pfc: r8a7794: add MMCIF pin groups pinctrl: sh-pfc: add R8A7794 PFC support pinctrl: make pinctrl_register() return proper error code pinctrl: mvebu: armada-39x: add support for Armada 395 variant pinctrl: mvebu: armada-39x: add missing SATA functions pinctrl: mvebu: armada-39x: add missing PCIe functions pinctrl: mvebu: armada-38x: add ptp functions pinctrl: mvebu: armada-38x: add ua1 functions pinctrl: mvebu: armada-38x: add nand functions pinctrl: mvebu: armada-38x: add sata functions pinctrl: mvebu: armada-xp: add dram functions pinctrl: mvebu: armada-xp: add nand rb function pinctrl: mvebu: armada-xp: add spi1 function pinctrl: mvebu: armada-39x: normalize ref clock naming pinctrl: mvebu: armada-xp: rename spi to spi0 pinctrl: mvebu: armada-370: align spi1 clock pin naming pinctrl: mvebu: armada-370: align VDD cpu-pd pin naming with datasheet ...
2015-06-25Merge tag 'backlight-for-linus-4.2' of ↵Linus Torvalds6-22/+22
git://git.kernel.org/pub/scm/linux/kernel/git/lee/backlight Pull backlight updates from Lee Jones: "Changes to existing drivers: - supply MODULE_DEVICE_TABLE() to ensure probing - constify struct; da9052_bl - enable compile test; lcd_l4f00242t03, lcd_lms283fg05, backlight_gpio - suspend/resume bugfix; lp855x_bl - devm_gpiod_get_optional() API fixup; pwm_bl - error handling fixup; backlight" * tag 'backlight-for-linus-4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/backlight: backlight: Change the return type of backlight_update_status() to int backlight: pwm_bl: Simplify usage of devm_gpiod_get_optional backlight: lp855x: Don't clear level on suspend/blank backlight: Allow compile test of GPIO consumers if !GPIOLIB video: backlight: da9052: Constify platform_device_id gpio-backlight: Discover driver during boot time
2015-06-25mm: kmemleak_alloc_percpu() should follow the gfp from per_alloc()Larry Finger3-7/+10
Beginning at commit d52d3997f843 ("ipv6: Create percpu rt6_info"), the following INFO splat is logged: =============================== [ INFO: suspicious RCU usage. ] 4.1.0-rc7-next-20150612 #1 Not tainted ------------------------------- kernel/sched/core.c:7318 Illegal context switch in RCU-bh read-side critical section! other info that might help us debug this: rcu_scheduler_active = 1, debug_locks = 0 3 locks held by systemd/1: #0: (rtnl_mutex){+.+.+.}, at: [<ffffffff815f0c8f>] rtnetlink_rcv+0x1f/0x40 #1: (rcu_read_lock_bh){......}, at: [<ffffffff816a34e2>] ipv6_add_addr+0x62/0x540 #2: (addrconf_hash_lock){+...+.}, at: [<ffffffff816a3604>] ipv6_add_addr+0x184/0x540 stack backtrace: CPU: 0 PID: 1 Comm: systemd Not tainted 4.1.0-rc7-next-20150612 #1 Hardware name: TOSHIBA TECRA A50-A/TECRA A50-A, BIOS Version 4.20 04/17/2014 Call Trace: dump_stack+0x4c/0x6e lockdep_rcu_suspicious+0xe7/0x120 ___might_sleep+0x1d5/0x1f0 __might_sleep+0x4d/0x90 kmem_cache_alloc+0x47/0x250 create_object+0x39/0x2e0 kmemleak_alloc_percpu+0x61/0xe0 pcpu_alloc+0x370/0x630 Additional backtrace lines are truncated. In addition, the above splat is followed by several "BUG: sleeping function called from invalid context at mm/slub.c:1268" outputs. As suggested by Martin KaFai Lau, these are the clue to the fix. Routine kmemleak_alloc_percpu() always uses GFP_KERNEL for its allocations, whereas it should follow the gfp from its callers. Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Reviewed-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Larry Finger <Larry.Finger@lwfinger.net> Cc: Martin KaFai Lau <kafai@fb.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Tejun Heo <tj@kernel.org> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: <stable@vger.kernel.org> [3.18+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-25mm, thp: respect MPOL_PREFERRED policy with non-local nodeVlastimil Babka1-16/+22
Since commit 077fcf116c8c ("mm/thp: allocate transparent hugepages on local node"), we handle THP allocations on page fault in a special way - for non-interleave memory policies, the allocation is only attempted on the node local to the current CPU, if the policy's nodemask allows the node. This is motivated by the assumption that THP benefits cannot offset the cost of remote accesses, so it's better to fallback to base pages on the local node (which might still be available, while huge pages are not due to fragmentation) than to allocate huge pages on a remote node. The nodemask check prevents us from violating e.g. MPOL_BIND policies where the local node is not among the allowed nodes. However, the current implementation can still give surprising results for the MPOL_PREFERRED policy when the preferred node is different than the current CPU's local node. In such case we should honor the preferred node and not use the local node, which is what this patch does. If hugepage allocation on the preferred node fails, we fall back to base pages and don't try other nodes, with the same motivation as is done for the local node hugepage allocations. The patch also moves the MPOL_INTERLEAVE check around to simplify the hugepage specific test. The difference can be demonstrated using in-tree transhuge-stress test on the following 2-node machine where half memory on one node was occupied to show the difference. > numactl --hardware available: 2 nodes (0-1) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 24 25 26 27 28 29 30 31 32 33 34 35 node 0 size: 7878 MB node 0 free: 3623 MB node 1 cpus: 12 13 14 15 16 17 18 19 20 21 22 23 36 37 38 39 40 41 42 43 44 45 46 47 node 1 size: 8045 MB node 1 free: 7818 MB node distances: node 0 1 0: 10 21 1: 21 10 Before the patch: > numactl -p0 -C0 ./transhuge-stress transhuge-stress: 2.197 s/loop, 0.276 ms/page, 7249.168 MiB/s 7962 succeed, 0 failed, 1786 different pages > numactl -p0 -C12 ./transhuge-stress transhuge-stress: 2.962 s/loop, 0.372 ms/page, 5376.172 MiB/s 7962 succeed, 0 failed, 3873 different pages Number of successful THP allocations corresponds to free memory on node 0 in the first case and node 1 in the second case, i.e. -p parameter is ignored and cpu binding "wins". After the patch: > numactl -p0 -C0 ./transhuge-stress transhuge-stress: 2.183 s/loop, 0.274 ms/page, 7295.516 MiB/s 7962 succeed, 0 failed, 1760 different pages > numactl -p0 -C12 ./transhuge-stress transhuge-stress: 2.878 s/loop, 0.361 ms/page, 5533.638 MiB/s 7962 succeed, 0 failed, 1750 different pages > numactl -p1 -C0 ./transhuge-stress transhuge-stress: 4.628 s/loop, 0.581 ms/page, 3440.893 MiB/s 7962 succeed, 0 failed, 3918 different pages The -p parameter is respected regardless of cpu binding. > numactl -C0 ./transhuge-stress transhuge-stress: 2.202 s/loop, 0.277 ms/page, 7230.003 MiB/s 7962 succeed, 0 failed, 1750 different pages > numactl -C12 ./transhuge-stress transhuge-stress: 3.020 s/loop, 0.379 ms/page, 5273.324 MiB/s 7962 succeed, 0 failed, 3916 different pages Without -p parameter, hugepage restriction to CPU-local node works as before. Fixes: 077fcf116c8c ("mm/thp: allocate transparent hugepages on local node") Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: <stable@vger.kernel.org> [4.0+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-25tmpfs: truncate prealloc blocks past i_sizeJosef Bacik1-1/+1
One of the rocksdb people noticed that when you do something like this fallocate(fd, FALLOC_FL_KEEP_SIZE, 0, 10M) pwrite(fd, buf, 5M, 0) ftruncate(5M) on tmpfs, the file would still take up 10M: which led to super fun issues because we were getting ENOSPC before we thought we should be getting ENOSPC. This patch fixes the problem, and mirrors what all the other fs'es do (and was agreed to be the correct behaviour at LSF). I tested it locally to make sure it worked properly with the following xfs_io -f -c "falloc -k 0 10M" -c "pwrite 0 5M" -c "truncate 5M" file Without the patch we have "Blocks: 20480", with the patch we have the correct value of "Blocks: 10240". Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-25mm/memory hotplug: print the last vmemmap region at the end of hot add memoryZhu Guihua1-0/+1
When hot add two nodes continuously, we found the vmemmap region info is a bit messed. The last region of node 2 is printed when node 3 hot added, like the following: Initmem setup node 2 [mem 0x0000000000000000-0xffffffffffffffff] On node 2 totalpages: 0 Built 2 zonelists in Node order, mobility grouping on. Total pages: 16090539 Policy zone: Normal init_memory_mapping: [mem 0x40000000000-0x407ffffffff] [mem 0x40000000000-0x407ffffffff] page 1G [ffffea1000000000-ffffea10001fffff] PMD -> [ffff8a077d800000-ffff8a077d9fffff] on node 2 [ffffea1000200000-ffffea10003fffff] PMD -> [ffff8a077de00000-ffff8a077dffffff] on node 2 ... [ffffea101f600000-ffffea101f9fffff] PMD -> [ffff8a074ac00000-ffff8a074affffff] on node 2 [ffffea101fa00000-ffffea101fdfffff] PMD -> [ffff8a074a800000-ffff8a074abfffff] on node 2 Initmem setup node 3 [mem 0x0000000000000000-0xffffffffffffffff] On node 3 totalpages: 0 Built 3 zonelists in Node order, mobility grouping on. Total pages: 16090539 Policy zone: Normal init_memory_mapping: [mem 0x60000000000-0x607ffffffff] [mem 0x60000000000-0x607ffffffff] page 1G [ffffea101fe00000-ffffea101fffffff] PMD -> [ffff8a074a400000-ffff8a074a5fffff] on node 2 <=== node 2 ??? [ffffea1800000000-ffffea18001fffff] PMD -> [ffff8a074a600000-ffff8a074a7fffff] on node 3 [ffffea1800200000-ffffea18005fffff] PMD -> [ffff8a074a000000-ffff8a074a3fffff] on node 3 [ffffea1800600000-ffffea18009fffff] PMD -> [ffff8a0749c00000-ffff8a0749ffffff] on node 3 ... The cause is the last region was missed at the and of hot add memory, and p_start, p_end, node_start were not reset, so when hot add memory to a new node, it will consider they are not contiguous blocks and print the previous one. So we print the last vmemmap region at the end of hot add memory to avoid the confusion. Signed-off-by: Zhu Guihua <zhugh.fnst@cn.fujitsu.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-25mm/mmap.c: optimization of do_mmap_pgoff functionPiotr Kwapulinski1-3/+3
The simple check for zero length memory mapping may be performed earlier. So that in case of zero length memory mapping some unnecessary code is not executed at all. It does not make the code less readable and saves some CPU cycles. Signed-off-by: Piotr Kwapulinski <kwapulinski.piotr@gmail.com> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-25mm: kmemleak: optimise kmemleak_lock acquiring during kmemleak_scanCatalin Marinas1-34/+56
The kmemleak memory scanning uses finer grained object->lock spinlocks primarily to avoid races with the memory block freeing. However, the pointer lookup in the rb tree requires the kmemleak_lock to be held. This is currently done in the find_and_get_object() function for each pointer-like location read during scanning. While this allows a low latency on kmemleak_*() callbacks on other CPUs, the memory scanning is slower. This patch moves the kmemleak_lock outside the scan_block() loop, acquiring/releasing it only once per scanned memory block. The allow_resched logic is moved outside scan_block() and a new scan_large_block() function is implemented which splits large blocks in MAX_SCAN_SIZE chunks with cond_resched() calls in-between. A redundant (object->flags & OBJECT_NO_SCAN) check is also removed from scan_object(). With this patch, the kmemleak scanning performance is significantly improved: at least 50% with lock debugging disabled and over an order of magnitude with lock proving enabled (on an arm64 system). Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-25mm: kmemleak: avoid deadlock on the kmemleak object insertion error pathCatalin Marinas1-4/+11
While very unlikely (usually kmemleak or sl*b bug), the create_object() function in mm/kmemleak.c may fail to insert a newly allocated object into the rb tree. When this happens, kmemleak disables itself and prints additional information about the object already found in the rb tree. Such printing is done with the parent->lock acquired, however the kmemleak_lock is already held. This is a potential race with the scanning thread which acquires object->lock and kmemleak_lock in a This patch removes the locking around the 'parent' object information printing. Such object cannot be freed or removed from object_tree_root and object_list since kmemleak_lock is already held. There is a very small risk that some of the object data is being modified on another CPU but the only downside is inconsistent information printing. Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-25mm: kmemleak: do not acquire scan_mutex in kmemleak_do_cleanup()Catalin Marinas1-2/+0
The kmemleak_do_cleanup() work thread already waits for the kmemleak_scan thread to finish via kthread_stop(). Waiting in kthread_stop() while scan_mutex is held may lead to deadlock if kmemleak_scan_thread() also waits to acquire for scan_mutex. Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-25mm: kmemleak: fix delete_object_*() race when called on the same memory blockCatalin Marinas1-13/+26
Calling delete_object_*() on the same pointer is not a standard use case (unless there is a bug in the code calling kmemleak_free()). However, during kmemleak disabling (error or user triggered via /sys), there is a potential race between kmemleak_free() calls on a CPU and __kmemleak_do_cleanup() on a different CPU. The current delete_object_*() implementation first performs a look-up holding kmemleak_lock, increments the object->use_count and then re-acquires kmemleak_lock to remove the object from object_tree_root and object_list. This patch simplifies the delete_object_*() mechanism to both look up and remove an object from the object_tree_root and object_list atomically (guarded by kmemleak_lock). This allows safe concurrent calls to delete_object_*() on the same pointer without additional locking for synchronising the kmemleak_free_enabled flag. A side effect is a slight improvement in the delete_object_*() performance by avoiding acquiring kmemleak_lock twice and incrementing/decrementing object->use_count. Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-25mm: kmemleak: allow safe memory scanning during kmemleak disablingCatalin Marinas1-3/+16
The kmemleak scanning thread can run for minutes. Callbacks like kmemleak_free() are allowed during this time, the race being taken care of by the object->lock spinlock. Such lock also prevents a memory block from being freed or unmapped while it is being scanned by blocking the kmemleak_free() -> ... -> __delete_object() function until the lock is released in scan_object(). When a kmemleak error occurs (e.g. it fails to allocate its metadata), kmemleak_enabled is set and __delete_object() is no longer called on freed objects. If kmemleak_scan is running at the same time, kmemleak_free() no longer waits for the object scanning to complete, allowing the corresponding memory block to be freed or unmapped (in the case of vfree()). This leads to kmemleak_scan potentially triggering a page fault. This patch separates the kmemleak_free() enabling/disabling from the overall kmemleak_enabled nob so that we can defer the disabling of the object freeing tracking until the scanning thread completed. The kmemleak_free_part() is deliberately ignored by this patch since this is only called during boot before the scanning thread started. Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Reported-by: Vignesh Radhakrishnan <vigneshr@codeaurora.org> Tested-by: Vignesh Radhakrishnan <vigneshr@codeaurora.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-25memcg: convert mem_cgroup->under_oom from atomic_t to intTejun Heo1-8/+21
memcg->under_oom tracks whether the memcg is under OOM conditions and is an atomic_t counter managed with mem_cgroup_[un]mark_under_oom(). While atomic_t appears to be simple synchronization-wise, when used as a synchronization construct like here, it's trickier and more error-prone due to weak memory ordering rules, especially around atomic_read(), and false sense of security. For example, both non-trivial read sites of memcg->under_oom are a bit problematic although not being actually broken. * mem_cgroup_oom_register_event() It isn't explicit what guarantees the memory ordering between event addition and memcg->under_oom check. This isn't broken only because memcg_oom_lock is used for both event list and memcg->oom_lock. * memcg_oom_recover() The lockless test doesn't have any explanation why this would be safe. mem_cgroup_[un]mark_under_oom() are very cold paths and there's no point in avoiding locking memcg_oom_lock there. This patch converts memcg->under_oom from atomic_t to int, puts their modifications under memcg_oom_lock and documents why the lockless test in memcg_oom_recover() is safe. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-25memcg: remove unused mem_cgroup->oom_wakeupsTejun Heo1-9/+1
Since commit 4942642080ea ("mm: memcg: handle non-error OOM situations more gracefully"), nobody uses mem_cgroup->oom_wakeups. Remove it. While at it, also fold memcg_wakeup_oom() into memcg_oom_recover() which is its only user. This cleanup was suggested by Michal. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-25frontswap: allow multiple backendsDan Streetman3-98/+139
Change frontswap single pointer to a singly linked list of frontswap implementations. Update Xen tmem implementation as register no longer returns anything. Frontswap only keeps track of a single implementation; any implementation that registers second (or later) will replace the previously registered implementation, and gets a pointer to the previous implementation that the new implementation is expected to pass all frontswap functions to if it can't handle the function itself. However that method doesn't really make much sense, as passing that work on to every implementation adds unnecessary work to implementations; instead, frontswap should simply keep a list of all registered implementations and try each implementation for any function. Most importantly, neither of the two currently existing frontswap implementations in the kernel actually do anything with any previous frontswap implementation that they replace when registering. This allows frontswap to successfully manage multiple implementations by keeping a list of them all. Signed-off-by: Dan Streetman <ddstreet@ieee.org> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: David Vrabel <david.vrabel@citrix.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-25x86, mirror: x86 enabling - find mirrored memory rangesTony Luck3-0/+27
UEFI GetMemoryMap() uses a new attribute bit to mark mirrored memory address ranges. See UEFI 2.5 spec pages 157-158: http://www.uefi.org/sites/default/files/resources/UEFI%202_5.pdf On EFI enabled systems scan the memory map and tell memblock about any mirrored ranges. Signed-off-by: Tony Luck <tony.luck@intel.com> Cc: Xishi Qiu <qiuxishi@huawei.com> Cc: Hanjun Guo <guohanjun@huawei.com> Cc: Xiexiuqi <xiexiuqi@huawei.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-25mm/memblock: allocate boot time data structures from mirrored memoryTony Luck3-12/+84
Try to allocate all boot time kernel data structures from mirrored memory. If we run out of mirrored memory print warnings, but fall back to using non-mirrored memory to make sure that we still boot. By number of bytes, most of what we allocate at boot time is the page structures. 64 bytes per 4K page on x86_64 ... or about 1.5% of total system memory. For workloads where the bulk of memory is allocated to applications this may represent a useful improvement to system availability since 1.5% of total memory might be a third of the memory allocated to the kernel. Signed-off-by: Tony Luck <tony.luck@intel.com> Cc: Xishi Qiu <qiuxishi@huawei.com> Cc: Hanjun Guo <guohanjun@huawei.com> Cc: Xiexiuqi <xiexiuqi@huawei.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-25mm/memblock: add extra "flags" to memblock to allow selection of memory ↵Tony Luck10-47/+83
based on attribute Some high end Intel Xeon systems report uncorrectable memory errors as a recoverable machine check. Linux has included code for some time to process these and just signal the affected processes (or even recover completely if the error was in a read only page that can be replaced by reading from disk). But we have no recovery path for errors encountered during kernel code execution. Except for some very specific cases were are unlikely to ever be able to recover. Enter memory mirroring. Actually 3rd generation of memory mirroing. Gen1: All memory is mirrored Pro: No s/w enabling - h/w just gets good data from other side of the mirror Con: Halves effective memory capacity available to OS/applications Gen2: Partial memory mirror - just mirror memory begind some memory controllers Pro: Keep more of the capacity Con: Nightmare to enable. Have to choose between allocating from mirrored memory for safety vs. NUMA local memory for performance Gen3: Address range partial memory mirror - some mirror on each memory controller Pro: Can tune the amount of mirror and keep NUMA performance Con: I have to write memory management code to implement The current plan is just to use mirrored memory for kernel allocations. This has been broken into two phases: 1) This patch series - find the mirrored memory, use it for boot time allocations 2) Wade into mm/page_alloc.c and define a ZONE_MIRROR to pick up the unused mirrored memory from mm/memblock.c and only give it out to select kernel allocations (this is still being scoped because page_alloc.c is scary). This patch (of 3): Add extra "flags" to memblock to allow selection of memory based on attribute. No functional changes Signed-off-by: Tony Luck <tony.luck@intel.com> Cc: Xishi Qiu <qiuxishi@huawei.com> Cc: Hanjun Guo <guohanjun@huawei.com> Cc: Xiexiuqi <xiexiuqi@huawei.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-25mm: do not ignore mapping_gfp_mask in page cache allocation pathsMichal Hocko3-5/+7
page_cache_read, do_generic_file_read, __generic_file_splice_read and __ntfs_grab_cache_pages currently ignore mapping_gfp_mask when calling add_to_page_cache_lru which might cause recursion into fs down in the direct reclaim path if the mapping really relies on GFP_NOFS semantic. This doesn't seem to be the case now because page_cache_read (page fault path) doesn't seem to suffer from the reclaim recursion issues and do_generic_file_read and __generic_file_splice_read also shouldn't be called under fs locks which would deadlock in the reclaim path. Anyway it is better to obey mapping gfp mask and prevent from later breakage. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: Dave Chinner <david@fromorbit.com> Cc: Neil Brown <neilb@suse.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Anton Altaparmakov <anton@tuxera.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>