summaryrefslogtreecommitdiff
path: root/fs/ext4
AgeCommit message (Collapse)AuthorFilesLines
2014-08-11Merge branch 'for-linus' of ↵Linus Torvalds1-1/+0
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs updates from Al Viro: "Stuff in here: - acct.c fixes and general rework of mnt_pin mechanism. That allows to go for delayed-mntput stuff, which will permit mntput() on deep stack without worrying about stack overflows - fs shutdown will happen on shallow stack. IOW, we can do Eric's umount-on-rmdir series without introducing tons of stack overflows on new mntput() call chains it introduces. - Bruce's d_splice_alias() patches - more Miklos' rename() stuff. - a couple of regression fixes (stable fodder, in the end of branch) and a fix for API idiocy in iov_iter.c. There definitely will be another pile, maybe even two. I'd like to get Eric's series in this time, but even if we miss it, it'll go right in the beginning of for-next in the next cycle - the tricky part of prereqs is in this pile" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (40 commits) fix copy_tree() regression __generic_file_write_iter(): fix handling of sync error after DIO switch iov_iter_get_pages() to passing maximal number of pages fs: mark __d_obtain_alias static dcache: d_splice_alias should detect loops exportfs: update Exporting documentation dcache: d_find_alias needn't recheck IS_ROOT && DCACHE_DISCONNECTED dcache: remove unused d_find_alias parameter dcache: d_obtain_alias callers don't all want DISCONNECTED dcache: d_splice_alias should ignore DCACHE_DISCONNECTED dcache: d_splice_alias mustn't create directory aliases dcache: close d_move race in d_splice_alias dcache: move d_splice_alias namei: trivial fix to vfs_rename_dir comment VFS: allow ->d_manage() to declare -EISDIR in rcu_walk mode. cifs: support RENAME_NOREPLACE hostfs: support rename flags shmem: support RENAME_EXCHANGE shmem: support RENAME_NOREPLACE btrfs: add RENAME_NOREPLACE ...
2014-08-07fs: call rename2 if existsMiklos Szeredi1-1/+0
Christoph Hellwig suggests: 1) make vfs_rename call ->rename2 if it exists instead of ->rename 2) switch all filesystems that you're adding NOREPLACE support for to use ->rename2 3) see how many ->rename instances we'll have left after a few iterations of 2. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-07-31ext4: fix ext4_discard_allocated_blocks() if we can't allocate the pa structTheodore Ts'o1-1/+20
If there is a failure while allocating the preallocation structure, a number of blocks can end up getting marked in the in-memory buddy bitmap, and then not getting released. This can result in the following corruption getting reported by the kernel: EXT4-fs error (device sda3): ext4_mb_generate_buddy:758: group 1126, 12793 clusters in bitmap, 12729 in gd In that case, we need to release the blocks using mb_free_blocks(). Tested: fs smoke test; also demonstrated that with injected errors, the file system is no longer getting corrupted Google-Bug-Id: 16657874 Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@vger.kernel.org
2014-07-29ext4: fix COLLAPSE RANGE test for bigalloc file systemsNamjae Jeon1-5/+2
Blocks in collapse range should be collapsed per cluster unit when bigalloc is enable. If bigalloc is not enable, EXT4_CLUSTER_SIZE will be same with EXT4_BLOCK_SIZE. With this bug fixed, patch enables COLLAPSE_RANGE for bigalloc, which fixes a large number of xfstest failures which use fsx. Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com> Signed-off-by: Ashish Sangwan <a.sangwan@samsung.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-07-28ext4: check inline directory before convertingDarrick J. Wong3-0/+39
Before converting an inline directory to a regular directory, check the directory entries to make sure they're not obviously broken. This helps us to avoid a BUG_ON if one of the dirents is trashed. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Andreas Dilger <adilger@dilger.ca>
2014-07-28ext4: fix incorrect locking in move_extent_per_pageDmitry Monakhov1-1/+2
If we have to copy data we must drop i_data_sem because of get_blocks() will be called inside mext_page_mkuptodate(), but later we must reacquire it again because we are about to change extent's tree Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
2014-07-28ext4: use correct depth valueDmitry Monakhov1-1/+1
Inode's depth can be changed from here: ext4_ext_try_to_merge() ->ext4_ext_try_to_merge_up() We must use correct value. Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
2014-07-28ext4: add i_data_sem sanity checkDmitry Monakhov2-0/+9
Each caller of ext4_ext_dirty must hold i_data_sem, The only exception is migration code, let's make it convenient. Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
2014-07-28ext4: fix wrong size computation in ext4_mb_normalize_request()Xiaoguang Wang1-2/+3
As the member fe_len defined in struct ext4_free_extent is expressed as number of clusters, the variable "size" computation is wrong, we need to first translate fe_len to block number, then to bytes. Signed-off-by: Xiaoguang Wang <wangxg.fnst@cn.fujitsu.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Lukas Czerner <lczerner@redhat.com>
2014-07-15ext4: make ext4_has_inline_data() as a inline functionZheng Liu2-7/+6
Now ext4_has_inline_data() is used in wide spread codepaths. So we need to make it as a inline function to avoid burning some CPU cycles. Change in text size: text data bss dec hex filename before: 326110 19258 5528 350896 55ab0 fs/ext4/ext4.o after: 326227 19258 5528 351013 55b25 fs/ext4/ext4.o I use the following script to measure the CPU usage. #!/bin/bash shm_base='/dev/shm' img=${shm_base}/ext4-img mnt=/mnt/loop e2fsprgs_base=$HOME/e2fsprogs mkfs=${e2fsprgs_base}/misc/mke2fs fsck=${e2fsprgs_base}/e2fsck/e2fsck sudo umount $mnt dd if=/dev/zero of=$img bs=4k count=3145728 ${mkfs} -t ext4 -O inline_data -F $img sudo mount -t ext4 -o loop $img $mnt # start testing... testdir="${mnt}/testdir" mkdir $testdir cd $testdir echo "start testing..." for ((cnt=0;cnt<100;cnt++)); do for ((i=0;i<5;i++)); do for ((j=0;j<5;j++)); do for ((k=0;k<5;k++)); do for ((l=0;l<5;l++)); do mkdir -p $i/$j/$k/$l echo "$i-$j-$k-$l" > $i/$j/$k/$l/testfile done done done done ls -R $testdir > /dev/null rm -rf $testdir/* done The result of `perf top -G -U` is as below. vanilla: 13.92% [ext4] [k] ext4_do_update_inode 9.36% [ext4] [k] __ext4_get_inode_loc 4.07% [ext4] [k] ftrace_define_fields_ext4_writepages 3.83% [ext4] [k] __ext4_handle_dirty_metadata 3.42% [ext4] [k] ext4_get_inode_flags 2.71% [ext4] [k] ext4_mark_iloc_dirty 2.46% [ext4] [k] ftrace_define_fields_ext4_direct_IO_enter 2.26% [ext4] [k] ext4_get_inode_loc 2.22% [ext4] [k] ext4_has_inline_data [...] After applied the patch, we don't see ext4_has_inline_data() because it has been inlined and perf couldn't sample it. Although it doesn't mean that the CPU cycles can be saved but at least the overhead of function calls can be eliminated. So IMHO we'd better inline this function. Cc: Andreas Dilger <adilger.kernel@dilger.ca> Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-07-15ext4: remove readpage() check in ext4_mmap_file()Zhang Zhen1-4/+0
There is no kind of file which does not supply a page reading function. Signed-off-by: Zhang Zhen <zhenzhang.zhang@huawei.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-07-15ext4: fix punch hole on files with indirect mappingLukas Czerner3-82/+205
Currently punch hole code on files with direct/indirect mapping has some problems which may lead to a data loss. For example (from Jan Kara): fallocate -n -p 10240000 4096 will punch the range 10240000 - 12632064 instead of the range 1024000 - 10244096. Also the code is a bit weird and it's not using infrastructure provided by indirect.c, but rather creating it's own way. This patch fixes the issues as well as making the operation to run 4 times faster from my testing (punching out 60GB file). It uses similar approach used in ext4_ind_truncate() which takes advantage of ext4_free_branches() function. Also rename the ext4_free_hole_blocks() to something more sensible, like the equivalent we have for extent mapped files. Call it ext4_ind_remove_space(). This has been tested mostly with fsx and some xfstests which are testing punch hole but does not require unwritten extents which are not supported with direct/indirect mapping. Not problems showed up even with 1024k block size. CC: stable@vger.kernel.org Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-07-15ext4: remove metadata reservation checksTheodore Ts'o5-141/+7
Commit 27dd43854227b ("ext4: introduce reserved space") reserves 2% of the file system space to make sure metadata allocations will always succeed. Given that, tracking the reservation of metadata blocks is no longer necessary. Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-07-15ext4: rearrange initialization to fix EXT4FS_DEBUGTheodore Ts'o1-49/+39
The EXT4FS_DEBUG is a *very* developer specific #ifdef designed for ext4 developers only. (You have to modify fs/ext4/ext4.h to enable it.) Rearrange how we initialize data structures to avoid calling ext4_count_free_clusters() until the multiblock allocator has been initialized. This also allows us to only call ext4_count_free_clusters() once, and simplifies the code somewhat. (Thanks to Chen Gang <gang.chen.5i5j@gmail.com> for pointing out a !CONFIG_SMP compile breakage in the original patch.) Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Lukas Czerner <lczerner@redhat.com>
2014-07-13ext4: fix potential null pointer dereference in ext4_free_inodeNamjae Jeon1-1/+1
Fix potential null pointer dereferencing problem caused by e43bb4e612 ("ext4: decrement free clusters/inodes counters when block group declared bad") Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com> Signed-off-by: Ashish Sangwan <a.sangwan@samsung.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Lukas Czerner <lczerner@redhat.com>
2014-07-12ext4: fix a potential deadlock in __ext4_es_shrink()Theodore Ts'o1-2/+2
This fixes the following lockdep complaint: [ INFO: possible circular locking dependency detected ] 3.16.0-rc2-mm1+ #7 Tainted: G O ------------------------------------------------------- kworker/u24:0/4356 is trying to acquire lock: (&(&sbi->s_es_lru_lock)->rlock){+.+.-.}, at: [<ffffffff81285fff>] __ext4_es_shrink+0x4f/0x2e0 but task is already holding lock: (&ei->i_es_lock){++++-.}, at: [<ffffffff81286961>] ext4_es_insert_extent+0x71/0x180 which lock already depends on the new lock. Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&ei->i_es_lock); lock(&(&sbi->s_es_lru_lock)->rlock); lock(&ei->i_es_lock); lock(&(&sbi->s_es_lru_lock)->rlock); *** DEADLOCK *** 6 locks held by kworker/u24:0/4356: #0: ("writeback"){.+.+.+}, at: [<ffffffff81071d00>] process_one_work+0x180/0x560 #1: ((&(&wb->dwork)->work)){+.+.+.}, at: [<ffffffff81071d00>] process_one_work+0x180/0x560 #2: (&type->s_umount_key#22){++++++}, at: [<ffffffff811a9c74>] grab_super_passive+0x44/0x90 #3: (jbd2_handle){+.+...}, at: [<ffffffff812979f9>] start_this_handle+0x189/0x5f0 #4: (&ei->i_data_sem){++++..}, at: [<ffffffff81247062>] ext4_map_blocks+0x132/0x550 #5: (&ei->i_es_lock){++++-.}, at: [<ffffffff81286961>] ext4_es_insert_extent+0x71/0x180 stack backtrace: CPU: 0 PID: 4356 Comm: kworker/u24:0 Tainted: G O 3.16.0-rc2-mm1+ #7 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 Workqueue: writeback bdi_writeback_workfn (flush-253:0) ffffffff8213dce0 ffff880014b07538 ffffffff815df0bb 0000000000000007 ffffffff8213e040 ffff880014b07588 ffffffff815db3dd ffff880014b07568 ffff880014b07610 ffff88003b868930 ffff88003b868908 ffff88003b868930 Call Trace: [<ffffffff815df0bb>] dump_stack+0x4e/0x68 [<ffffffff815db3dd>] print_circular_bug+0x1fb/0x20c [<ffffffff810a7a3e>] __lock_acquire+0x163e/0x1d00 [<ffffffff815e89dc>] ? retint_restore_args+0xe/0xe [<ffffffff815ddc7b>] ? __slab_alloc+0x4a8/0x4ce [<ffffffff81285fff>] ? __ext4_es_shrink+0x4f/0x2e0 [<ffffffff810a8707>] lock_acquire+0x87/0x120 [<ffffffff81285fff>] ? __ext4_es_shrink+0x4f/0x2e0 [<ffffffff8128592d>] ? ext4_es_free_extent+0x5d/0x70 [<ffffffff815e6f09>] _raw_spin_lock+0x39/0x50 [<ffffffff81285fff>] ? __ext4_es_shrink+0x4f/0x2e0 [<ffffffff8119760b>] ? kmem_cache_alloc+0x18b/0x1a0 [<ffffffff81285fff>] __ext4_es_shrink+0x4f/0x2e0 [<ffffffff812869b8>] ext4_es_insert_extent+0xc8/0x180 [<ffffffff812470f4>] ext4_map_blocks+0x1c4/0x550 [<ffffffff8124c4c4>] ext4_writepages+0x6d4/0xd00 ... Reported-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reported-by: Minchan Kim <minchan@kernel.org> Cc: stable@vger.kernel.org Cc: Zheng Liu <gnehzuil.liu@gmail.com>
2014-07-11ext4: revert commit which was causing fs corruption after journal replaysTheodore Ts'o1-27/+24
Commit 007649375f6af2 ("ext4: initialize multi-block allocator before checking block descriptors") causes the block group descriptor's count of the number of free blocks to become inconsistent with the number of free blocks in the allocation bitmap. This is a harmless form of fs corruption, but it causes the kernel to potentially remount the file system read-only, or to panic, depending on the file systems's error behavior. Thanks to Eric Whitney for his tireless work to reproduce and to find the guilty commit. Fixes: 007649375f6af2 ("ext4: initialize multi-block allocator before checking block descriptors" Cc: stable@vger.kernel.org # 3.15 Reported-by: David Jander <david@protonic.nl> Reported-by: Matteo Croce <technoboy85@gmail.com> Tested-by: Eric Whitney <enwlinux@gmail.com> Suggested-by: Eric Whitney <enwlinux@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-07-06ext4: disable synchronous transaction batching if max_batch_time==0Eric Sandeen1-2/+0
The mount manpage says of the max_batch_time option, This optimization can be turned off entirely by setting max_batch_time to 0. But the code doesn't do that. So fix the code to do that. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
2014-07-06ext4: clarify ext4_error message in ext4_mb_generate_buddy_error()Theodore Ts'o1-2/+2
We are spending a lot of time explaining to users what this error means. Let's try to improve the message to avoid this problem. Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
2014-07-06ext4: clarify error count warning messagesTheodore Ts'o1-3/+4
Make it clear that values printed are times, and that it is error since last fsck. Also add note about fsck version required. Signed-off-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Cc: stable@vger.kernel.org
2014-07-06ext4: fix unjournalled bg descriptor while initializing inode bitmapTheodore Ts'o1-7/+7
The first time that we allocate from an uninitialized inode allocation bitmap, if the block allocation bitmap is also uninitalized, we need to get write access to the block group descriptor before we start modifying the block group descriptor flags and updating the free block count, etc. Otherwise, there is the potential of a bad journal checksum (if journal checksums are enabled), and of the file system becoming inconsistent if we crash at exactly the wrong time. Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
2014-06-26ext4: Fix hole punching for files with indirect blocksJan Kara1-2/+10
Hole punching code for files with indirect blocks wrongly computed number of blocks which need to be cleared when traversing the indirect block tree. That could result in punching more blocks than actually requested and thus effectively cause a data loss. For example: fallocate -n -p 10240000 4096 will punch the range 10240000 - 12632064 instead of the range 1024000 - 10244096. Fix the calculation. CC: stable@vger.kernel.org Fixes: 8bad6fc813a3a5300f51369c39d315679fd88c72 Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-06-26ext4: Fix block zeroing when punching holes in indirect block filesJan Kara1-2/+2
free_holes_block() passed local variable as a block pointer to ext4_clear_blocks(). Thus ext4_clear_blocks() zeroed out this local variable instead of proper place in inode / indirect block. We later zero out proper place in inode / indirect block but don't dirty the inode / buffer again which can lead to subtle issues (some changes e.g. to inode can be lost). Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-06-26ext4: decrement free clusters/inodes counters when block group declared badNamjae Jeon3-0/+47
We should decrement free clusters counter when block bitmap is marked as corrupt and free inodes counter when the allocation bitmap is marked as corrupt to avoid misunderstanding due to incorrect available size in statfs result. User can get immediately ENOSPC error from write begin without reaching for the writepages. Cc: Darrick J. Wong<darrick.wong@oracle.com> Reported-by: Amit Sahrawat <amit.sahrawat83@gmail.com> Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com> Signed-off-by: Ashish Sangwan <a.sangwan@samsung.com>
2014-06-16ext4: Fix buffer double free in ext4_alloc_branch()Jan Kara1-1/+7
Error recovery in ext4_alloc_branch() calls ext4_forget() even for buffer corresponding to indirect block it did not allocate. This leads to brelse() being called twice for that buffer (once from ext4_forget() and once from cleanup in ext4_ind_map_blocks()) leading to buffer use count misaccounting. Eventually (but often much later because there are other users of the buffer) we will see messages like: VFS: brelse: Trying to free free buffer Another manifestation of this problem is an error: JBD2 unexpected failure: jbd2_journal_revoke: !buffer_revoked(bh); inconsistent data on disk The fix is easy - don't forget buffer we did not allocate. Also add an explanatory comment because the indexing at ext4_alloc_branch() is somewhat subtle. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
2014-06-12Merge branch 'for-linus' of ↵Linus Torvalds4-44/+32
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs updates from Al Viro: "This the bunch that sat in -next + lock_parent() fix. This is the minimal set; there's more pending stuff. In particular, I really hope to get acct.c fixes merged this cycle - we need that to deal sanely with delayed-mntput stuff. In the next pile, hopefully - that series is fairly short and localized (kernel/acct.c, fs/super.c and fs/namespace.c). In this pile: more iov_iter work. Most of prereqs for ->splice_write with sane locking order are there and Kent's dio rewrite would also fit nicely on top of this pile" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (70 commits) lock_parent: don't step on stale ->d_parent of all-but-freed one kill generic_file_splice_write() ceph: switch to iter_file_splice_write() shmem: switch to iter_file_splice_write() nfs: switch to iter_splice_write_file() fs/splice.c: remove unneeded exports ocfs2: switch to iter_file_splice_write() ->splice_write() via ->write_iter() bio_vec-backed iov_iter optimize copy_page_{to,from}_iter() bury generic_file_aio_{read,write} lustre: get rid of messing with iovecs ceph: switch to ->write_iter() ceph_sync_direct_write: stop poking into iov_iter guts ceph_sync_read: stop poking into iov_iter guts new helper: copy_page_from_iter() fuse: switch to ->write_iter() btrfs: switch to ->write_iter() ocfs2: switch to ->write_iter() xfs: switch to ->write_iter() ...
2014-06-12->splice_write() via ->write_iter()Al Viro1-1/+1
iter_file_splice_write() - a ->splice_write() instance that gathers the pipe buffers, builds a bio_vec-based iov_iter covering those and feeds it to ->write_iter(). A bunch of simple cases coverted to that... [AV: fixed the braino spotted by Cyrill] Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-06-09Merge tag 'ext4_for_linus' of ↵Linus Torvalds20-429/+494
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "Clean ups and miscellaneous bug fixes, in particular for the new collapse_range and zero_range fallocate functions. In addition, improve the scalability of adding and remove inodes from the orphan list" * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (25 commits) ext4: handle symlink properly with inline_data ext4: fix wrong assert in ext4_mb_normalize_request() ext4: fix zeroing of page during writeback ext4: remove unused local variable "stored" from ext4_readdir(...) ext4: fix ZERO_RANGE test failure in data journalling ext4: reduce contention on s_orphan_lock ext4: use sbi in ext4_orphan_{add|del}() ext4: use EXT_MAX_BLOCKS in ext4_es_can_be_merged() ext4: add missing BUFFER_TRACE before ext4_journal_get_write_access ext4: remove unnecessary double parentheses ext4: do not destroy ext4_groupinfo_caches if ext4_mb_init() fails ext4: make local functions static ext4: fix block bitmap validation when bigalloc, ^flex_bg ext4: fix block bitmap initialization under sparse_super2 ext4: find the group descriptors on a 1k-block bigalloc,meta_bg filesystem ext4: avoid unneeded lookup when xattr name is invalid ext4: fix data integrity sync in ordered mode ext4: remove obsoleted check ext4: add a new spinlock i_raw_lock to protect the ext4's raw inode ext4: fix locking for O_APPEND writes ...
2014-06-05mm: non-atomically mark page accessed during page cache allocation where ↵Mel Gorman1-6/+8
possible aops->write_begin may allocate a new page and make it visible only to have mark_page_accessed called almost immediately after. Once the page is visible the atomic operations are necessary which is noticable overhead when writing to an in-memory filesystem like tmpfs but should also be noticable with fast storage. The objective of the patch is to initialse the accessed information with non-atomic operations before the page is visible. The bulk of filesystems directly or indirectly use grab_cache_page_write_begin or find_or_create_page for the initial allocation of a page cache page. This patch adds an init_page_accessed() helper which behaves like the first call to mark_page_accessed() but may called before the page is visible and can be done non-atomically. The primary APIs of concern in this care are the following and are used by most filesystems. find_get_page find_lock_page find_or_create_page grab_cache_page_nowait grab_cache_page_write_begin All of them are very similar in detail to the patch creates a core helper pagecache_get_page() which takes a flags parameter that affects its behavior such as whether the page should be marked accessed or not. Then old API is preserved but is basically a thin wrapper around this core function. Each of the filesystems are then updated to avoid calling mark_page_accessed when it is known that the VM interfaces have already done the job. There is a slight snag in that the timing of the mark_page_accessed() has now changed so in rare cases it's possible a page gets to the end of the LRU as PageReferenced where as previously it might have been repromoted. This is expected to be rare but it's worth the filesystem people thinking about it in case they see a problem with the timing change. It is also the case that some filesystems may be marking pages accessed that previously did not but it makes sense that filesystems have consistent behaviour in this regard. The test case used to evaulate this is a simple dd of a large file done multiple times with the file deleted on each iterations. The size of the file is 1/10th physical memory to avoid dirty page balancing. In the async case it will be possible that the workload completes without even hitting the disk and will have variable results but highlight the impact of mark_page_accessed for async IO. The sync results are expected to be more stable. The exception is tmpfs where the normal case is for the "IO" to not hit the disk. The test machine was single socket and UMA to avoid any scheduling or NUMA artifacts. Throughput and wall times are presented for sync IO, only wall times are shown for async as the granularity reported by dd and the variability is unsuitable for comparison. As async results were variable do to writback timings, I'm only reporting the maximum figures. The sync results were stable enough to make the mean and stddev uninteresting. The performance results are reported based on a run with no profiling. Profile data is based on a separate run with oprofile running. async dd 3.15.0-rc3 3.15.0-rc3 vanilla accessed-v2 ext3 Max elapsed 13.9900 ( 0.00%) 11.5900 ( 17.16%) tmpfs Max elapsed 0.5100 ( 0.00%) 0.4900 ( 3.92%) btrfs Max elapsed 12.8100 ( 0.00%) 12.7800 ( 0.23%) ext4 Max elapsed 18.6000 ( 0.00%) 13.3400 ( 28.28%) xfs Max elapsed 12.5600 ( 0.00%) 2.0900 ( 83.36%) The XFS figure is a bit strange as it managed to avoid a worst case by sheer luck but the average figures looked reasonable. samples percentage ext3 86107 0.9783 vmlinux-3.15.0-rc4-vanilla mark_page_accessed ext3 23833 0.2710 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed ext3 5036 0.0573 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed ext4 64566 0.8961 vmlinux-3.15.0-rc4-vanilla mark_page_accessed ext4 5322 0.0713 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed ext4 2869 0.0384 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed xfs 62126 1.7675 vmlinux-3.15.0-rc4-vanilla mark_page_accessed xfs 1904 0.0554 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed xfs 103 0.0030 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed btrfs 10655 0.1338 vmlinux-3.15.0-rc4-vanilla mark_page_accessed btrfs 2020 0.0273 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed btrfs 587 0.0079 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed tmpfs 59562 3.2628 vmlinux-3.15.0-rc4-vanilla mark_page_accessed tmpfs 1210 0.0696 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed tmpfs 94 0.0054 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed [akpm@linux-foundation.org: don't run init_page_accessed() against an uninitialised pointer] Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Jan Kara <jack@suse.cz> Cc: Michal Hocko <mhocko@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Tested-by: Prabhakar Lad <prabhakar.csengg@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-05fs/buffer.c: remove block_write_full_page_endio()Matthew Wilcox1-1/+1
The last in-tree caller of block_write_full_page_endio() was removed in January 2013. It's time to remove the EXPORT_SYMBOL, which leaves block_write_full_page() as the only caller of block_write_full_page_endio(), so inline block_write_full_page_endio() into block_write_full_page(). Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Dheeraj Reddy <dheeraj.reddy@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-03Merge branch 'locking-core-for-linus' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into next Pull core locking updates from Ingo Molnar: "The main changes in this cycle were: - reduced/streamlined smp_mb__*() interface that allows more usecases and makes the existing ones less buggy, especially in rarer architectures - add rwsem implementation comments - bump up lockdep limits" * 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (33 commits) rwsem: Add comments to explain the meaning of the rwsem's count field lockdep: Increase static allocations arch: Mass conversion of smp_mb__*() arch,doc: Convert smp_mb__*() arch,xtensa: Convert smp_mb__*() arch,x86: Convert smp_mb__*() arch,tile: Convert smp_mb__*() arch,sparc: Convert smp_mb__*() arch,sh: Convert smp_mb__*() arch,score: Convert smp_mb__*() arch,s390: Convert smp_mb__*() arch,powerpc: Convert smp_mb__*() arch,parisc: Convert smp_mb__*() arch,openrisc: Convert smp_mb__*() arch,mn10300: Convert smp_mb__*() arch,mips: Convert smp_mb__*() arch,metag: Convert smp_mb__*() arch,m68k: Convert smp_mb__*() arch,m32r: Convert smp_mb__*() arch,ia64: Convert smp_mb__*() ...
2014-06-02ext4: handle symlink properly with inline_dataZheng Liu1-0/+3
This commit tries to fix a bug that we can't read symlink properly with inline data feature when the length of symlink is greater than 60 bytes but less than extra space. The key issue is in ext4_inode_is_fast_symlink() that it doesn't check whether or not an inode has inline data. When the user creates a new symlink, an inode will be allocated with MAY_INLINE_DATA flag. Then symlink will be stored in ->i_block and extended attribute space. In the mean time, this inode is with inline data flag. After remounting it, ext4_inode_is_fast_symlink() function thinks that this inode is a fast symlink so that the data in ->i_block is copied to the user, and the data in extra space is trimmed. In fact this inode should be as a normal symlink. The following script can hit this bug. #!/bin/bash cd ${MNT} filename=ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789 rm -rf test mkdir test cd test echo "hello" >$filename ln -s $filename symlinkfile cd sudo umount /mnt/sda1 sudo mount -t ext4 /dev/sda1 /mnt/sda1 readlink /mnt/sda1/test/symlinkfile After applying this patch, it will break the assumption in e2fsck because the original implementation doesn't want to support symlink with inline data. Reported-by: "Darrick J. Wong" <darrick.wong@oracle.com> Reported-by: Ian Nartowicz <claws@nartowicz.co.uk> Cc: Ian Nartowicz <claws@nartowicz.co.uk> Cc: Tao Ma <tm@tao.ma> Cc: "Darrick J. Wong" <darrick.wong@oracle.com> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-05-27ext4: fix wrong assert in ext4_mb_normalize_request()Maurizio Lombardi1-1/+1
The variable "size" is expressed as number of blocks and not as number of clusters, this could trigger a kernel panic when using ext4 with the size of a cluster different from the size of a block. Cc: stable@vger.kernel.org Signed-off-by: Maurizio Lombardi <mlombard@redhat.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-05-27ext4: fix zeroing of page during writebackJan Kara1-13/+11
Tail of a page straddling inode size must be zeroed when being written out due to POSIX requirement that modifications of mmaped page beyond inode size must not be written to the file. ext4_bio_write_page() did this only for blocks fully beyond inode size but didn't properly zero blocks partially beyond inode size. Fix this. The problem has been uncovered by mmap_11-4 test in openposix test suite (part of LTP). Reported-by: Xiaoguang Wang <wangxg.fnst@cn.fujitsu.com> Fixes: 5a0dc7365c240 Fixes: bd2d0210cf22f CC: stable@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-05-27ext4: remove unused local variable "stored" from ext4_readdir(...)Giedrius Rekasius1-2/+1
Remove local variable "stored" from ext4_readdir(...). This variable gets initialized but is never used inside the function. Signed-off-by: Giedrius Rekasius <giedrius.rekasius@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-05-27ext4: fix ZERO_RANGE test failure in data journallingNamjae Jeon1-0/+7
xfstests generic/091 is failing when mounting ext4 with data=journal. I think that this regression is same problem that occurred prior to collapse range issue. So ZERO RANGE also need to call ext4_force_commit as collapse range. Cc: stable@vger.kernel.org Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com> Signed-off-by: Ashish Sangwan <a.sangwan@samsung.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-05-26ext4: reduce contention on s_orphan_lockJan Kara1-44/+65
Shuffle code around in ext4_orphan_add() and ext4_orphan_del() so that we avoid taking global s_orphan_lock in some cases and hold it for shorter time in other cases. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-05-26ext4: use sbi in ext4_orphan_{add|del}()Jan Kara1-16/+15
Use sbi pointer consistently in ext4_orphan_del() instead of opencoding it sometimes. Also ext4_orphan_add() uses EXT4_SB(sb) often so create sbi variable for it as well and use it. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-05-13ext4: use EXT_MAX_BLOCKS in ext4_es_can_be_merged()Lukas Czerner1-1/+7
In ext4_es_can_be_merged() when checking whether we can merge two extents we should use EXT_MAX_BLOCKS instead of defining it manually. Also if it is really the case we should notify userspace because clearly there is a bug in extent status tree implementation since this should never happen. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
2014-05-13ext4: add missing BUFFER_TRACE before ext4_journal_get_write_accessliang xie10-0/+33
Make them more consistently Signed-off-by: xieliang <xieliang@xiaomi.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-05-12ext4: remove unnecessary double parenthesesLukas Czerner5-11/+11
Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-05-12ext4: do not destroy ext4_groupinfo_caches if ext4_mb_init() failsAndrey Tsyvarev1-3/+1
Caches from 'ext4_groupinfo_caches' may be in use by other mounts, which have already existed. So, it is incorrect to destroy them when newly requested mount fails. Found by Linux File System Verification project (linuxtesting.org). Signed-off-by: Andrey Tsyvarev <tsyvarev@ispras.ru> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Lukas Czerner <lczerner@redhat.com>
2014-05-12ext4: make local functions staticStephen Hemminger9-54/+26
I have been running make namespacecheck to look for unneeded globals, and found these in ext4. Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-05-12ext4: fix block bitmap validation when bigalloc, ^flex_bgDarrick J. Wong1-5/+7
On a bigalloc,^flex_bg filesystem, the ext4_valid_block_bitmap function fails to convert from blocks to clusters when spot-checking the validity of the bitmap block that we've just read from disk. This causes ext4 to think that the bitmap is garbage, which results in the block group being taken offline when it's not necessary. Add in the necessary EXT4_B2C() calls to perform the conversions. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-05-12ext4: fix block bitmap initialization under sparse_super2Darrick J. Wong2-15/+22
The ext4_bg_has_super() function doesn't know about the new rules for where backup superblocks go on a sparse_super2 filesystem. Therefore, block bitmap initialization doesn't know that it shouldn't reserve space for backups in groups that are never going to contain backups. The result of this is e2fsck complaining about the block bitmap being incorrect (fortunately not in a way that results in cross-linked files), so fix the whole thing. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-05-12ext4: find the group descriptors on a 1k-block bigalloc,meta_bg filesystemDarrick J. Wong1-0/+10
On a filesystem with a 1k block size, the group descriptors live in block 2, not block 1. If the filesystem has bigalloc,meta_bg set, however, the calculation of the group descriptor table location does not take this into account and returns the wrong block number. Fix the calculation to return the correct value for this case. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-05-12ext4: avoid unneeded lookup when xattr name is invalidZhang Zhen1-0/+3
In ext4_xattr_set_handle() we have checked the xattr name's length. So we should also check it in ext4_xattr_get() to avoid unneeded lookup caused by invalid name. Signed-off-by: Zhang Zhen <zhenzhang.zhang@huawei.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-05-12ext4: fix data integrity sync in ordered modeNamjae Jeon3-5/+12
When we perform a data integrity sync we tag all the dirty pages with PAGECACHE_TAG_TOWRITE at start of ext4_da_writepages. Later we check for this tag in write_cache_pages_da and creates a struct mpage_da_data containing contiguously indexed pages tagged with this tag and sync these pages with a call to mpage_da_map_and_submit. This process is done in while loop until all the PAGECACHE_TAG_TOWRITE pages are synced. We also do journal start and stop in each iteration. journal_stop could initiate journal commit which would call ext4_writepage which in turn will call ext4_bio_write_page even for delayed OR unwritten buffers. When ext4_bio_write_page is called for such buffers, even though it does not sync them but it clears the PAGECACHE_TAG_TOWRITE of the corresponding page and hence these pages are also not synced by the currently running data integrity sync. We will end up with dirty pages although sync is completed. This could cause a potential data loss when the sync call is followed by a truncate_pagecache call, which is exactly the case in collapse_range. (It will cause generic/127 failure in xfstests) To avoid this issue, we can use set_page_writeback_keepwrite instead of set_page_writeback, which doesn't clear TOWRITE tag. Cc: stable@vger.kernel.org Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com> Signed-off-by: Ashish Sangwan <a.sangwan@samsung.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
2014-05-07convert ext4 to ->write_iter()Al Viro1-18/+11
unfortunately, Ted's changes to ext4_file_write() are *still* an incomplete fix - playing with rlimits can let you smuggle an unaligned request past the checks. So there almost certainly will be more merge PITA around that place... [fix from Peter Ujfalusi <peter.ujfalusi@ti.com> folded] Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-07Merge ext4 changes in ext4_file_write() into for-nextAl Viro8-245/+240
From ext4.git#dev, needed for switch of ext4 to ->write_iter() ;-/