summaryrefslogtreecommitdiff
path: root/fs/ext4
AgeCommit message (Collapse)AuthorFilesLines
2015-09-09ext4: start transaction before calling into DAXMatthew Wilcox1-3/+52
Jan Kara pointed out that in the case where we are writing to a hole, we can end up with a lock inversion between the page lock and the journal lock. We can avoid this by starting the transaction in ext4 before calling into DAX. The journal lock nests inside the superblock pagefault lock, so we have to duplicate that code from dax_fault, like XFS does. Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-09ext4: add ext4_get_block_dax()Matthew Wilcox3-3/+16
DAX wants different semantics from any currently-existing ext4 get_block callback. Unlike ext4_get_block_write(), it needs to honour the 'create' flag, and unlike ext4_get_block(), it needs to be able to return unwritten extents. So introduce a new ext4_get_block_dax() which has those semantics. We could also change ext4_get_block_write() to honour the 'create' flag, but that might have consequences on other users that I do not currently understand. Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-09ext4: use ext4_get_block_write() for DAXMatthew Wilcox1-4/+4
DAX relies on the get_block function either zeroing newly allocated blocks before they're findable by subsequent calls to get_block, or marking newly allocated blocks as unwritten. ext4_get_block() cannot create unwritten extents, but ext4_get_block_write() can. Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Reported-by: Andy Rudoff <andy.rudoff@intel.com> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-09ext4: huge page fault supportMatthew Wilcox1-1/+9
Use DAX to provide support for huge pages. Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Cc: Hillf Danton <dhillf@gmail.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-09dax: move DAX-related functions to a new headerMatthew Wilcox3-0/+3
In order to handle the !CONFIG_TRANSPARENT_HUGEPAGES case, we need to return VM_FAULT_FALLBACK from the inlined dax_pmd_fault(), which is defined in linux/mm.h. Given that we don't want to include <linux/mm.h> in <linux/fs.h>, the easiest solution is to move the DAX-related functions to a new header, <linux/dax.h>. We could also have moved VM_FAULT_* definitions to a new header, or a different header that isn't quite such a boil-the-ocean header as <linux/mm.h>, but this felt like the best option. Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Cc: Hillf Danton <dhillf@gmail.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-05fs: create and use seq_show_option for escapingKees Cook1-2/+2
Many file systems that implement the show_options hook fail to correctly escape their output which could lead to unescaped characters (e.g. new lines) leaking into /proc/mounts and /proc/[pid]/mountinfo files. This could lead to confusion, spoofed entries (resulting in things like systemd issuing false d-bus "mount" notifications), and who knows what else. This looks like it would only be the root user stepping on themselves, but it's possible weird things could happen in containers or in other situations with delegated mount privileges. Here's an example using overlay with setuid fusermount trusting the contents of /proc/mounts (via the /etc/mtab symlink). Imagine the use of "sudo" is something more sneaky: $ BASE="ovl" $ MNT="$BASE/mnt" $ LOW="$BASE/lower" $ UP="$BASE/upper" $ WORK="$BASE/work/ 0 0 none /proc fuse.pwn user_id=1000" $ mkdir -p "$LOW" "$UP" "$WORK" $ sudo mount -t overlay -o "lowerdir=$LOW,upperdir=$UP,workdir=$WORK" none /mnt $ cat /proc/mounts none /root/ovl/mnt overlay rw,relatime,lowerdir=ovl/lower,upperdir=ovl/upper,workdir=ovl/work/ 0 0 none /proc fuse.pwn user_id=1000 0 0 $ fusermount -u /proc $ cat /proc/mounts cat: /proc/mounts: No such file or directory This fixes the problem by adding new seq_show_option and seq_show_option_n helpers, and updating the vulnerable show_option handlers to use them as needed. Some, like SELinux, need to be open coded due to unusual existing escape mechanisms. [akpm@linux-foundation.org: add lost chunk, per Kees] [keescook@chromium.org: seq_show_option should be using const parameters] Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Acked-by: Jan Kara <jack@suse.com> Acked-by: Paul Moore <paul@paul-moore.com> Cc: J. R. Okajima <hooanon05g@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-03Merge tag 'ext4_for_linus' of ↵Linus Torvalds8-41/+101
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "Pretty much all bug fixes and clean ups for 4.3, after a lot of features and other churn going into 4.2" * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: Revert "ext4: remove block_device_ejected" ext4: ratelimit the file system mounted message ext4: silence a format string false positive ext4: simplify some code in read_mmp_block() ext4: don't manipulate recovery flag when freezing no-journal fs jbd2: limit number of reserved credits ext4 crypto: remove duplicate header file ext4: update c/mtime on truncate up jbd2: avoid infinite loop when destroying aborted journal ext4, jbd2: add REQ_FUA flag when recording an error in the superblock ext4 crypto: fix spelling typo in comment ext4 crypto: exit cleanly if ext4_derive_key_aes() fails ext4: reject journal options for ext2 mounts ext4: implement cgroup writeback support ext4: replace ext4_io_submit->io_op with ->io_wbc ext4 crypto: check for too-short encrypted file names ext4 crypto: use a jbd2 transaction when adding a crypto policy jbd2: speedup jbd2_journal_dirty_metadata()
2015-09-03Merge branch 'for_linus' of ↵Linus Torvalds5-42/+102
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull ext3 removal, quota & udf fixes from Jan Kara: "The biggest change in the pull is the removal of ext3 filesystem driver (~28k lines removed). Ext4 driver is a full featured replacement these days and both RH and SUSE use it for several years without issues. Also there are some workarounds in VM & block layer mainly for ext3 which we could eventually get rid of. Other larger change is addition of proper error handling for dquot_initialize(). The rest is small fixes and cleanups" [ I wasn't convinced about the ext3 removal and worried about things falling through the cracks for legacy users, but ext4 maintainers piped up and were all unanimously in favor of removal, and maintaining all legacy ext3 support inside ext4. - Linus ] * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: udf: Don't modify filesystem for read-only mounts quota: remove an unneeded condition ext4: memory leak on error in ext4_symlink() mm/Kconfig: NEED_BOUNCE_POOL: clean-up condition ext4: Improve ext4 Kconfig test block: Remove forced page bouncing under IO fs: Remove ext3 filesystem driver doc: Update doc about journalling layer jfs: Handle error from dquot_initialize() reiserfs: Handle error from dquot_initialize() ocfs2: Handle error from dquot_initialize() ext4: Handle error from dquot_initialize() ext2: Handle error from dquot_initalize() quota: Propagate error from ->acquire_dquot()
2015-08-16Revert "ext4: remove block_device_ejected"Theodore Ts'o1-1/+17
This reverts commit 08439fec266c3cc5702953b4f54bdf5649357de0. Unfortunately we still need to test for bdi->dev to avoid a crash when a USB stick is yanked out while a file system is mounted: usb 2-2: USB disconnect, device number 2 Buffer I/O error on dev sdb1, logical block 15237120, lost sync page write JBD2: Error -5 detected when updating journal superblock for sdb1-8. BUG: unable to handle kernel paging request at 34beb000 IP: [<c136ce88>] __percpu_counter_add+0x18/0xc0 *pdpt = 0000000023db9001 *pde = 0000000000000000 Oops: 0000 [#1] SMP CPU: 0 PID: 4083 Comm: umount Tainted: G U OE 4.1.1-040101-generic #201507011435 Hardware name: LENOVO 7675CTO/7675CTO, BIOS 7NETC2WW (2.22 ) 03/22/2011 task: ebf06b50 ti: ebebc000 task.ti: ebebc000 EIP: 0060:[<c136ce88>] EFLAGS: 00010082 CPU: 0 EIP is at __percpu_counter_add+0x18/0xc0 EAX: f21c8e88 EBX: f21c8e88 ECX: 00000000 EDX: 00000001 ESI: 00000001 EDI: 00000000 EBP: ebebde60 ESP: ebebde40 DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 CR0: 8005003b CR2: 34beb000 CR3: 33354200 CR4: 000007f0 Stack: c1abe100 edcb0098 edcb00ec ffffffff f21c8e68 ffffffff f21c8e68 f286d160 ebebde84 c1160454 00000010 00000282 f72a77f8 00000984 f72a77f8 f286d160 f286d170 ebebdea0 c11e613f 00000000 00000282 f72a77f8 edd7f4d0 00000000 Call Trace: [<c1160454>] account_page_dirtied+0x74/0x110 [<c11e613f>] __set_page_dirty+0x3f/0xb0 [<c11e6203>] mark_buffer_dirty+0x53/0xc0 [<c124a0cb>] ext4_commit_super+0x17b/0x250 [<c124ac71>] ext4_put_super+0xc1/0x320 [<c11f04ba>] ? fsnotify_unmount_inodes+0x1aa/0x1c0 [<c11cfeda>] ? evict_inodes+0xca/0xe0 [<c11b925a>] generic_shutdown_super+0x6a/0xe0 [<c10a1df0>] ? prepare_to_wait_event+0xd0/0xd0 [<c1165a50>] ? unregister_shrinker+0x40/0x50 [<c11b92f6>] kill_block_super+0x26/0x70 [<c11b94f5>] deactivate_locked_super+0x45/0x80 [<c11ba007>] deactivate_super+0x47/0x60 [<c11d2b39>] cleanup_mnt+0x39/0x80 [<c11d2bc0>] __cleanup_mnt+0x10/0x20 [<c1080b51>] task_work_run+0x91/0xd0 [<c1011e3c>] do_notify_resume+0x7c/0x90 [<c1720da5>] work_notify Code: 8b 55 e8 e9 f4 fe ff ff 90 90 90 90 90 90 90 90 90 90 90 55 89 e5 83 ec 20 89 5d f4 89 c3 89 75 f8 89 d6 89 7d fc 89 cf 8b 48 14 <64> 8b 01 89 45 ec 89 c2 8b 45 08 c1 fa 1f 01 75 ec 89 55 f0 89 EIP: [<c136ce88>] __percpu_counter_add+0x18/0xc0 SS:ESP 0068:ebebde40 CR2: 0000000034beb000 ---[ end trace dd564a7bea834ecd ]--- Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=101011 Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
2015-08-15ext4: ratelimit the file system mounted messageTheodore Ts'o1-3/+6
The xfstests ext4/305 will mount and unmount the same file system over 4,000 times, and each one of these will cause a system log message. Ratelimit this message since if we are getting more than a few dozen of these messages, they probably aren't going to be helpful. Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-08-15ext4: silence a format string false positiveDan Carpenter1-1/+1
Static checkers complain that the format string should be "%s". It does not make a difference for the current code. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-08-15ext4: simplify some code in read_mmp_block()Dan Carpenter1-21/+25
My static check complains because we have: if (!*bh) return -ENOMEM; if (*bh) { The second check is unnecessary. I've simplified this code by moving the "if (!*bh)" checks around. Also Andreas Dilger says we should probably print a warning if sb_getblk() fails. [ Restructured the code so that we print a warning message as well if the mmp block doesn't check out, and to print the error code to disambiguate between the error cases. - TYT ] Reviewed-by: Andreas Dilger <adilger@dilger.ca> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-08-15ext4: don't manipulate recovery flag when freezing no-journal fsEric Sandeen1-4/+8
At some point along this sequence of changes: f6e63f9 ext4: fold ext4_nojournal_sops into ext4_sops bb04457 ext4: support freezing ext2 (nojournal) file systems 9ca9238 ext4: Use separate super_operations structure for no_journal filesystems ext4 started setting needs_recovery on filesystems without journals when they are unfrozen. This makes no sense, and in fact confuses blkid to the point where it doesn't recognize the filesystem at all. (freeze ext2; unfreeze ext2; run blkid; see no output; run dumpe2fs, see needs_recovery set on fs w/ no journal). To fix this, don't manipulate the INCOMPAT_RECOVER feature on filesystems without journals. Reported-by: Stu Mark <smark@datto.com> Reviewed-by: Jan Kara <jack@suse.com> Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
2015-08-13block: remove bio_get_nr_vecs()Kent Overstreet2-3/+2
We can always fill up the bio now, no need to estimate the possible size based on queue parameters. Acked-by: Steven Whitehouse <swhiteho@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> [hch: rebased and wrote a changelog] Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lin <ming.l@ssi.samsung.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-07-29block: add a bi_error field to struct bioChristoph Hellwig2-12/+9
Currently we have two different ways to signal an I/O error on a BIO: (1) by clearing the BIO_UPTODATE flag (2) by returning a Linux errno value to the bi_end_io callback The first one has the drawback of only communicating a single possible error (-EIO), and the second one has the drawback of not beeing persistent when bios are queued up, and are not passed along from child to parent bio in the ever more popular chaining scenario. Having both mechanisms available has the additional drawback of utterly confusing driver authors and introducing bugs where various I/O submitters only deal with one of them, and the others have to add boilerplate code to deal with both kinds of error returns. So add a new bi_error field to store an errno value directly in struct bio and remove the existing mechanisms to clean all this up. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-07-28ext4 crypto: remove duplicate header filezilong.liu1-1/+0
Remove key.h which is included twice in crypto_fname.c Signed-off-by: zilong.liu <liuziloong@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-07-28ext4: update c/mtime on truncate upEryu Guan1-0/+8
Commit 3da40c7b0898 ("ext4: only call ext4_truncate when size <= isize") introduced a bug that c/mtime is not updated on truncate up. Fix the issue by setting c/mtime explicitly in the truncate up case. Note that ftruncate(2) is not affected, so you won't see this bug using truncate(1) and xfs_io(1). Signed-off-by: Zirong Lang <zorro.lang@gmail.com> Signed-off-by: Eryu Guan <guaneryu@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-07-27ext4: memory leak on error in ext4_symlink()Dan Carpenter1-1/+1
We should release "sd" before returning. Fixes: 0fa12ad1b285 ('ext4: Handle error from dquot_initialize()') Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Jan Kara <jack@suse.com>
2015-07-23ext4: Improve ext4 Kconfig testJan Kara1-6/+7
Now that ext4 driver must be used to access ext3 filesystems, improve the Kconfig help text to better explain that using ext4 driver to access the filesystem is fully compatible with the old ext3 driver. Acked-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Jan Kara <jack@suse.com>
2015-07-23fs: Remove ext3 filesystem driverJan Kara2-16/+39
The functionality of ext3 is fully supported by ext4 driver. Major distributions (SUSE, RedHat) already use ext4 driver to handle ext3 filesystems for quite some time. There is some ugliness in mm resulting from jbd cleaning buffers in a dirty page without cleaning page dirty bit and also support for buffer bouncing in the block layer when stable pages are required is there only because of jbd. So let's remove the ext3 driver. This saves us some 28k lines of duplicated code. Acked-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Jan Kara <jack@suse.cz>
2015-07-23ext4: Handle error from dquot_initialize()Jan Kara3-20/+56
dquot_initialize() can now return error. Handle it where possible. Acked-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Jan Kara <jack@suse.com>
2015-07-23ext4, jbd2: add REQ_FUA flag when recording an error in the superblockDaeho Jeong1-1/+2
When an error condition is detected, an error status should be recorded into superblocks of EXT4 or JBD2. However, the write request is submitted now without REQ_FUA flag, even in "barrier=1" mode, which is followed by panic() function in "errors=panic" mode. On mobile devices which make whole system reset as soon as kernel panic occurs, this write request containing an error flag will disappear just from storage cache without written to the physical cells. Therefore, when next start, even forever, the error flag cannot be shown in both superblocks, and e2fsck cannot fix the filesystem problems automatically, unless e2fsck is executed in force checking mode. [ Changed use test_opt(sb, BARRIER) of checking the journal flags -- TYT ] Signed-off-by: Daeho Jeong <daeho.jeong@samsung.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-07-22ext4 crypto: fix spelling typo in commentLaurent Navet1-1/+1
Signed-off-by: Laurent Navet <laurent.navet@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-07-22ext4 crypto: exit cleanly if ext4_derive_key_aes() failsLaurent Navet1-0/+2
Return value of ext4_derive_key_aes() is stored but not used. Add test to exit cleanly if ext4_derive_key_aes() fail. Also fix coverity CID 1309760. Signed-off-by: Laurent Navet <laurent.navet@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-07-22ext4: reject journal options for ext2 mountsCarlos Maiolino1-3/+3
There is no reason to allow ext2 filesystems be mounted with journal mount options. So, this patch adds them to the MOPT_NO_EXT2 mount options list. Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-07-22ext4: implement cgroup writeback supportTejun Heo2-0/+4
For ordered and writeback data modes, all data IOs go through ext4_io_submit. This patch adds cgroup writeback support by invoking wbc_init_bio() from io_submit_init_bio() and wbc_account_io() in io_submit_add_bh(). Journal data which is written by jbd2 worker is left alone by this patch and will always be written out from the root cgroup. ext4_fill_super() is updated to set MS_CGROUPWB when data mode is either ordered or writeback. In journaled data mode, most IOs become synchronous through the journal and enabling cgroup writeback support doesn't make much sense or difference. Journaled data mode is left alone. Lightly tested with sequential data write workload. Behaves as expected. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-07-22ext4: replace ext4_io_submit->io_op with ->io_wbcTejun Heo2-3/+5
ext4_io_submit_init() takes the pointer to writeback_control to test its sync_mode and determine between WRITE and WRITE_SYNC and records the result in ->io_op. This patch makes it record the pointer directly and moves the test to ext4_io_submit(). This doesn't cause any noticeable differences now but having writeback_control available throughout IO submission path will be depended upon by the planned cgroup writeback support. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-07-17ext4 crypto: check for too-short encrypted file namesTheodore Ts'o1-0/+4
An encrypted file name should never be shorter than an 16 bytes, the AES block size. The 3.10 crypto layer will oops and crash the kernel if ciphertext shorter than the block size is passed to it. Fortunately, in modern kernels the crypto layer will not crash the kernel in this scenario, but nevertheless, it represents a corrupted directory, and we should detect it and mark the file system as corrupted so that e2fsck can fix this. Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-07-17ext4 crypto: use a jbd2 transaction when adding a crypto policyTheodore Ts'o1-2/+15
Start a jbd2 transaction, and mark the inode dirty on the inode under that transaction after setting the encrypt flag. Otherwise if the directory isn't modified after setting the crypto policy, the encrypted flag might not survive the inode getting pushed out from memory, or the the file system getting unmounted and remounted. Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-07-09ioctl_compat: handle FITRIMMikulas Patocka1-1/+0
The FITRIM ioctl has the same arguments on 32-bit and 64-bit architectures, so we can add it to the list of compatible ioctls and drop it from compat_ioctl method of various filesystems. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Ted Ts'o <tytso@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-07-06Merge tag 'ext4_for_linus_stable' of ↵Linus Torvalds4-21/+40
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 bugfixes from Ted Ts'o: "Bug fixes (all for stable kernels) for ext4: - address corner cases for indirect blocks->extent migration - fix reserved block accounting invalidate_page when page_size != block_size (i.e., ppc or 1k block size file systems) - fix deadlocks when a memcg is under heavy memory pressure - fix fencepost error in lazytime optimization" * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: replace open coded nofail allocation in ext4_free_blocks() ext4: correctly migrate a file with a hole at the beginning ext4: be more strict when migrating to non-extent based file ext4: fix reservation release on invalidatepage for delalloc fs ext4: avoid deadlocks in the writeback path by using sb_getblk_gfp bufferhead: Add _gfp version for sb_getblk() ext4: fix fencepost error in lazytime optimization
2015-07-05ext4: replace open coded nofail allocation in ext4_free_blocks()Michal Hocko1-11/+5
ext4_free_blocks is looping around the allocation request and mimics __GFP_NOFAIL behavior without any allocation fallback strategy. Let's remove the open coded loop and replace it with __GFP_NOFAIL. Without the flag the allocator has no way to find out never-fail requirement and cannot help in any way. Signed-off-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
2015-07-05Merge branch 'for-linus' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull more vfs updates from Al Viro: "Assorted VFS fixes and related cleanups (IMO the most interesting in that part are f_path-related things and Eric's descriptor-related stuff). UFS regression fixes (it got broken last cycle). 9P fixes. fs-cache series, DAX patches, Jan's file_remove_suid() work" [ I'd say this is much more than "fixes and related cleanups". The file_table locking rule change by Eric Dumazet is a rather big and fundamental update even if the patch isn't huge. - Linus ] * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (49 commits) 9p: cope with bogus responses from server in p9_client_{read,write} p9_client_write(): avoid double p9_free_req() 9p: forgetting to cancel request on interrupted zero-copy RPC dax: bdev_direct_access() may sleep block: Add support for DAX reads/writes to block devices dax: Use copy_from_iter_nocache dax: Add block size note to documentation fs/file.c: __fget() and dup2() atomicity rules fs/file.c: don't acquire files->file_lock in fd_install() fs:super:get_anon_bdev: fix race condition could cause dev exceed its upper limitation vfs: avoid creation of inode number 0 in get_next_ino namei: make set_root_rcu() return void make simple_positive() public ufs: use dir_pages instead of ufs_dir_pages() pagemap.h: move dir_pages() over there remove the pointless include of lglock.h fs: cleanup slight list_entry abuse xfs: Correctly lock inode when removing suid and file capabilities fs: Call security_ops->inode_killpriv on truncate fs: Provide function telling whether file_remove_privs() will do anything ...
2015-07-04ext4: correctly migrate a file with a hole at the beginningEryu Guan1-4/+5
Currently ext4_ind_migrate() doesn't correctly handle a file which contains a hole at the beginning of the file. This caused the migration to be done incorrectly, and then if there is a subsequent following delayed allocation write to the "hole", this would reclaim the same data blocks again and results in fs corruption. # assmuing 4k block size ext4, with delalloc enabled # skip the first block and write to the second block xfs_io -fc "pwrite 4k 4k" -c "fsync" /mnt/ext4/testfile # converting to indirect-mapped file, which would move the data blocks # to the beginning of the file, but extent status cache still marks # that region as a hole chattr -e /mnt/ext4/testfile # delayed allocation writes to the "hole", reclaim the same data block # again, results in i_blocks corruption xfs_io -c "pwrite 0 4k" /mnt/ext4/testfile umount /mnt/ext4 e2fsck -nf /dev/sda6 ... Inode 53, i_blocks is 16, should be 8. Fix? no ... Signed-off-by: Eryu Guan <guaneryu@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
2015-07-04ext4: be more strict when migrating to non-extent based fileEryu Guan1-1/+11
Currently the check in ext4_ind_migrate() is not enough before doing the real conversion: a) delayed allocated extents could bypass the check on eh->eh_entries and eh->eh_depth This can be demonstrated by this script xfs_io -fc "pwrite 0 4k" -c "pwrite 8k 4k" /mnt/ext4/testfile chattr -e /mnt/ext4/testfile where testfile has two extents but still be converted to non-extent based file format. b) only extent length is checked but not the offset, which would result in data lose (delalloc) or fs corruption (nodelalloc), because non-extent based file only supports at most (12 + 2^10 + 2^20 + 2^30) blocks This can be demostrated by xfs_io -fc "pwrite 5T 4k" /mnt/ext4/testfile chattr -e /mnt/ext4/testfile sync If delalloc is enabled, dmesg prints EXT4-fs warning (device dm-4): ext4_block_to_path:105: block 1342177280 > max in inode 53 EXT4-fs (dm-4): Delayed block allocation failed for inode 53 at logical offset 1342177280 with max blocks 1 with error 5 EXT4-fs (dm-4): This should not happen!! Data will be lost If delalloc is disabled, e2fsck -nf shows corruption Inode 53, i_size is 5497558142976, should be 4096. Fix? no Fix the two issues by a) forcing all delayed allocation blocks to be allocated before checking eh->eh_depth and eh->eh_entries b) limiting the last logical block of the extent is within direct map Signed-off-by: Eryu Guan <guaneryu@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
2015-07-04ext4: fix reservation release on invalidatepage for delalloc fsLukas Czerner1-3/+12
On delalloc enabled file system on invalidatepage operation in ext4_da_page_release_reservation() we want to clear the delayed buffer and remove the extent covering the delayed buffer from the extent status tree. However currently there is a bug where on the systems with page size > block size we will always remove extents from the start of the page regardless where the actual delayed buffers are positioned in the page. This leads to the errors like this: EXT4-fs warning (device loop0): ext4_da_release_space:1225: ext4_da_release_space: ino 13, to_free 1 with only 0 reserved data blocks This however can cause data loss on writeback time if the file system is in ENOSPC condition because we're releasing reservation for someones else delayed buffer. Fix this by only removing extents that corresponds to the part of the page we want to invalidate. This problem is reproducible by the following fio receipt (however I was only able to reproduce it with fio-2.1 or older. [global] bs=8k iodepth=1024 iodepth_batch=60 randrepeat=1 size=1m directory=/mnt/test numjobs=20 [job1] ioengine=sync bs=1k direct=1 rw=randread filename=file1:file2 [job2] ioengine=libaio rw=randwrite direct=1 filename=file1:file2 [job3] bs=1k ioengine=posixaio rw=randwrite direct=1 filename=file1:file2 [job5] bs=1k ioengine=sync rw=randread filename=file1:file2 [job7] ioengine=libaio rw=randwrite filename=file1:file2 [job8] ioengine=posixaio rw=randwrite filename=file1:file2 [job10] ioengine=mmap rw=randwrite bs=1k filename=file1:file2 [job11] ioengine=mmap rw=randwrite direct=1 filename=file1:file2 Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz> Cc: stable@vger.kernel.org
2015-07-02ext4: avoid deadlocks in the writeback path by using sb_getblk_gfpNikolay Borisov1-3/+3
Switch ext4 to using sb_getblk_gfp with GFP_NOFS added to fix possible deadlocks in the page writeback path. Signed-off-by: Nikolay Borisov <kernel@kyup.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
2015-07-02ext4: fix fencepost error in lazytime optimizationTheodore Ts'o1-1/+6
Commit 8f4d8558391: "ext4: fix lazytime optimization" was not a complete fix. In the case where the inode number is a multiple of 16, and we could still end up updating an inode with dirty timestamps written to the wrong inode on disk. Oops. This can be easily reproduced by using generic/005 with a file system with metadata_csum and lazytime enabled. Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
2015-07-01Merge tag 'xfs-for-linus-4.2-rc1' of ↵Linus Torvalds2-16/+21
git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs Pul xfs updates from Dave Chinner: "There's a couple of small API changes to the core DAX code which required small changes to the ext2 and ext4 code bases, but otherwise everything is within the XFS codebase. This update contains: - A new sparse on-disk inode record format to allow small extents to be used for inode allocation when free space is fragmented. - DAX support. This includes minor changes to the DAX core code to fix problems with lock ordering and bufferhead mapping abuse. - transaction commit interface cleanup - removal of various unnecessary XFS specific type definitions - cleanup and optimisation of freelist preparation before allocation - various minor cleanups - bug fixes for - transaction reservation leaks - incorrect inode logging in unwritten extent conversion - mmap lock vs freeze ordering - remote symlink mishandling - attribute fork removal issues" * tag 'xfs-for-linus-4.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (49 commits) xfs: don't truncate attribute extents if no extents exist xfs: clean up XFS_MIN_FREELIST macros xfs: sanitise error handling in xfs_alloc_fix_freelist xfs: factor out free space extent length check xfs: xfs_alloc_fix_freelist() can use incore perag structures xfs: remove xfs_caddr_t xfs: use void pointers in log validation helpers xfs: return a void pointer from xfs_buf_offset xfs: remove inst_t xfs: remove __psint_t and __psunsigned_t xfs: fix remote symlinks on V5/CRC filesystems xfs: fix xfs_log_done interface xfs: saner xfs_trans_commit interface xfs: remove the flags argument to xfs_trans_cancel xfs: pass a boolean flag to xfs_trans_free_items xfs: switch remaining xfs_trans_dup users to xfs_trans_roll xfs: check min blks for random debug mode sparse allocations xfs: fix sparse inodes 32-bit compile failure xfs: add initial DAX support xfs: add DAX IO path support ...
2015-06-26Merge branch 'akpm' (patches from Andrew)Linus Torvalds1-3/+1
Merge second patchbomb from Andrew Morton: - most of the rest of MM - lots of misc things - procfs updates - printk feature work - updates to get_maintainer, MAINTAINERS, checkpatch - lib/ updates * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (96 commits) exit,stats: /* obey this comment */ coredump: add __printf attribute to cn_*printf functions coredump: use from_kuid/kgid when formatting corename fs/reiserfs: remove unneeded cast NILFS2: support NFSv2 export fs/befs/btree.c: remove unneeded initializations fs/minix: remove unneeded cast init/do_mounts.c: add create_dev() failure log kasan: remove duplicate definition of the macro KASAN_FREE_PAGE fs/efs: femove unneeded cast checkpatch: emit "NOTE: <types>" message only once after multiple files checkpatch: emit an error when there's a diff in a changelog checkpatch: validate MODULE_LICENSE content checkpatch: add multi-line handling for PREFER_ETHER_ADDR_COPY checkpatch: suggest using eth_zero_addr() and eth_broadcast_addr() checkpatch: fix processing of MEMSET issues checkpatch: suggest using ether_addr_equal*() checkpatch: avoid NOT_UNIFIED_DIFF errors on cover-letter.patch files checkpatch: remove local from codespell path checkpatch: add --showfile to allow input via pipe to show filenames ...
2015-06-26fs/ext4/super.c: use strreplace() in ext4_fill_super()Rasmus Villemoes1-3/+1
This makes a very large function a little smaller. Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-26Merge branch 'for-4.2/writeback' of git://git.kernel.dk/linux-blockLinus Torvalds3-0/+3
Pull cgroup writeback support from Jens Axboe: "This is the big pull request for adding cgroup writeback support. This code has been in development for a long time, and it has been simmering in for-next for a good chunk of this cycle too. This is one of those problems that has been talked about for at least half a decade, finally there's a solution and code to go with it. Also see last weeks writeup on LWN: http://lwn.net/Articles/648292/" * 'for-4.2/writeback' of git://git.kernel.dk/linux-block: (85 commits) writeback, blkio: add documentation for cgroup writeback support vfs, writeback: replace FS_CGROUP_WRITEBACK with SB_I_CGROUPWB writeback: do foreign inode detection iff cgroup writeback is enabled v9fs: fix error handling in v9fs_session_init() bdi: fix wrong error return value in cgwb_create() buffer: remove unusued 'ret' variable writeback: disassociate inodes from dying bdi_writebacks writeback: implement foreign cgroup inode bdi_writeback switching writeback: add lockdep annotation to inode_to_wb() writeback: use unlocked_inode_to_wb transaction in inode_congested() writeback: implement unlocked_inode_to_wb transaction and use it for stat updates writeback: implement [locked_]inode_to_wb_and_lock_list() writeback: implement foreign cgroup inode detection writeback: make writeback_control track the inode being written back writeback: relocate wb[_try]_get(), wb_put(), inode_{attach|detach}_wb() mm: vmscan: disable memcg direct reclaim stalling if cgroup writeback support is in use writeback: implement memcg writeback domain based throttling writeback: reset wb_domain->dirty_limit[_tstmp] when memcg domain size changes writeback: implement memcg wb_domain writeback: update wb_over_bg_thresh() to use wb_domain aware operations ...
2015-06-26Merge branch 'for-4.2/core' of git://git.kernel.dk/linux-blockLinus Torvalds1-1/+0
Pull core block IO update from Jens Axboe: "Nothing really major in here, mostly a collection of smaller optimizations and cleanups, mixed with various fixes. In more detail, this contains: - Addition of policy specific data to blkcg for block cgroups. From Arianna Avanzini. - Various cleanups around command types from Christoph. - Cleanup of the suspend block I/O path from Christoph. - Plugging updates from Shaohua and Jeff Moyer, for blk-mq. - Eliminating atomic inc/dec of both remaining IO count and reference count in a bio. From me. - Fixes for SG gap and chunk size support for data-less (discards) IO, so we can merge these better. From me. - Small restructuring of blk-mq shared tag support, freeing drivers from iterating hardware queues. From Keith Busch. - A few cfq-iosched tweaks, from Tahsin Erdogan and me. Makes the IOPS mode the default for non-rotational storage" * 'for-4.2/core' of git://git.kernel.dk/linux-block: (35 commits) cfq-iosched: fix other locations where blkcg_to_cfqgd() can return NULL cfq-iosched: fix sysfs oops when attempting to read unconfigured weights cfq-iosched: move group scheduling functions under ifdef cfq-iosched: fix the setting of IOPS mode on SSDs blktrace: Add blktrace.c to BLOCK LAYER in MAINTAINERS file block, cgroup: implement policy-specific per-blkcg data block: Make CFQ default to IOPS mode on SSDs block: add blk_set_queue_dying() to blkdev.h blk-mq: Shared tag enhancements block: don't honor chunk sizes for data-less IO block: only honor SG gap prevention for merges that contain data block: fix returnvar.cocci warnings block, dm: don't copy bios for request clones block: remove management of bi_remaining when restoring original bi_end_io block: replace trylock with mutex_lock in blkdev_reread_part() block: export blkdev_reread_part() and __blkdev_reread_part() suspend: simplify block I/O handling block: collapse bio bit space block: remove unused BIO_RW_BLOCK and BIO_EOF flags block: remove BIO_EOPNOTSUPP ...
2015-06-26Merge tag 'ext4_for_linus' of ↵Linus Torvalds23-1195/+1227
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "A very large number of cleanups and bug fixes --- in particular for the ext4 encryption patches, which is a new feature added in the last merge window. Also fix a number of long-standing xfstest failures. (Quota writes failing due to ENOSPC, a race between truncate and writepage in data=journalled mode that was causing generic/068 to fail, and other corner cases.) Also add support for FALLOC_FL_INSERT_RANGE, and improve jbd2 performance eliminating locking when a buffer is modified more than once during a transaction (which is very common for allocation bitmaps, for example), in which case the state of the journalled buffer head doesn't need to change" [ I renamed "ext4_follow_link()" to "ext4_encrypted_follow_link()" in the merge resolution, to make it clear that that function is _only_ used for encrypted symlinks. The function doesn't actually work for non-encrypted symlinks at all, and they use the generic helpers - Linus ] * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (52 commits) ext4: set lazytime on remount if MS_LAZYTIME is set by mount ext4: only call ext4_truncate when size <= isize ext4: make online defrag error reporting consistent ext4: minor cleanup of ext4_da_reserve_space() ext4: don't retry file block mapping on bigalloc fs with non-extent file ext4: prevent ext4_quota_write() from failing due to ENOSPC ext4: call sync_blockdev() before invalidate_bdev() in put_super() jbd2: speedup jbd2_journal_dirty_metadata() jbd2: get rid of open coded allocation retry loop ext4: improve warning directory handling messages jbd2: fix ocfs2 corrupt when updating journal superblock fails ext4: mballoc: avoid 20-argument function call ext4: wait for existing dio workers in ext4_alloc_file_blocks() ext4: recalculate journal credits as inode depth changes jbd2: use GFP_NOFS in jbd2_cleanup_journal_tail() ext4: use swap() in mext_page_double_lock() ext4: use swap() in memswap() ext4: fix race between truncate and __ext4_journalled_writepage() ext4 crypto: fail the mount if blocksize != pagesize ext4: Add support FALLOC_FL_INSERT_RANGE for fallocate ...
2015-06-24vfs: add file_path() helperMiklos Szeredi1-1/+1
Turn d_path(&file->f_path, ...); into file_path(file, ...); Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-06-23ext4: set lazytime on remount if MS_LAZYTIME is set by mountTheodore Ts'o1-0/+3
Newer versions of mount parse the lazytime feature and pass it to the mount system call via the flags field in the mount system call, removing the lazytime string from the mount options list. So we need to check for the presence of MS_LAZYTIME and set it in sb->s_flags in order for this flag to be set on a remount. Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
2015-06-22Merge branch 'for-linus-1' of ↵Linus Torvalds4-39/+28
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs updates from Al Viro: "In this pile: pathname resolution rewrite. - recursion in link_path_walk() is gone. - nesting limits on symlinks are gone (the only limit remaining is that the total amount of symlinks is no more than 40, no matter how nested). - "fast" (inline) symlinks are handled without leaving rcuwalk mode. - stack footprint (independent of the nesting) is below kilobyte now, about on par with what it used to be with one level of nested symlinks and ~2.8 times lower than it used to be in the worst case. - struct nameidata is entirely private to fs/namei.c now (not even opaque pointers are being passed around). - ->follow_link() and ->put_link() calling conventions had been changed; all in-tree filesystems converted, out-of-tree should be able to follow reasonably easily. For out-of-tree conversions, see Documentation/filesystems/porting for details (and in-tree filesystems for examples of conversion). That has sat in -next since mid-May, seems to survive all testing without regressions and merges clean with v4.1" * 'for-linus-1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (131 commits) turn user_{path_at,path,lpath,path_dir}() into static inlines namei: move saved_nd pointer into struct nameidata inline user_path_create() inline user_path_parent() namei: trim do_last() arguments namei: stash dfd and name into nameidata namei: fold path_cleanup() into terminate_walk() namei: saner calling conventions for filename_parentat() namei: saner calling conventions for filename_create() namei: shift nameidata down into filename_parentat() namei: make filename_lookup() reject ERR_PTR() passed as name namei: shift nameidata inside filename_lookup() namei: move putname() call into filename_lookup() namei: pass the struct path to store the result down into path_lookupat() namei: uninline set_root{,_rcu}() namei: be careful with mountpoint crossings in follow_dotdot_rcu() Documentation: remove outdated information from automount-support.txt get rid of assorted nameidata-related debris lustre: kill unused helper lustre: kill unused macro (LOOKUP_CONTINUE) ...
2015-06-22ext4: only call ext4_truncate when size <= isizeJosef Bacik1-20/+18
At LSF we decided that if we truncate up from isize we shouldn't trim fallocated blocks that were fallocated with KEEP_SIZE and are past the new i_size. This patch fixes ext4 to do this. [ Completely reworked patch so that i_disksize would actually get set when truncating up. Also reworked the code for handling truncate so that it's easier to handle. -- tytso ] Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Lukas Czerner <lczerner@redhat.com>
2015-06-22ext4: make online defrag error reporting consistentEric Whitney1-3/+7
Make the error reporting behavior resulting from the unsupported use of online defrag on files with data journaling enabled consistent with that implemented for bigalloc file systems. Difference found with ext4/308. Signed-off-by: Eric Whitney <enwlinux@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
2015-06-22ext4: minor cleanup of ext4_da_reserve_space()Eric Whitney1-17/+5
Remove outdated comments and dead code from ext4_da_reserve_space. Clean up its trace point, and relocate it to make it more useful. While we're at it, fix a nearby conditional used to determine if we have a non-bigalloc file system. It doesn't match usage elsewhere in the code, and misleadingly suggests that an s_cluster_ratio value of 0 would be legal. Signed-off-by: Eric Whitney <enwlinux@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>