summaryrefslogtreecommitdiff
path: root/fs/ext4/inode.c
AgeCommit message (Collapse)AuthorFilesLines
2013-09-05Merge branch 'for-linus' of ↵Linus Torvalds1-21/+7
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs pile 1 from Al Viro: "Unfortunately, this merge window it'll have a be a lot of small piles - my fault, actually, for not keeping #for-next in anything that would resemble a sane shape ;-/ This pile: assorted fixes (the first 3 are -stable fodder, IMO) and cleanups + %pd/%pD formats (dentry/file pathname, up to 4 last components) + several long-standing patches from various folks. There definitely will be a lot more (starting with Miklos' check_submount_and_drop() series)" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (26 commits) direct-io: Handle O_(D)SYNC AIO direct-io: Implement generic deferred AIO completions add formats for dentry/file pathnames kvm eventfd: switch to fdget powerpc kvm: use fdget switch fchmod() to fdget switch epoll_ctl() to fdget switch copy_module_from_fd() to fdget git simplify nilfs check for busy subtree ibmasmfs: don't bother passing superblock when not needed don't pass superblock to hypfs_{mkdir,create*} don't pass superblock to hypfs_diag_create_files don't pass superblock to hypfs_vm_create_files() oprofile: get rid of pointless forward declarations of struct super_block oprofilefs_create_...() do not need superblock argument oprofilefs_mkdir() doesn't need superblock argument don't bother with passing superblock to oprofile_create_stats_files() oprofile: don't bother with passing superblock to ->create_files() don't bother passing sb to oprofile_create_files() coh901318: don't open-code simple_read_from_buffer() ...
2013-09-04direct-io: Implement generic deferred AIO completionsChristoph Hellwig1-21/+7
Add support to the core direct-io code to defer AIO completions to user context using a workqueue. This replaces opencoded and less efficient code in XFS and ext4 (we save a memory allocation for each direct IO) and will be needed to properly support O_(D)SYNC for AIO. The communication between the filesystem and the direct I/O code requires a new buffer head flag, which is a bit ugly but not avoidable until the direct I/O code stops abusing the buffer_head structure for communicating with the filesystems. Currently this creates a per-superblock unbound workqueue for these completions, which is taken from an earlier patch by Jan Kara. I'm not really convinced about this use and would prefer a "normal" global workqueue with a high concurrency limit, but this needs further discussion. JK: Fixed ext4 part, dynamic allocation of the workqueue. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-08-28ext4: Fix misspellings using 'codespell' toolAnatol Pomozov1-4/+4
Signed-off-by: Anatol Pomozov <anatol.pomozov@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-08-28ext4: convert write_begin methods to stable_page_writes semanticsDmitry Monakhov1-2/+3
Use wait_for_stable_page() instead of wait_on_page_writeback() Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
2013-08-17ext4: fix lost truncate due to race with writebackJan Kara1-5/+12
The following race can lead to a loss of i_disksize update from truncate thus resulting in a wrong inode size if the inode size isn't updated again before inode is reclaimed: ext4_setattr() mpage_map_and_submit_extent() EXT4_I(inode)->i_disksize = attr->ia_size; ... ... disksize = ((loff_t)mpd->first_page) << PAGE_CACHE_SHIFT /* False because i_size isn't * updated yet */ if (disksize > i_size_read(inode)) /* True, because i_disksize is * already truncated */ if (disksize > EXT4_I(inode)->i_disksize) /* Overwrite i_disksize * update from truncate */ ext4_update_i_disksize() i_size_write(inode, attr->ia_size); For other places updating i_disksize such race cannot happen because i_mutex prevents these races. Writeback is the only place where we do not hold i_mutex and we cannot grab it there because of lock ordering. We fix the race by doing both i_disksize and i_size update in truncate atomically under i_data_sem and in mpage_map_and_submit_extent() we move the check against i_size under i_data_sem as well. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@vger.kernel.org
2013-08-17ext4: simplify truncation code in ext4_setattr()Jan Kara1-60/+49
Merge conditions in ext4_setattr() handling inode size changes, also move ext4_begin_ordered_truncate() call somewhat earlier because it simplifies error recovery in case of failure. Also add error handling in case i_disksize update fails. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@vger.kernel.org
2013-08-17ext4: fix ext4_writepages() in presence of truncateJan Kara1-41/+66
Inode size can arbitrarily change while writeback is in progress. When ext4_writepages() has prepared a long extent for mapping and truncate then reduces i_size, mpage_map_and_submit_buffers() will always map just one buffer in a page instead of all of them due to lblk < blocks check. So we end up not using all blocks we've allocated (thus leaking them) and also delalloc accounting goes wrong manifesting as a warning like: ext4_da_release_space:1333: ext4_da_release_space: ino 12, to_free 1 with only 0 reserved data blocks Note that the problem can happen only when blocksize < pagesize because otherwise we have only a single buffer in the page. Fix the problem by removing the size check from the mapping loop. We have an extent allocated so we have to use it all before checking for i_size. We also rename add_page_bufs_to_extent() to mpage_process_page_bufs() and make that function submit the page for IO if all buffers (upto EOF) in it are mapped. Reported-by: Dave Jones <davej@redhat.com> Reported-by: Zheng Liu <gnehzuil.liu@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@vger.kernel.org
2013-08-17ext4: move test whether extent to map can be extended to one placeJan Kara1-20/+29
Currently the logic whether the current buffer can be added to an extent of buffers to map is split between mpage_add_bh_to_extent() and add_page_bufs_to_extent(). Move the whole logic to mpage_add_bh_to_extent() which makes things a bit more straightforward and make following i_size fixes easier. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@vger.kernel.org
2013-08-17ext4: use unsigned int for es_status valuesTheodore Ts'o1-3/+3
Don't use an unsigned long long for the es_status flags; this requires that we pass 64-bit values around which is painful on 32-bit systems. Instead pass the extent status flags around using the low 4 bits of an unsigned int, and shift them into place when we are reading or writing es_pblk. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
2013-08-17jbd2: Fix oops in jbd2_journal_file_inode()Jan Kara1-0/+43
Commit 0713ed0cde76438d05849f1537d3aab46e099475 added jbd2_journal_file_inode() call into ext4_block_zero_page_range(). However that function gets called from truncate path and thus inode needn't have jinode attached - that happens in ext4_file_open() but the file needn't be ever open since mount. Calling jbd2_journal_file_inode() without jinode attached results in the oops. We fix the problem by attaching jinode to inode also in ext4_truncate() and ext4_punch_hole() when we are going to zero out partial blocks. Reported-by: majianpeng <majianpeng@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-07-29ext4: add WARN_ON to check the length of allocated blocksZheng Liu1-21/+18
In commit 921f266b: ext4: add self-testing infrastructure to do a sanity check, some sanity checks were added in map_blocks to make sure 'retval == map->m_len'. Enable these checks by default and report any assertion failures using ext4_warning() and WARN_ON() since they can help us to figure out some bugs that are otherwise hard to hit. Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-07-16ext4: call ext4_es_lru_add() after handling cache missTheodore Ts'o1-5/+2
If there are no items in the extent status tree, ext4_es_lru_add() is a no-op. So it is not sufficient to call ext4_es_lru_add() before we try to lookup an entry in the extent status tree. We also need to call it at the end of ext4_ext_map_blocks(), after items have been added to the extent status tree. This could lead to inodes with that have extent status trees but which are not in the LRU list, which means they won't get considered for eviction by the es_shrinker. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: Zheng Liu <wenqing.lz@taobao.com> Cc: stable@vger.kernel.org
2013-07-13ext4: fix spelling errors and a comment in extent_status treeTheodore Ts'o1-4/+4
Replace "assertation" with "assertion" in lots and lots of debugging messages. Correct the comment stating when ext4_es_insert_extent() is used. It was no doubt tree at one point, but it is no longer true... Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: Zheng Liu <gnehzuil.liu@gmail.com>
2013-07-06ext4: silence warning in ext4_writepages()Jan Kara1-2/+2
The loop in mpage_map_and_submit_extent() is guaranteed to always run at least once since the caller of mpage_map_and_submit_extent() makes sure map->m_len > 0. So make that explicit using do-while instead of pure while which also silences the compiler warning about uninitialized 'err' variable. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Lukas Czerner <lczerner@redhat.com>
2013-07-01ext4: fix up error handling for mpage_map_and_submit_extent()Theodore Ts'o1-24/+28
The function mpage_released_unused_page() must only be called once; otherwise the kernel will BUG() when the second call to mpage_released_unused_page() tries to unlock the pages which had been unlocked by the first call. Also restructure the error handling so that we only give up on writing the dirty pages in the case of ENOSPC where retrying the allocation won't help. Otherwise, a transient failure, such as a kmalloc() failure in calling ext4_map_blocks() might cause us to give up on those pages, leading to a scary message in /var/log/messages plus data loss. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
2013-07-01ext4: only zero partial blocks in ext4_zero_partial_blocks()Lukas Czerner1-7/+10
Currently if we pass range into ext4_zero_partial_blocks() which covers entire block we would attempt to zero it even though we should only zero unaligned part of the block. Fix this by checking whether the range covers the whole block skip zeroing if so. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-07-01ext4: check error return from ext4_write_inline_data_end()Theodore Ts'o1-4/+7
The function ext4_write_inline_data_end() can return an error. So we need to assign it to a signed integer variable to check for an error return (since copied is an unsigned int). Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: Zheng Liu <wenqing.lz@taobao.com> Cc: stable@vger.kernel.org
2013-07-01ext4: delete unnecessary C statementsjon ernst1-9/+1
Comparing unsigned variable with 0 always returns false. err = 0 is duplicated and unnecessary. [ tytso: Also cleaned up error handling in ext4_block_zero_page_range() ] Signed-off-by: "Jon Ernst" <jonernst07@gmx.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-07-01ext4: pass inode pointer instead of file pointer to punch holeAshish Sangwan1-2/+1
No need to pass file pointer when we can directly pass inode pointer. Signed-off-by: Ashish Sangwan <a.sangwan@samsung.com> Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-07-01ext4: improve extent cache shrink mechanism to avoid to burn CPU timeZheng Liu1-0/+4
Now we maintain an proper in-order LRU list in ext4 to reclaim entries from extent status tree when we are under heavy memory pressure. For keeping this order, a spin lock is used to protect this list. But this lock burns a lot of CPU time. We can use the following steps to trigger it. % cd /dev/shm % dd if=/dev/zero of=ext4-img bs=1M count=2k % mkfs.ext4 ext4-img % mount -t ext4 -o loop ext4-img /mnt % cd /mnt % for ((i=0;i<160;i++)); do truncate -s 64g $i; done % for ((i=0;i<160;i++)); do cp $i /dev/null &; done % perf record -a -g % perf report This commit tries to fix this problem. Now a new member called i_touch_when is added into ext4_inode_info to record the last access time for an inode. Meanwhile we never need to keep a proper in-order LRU list. So this can avoid to burns some CPU time. When we try to reclaim some entries from extent status tree, we use list_sort() to get a proper in-order list. Then we traverse this list to discard some entries. In ext4_sb_info, we use s_es_last_sorted to record the last time of sorting this list. When we traverse the list, we skip the inode that is newer than this time, and move this inode to the tail of LRU list. When the head of the list is newer than s_es_last_sorted, we will sort the LRU list again. In this commit, we break the loop if s_extent_cache_cnt == 0 because that means that all extents in extent status tree have been reclaimed. Meanwhile in this commit, ext4_es_{un}register_shrinker()'s prototype is changed to save a local variable in these functions. Reported-by: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-06ext4: use ext4_da_writepages() for all modesTheodore Ts'o1-10/+31
Rename ext4_da_writepages() to ext4_writepages() and use it for all modes. We still need to iterate over all the pages in the case of data=journalling, but in the case of nodelalloc/data=ordered (which is what file systems mounted using ext3 backwards compatibility will use) this will allow us to use a much more efficient I/O submission path. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-04ext4: remove ext4_ioend_wait()Jan Kara1-2/+3
Now that we clear PageWriteback after extent conversion, there's no need to wait for io_end processing in ext4_evict_inode(). Running AIO/DIO keeps file reference until aio_complete() is called so ext4_evict_inode() cannot be called. For io_end structures resulting from buffered IO waiting is happening because we wait for PageWriteback in truncate_inode_pages(). Reviewed-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-04ext4: don't wait for extent conversion in ext4_punch_hole()Jan Kara1-3/+0
We don't have to wait for extent conversion in ext4_punch_hole() as buffered IO for the punched range has been flushed and waited upon (thus all extent conversions for that range have completed). Also we wait for all DIO to finish using inode_dio_wait() so there cannot be any extent conversions pending due to direct IO. Also remove ext4_flush_unwritten_io() since it's unused now. Reviewed-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-04ext4: remove wait for unwritten extent conversion from ext4_truncate()Jan Kara1-6/+0
Since PageWriteback bit is now cleared after extents are converted from unwritten to written ones, we have full exclusion of writeback path from truncate (truncate_inode_pages() waits for PageWriteback bits to get cleared on all invalidated pages). Exclusion from DIO path is achieved by inode_dio_wait() call in ext4_setattr(). So there's no need to wait for extent convertion in ext4_truncate() anymore. Reviewed-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-04ext4: protect extent conversion after DIO with i_dio_countJan Kara1-2/+10
Make sure extent conversion after DIO happens while i_dio_count is still elevated so that inode_dio_wait() waits until extent conversion is done. This removes the need for explicit waiting for extent conversion in some cases. Reviewed-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-04ext4: use transaction reservation for extent conversion in ext4_end_ioJan Kara1-5/+20
Later we would like to clear PageWriteback bit only after extent conversion from unwritten to written extents is performed. However it is not possible to start a transaction after PageWriteback is set because that violates lock ordering (and is easy to deadlock). So we have to reserve a transaction before locking pages and sending them for IO and later we use the transaction for extent conversion from ext4_end_io(). Reviewed-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-04ext4: remove buffer_uninit handlingJan Kara1-2/+2
There isn't any need for setting BH_Uninit on buffers anymore. It was only used to signal we need to mark io_end as needing extent conversion in add_bh_to_extent() but now we can mark the io_end directly when mapping extent. Reviewed-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-04ext4: restructure writeback pathJan Kara1-524/+487
There are two issues with current writeback path in ext4. For one we don't necessarily map complete pages when blocksize < pagesize and thus needn't do any writeback in one iteration. We always map some blocks though so we will eventually finish mapping the page. Just if writeback races with other operations on the file, forward progress is not really guaranteed. The second problem is that current code structure makes it hard to associate all the bios to some range of pages with one io_end structure so that unwritten extents can be converted after all the bios are finished. This will be especially difficult later when io_end will be associated with reserved transaction handle. We restructure the writeback path to a relatively simple loop which first prepares extent of pages, then maps one or more extents so that no page is partially mapped, and once page is fully mapped it is submitted for IO. We keep all the mapping and IO submission information in mpage_da_data structure to somewhat reduce stack usage. Resulting code is somewhat shorter than the old one and hopefully also easier to read. Reviewed-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-04ext4: better estimate credits needed for ext4_da_writepages()Jan Kara1-35/+27
We limit the number of blocks written in a single loop of ext4_da_writepages() to 64 when inode uses indirect blocks. That is unnecessary as credit estimates for mapping logically continguous run of blocks is rather low even for inode with indirect blocks. So just lift this limitation and properly calculate the number of necessary credits. This better credit estimate will also later allow us to always write at least a single page in one iteration. Reviewed-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-04ext4: improve writepage credit estimate for files with indirect blocksJan Kara1-1/+1
ext4_ind_trans_blocks() wrongly used 'chunk' argument to decide whether blocks mapped are logically contiguous. That is wrong since the argument informs whether the blocks are physically contiguous. As the blocks mapped are always logically contiguous and that's all ext4_ind_trans_blocks() cares about, just remove the 'chunk' argument. Reviewed-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-04ext4: stop messing with nr_to_write in ext4_da_writepages()Jan Kara1-96/+0
Writeback code got better in how it submits IO and now the number of pages requested to be written is usually higher than original 1024. The number is now dynamically computed based on observed throughput and is set to be about 0.5 s worth of writeback. E.g. on ordinary SATA drive this ends up somewhere around 10000 as my testing shows. So remove the unnecessary smarts from ext4_da_writepages(). Reviewed-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-04ext4: use io_end for multiple biosJan Kara1-38/+60
Change writeback path to create just one io_end structure for the extent to which we submit IO and share it among bios writing that extent. This prevents needless splitting and joining of unwritten extents when they cannot be submitted as a single bio. Bugs in ENOMEM handling found by Linux File System Verification project (linuxtesting.org) and fixed by Alexey Khoroshilov <khoroshilov@ispras.ru>. CC: Alexey Khoroshilov <khoroshilov@ispras.ru> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-01ext4: fix overflow when counting used blocks on 32-bit architecturesJan Kara1-2/+2
The arithmetics adding delalloc blocks to the number of used blocks in ext4_getattr() can easily overflow on 32-bit archs as we first multiply number of blocks by blocksize and then divide back by 512. Make the arithmetics more clever and also use proper type (unsigned long long instead of unsigned long). Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2013-05-28ext4: remove unused discard_partial_page_buffersLukas Czerner1-206/+0
The discard_partial_page_buffers is no longer used anywhere so we can simply remove it including the *_no_lock variant and EXT4_DISCARD_PARTIAL_PG_ZERO_UNMAPPED define. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2013-05-28ext4: use ext4_zero_partial_blocks in punch_holeLukas Czerner1-72/+46
We're doing to get rid of ext4_discard_partial_page_buffers() since it is duplicating some code and also partially duplicating work of truncate_pagecache_range(), moreover the old implementation was much clearer. Now when the truncate_inode_pages_range() can handle truncating non page aligned regions we can use this to invalidate and zero out block aligned region of the punched out range and then use ext4_block_truncate_page() to zero the unaligned blocks on the start and end of the range. This will greatly simplify the punch hole code. Moreover after this commit we can get rid of the ext4_discard_partial_page_buffers() completely. We also introduce function ext4_prepare_punch_hole() to do come common operations before we attempt to do the actual punch hole on indirect or extent file which saves us some code duplication. This has been tested on ppc64 with 1k block size with fsx and xfstests without any problems. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2013-05-28Revert "ext4: fix fsx truncate failure"Lukas Czerner1-9/+2
This reverts commit 189e868fa8fdca702eb9db9d8afc46b5cb9144c9. This commit reintroduces the use of ext4_block_truncate_page() in ext4 truncate operation instead of ext4_discard_partial_page_buffers(). The statement in the commit description that the truncate operation only zero block unaligned portion of the last page is not exactly right, since truncate_pagecache_range() also zeroes and invalidate the unaligned portion of the page. Then there is no need to zero and unmap it once more and ext4_block_truncate_page() was doing the right job, although we still need to update the buffer head containing the last block, which is exactly what ext4_block_truncate_page() is doing. Moreover the problem described in the commit is fixed more properly with commit 15291164b22a357cb211b618adfef4fa82fc0de3 jbd2: clear BH_Delay & BH_Unwritten in journal_unmap_buffer This was tested on ppc64 machine with block size of 1024 bytes without any problems. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2013-05-28ext4: Call ext4_jbd2_file_inode() after zeroing blockLukas Czerner1-1/+4
In data=ordered mode we should call ext4_jbd2_file_inode() so that crash after the truncate transaction has committed does not expose stall data in the tail of the block. Thanks Jan Kara for pointing that out. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2013-05-28Revert "ext4: remove no longer used functions in inode.c"Lukas Czerner1-0/+120
This reverts commit ccb4d7af914e0fe9b2f1022f8ea6c300463fd5e6. This commit reintroduces functions ext4_block_truncate_page() and ext4_block_zero_page_range() which has been previously removed in favour of ext4_discard_partial_page_buffers(). In future commits we want to reintroduce those function and remove ext4_discard_partial_page_buffers() since it is duplicating some code and also partially duplicating work of truncate_pagecache_range(), moreover the old implementation was much clearer. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2013-05-22ext4: use ->invalidatepage() length argumentLukas Czerner1-11/+19
->invalidatepage() aop now accepts range to invalidate so we can make use of it in all ext4 invalidatepage routines. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz>
2013-05-22jbd2: change jbd2_journal_invalidatepage to accept lengthLukas Czerner1-1/+2
invalidatepage now accepts range to invalidate and there are two file system using jbd2 also implementing punch hole feature which can benefit from this. We need to implement the same thing for jbd2 layer in order to allow those file system take benefit of this functionality. This commit adds length argument to the jbd2_journal_invalidatepage() and updates all instances in ext4 and ocfs2. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz>
2013-05-22mm: change invalidatepage prototype to accept lengthLukas Czerner1-7/+11
Currently there is no way to truncate partial page where the end truncate point is not at the end of the page. This is because it was not needed and the functionality was enough for file system truncate operation to work properly. However more file systems now support punch hole feature and it can benefit from mm supporting truncating page just up to the certain point. Specifically, with this functionality truncate_inode_pages_range() can be changed so it supports truncating partial page at the end of the range (currently it will BUG_ON() if 'end' is not at the end of the page). This commit changes the invalidatepage() address space operation prototype to accept range to be invalidated and update all the instances for it. We also change the block_invalidatepage() in the same way and actually make a use of the new length argument implementing range invalidation. Actual file system implementations will follow except the file systems where the changes are really simple and should not change the behaviour in any way .Implementation for truncate_page_range() which will be able to accept page unaligned ranges will follow as well. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Hugh Dickins <hughd@google.com>
2013-05-14Merge tag 'ext4_for_linus_stable' of ↵Linus Torvalds1-47/+38
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 update from Ted Ts'o: "Fixed regressions (two stability regressions and a performance regression) introduced during the 3.10-rc1 merge window. Also included is a bug fix relating to allocating blocks after resizing an ext3 file system when using the ext4 file system driver" * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: jbd,jbd2: fix oops in jbd2_journal_put_journal_head() ext4: revert "ext4: use io_end for multiple bios" ext4: limit group search loop for non-extent files ext4: fix fio regression
2013-05-12ext4: revert "ext4: use io_end for multiple bios"Theodore Ts'o1-47/+38
This reverts commit 4eec708d263f0ee10861d69251708a225b64cac7. Multiple users have reported crashes which is apparently caused by this commit. Thanks to Dmitry Monakhov for bisecting it. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: Dmitry Monakhov <dmonakhov@openvz.org> Cc: Jan Kara <jack@suse.cz>
2013-05-08aio: don't include aio.h in sched.hKent Overstreet1-0/+1
Faster kernel compiles by way of fewer unnecessary includes. [akpm@linux-foundation.org: fix fallout] [akpm@linux-foundation.org: fix build] Signed-off-by: Kent Overstreet <koverstreet@google.com> Cc: Zach Brown <zab@redhat.com> Cc: Felipe Balbi <balbi@ti.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Jens Axboe <axboe@kernel.dk> Cc: Asai Thambi S P <asamymuthupa@micron.com> Cc: Selvan Mani <smani@micron.com> Cc: Sam Bradshaw <sbradshaw@micron.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Benjamin LaHaise <bcrl@kvack.org> Reviewed-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-23ext4: fix type-widening bug in inode table readahead codeTheodore Ts'o1-2/+3
Due to a missing cast, the high 32-bits of a 64-bit block number used when calculating the readahead block for inode tables can get lost. This means we can end up fetching the wrong blocks for readahead for file systems > 16TB. Linus found this when experimenting with an enhacement to the sparse static code checker which checks for missing widening casts before binary "not" operators. Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-04-22ext4: mark metadata blocks using bh flagsTheodore Ts'o1-1/+5
This allows metadata writebacks which are issued via block device writeback to be sent with the current write request flags. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-04-12ext4: use io_end for multiple biosJan Kara1-38/+47
Change writeback path to create just one io_end structure for the extent to which we submit IO and share it among bios writing that extent. This prevents needless splitting and joining of unwritten extents when they cannot be submitted as a single bio. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Dmitry Monakhov <dmonakhov@openvz.org> Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
2013-04-10ext4: fix big-endian bug in metadata checksum calculationsDmitry Monakhov1-4/+4
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@vger.kernel.org
2013-04-10ext4: introduce reserved spaceLukas Czerner1-1/+10
Currently in ENOSPC condition when writing into unwritten space, or punching a hole, we might need to split the extent and grow extent tree. However since we can not allocate any new metadata blocks we'll have to zero out unwritten part of extent or punched out part of extent, or in the worst case return ENOSPC even though use actually does not allocate any space. Also in delalloc path we do reserve metadata and data blocks for the time we're going to write out, however metadata block reservation is very tricky especially since we expect that logical connectivity implies physical connectivity, however that might not be the case and hence we might end up allocating more metadata blocks than previously reserved. So in future, metadata reservation checks should be removed since we can not assure that we do not under reserve. And this is where reserved space comes into the picture. When mounting the file system we slice off a little bit of the file system space (2% or 4096 clusters, whichever is smaller) which can be then used for the cases mentioned above to prevent costly zeroout, or unexpected ENOSPC. The number of reserved clusters can be set via sysfs, however it can never be bigger than number of free clusters in the file system. Note that this patch fixes the failure of xfstest 274 as expected. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
2013-04-09ext4: fix free space estimate in ext4_nonda_switch()Eric Whitney1-7/+8
Values stored in s_freeclusters_counter and s_dirtyclusters_counter are both in cluster units. Remove the cluster to block conversion applied to s_freeclusters_counter causing an inflated estimate of free space because s_dirtyclusters_counter is not similarly converted. Rename free_blocks and dirty_blocks to better reflect the units these variables contain to avoid future confusion. This fix corrects ENOSPC failures for xfstests 127 and 231 on bigalloc file systems. Signed-off-by: Eric Whitney <enwlinux@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>