summaryrefslogtreecommitdiff
path: root/fs/xfs
AgeCommit message (Collapse)AuthorFilesLines
2013-08-13xfs: separate dquot on disk format definitions out of xfs_quota.hDave Chinner29-137/+175
The on disk format definitions of the on-disk dquot, log formats and quota off log formats are all intertwined with other definitions for quotas. Separate them out into their own header file so they can easily be shared with userspace. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-13xfs: split out EFI/EFD log item format definitionDave Chinner2-86/+85
The EFI/EFD item format definitions are shared with userspace. Split the out of header files that contain kernel only defintions to make it simple to shared them. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-13xfs: split out buf log item format definitionsDave Chinner2-97/+100
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-13xfs: split out inode log item format definitionDave Chinner7-183/+200
The log item format definitions are shared with userspace. Split them out of header files that contain kernel only defintions to make it simple to shared them. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-13xfs: separate out log format definitionsDave Chinner4-208/+207
The on-disk format definitions for the log are spread randoms through a couple of header files. Consolidate it all in a single file that can be shared easily with userspace. This means that xfs_log.h and xfs_log_priv.h no longer need to be shared with userspace. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-07-30xfs: WQ_NON_REENTRANT is meaningless and going awayTejun Heo1-3/+3
dbf2576e37 ("workqueue: make all workqueues non-reentrant") made WQ_NON_REENTRANT no-op and the flag is going away. Remove its usages. This patch doesn't introduce any behavior changes. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Ben Myers <bpm@sgi.com> Cc: Alex Elder <elder@kernel.org> Cc: xfs@oss.sgi.com Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-07-24xfs: di_flushiter considered harmfulDave Chinner3-11/+36
When we made all inode updates transactional, we no longer needed the log recovery detection for inodes being newer on disk than the transaction being replayed - it was redundant as replay of the log would always result in the latest version of the inode would be on disk. It was redundant, but left in place because it wasn't considered to be a problem. However, with the new "don't read inodes on create" optimisation, flushiter has come back to bite us. Essentially, the optimisation made always initialises flushiter to zero in the create transaction, and so if we then crash and run recovery and the inode already on disk has a non-zero flushiter it will skip recovery of that inode. As a result, log recovery does the wrong thing and we end up with a corrupt filesystem. Because we have to support old kernel to new kernel upgrades, we can't just get rid of the flushiter support in log recovery as we might be upgrading from a kernel that doesn't have fully transactional inode updates. Unfortunately, for v4 superblocks there is no way to guarantee that log recovery knows about this fact. We cannot add a new inode format flag to say it's a "special inode create" because it won't be understood by older kernels and so recovery could do the wrong thing on downgrade. We cannot specially detect the combination of zero mode/non-zero flushiter on disk to non-zero mode, zero flushiter in the log item during recovery because wrapping of the flushiter can result in false detection. Hence that makes this "don't use flushiter" optimisation limited to a disk format that guarantees that we don't need it. And that means the only fix here is to limit the "no read IO on create" optimisation to version 5 superblocks.... Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-07-22xfs: Start using pquotaino from the superblock.Chandra Seetharaman5-43/+158
Start using pquotino and define a macro to check if the superblock has pquotino. Keep backward compatibilty by alowing mount of older superblock with no separate pquota inode. Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-07-22xfs: Initialize all quota inodes to be NULLFSINOChandra Seetharaman1-0/+18
mkfs doesn't initialize the quota inodes to NULLFSINO as it does for the other internal inodes. This leads to two in-core values (0 and NULLFSINO) to be checked against, to make sure if a quota inode is valid. Solve that problem by initializing the in-core values of all quotaino values to NULLFSINO if they are 0 in the disk. Note that these values are not written back to on-disk superblock unless some quota is enabled on the filesystem. Even in that case sb_pquotino is written to disk only if the on-disk superblock supports pquotino Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-07-22xfs: Fix a deadlock in xfs_log_commit_cil() code pathChandra Seetharaman1-1/+1
While testing and rearranging pquota/gquota code, I stumbled on a xfs_shutdown() during a mount. But the mount just hung. Debugged and found that there is a deadlock involving &log->l_cilp->xc_ctx_lock. It is in a code path where &log->l_cilp->xc_ctx_lock is first acquired in read mode and some levels down the same semaphore is being acquired in write mode causing a deadlock. This is the stack: xfs_log_commit_cil -> acquires &log->l_cilp->xc_ctx_lock in read mode xlog_print_tic_res xfs_force_shutdown xfs_log_force_umount xlog_cil_force xlog_cil_force_lsn xlog_cil_push_foreground xlog_cil_push - tries to acquire same semaphore in write mode This patch fixes the deadlock by changing the reason code for xfs_force_shutdown in xlog_print_tic_res() to SHUTDOWN_LOG_IO_ERROR. SHUTDOWN_LOG_IO_ERROR is the right reason code to be set since we are in the log path. Thanks to Dave for suggesting this solution. Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-07-22xfs: fix assertion failure in xfs_vm_write_failed()Jie Liu1-1/+14
In xfs_vm_write_failed(), we evaluate the block_offset of pos with PAGE_MASK which is an unsigned long. That is fine on 64-bit platforms regardless of whether the request pos is 32-bit or 64-bit. However, on 32-bit platforms the value is 0xfffff000 and so the high 32 bits in it will be masked off with (pos & PAGE_MASK) for a 64-bit pos. As a result, the evaluated block_offset is incorrect which will cause this failure ASSERT(block_offset + from == pos); and potentially pass the wrong block to xfs_vm_kill_delalloc_range(). In this case, we can get a kernel panic if CONFIG_XFS_DEBUG is enabled: XFS: Assertion failed: block_offset + from == pos, file: fs/xfs/xfs_aops.c, line: 1504 ------------[ cut here ]------------ kernel BUG at fs/xfs/xfs_message.c:100! invalid opcode: 0000 [#1] SMP ........ Pid: 4057, comm: mkfs.xfs Tainted: G O 3.9.0-rc2 #1 EIP: 0060:[<f94a7e8b>] EFLAGS: 00010282 CPU: 0 EIP is at assfail+0x2b/0x30 [xfs] EAX: 00000056 EBX: f6ef28a0 ECX: 00000007 EDX: f57d22a4 ESI: 1c2fb000 EDI: 00000000 EBP: ea6b5d30 ESP: ea6b5d1c DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 CR0: 8005003b CR2: 094f3ff4 CR3: 2bcb4000 CR4: 000006f0 DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000 DR6: ffff0ff0 DR7: 00000400 Process mkfs.xfs (pid: 4057, ti=ea6b4000 task=ea5799e0 task.ti=ea6b4000) Stack: 00000000 f9525c48 f951fa80 f951f96b 000005e4 ea6b5d7c f9494b34 c19b0ea2 00000066 f3d6c620 c19b0ea2 00000000 e9a91458 00001000 00000000 00000000 00000000 c15c7e89 00000000 1c2fb000 00000000 00000000 1c2fb000 00000080 Call Trace: [<f9494b34>] xfs_vm_write_failed+0x74/0x1b0 [xfs] [<c15c7e89>] ? printk+0x4d/0x4f [<f9494d7d>] xfs_vm_write_begin+0x10d/0x170 [xfs] [<c110a34c>] generic_file_buffered_write+0xdc/0x210 [<f949b669>] xfs_file_buffered_aio_write+0xf9/0x190 [xfs] [<f949b7f3>] xfs_file_aio_write+0xf3/0x160 [xfs] [<c115e504>] do_sync_write+0x94/0xd0 [<c115ed1f>] vfs_write+0x8f/0x160 [<c115e470>] ? wait_on_retry_sync_kiocb+0x50/0x50 [<c115f017>] sys_write+0x47/0x80 [<c15d860d>] sysenter_do_call+0x12/0x28 ............. EIP: [<f94a7e8b>] assfail+0x2b/0x30 [xfs] SS:ESP 0068:ea6b5d1c ---[ end trace cdd9af4f4ecab42f ]--- Kernel panic - not syncing: Fatal exception In order to avoid this, we can evaluate the block_offset of the start of the page by using shifts rather than masks the mismatch problem. Thanks Dave Chinner for help finding and fixing this bug. Reported-by: Michael L. Semon <mlsemon35@gmail.com> Reviewed-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Jie Liu <jeff.liu@oracle.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-07-13Merge tag 'for-linus-v3.11-rc1-2' of git://oss.sgi.com/xfs/xfsLinus Torvalds21-316/+435
Pull more xfs updates from Ben Myers: "Here are a fix for xfs_fsr, a cleanup in bulkstat, a cleanup in xfs_open_by_handle, updated mount options documentation, a cleanup in xfs_bmapi_write, a fix for the size of dquot log reservations, a fix for sgid inheritance when acls are in use, a fix for cleaning up quotainfo structures, and some more of the work which allows group and project quotas to be used together. We had a few more in this last quota category that we might have liked to get in, but it looks there are still a few items that need to be addressed. - fix for xfs_fsr returning -EINVAL - cleanup in xfs_bulkstat - cleanup in xfs_open_by_handle - update mount options documentation - clean up local format handling in xfs_bmapi_write - fix dquot log reservations which were too small - fix sgid inheritance for subdirectories when default acls are in use - add project quota fields to various structures - fix teardown of quotainfo structures when quotas are turned off" * tag 'for-linus-v3.11-rc1-2' of git://oss.sgi.com/xfs/xfs: xfs: Fix the logic check for all quotas being turned off xfs: Add pquota fields where gquota is used. xfs: fix sgid inheritance for subdirectories inheriting default acls [V3] xfs: dquot log reservations are too small xfs: remove local fork format handling from xfs_bmapi_write() xfs: update mount options documentation xfs: use get_unused_fd_flags(0) instead of get_unused_fd() xfs: clean up unused codes at xfs_bulkstat() xfs: use XFS_BMAP_BMDR_SPACE vs. XFS_BROOT_SIZE_ADJ
2013-07-12xfs: Fix the logic check for all quotas being turned offChandra Seetharaman2-11/+2
During the review of seperate pquota inode patches, David noticed that the test to detect all quotas being turned off was incorrect, and hence the block was not freeing all the quota information. The check made sense in Irix, but in Linux, quota is turned off one at a time, which makes the test invalid for Linux. This problem existed since XFS was ported to Linux. David suggested to fix the problem by detecting when all quotas are turned off by checking m_qflags. Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-07-11xfs: Add pquota fields where gquota is used.Chandra Seetharaman14-124/+291
Add project quota changes to all the places where group quota field is used: * add separate project quota members into various structures * split project quota and group quotas so that instead of overriding the group quota members incore, the new project quota members are used instead * get rid of usage of the OQUOTA flag incore, in favor of separate group and project quota flags. * add a project dquot argument to various functions. Not using the pquotino field from superblock yet. Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-07-10xfs: fix sgid inheritance for subdirectories inheriting default acls [V3]Carlos Maiolino1-10/+10
XFS removes sgid bits of subdirectories under a directory containing a default acl. When a default acl is set, it implies xfs to call xfs_setattr_nonsize() in its code path. Such function is shared among mkdir and chmod system calls, and does some checks unneeded by mkdir (calling inode_change_ok()). Such checks remove sgid bit from the inode after it has been granted. With this patch, we extend the meaning of XFS_ATTR_NOACL flag to avoid these checks when acls are being inherited (thanks hch). Also, xfs_setattr_mode, doesn't need to re-check for group id and capabilities permissions, this only implies in another try to remove sgid bit from the directories. Such check is already done either on inode_change_ok() or xfs_setattr_nonsize(). Changelog: V2: Extends the meaning of XFS_ATTR_NOACL instead of wrap the tests into another function V3: Remove S_ISDIR check in xfs_setattr_nonsize() from the patch Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-07-10xfs: dquot log reservations are too smallDave Chinner2-9/+25
During review of the separate project quota inode patches, it became obvious that the dquot log reservation calculation underestimated the number dquots that can be modified in a transaction. This has it's roots way back in the Irix quota implementation. That is, when quotas were first implemented in XFS, it only supported user and project quotas as Irix did not have group quotas. Hence the worst case operation involving dquot modification was calculated to involve 2 user dquots and 1 project dquot or 1 user dequot and 2 project dquots. i.e. 3 dquots. This was determined back in 1996, and has remained unchanged ever since. However, back in 2001, the Linux XFS port dropped all support for project quota and implmented group quotas over the top. This was effectively done with a search-and-replace of project with group, and as such the log reservation was not changed. However, with the advent of group quotas, chmod and rename now could modify more than 3 dquots in a single transaction - both could modify 4 dquots. Hence this log reservation has been wrong for a long time. In 2005, project quota support was reintroduced into Linux, but it was implemented to be mutually exclusive to group quotas and so this didn't add any new changes to the dquot log reservation. Hence when project quotas were in use (rather than group quotas) the log reservation was again valid, just like in the Irix days. Now, with the addition of the separate project quota inode, group and project quotas are no longer mutually exclusive, and hence operations can now modify three dquots per inode where previously it was only two. The worst case here is the rename transaction, which can allocate/free space on two different directory inodes, and if they have different uid/gid/prid configurations and are world writeable, then rename can actually modify 6 different dquots now. Further, the dquot log reservation doesn't take into account the space used by the dquot log format structure that precedes the dquot that is logged, and hence further underestimates the worst case log space required by dquots during a transaction. This has been missing since the first commit in 1996. Hence the worst case log reservation needs to be increased from 3 to 6, and it needs to take into account a log format header for each of those dquots. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-07-10xfs: remove local fork format handling from xfs_bmapi_write()Dave Chinner4-124/+98
The conversion from local format to extent format requires interpretation of the data in the fork being converted, so it cannot be done in a generic way. It is up to the caller to convert the fork format to extent format before calling into xfs_bmapi_write() so format conversion can be done correctly. The code in xfs_bmapi_write() to convert the format is used implicitly by the attribute and directory code, but they specifically zero the fork size so that the conversion does not do any allocation or manipulation. Move this conversion into the shortform to leaf functions for the dir/attr code so the conversions are explicitly controlled by all callers. Now we can remove the conversion code in xfs_bmapi_write. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-07-10xfs: use get_unused_fd_flags(0) instead of get_unused_fd()Yann Droneaud1-1/+1
Macro get_unused_fd() is used to allocate a file descriptor with default flags. Those default flags (0) can be "unsafe": O_CLOEXEC must be used by default to not leak file descriptor across exec(). Instead of macro get_unused_fd(), functions anon_inode_getfd() or get_unused_fd_flags() should be used with flags given by userspace. If not possible, flags should be set to O_CLOEXEC to provide userspace with a default safe behavor. In a further patch, get_unused_fd() will be removed so that new code start using anon_inode_getfd() or get_unused_fd_flags() with correct flags. This patch replaces calls to get_unused_fd() with equivalent call to get_unused_fd_flags(0) to preserve current behavor for existing code. The hard coded flag value (0) should be reviewed on a per-subsystem basis, and, if possible, set to O_CLOEXEC. Signed-off-by: Yann Droneaud <ydroneaud@opteya.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-07-10xfs: clean up unused codes at xfs_bulkstat()Jie Liu1-27/+1
There are some unused codes at xfs_bulkstat(): - Variable bp is defined to point to the on-disk inode cluster buffer, but it proved to be of no practical help. - We process the chunks of good inodes which were fetched by iterating btree records from an AG. When processing inodes from each chunk, the code recomputing agbno if run into the first inode of a cluster, however, the agbno is not being used thereafter. This patch tries to clean up those things. Signed-off-by: Jie Liu <jeff.liu@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-07-09Merge tag 'for-linus-v3.11-rc1' of git://oss.sgi.com/xfs/xfsLinus Torvalds43-477/+1180
Pull xfs update from Ben Myers: "This includes several bugfixes, part of the work for project quotas and group quotas to be used together, performance improvements for inode creation/deletion, buffer readahead, and bulkstat, implementation of the inode change count, an inode create transaction, and the removal of a bunch of dead code. There are also some duplicate commits that you already have from the 3.10-rc series. - part of the work to allow project quotas and group quotas to be used together - inode change count - inode create transaction - block queue plugging in buffer readahead and bulkstat - ordered log vector support - removal of dead code in and around xfs_sync_inode_grab, xfs_ialloc_get_rec, XFS_MOUNT_RETERR, XFS_ALLOCFREE_LOG_RES, XFS_DIROP_LOG_RES, xfs_chash, ctl_table, and xfs_growfs_data_private - don't keep silent if sunit/swidth can not be changed via mount - fix a leak of remote symlink blocks into the filesystem when xattrs are used on symlinks - fix for fiemap to return FIEMAP_EXTENT_UNKOWN flag on delay extents - part of a fix for xfs_fsr - disable speculative preallocation with small files - performance improvements for inode creates and deletes" * tag 'for-linus-v3.11-rc1' of git://oss.sgi.com/xfs/xfs: (61 commits) xfs: Remove incore use of XFS_OQUOTA_ENFD and XFS_OQUOTA_CHKD xfs: Change xfs_dquot_acct to be a 2-dimensional array xfs: Code cleanup and removal of some typedef usage xfs: Replace macro XFS_DQ_TO_QIP with a function xfs: Replace macro XFS_DQUOT_TREE with a function xfs: Define a new function xfs_is_quota_inode() xfs: implement inode change count xfs: Use inode create transaction xfs: Inode create item recovery xfs: Inode create transaction reservations xfs: Inode create log items xfs: Introduce an ordered buffer item xfs: Introduce ordered log vector support xfs: xfs_ifree doesn't need to modify the inode buffer xfs: don't do IO when creating an new inode xfs: don't use speculative prealloc for small files xfs: plug directory buffer readahead xfs: add pluging for bulkstat readahead xfs: Remove dead function prototype xfs_sync_inode_grab() xfs: Remove the left function variable from xfs_ialloc_get_rec() ...
2013-07-09xfs: use XFS_BMAP_BMDR_SPACE vs. XFS_BROOT_SIZE_ADJEric Sandeen2-10/+7
XFS_BROOT_SIZE_ADJ is an undocumented macro which accounts for the difference in size between the on-disk and in-core btree root. It's much clearer to just use the newly-added XFS_BMAP_BMDR_SPACE macro which gives us the on-disk size directly. In one case, we must test that the if_broot exists before applying the macro, however. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-07-03Merge branch 'for-linus' of ↵Linus Torvalds1-4/+2
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull second set of VFS changes from Al Viro: "Assorted f_pos race fixes, making do_splice_direct() safe to call with i_mutex on parent, O_TMPFILE support, Jeff's locks.c series, ->d_hash/->d_compare calling conventions changes from Linus, misc stuff all over the place." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits) Document ->tmpfile() ext4: ->tmpfile() support vfs: export lseek_execute() to modules lseek_execute() doesn't need an inode passed to it block_dev: switch to fixed_size_llseek() cpqphp_sysfs: switch to fixed_size_llseek() tile-srom: switch to fixed_size_llseek() proc_powerpc: switch to fixed_size_llseek() ubi/cdev: switch to fixed_size_llseek() pci/proc: switch to fixed_size_llseek() isapnp: switch to fixed_size_llseek() lpfc: switch to fixed_size_llseek() locks: give the blocked_hash its own spinlock locks: add a new "lm_owner_key" lock operation locks: turn the blocked_list into a hashtable locks: convert fl_link to a hlist_node locks: avoid taking global lock if possible when waking up blocked waiters locks: protect most of the file_lock handling with i_lock locks: encapsulate the fl_link list handling locks: make "added" in __posix_lock_file a bool ...
2013-07-03vfs: export lseek_execute() to modulesJie Liu1-4/+2
For those file systems(btrfs/ext4/ocfs2/tmpfs) that support SEEK_DATA/SEEK_HOLE functions, we end up handling the similar matter in lseek_execute() to update the current file offset to the desired offset if it is valid, ceph also does the simliar things at ceph_llseek(). To reduce the duplications, this patch make lseek_execute() public accessible so that we can call it directly from the underlying file systems. Thanks Dave Chinner for this suggestion. [AV: call it vfs_setpos(), don't bring the removed 'inode' argument back] v2->v1: - Add kernel-doc comments for lseek_execute() - Call lseek_execute() in ceph->llseek() Signed-off-by: Jie Liu <jeff.liu@oracle.com> Cc: Dave Chinner <dchinner@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andi Kleen <andi@firstfloor.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Chris Mason <chris.mason@fusionio.com> Cc: Josef Bacik <jbacik@fusionio.com> Cc: Ben Myers <bpm@sgi.com> Cc: Ted Tso <tytso@mit.edu> Cc: Hugh Dickins <hughd@google.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Sage Weil <sage@inktank.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-07-02Merge tag 'ext4_for_linus' of ↵Linus Torvalds2-11/+18
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 update from Ted Ts'o: "Lots of bug fixes, cleanups and optimizations. In the bug fixes category, of note is a fix for on-line resizing file systems where the block size is smaller than the page size (i.e., file systems 1k blocks on x86, or more interestingly file systems with 4k blocks on Power or ia64 systems.) In the cleanup category, the ext4's punch hole implementation was significantly improved by Lukas Czerner, and now supports bigalloc file systems. In addition, Jan Kara significantly cleaned up the write submission code path. We also improved error checking and added a few sanity checks. In the optimizations category, two major optimizations deserve mention. The first is that ext4_writepages() is now used for nodelalloc and ext3 compatibility mode. This allows writes to be submitted much more efficiently as a single bio request, instead of being sent as individual 4k writes into the block layer (which then relied on the elevator code to coalesce the requests in the block queue). Secondly, the extent cache shrink mechanism, which was introduce in 3.9, no longer has a scalability bottleneck caused by the i_es_lru spinlock. Other optimizations include some changes to reduce CPU usage and to avoid issuing empty commits unnecessarily." * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (86 commits) ext4: optimize starting extent in ext4_ext_rm_leaf() jbd2: invalidate handle if jbd2_journal_restart() fails ext4: translate flag bits to strings in tracepoints ext4: fix up error handling for mpage_map_and_submit_extent() jbd2: fix theoretical race in jbd2__journal_restart ext4: only zero partial blocks in ext4_zero_partial_blocks() ext4: check error return from ext4_write_inline_data_end() ext4: delete unnecessary C statements ext3,ext4: don't mess with dir_file->f_pos in htree_dirblock_to_tree() jbd2: move superblock checksum calculation to jbd2_write_superblock() ext4: pass inode pointer instead of file pointer to punch hole ext4: improve free space calculation for inline_data ext4: reduce object size when !CONFIG_PRINTK ext4: improve extent cache shrink mechanism to avoid to burn CPU time ext4: implement error handling of ext4_mb_new_preallocation() ext4: fix corruption when online resizing a fs with 1K block size ext4: delete unused variables ext4: return FIEMAP_EXTENT_UNKNOWN for delalloc extents jbd2: remove debug dependency on debug_fs and update Kconfig help text jbd2: use a single printk for jbd_debug() ...
2013-06-29[readdir] convert xfsAl Viro7-61/+44
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29xfs: Remove incore use of XFS_OQUOTA_ENFD and XFS_OQUOTA_CHKDChandra Seetharaman7-53/+123
Remove all incore use of XFS_OQUOTA_ENFD and XFS_OQUOTA_CHKD. Instead, start using XFS_GQUOTA_.* XFS_PQUOTA_.* counterparts for GQUOTA and PQUOTA respectively. On-disk copy still uses XFS_OQUOTA_ENFD and XFS_OQUOTA_CHKD. Read and write of the superblock does the conversion from *OQUOTA* to *[PG]QUOTA*. Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-06-28xfs: Change xfs_dquot_acct to be a 2-dimensional arrayChandra Seetharaman2-23/+20
In preparation for combined pquota/gquota support, for the sake of readability, change xfs_dquot_acct to be a 2-dimensional array. Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-06-28xfs: Code cleanup and removal of some typedef usageChandra Seetharaman7-126/+135
In preparation for combined pquota/gquota support, for the sake of readability, do some code cleanup surrounding the affected code. Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-06-28xfs: Replace macro XFS_DQ_TO_QIP with a functionChandra Seetharaman3-5/+17
In preparation for combined pquota/gquota support, for the sake of readability, change the macro to an inline function. Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-06-28xfs: Replace macro XFS_DQUOT_TREE with a functionChandra Seetharaman3-10/+20
In preparation for combined pquota/gquota support, for the sake of readability, change the macro to an inline function. Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-06-28xfs: Define a new function xfs_is_quota_inode()Chandra Seetharaman4-10/+12
In preparation for combined pquota/gquota support, define a new function to check if the given inode is a quota inode. Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-06-28xfs: implement inode change countDave Chinner2-0/+15
For CRC enabled filesystems, add support for the monotonic inode version change counter that is needed by protocols like NFSv4 for determining if the inode has changed in any way at all between two unrelated operations on the inode. This bumps the change count the first time an inode is dirtied in a transaction. Since all modifications to the inode are logged, this will catch all changes that are made to the inode, including timestamp updates that occur during data writes. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Reviewed-by: Chandra Seetharaman <sekharan@us.ibm.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-06-27xfs: Use inode create transactionDave Chinner2-11/+33
Replace the use of buffer based logging of inode initialisation, uses the new logical form to describe the range to be initialised in recovery. We continue to "log" the inode buffers to push them into the AIL and ensure that the inode create transaction is not removed from the log before the inode buffers are written to disk. Update the transaction identifier and reservations to match the changed implementation. Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-06-27xfs: Inode create item recoveryDave Chinner3-14/+145
When we find a icreate transaction, we need to get and initialise the buffers in the range that has been passed. Extract and verify the information in the item record, then loop over the range initialising and issuing the buffer writes delayed. Support an arbitrary size range to initialise so that in future when we allocate inodes in much larger chunks all kernels that understand this transaction can still recover them. Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-06-27xfs: Inode create transaction reservationsDave Chinner1-41/+77
Define the log and space transaction sizes. Factor the current create log reservation macro into the two logical halves and reuse one half for the new icreate transactions. The icreate transaction is transparent to all the high level create code - the pre-calculated reservations will correctly set the reservations dependent on whether the filesystem supports the icreate transaction. Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-06-27xfs: Inode create log itemsDave Chinner6-2/+261
Introduce the inode create log item type for logical inode create logging. Instead of logging the changes in buffers, pass the range to be initialised through the log by a new transaction type. This reduces the amount of log space required to record initialisation during allocation from about 128 bytes per inode to a small fixed amount per inode extent to be initialised. This requires a new log item type to track it through the log and the AIL. This is a relatively simple item - most callbacks are noops as this item has the same life cycle as the transaction. Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-06-27xfs: Introduce an ordered buffer itemDave Chinner5-31/+87
If we have a buffer that we have modified but we do not wish to physically log in a transaction (e.g. we've logged a logical change), we still need to ensure that transactional integrity is maintained. Hence we must not move the tail of the log past the transaction that the buffer is associated with before the buffer is written to disk. This means these special buffers still need to be included in the transaction and added to the AIL just like a normal buffer, but we do not want the modifications to the buffer written into the transaction. IOWs, what we want is an "ordered buffer" that maintains the same transactional life cycle as a physically logged buffer, just without the transcribing of the modifications to the log. Hence we need to flag the buffer as an "ordered buffer" to avoid including it in vector size calculations or formatting during the transaction. Once the transaction is committed, the buffer appears for all intents to be the same as a physically logged buffer as it transitions through the log and AIL. Relogging will also work just fine for such an ordered buffer - the logical transaction will be replayed before the subsequent modifications that relog the buffer, so everything will be reconstructed correctly by recovery. Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-06-27xfs: Introduce ordered log vector supportDave Chinner3-28/+71
And "ordered log vector" is a log vector that is used for tracking a log item through the CIL and into the AIL as part of the log checkpointing. These ordered log vectors are special in that they are not written to to journal in any way, and are not accounted to the checkpoint being written. The reason for this behaviour is to allow operations to attach items to transactions and have them follow the normal transactional lifecycle without actually having to write them to the journal. This allows logging of items that track high level logical changes and writing them to the log, while the physical items being modified pass through into the AIL and pin the tail of the log (and therefore the logical item in the log) until all the modified items are physically written to disk. IOWs, it allows us to write metadata without physically logging every individual change but still maintain the full transactional integrity guarantees we currently have w.r.t. crash recovery. This change modifies some of the CIL item insertion loops, as ordered log vectors introduce some new constraints as they don't track any data. One advantage of this change is that it combines two log vector chain walks into a single pass, so there is less overhead in the transaction commit pass as well. It also kills some unused code in the log vector walk loop when committing the CIL. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-06-27xfs: xfs_ifree doesn't need to modify the inode bufferDave Chinner1-28/+4
Long ago, bulkstat used to read inodes directly from the backing buffer for speed. This had the unfortunate problem of being cache incoherent with unlinks, and so xfs_ifree() had to mark the inode as free directly in the backing buffer. bulkstat was changed some time ago to use inode cache coherent lookups, and so will never see unlinked inodes in it's lookups. Hence xfs_ifree() does not need to touch the inode backing buffer anymore. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-06-27xfs: don't do IO when creating an new inodeDave Chinner1-8/+28
When we are allocating a new inode, we read the inode cluster off disk to increment the generation number. We are already using a random generation number for newly allocated inodes, so if we are not using the ikeep mode, we can just generate a new generation number when we initialise the newly allocated inode. This avoids the need for reading the inode buffer during inode creation. This will speed up allocation of inodes in cold, partially allocated clusters as they will no longer need to be read from disk during allocation. It will also reduce the CPU overhead of inode allocation by not having the process the buffer read, even on cache hits. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-06-27xfs: don't use speculative prealloc for small filesDave Chinner1-0/+13
Dedicated small file workloads have been seeing significant free space fragmentation causing premature inode allocation failure when large inode sizes are in use. A particular test case showed that a workload that runs to a real ENOSPC on 256 byte inodes would fail inode allocation with ENOSPC about about 80% full with 512 byte inodes, and at about 50% full with 1024 byte inodes. The same workload, when run with -o allocsize=4096 on 1024 byte inodes would run to being 100% full before giving ENOSPC. That is, no freespace fragmentation at all. The issue was caused by the specific IO pattern the application had - the framework it was using did not support direct IO, and so it was emulating it by using fadvise(DONT_NEED). The result was that the data was getting written back before the speculative prealloc had been trimmed from memory by the close(), and so small single block files were being allocated with 2 blocks, and then having one truncated away. The result was lots of small 4k free space extents, and hence each new 8k allocation would take another 8k from contiguous free space and turn it into 4k of allocated space and 4k of free space. Hence inode allocation, which requires contiguous, aligned allocation of 16k (256 byte inodes), 32k (512 byte inodes) or 64k (1024 byte inodes) can fail to find sufficiently large freespace and hence fail while there is still lots of free space available. There's a simple fix for this, and one that has precendence in the allocator code already - don't do speculative allocation unless the size of the file is larger than a certain size. In this case, that size is the minimum default preallocation size: mp->m_writeio_blocks. And to keep with the concept of being nice to people when the files are still relatively small, cap the prealloc to mp->m_writeio_blocks until the file goes over a stripe unit is size, at which point we'll fall back to the current behaviour based on the last extent size. This will effectively turn off speculative prealloc for very small files, keep preallocation low for small files, and behave as it currently does for any file larger than a stripe unit. This completely avoids the freespace fragmentation problem this particular IO pattern was causing. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-06-27xfs: plug directory buffer readaheadDave Chinner1-0/+3
Similar to bulkstat inode chunk readahead, we need to plug directory data buffer readahead during getdents to ensure that we can merge adjacent readahead requests and sort out of order requests optimally before they are dispatched. This improves the readahead efficiency and reduces the IO load it generates as the IO patterns are significantly better for both contiguous and fragmented directories. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-06-27xfs: add pluging for bulkstat readaheadDave Chinner1-0/+3
I was running some tests on bulkstat on CRC enabled filesystems when I noticed that all the IO being issued was 8k in size, regardless of the fact taht we are issuing sequential 8k buffers for inodes clusters. The IO size should be 16k for 256 byte inodes, and 32k for 512 byte inodes, but this wasn't happening. blktrace showed that there was an explict plug and unplug happening around each readahead IO from _xfs_buf_ioapply, and the unplug was causing the IO to be issued immediately. Hence no opportunity was being given to the elevator to merge adjacent readahead requests and dispatch them as a single IO. Add plugging around the inode chunk readahead dispatch loop in bulkstat to ensure that we don't unplug the queue between adjacent inode buffer readahead IOs and so we get fewer, larger IO requests hitting the storage subsystem for bulkstat. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-06-26xfs: Remove dead function prototype xfs_sync_inode_grab()Jie Liu1-1/+0
Remove dead function prototype xfs_sync_inode_grab() from xfs_icache.h. Signed-off-by: Jie Liu <jeff.liu@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-06-26xfs: Remove the left function variable from xfs_ialloc_get_rec()Jie Liu1-4/+3
This patch clean out the left function variable as it is useless to xfs_ialloc_get_rec(). Signed-off-by: Jie Liu <jeff.liu@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-06-20xfs: check on-disk (not incore) btree root size in dfrag.cEric Sandeen2-3/+7
xfs_swap_extents_check_format() contains checks to make sure that original and the temporary files during defrag are compatible; Gabriel VLASIU ran into a case where xfs_fsr returned EINVAL because the tests found the btree root to be of size 120, while the fork offset was only 104; IOW, they overlapped. However, this is just due to an error in the xfs_swap_extents_check_format() tests, because it is checking the in-memory btree root size against the on-disk fork offset. We should be checking the on-disk sizes in both cases. This patch adds a new macro to calculate this size, and uses it in the tests. With this change, the filesystem image provided by Gabriel allows for proper file degragmentation. Reported-by: Gabriel VLASIU <gabriel@vlasiu.net> Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-06-19xfs: Remove XFS_MOUNT_RETERRJie Liu3-36/+14
XFS_MOUNT_RETERR is going to be set at xfs_parseargs() if mp->m_dalign is enabled, so any time we enter "if (mp->m_dalign)" branch in xfs_update_alignment(), XFS_MOUNT_RETERR is set and so we always be emitting a warning and returning an error. Hence, we can remove it and get rid of a couple of redundant check up against it at xfs_upate_alignment(). Thanks Dave Chinner for the suggestions of simplify the code in xfs_parseargs(). Signed-off-by: Jie Liu <jeff.liu@oracle.com> Cc: Dave Chinner <dchinner@redhat.com> Cc: Mark Tinguely <tinguely@sgi.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-06-19xfs: Remove two dead transaction log reservaion macrosJie Liu1-8/+3
Upstream commit 5b292ae3a951a58e32119d73c7ac8f5bec7395a3 xfs: make use of xfs_calc_buf_res() in xfs_trans.c Beginning from above commit, neither XFS_ALLOCFREE_LOG_RES() nor XFS_DIROP_LOG_RES() is used by those routines for calculating transaction space reservations, so it's safe to remove them now. Also, with a slightly update for the relevant comments to reflect the ideas of why those log count numbers should be. Signed-off-by: Jie Liu <jeff.liu@oracle.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-06-19xfs: return FIEMAP_EXTENT_UNKNOWN for delayed allocation extentJie Liu1-1/+2
For FIEMAP ioctl(2), if an extent is in delayed allocation state, we need to return the FIEMAP_EXTENT_UNKNOWN flag except the FIEMAP_EXTENT_DELALLOC because its data location is unknown. Signed-off-by: Jie Liu <jeff.liu@oracle.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-06-19xfs: fix the symbolic link assert in xfs_ifreeMark Tinguely4-15/+51
Adding an extended attribute to a symbolic link can force that link to an remote extent. xfs_inactive() incorrectly assumes that any symbolic link small enough to be in the inode core is incore, resulting in the remote extent to not be removed. xfs_ifree() will assert on presence of this leaked remote extent. Signed-off-by: Mark Tinguely <tinguely@sgi.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>