summaryrefslogtreecommitdiff
path: root/fs/ceph/file.c
AgeCommit message (Collapse)AuthorFilesLines
2020-03-30ceph: re-org copy_file_range and fix some error pathsLuis Henriques1-73/+100
This patch re-organizes copy_file_range, trying to fix a few issues in the error handling. Here's the summary: - Abort copy if initial do_splice_direct() returns fewer bytes than requested. - Move the 'size' initialization (with i_size_read()) further down in the code, after the initial call to do_splice_direct(). This avoids issues with a possibly stale value if a manual copy is done. - Move the object copy loop into a separate function. This makes it easier to handle errors (e.g, dirtying caps and updating the MDS metadata if only some objects have been copied before an error has occurred). - Added calls to ceph_oloc_destroy() to avoid leaking memory with src_oloc and dst_oloc - After the object copy loop, the new file size to be reported to the MDS (if there's file size change) is now the actual file size, and not the size after an eventual extra manual copy. - Added a few dout() to show the number of bytes copied in the two manual copies and in the object copy loop. Signed-off-by: Luis Henriques <lhenriques@suse.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-03-23ceph: check POOL_FLAG_FULL/NEARFULL in addition to OSDMAP_FULL/NEARFULLIlya Dryomov1-3/+11
CEPH_OSDMAP_FULL/NEARFULL aren't set since mimic, so we need to consult per-pool flags as well. Unfortunately the backwards compatibility here is lacking: - the change that deprecated OSDMAP_FULL/NEARFULL went into mimic, but was guarded by require_osd_release >= RELEASE_LUMINOUS - it was subsequently backported to luminous in v12.2.2, but that makes no difference to clients that only check OSDMAP_FULL/NEARFULL because require_osd_release is not client-facing -- it is for OSDs Since all kernels are affected, the best we can do here is just start checking both map flags and pool flags and send that to stable. These checks are best effort, so take osdc->lock and look up pool flags just once. Remove the FIXME, since filesystem quotas are checked above and RADOS quotas are reflected in POOL_FLAG_FULL: when the pool reaches its quota, both POOL_FLAG_FULL and POOL_FLAG_FULL_QUOTA are set. Cc: stable@vger.kernel.org Reported-by: Yanhu Cao <gmayyyha@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Acked-by: Sage Weil <sage@redhat.com>
2020-02-11ceph: do not execute direct write in parallel if O_APPEND is specifiedXiubo Li1-6/+11
In O_APPEND & O_DIRECT mode, the data from different writers will be possibly overlapping each other since they take the shared lock. For example, both Writer1 and Writer2 are in O_APPEND and O_DIRECT mode: Writer1 Writer2 shared_lock() shared_lock() getattr(CAP_SIZE) getattr(CAP_SIZE) iocb->ki_pos = EOF iocb->ki_pos = EOF write(data1) write(data2) shared_unlock() shared_unlock() The data2 will overlap the data1 from the same file offset, the old EOF. Switch to exclusive lock instead when O_APPEND is specified. Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-01-27ceph: use copy-from2 op in copy_file_rangeLuis Henriques1-1/+10
Instead of using the copy-from operation, switch copy_file_range to the new copy-from2 operation, which allows to send the truncate_seq and truncate_size parameters. If an OSD does not support the copy-from2 operation it will return -EOPNOTSUPP. In that case, the kernel client will stop trying to do remote object copies for this fs client and will always use the generic VFS copy_file_range. Signed-off-by: Luis Henriques <lhenriques@suse.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-12-02Merge tag 'compat-ioctl-5.5' of ↵Linus Torvalds1-1/+1
git://git.kernel.org:/pub/scm/linux/kernel/git/arnd/playground Pull removal of most of fs/compat_ioctl.c from Arnd Bergmann: "As part of the cleanup of some remaining y2038 issues, I came to fs/compat_ioctl.c, which still has a couple of commands that need support for time64_t. In completely unrelated work, I spent time on cleaning up parts of this file in the past, moving things out into drivers instead. After Al Viro reviewed an earlier version of this series and did a lot more of that cleanup, I decided to try to completely eliminate the rest of it and move it all into drivers. This series incorporates some of Al's work and many patches of my own, but in the end stops short of actually removing the last part, which is the scsi ioctl handlers. I have patches for those as well, but they need more testing or possibly a rewrite" * tag 'compat-ioctl-5.5' of git://git.kernel.org:/pub/scm/linux/kernel/git/arnd/playground: (42 commits) scsi: sd: enable compat ioctls for sed-opal pktcdvd: add compat_ioctl handler compat_ioctl: move SG_GET_REQUEST_TABLE handling compat_ioctl: ppp: move simple commands into ppp_generic.c compat_ioctl: handle PPPIOCGIDLE for 64-bit time_t compat_ioctl: move PPPIOCSCOMPRESS to ppp_generic compat_ioctl: unify copy-in of ppp filters tty: handle compat PPP ioctls compat_ioctl: move SIOCOUTQ out of compat_ioctl.c compat_ioctl: handle SIOCOUTQNSD af_unix: add compat_ioctl support compat_ioctl: reimplement SG_IO handling compat_ioctl: move WDIOC handling into wdt drivers fs: compat_ioctl: move FITRIM emulation into file systems gfs2: add compat_ioctl support compat_ioctl: remove unused convert_in_user macro compat_ioctl: remove last RAID handling code compat_ioctl: remove /dev/raw ioctl translation compat_ioctl: remove PCI ioctl translation compat_ioctl: remove joystick ioctl translation ...
2019-11-14ceph: increment/decrement dio counter on async requestsJeff Layton1-0/+4
Ceph can in some cases issue an async DIO request, in which case we can end up calling ceph_end_io_direct before the I/O is actually complete. That may allow buffered operations to proceed while DIO requests are still in flight. Fix this by incrementing the i_dio_count when issuing an async DIO request, and decrement it when tearing down the aio_req. Fixes: 321fe13c9398 ("ceph: add buffered/direct exclusionary locking for reads and writes") Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-11-14ceph: take the inode lock before acquiring cap refsJeff Layton1-7/+18
Most of the time, we (or the vfs layer) takes the inode_lock and then acquires caps, but ceph_read_iter does the opposite, and that can lead to a deadlock. When there are multiple clients treading over the same data, we can end up in a situation where a reader takes caps and then tries to acquire the inode_lock. Another task holds the inode_lock and issues a request to the MDS which needs to revoke the caps, but that can't happen until the inode_lock is unwedged. Fix this by having ceph_read_iter take the inode_lock earlier, before attempting to acquire caps. Fixes: 321fe13c9398 ("ceph: add buffered/direct exclusionary locking for reads and writes") Link: https://tracker.ceph.com/issues/36348 Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-11-05ceph: don't allow copy_file_range when stripe_count != 1Luis Henriques1-2/+10
copy_file_range tries to use the OSD 'copy-from' operation, which simply performs a full object copy. Unfortunately, the implementation of this system call assumes that stripe_count is always set to 1 and doesn't take into account that the data may be striped across an object set. If the file layout has stripe_count different from 1, then the destination file data will be corrupted. For example: Consider a 8 MiB file with 4 MiB object size, stripe_count of 2 and stripe_size of 2 MiB; the first half of the file will be filled with 'A's and the second half will be filled with 'B's: 0 4M 8M Obj1 Obj2 +------+------+ +----+ +----+ file: | AAAA | BBBB | | AA | | AA | +------+------+ |----| |----| | BB | | BB | +----+ +----+ If we copy_file_range this file into a new file (which needs to have the same file layout!), then it will start by copying the object starting at file offset 0 (Obj1). And then it will copy the object starting at file offset 4M -- which is Obj1 again. Unfortunately, the solution for this is to not allow remote object copies to be performed when the file layout stripe_count is not 1 and simply fallback to the default (VFS) copy_file_range implementation. Cc: stable@vger.kernel.org Signed-off-by: Luis Henriques <lhenriques@suse.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-11-05ceph: don't try to handle hashed dentries in non-O_CREAT atomic_openJeff Layton1-0/+3
If ceph_atomic_open is handed a !d_in_lookup dentry, then that means that it already passed d_revalidate so we *know* that it's negative (or at least was very recently). Just return -ENOENT in that case. This also addresses a subtle bug in dentry handling. Non-O_CREAT opens call atomic_open with the parent's i_rwsem shared, but calling d_splice_alias on a hashed dentry requires the exclusive lock. If ceph_atomic_open receives a hashed, negative dentry on a non-O_CREAT open, and another client were to race in and create the file before we issue our OPEN, ceph_fill_trace could end up calling d_splice_alias on the dentry with the new inode with insufficient locks. Cc: stable@vger.kernel.org Reported-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-10-23ceph: fix compat_ioctl for ceph_dir_operationsArnd Bergmann1-1/+1
The ceph_ioctl function is used both for files and directories, but only the files support doing that in 32-bit compat mode. On the s390 architecture, there is also a problem with invalid 31-bit pointers that need to be passed through compat_ptr(). Use the new compat_ptr_ioctl() to address both issues. Note: When backporting this patch to stable kernels, "compat_ioctl: add compat_ptr_ioctl()" is needed as well. Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-09-16ceph: allow object copies across different filesystems in the same clusterLuis Henriques1-4/+13
OSDs are able to perform object copies across different pools. Thus, there's no need to prevent copy_file_range from doing remote copies if the source and destination superblocks are different. Only return -EXDEV if they have different fsid (the cluster ID). Signed-off-by: Luis Henriques <lhenriques@suse.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-09-16ceph: add buffered/direct exclusionary locking for reads and writesJeff Layton1-13/+22
xfstest generic/451 intermittently fails. The test does O_DIRECT writes to a file, and then reads back the result using buffered I/O, while running a separate set of tasks that are also doing buffered reads. The client will invalidate the cache prior to a direct write, but it's easy for one of the other readers' replies to race in and reinstantiate the invalidated range with stale data. To fix this, we must to serialize direct I/O writes and buffered reads. We could just sprinkle in some shared locks on the i_rwsem for reads, and increase the exclusive footprint on the write side, but that would cause O_DIRECT writes to end up serialized vs. other direct requests. Instead, borrow the scheme used by nfs.ko. Buffered writes take the i_rwsem exclusively, but buffered reads take a shared lock, allowing them to run in parallel. O_DIRECT requests also take a shared lock, but we need for them to not run in parallel with buffered reads. A flag on the ceph_inode_info is used to indicate whether it's in direct or buffered I/O mode. When a conflicting request is submitted, it will block until the inode can be flipped to the necessary mode. Link: https://tracker.ceph.com/issues/40985 Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-09-16ceph: auto reconnect after blacklistedYan, Zheng1-1/+7
Make client use osd reply and session message to infer if itself is blacklisted. Client reconnect to cluster using new entity addr if it is blacklisted. Auto reconnect is limited to once every 30 minutes. Auto reconnect is disabled by default. It can be enabled/disabled by recover_session=<no|clean> mount option. In 'clean' mode, client drops any dirty data/metadata, invalidates page caches and invalidates all writable file handles. After reconnect, file locks become stale because MDS loses track of them. If an inode contains any stale file locks, read/write on the indoe are not allowed until applications release all stale file locks. Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-09-16ceph: invalidate all write mode filp after reconnectYan, Zheng1-0/+1
Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-09-16ceph: pass filp to ceph_get_caps()Yan, Zheng1-18/+17
Also change several other functions' arguments, no logical changes. This is preparetion for later patch that checks filp error. Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-09-16ceph: track and report error of async metadata operationYan, Zheng1-2/+4
Use errseq_t to track and report errors of async metadata operations, similar to how kernel handles errors during writeback. If any dirty caps or any unsafe request gets dropped during session eviction, record -EIO in corresponding inode's i_meta_err. The error will be reported by subsequent fsync, Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-09-16ceph: allow copy_file_range when src and dst inode are sameJeff Layton1-2/+0
There is no reason to prevent this. The OSD should be able to handle this as long as the objects are different, and the existing code falls back when the offset into the object is different. Signed-off-by: Jeff Layton <jlayton@kernel.org> Acked-by: Luis Henriques <lhenriques@suse.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-07-18Merge tag 'ceph-for-5.3-rc1' of git://github.com/ceph/ceph-clientLinus Torvalds1-13/+21
Pull ceph updates from Ilya Dryomov: "Lots of exciting things this time! - support for rbd object-map and fast-diff features (myself). This will speed up reads, discards and things like snap diffs on sparse images. - ceph.snap.btime vxattr to expose snapshot creation time (David Disseldorp). This will be used to integrate with "Restore Previous Versions" feature added in Windows 7 for folks who reexport ceph through SMB. - security xattrs for ceph (Zheng Yan). Only selinux is supported for now due to the limitations of ->dentry_init_security(). - support for MSG_ADDR2, FS_BTIME and FS_CHANGE_ATTR features (Jeff Layton). This is actually a single feature bit which was missing because of the filesystem pieces. With this in, the kernel client will finally be reported as "luminous" by "ceph features" -- it is still being reported as "jewel" even though all required Luminous features were implemented in 4.13. - stop NULL-terminating ceph vxattrs (Jeff Layton). The convention with xattrs is to not terminate and this was causing inconsistencies with ceph-fuse. - change filesystem time granularity from 1 us to 1 ns, again fixing an inconsistency with ceph-fuse (Luis Henriques). On top of this there are some additional dentry name handling and cap flushing fixes from Zheng. Finally, Jeff is formally taking over for Zheng as the filesystem maintainer" * tag 'ceph-for-5.3-rc1' of git://github.com/ceph/ceph-client: (71 commits) ceph: fix end offset in truncate_inode_pages_range call ceph: use generic_delete_inode() for ->drop_inode ceph: use ceph_evict_inode to cleanup inode's resource ceph: initialize superblock s_time_gran to 1 MAINTAINERS: take over for Zheng as CephFS kernel client maintainer rbd: setallochint only if object doesn't exist rbd: support for object-map and fast-diff rbd: call rbd_dev_mapping_set() from rbd_dev_image_probe() libceph: export osd_req_op_data() macro libceph: change ceph_osdc_call() to take page vector for response libceph: bump CEPH_MSG_MAX_DATA_LEN (again) rbd: new exclusive lock wait/wake code rbd: quiescing lock should wait for image requests rbd: lock should be quiesced on reacquire rbd: introduce copyup state machine rbd: rename rbd_obj_setup_*() to rbd_obj_init_*() rbd: move OSD request allocation into object request state machines rbd: factor out __rbd_osd_setup_discard_ops() rbd: factor out rbd_osd_setup_copyup() rbd: introduce obj_req->osd_reqs list ...
2019-07-08ceph: fix end offset in truncate_inode_pages_range callLuis Henriques1-1/+1
Commit e450f4d1a5d6 ("ceph: pass inclusive lend parameter to filemap_write_and_wait_range()") fixed the end offset parameter used to call filemap_write_and_wait_range and invalidate_inode_pages2_range. Unfortunately it missed truncate_inode_pages_range, introducing a regression that is easily detected by xfstest generic/130. The problem is that when doing direct IO it is possible that an extra page is truncated from the page cache when the end offset is page aligned. This can cause data loss if that page hasn't been sync'ed to the OSDs. While there, change code to use PAGE_ALIGN macro instead. Cc: stable@vger.kernel.org Fixes: e450f4d1a5d6 ("ceph: pass inclusive lend parameter to filemap_write_and_wait_range()") Signed-off-by: Luis Henriques <lhenriques@suse.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-07-08libceph: rename r_unsafe_item to r_private_itemIlya Dryomov1-3/+3
This list item remained from when we had safe and unsafe replies (commit vs ack). It has since become a private list item for use by clients. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-07-08ceph: increment change_attribute on local changesJeff Layton1-0/+5
We don't set SB_I_VERSION on ceph since we need to manage it ourselves, so we must increment it whenever we update the file times. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-07-08ceph: add selinux supportYan, Zheng1-0/+3
When creating new file/directory, use security_dentry_init_security() to prepare selinux context for the new inode, then send openc/mkdir request to MDS, together with selinux xattr. security_dentry_init_security() only supports single security module and only selinux has dentry_init_security hook. So only selinux is supported for now. We can add support for other security modules once kernel has a generic version of dentry_init_security() Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-07-08ceph: rename struct ceph_acls_info to ceph_acl_sec_ctxYan, Zheng1-9/+9
Also rename ceph_release_acls_info() to ceph_release_acl_sec_ctx(). And move their definitions to different files. This is preparation for security label support. Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-06-09vfs: allow copy_file_range to copy across devicesAmir Goldstein1-1/+3
We want to enable cross-filesystem copy_file_range functionality where possible, so push the "same superblock only" checks down to the individual filesystem callouts so they can make their own decisions about cross-superblock copy offload and fallack to generic_copy_file_range() for cross-superblock copy. [Amir] We do not call ->remap_file_range() in case the files are not on the same sb and do not call ->copy_file_range() in case the files do not belong to the same filesystem driver. This changes behavior of the copy_file_range(2) syscall, which will now allow cross filesystem in-kernel copy. CIFS already supports cross-superblock copy, between two shares to the same server. This functionality will now be available via the copy_file_range(2) syscall. Cc: Steve French <stfrench@microsoft.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-06-09vfs: no fallback for ->copy_file_rangeDave Chinner1-3/+18
Now that we have generic_copy_file_range(), remove it as a fallback case when offloads fail. This puts the responsibility for executing fallbacks on the filesystems that implement ->copy_file_range and allows us to add operational validity checks to generic_copy_file_range(). Rework vfs_copy_file_range() to call a new do_copy_file_range() helper to execute the copying callout, and move calls to generic_file_copy_range() into filesystem methods where they currently return failures. [Amir] overlayfs is not responsible of executing the fallback. It is the responsibility of the underlying filesystem. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-06-05ceph: single workqueue for inode related worksYan, Zheng1-1/+1
We have three workqueue for inode works. Later patch will introduce one more work for inode. It's not good to introcuce more workqueue and add more 'struct work_struct' to 'struct ceph_inode_info'. Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-05-17Merge tag 'ceph-for-5.2-rc1' of git://github.com/ceph/ceph-clientLinus Torvalds1-1/+1
Pull ceph updates from Ilya Dryomov: "On the filesystem side we have: - a fix to enforce quotas set above the mount point (Luis Henriques) - support for exporting snapshots through NFS (Zheng Yan) - proper statx implementation (Jeff Layton). statx flags are mapped to MDS caps, with AT_STATX_{DONT,FORCE}_SYNC taken into account. - some follow-up dentry name handling fixes, in particular elimination of our hand-rolled helper and the switch to __getname() as suggested by Al (Jeff Layton) - a set of MDS client cleanups in preparation for async MDS requests in the future (Jeff Layton) - a fix to sync the filesystem before remounting (Jeff Layton) On the rbd side, work is on-going on object-map and fast-diff image features" * tag 'ceph-for-5.2-rc1' of git://github.com/ceph/ceph-client: (29 commits) ceph: flush dirty inodes before proceeding with remount ceph: fix unaligned access in ceph_send_cap_releases libceph: make ceph_pr_addr take an struct ceph_entity_addr pointer libceph: fix unaligned accesses in ceph_entity_addr handling rbd: don't assert on writes to snapshots rbd: client_mutex is never nested ceph: print inode number in __caps_issued_mask debugging messages ceph: just call get_session in __ceph_lookup_mds_session ceph: simplify arguments and return semantics of try_get_cap_refs ceph: fix comment over ceph_drop_caps_for_unlink ceph: move wait for mds request into helper function ceph: have ceph_mdsc_do_request call ceph_mdsc_submit_request ceph: after an MDS request, do callback and completions ceph: use pathlen values returned by set_request_path_attr ceph: use __getname/__putname in ceph_mdsc_build_path ceph: use ceph_mdsc_build_path instead of clone_dentry_name ceph: fix potential use-after-free in ceph_mdsc_build_path ceph: dump granular cap info in "caps" debugfs file ceph: make iterate_session_caps a public symbol ceph: fix NULL pointer deref when debugging is enabled ...
2019-05-07ceph: fix NULL pointer deref when debugging is enabledJeff Layton1-1/+1
Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-04-09fs: mark expected switch fall-throughsGustavo A. R. Silva1-0/+1
In preparation to enabling -Wimplicit-fallthrough, mark switch cases where we are expecting to fall through. This patch fixes the following warnings: fs/affs/affs.h:124:38: warning: this statement may fall through [-Wimplicit-fallthrough=] fs/configfs/dir.c:1692:11: warning: this statement may fall through [-Wimplicit-fallthrough=] fs/configfs/dir.c:1694:7: warning: this statement may fall through [-Wimplicit-fallthrough=] fs/ceph/file.c:249:3: warning: this statement may fall through [-Wimplicit-fallthrough=] fs/ext4/hash.c:233:15: warning: this statement may fall through [-Wimplicit-fallthrough=] fs/ext4/hash.c:246:15: warning: this statement may fall through [-Wimplicit-fallthrough=] fs/ext2/inode.c:1237:7: warning: this statement may fall through [-Wimplicit-fallthrough=] fs/ext2/inode.c:1244:7: warning: this statement may fall through [-Wimplicit-fallthrough=] fs/ext4/indirect.c:1182:6: warning: this statement may fall through [-Wimplicit-fallthrough=] fs/ext4/indirect.c:1188:6: warning: this statement may fall through [-Wimplicit-fallthrough=] fs/ext4/indirect.c:1432:6: warning: this statement may fall through [-Wimplicit-fallthrough=] fs/ext4/indirect.c:1440:6: warning: this statement may fall through [-Wimplicit-fallthrough=] fs/f2fs/node.c:618:8: warning: this statement may fall through [-Wimplicit-fallthrough=] fs/f2fs/node.c:620:8: warning: this statement may fall through [-Wimplicit-fallthrough=] fs/btrfs/ref-verify.c:522:15: warning: this statement may fall through [-Wimplicit-fallthrough=] fs/gfs2/bmap.c:711:7: warning: this statement may fall through [-Wimplicit-fallthrough=] fs/gfs2/bmap.c:722:7: warning: this statement may fall through [-Wimplicit-fallthrough=] fs/jffs2/fs.c:339:6: warning: this statement may fall through [-Wimplicit-fallthrough=] fs/nfsd/nfs4proc.c:429:12: warning: this statement may fall through [-Wimplicit-fallthrough=] fs/ufs/util.h:62:6: warning: this statement may fall through [-Wimplicit-fallthrough=] fs/ufs/util.h:43:6: warning: this statement may fall through [-Wimplicit-fallthrough=] fs/fcntl.c:770:7: warning: this statement may fall through [-Wimplicit-fallthrough=] fs/seq_file.c:319:10: warning: this statement may fall through [-Wimplicit-fallthrough=] fs/libfs.c:148:11: warning: this statement may fall through [-Wimplicit-fallthrough=] fs/libfs.c:150:7: warning: this statement may fall through [-Wimplicit-fallthrough=] fs/signalfd.c:178:7: warning: this statement may fall through [-Wimplicit-fallthrough=] fs/locks.c:1473:16: warning: this statement may fall through [-Wimplicit-fallthrough=] Warning level 3 was used: -Wimplicit-fallthrough=3 This patch is part of the ongoing efforts to enabling -Wimplicit-fallthrough. Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
2019-03-05ceph: pass inclusive lend parameter to filemap_write_and_wait_range()zhengbin1-5/+8
The 'lend' parameter of filemap_write_and_wait_range is required to be inclusive, so follow the rule. Same for invalidate_inode_pages2_range. Signed-off-by: zhengbin <zhengbin13@huawei.com> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-11-08ceph: add destination file data sync before doing any remote copyLuis Henriques1-2/+9
If we try to copy into a file that was just written, any data that is remote copied will be overwritten by our buffered writes once they are flushed.  When this happens, the call to invalidate_inode_pages2_range will also return a -EBUSY error. This patch fixes this by also sync'ing the destination file before starting any copy. Fixes: 503f82a9932d ("ceph: support copy_file_range file operation") Signed-off-by: Luis Henriques <lhenriques@suse.com> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-11-02Merge branch 'work.afs' of ↵Linus Torvalds1-5/+4
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull AFS updates from Al Viro: "AFS series, with some iov_iter bits included" * 'work.afs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (26 commits) missing bits of "iov_iter: Separate type from direction and use accessor functions" afs: Probe multiple fileservers simultaneously afs: Fix callback handling afs: Eliminate the address pointer from the address list cursor afs: Allow dumping of server cursor on operation failure afs: Implement YFS support in the fs client afs: Expand data structure fields to support YFS afs: Get the target vnode in afs_rmdir() and get a callback on it afs: Calc callback expiry in op reply delivery afs: Fix FS.FetchStatus delivery from updating wrong vnode afs: Implement the YFS cache manager service afs: Remove callback details from afs_callback_break struct afs: Commit the status on a new file/dir/symlink afs: Increase to 64-bit volume ID and 96-bit vnode ID for YFS afs: Don't invoke the server to read data beyond EOF afs: Add a couple of tracepoints to log I/O errors afs: Handle EIO from delivery function afs: Fix TTL on VL server and address lists afs: Implement VL server rotation afs: Improve FS server rotation error handling ...
2018-10-24iov_iter: Separate type from direction and use accessor functionsDavid Howells1-3/+2
In the iov_iter struct, separate the iterator type from the iterator direction and use accessor functions to access them in most places. Convert a bunch of places to use switch-statements to access them rather then chains of bitwise-AND statements. This makes it easier to add further iterator types. Also, this can be more efficient as to implement a switch of small contiguous integers, the compiler can use ~50% fewer compare instructions than it has to use bitwise-and instructions. Further, cease passing the iterator type into the iterator setup function. The iterator function can set that itself. Only the direction is required. Signed-off-by: David Howells <dhowells@redhat.com>
2018-10-24iov_iter: Use accessor functionDavid Howells1-1/+1
Use accessor functions to access an iterator's type and direction. This allows for the possibility of using some other method of determining the type of iterator than if-chains with bitwise-AND conditions. Signed-off-by: David Howells <dhowells@redhat.com>
2018-10-22ceph: new mount option to disable usage of copy-from opLuis Henriques1-0/+3
Add a new mount option 'nocopyfrom' that will prevent the usage of the RADOS 'copy-from' operation in cephfs. This could be useful, for example, for an administrator to temporarily mitigate any possible bugs in the 'copy-from' implementation. Currently, only copy_file_range uses this RADOS operation. Setting this mount option will result in this syscall reverting to the default VFS implementation, i.e. to perform the copies locally instead of doing remote object copies. Signed-off-by: Luis Henriques <lhenriques@suse.com> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-10-22ceph: support copy_file_range file operationLuis Henriques1-1/+293
This commit implements support for the copy_file_range syscall in cephfs. It is implemented using the RADOS 'copy-from' operation, which allows to do a remote object copy, without the need to download/upload data from/to the OSDs. Some manual copy may however be required if the source/destination file offsets aren't object aligned or if the copy length is smaller than the object size. Signed-off-by: Luis Henriques <lhenriques@suse.com> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-10-22libceph, rbd, ceph: move ceph_osdc_alloc_messages() callsIlya Dryomov1-5/+5
The current requirement is that ceph_osdc_alloc_messages() should be called after oid and oloc are known. In preparation for preallocating message data items, move ceph_osdc_alloc_messages() further down, so that it is called when OSD op codes are known. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-10-22ceph: num_ops is off by one in ceph_aio_retry_work()Ilya Dryomov1-1/+1
Two OSD op slots are allocated, but only one is ever used. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-10-22ceph: only allow punch hole mode in fallocateLuis Henriques1-36/+9
Current implementation of cephfs fallocate isn't correct as it doesn't really reserve the space in the cluster, which means that a subsequent call to a write may actually fail due to lack of space. In fact, it is currently possible to fallocate an amount space that is larger than the free space in the cluster. It has behaved this way since the initial commit ad7a60de882a ("ceph: punch hole support"). Since there's no easy solution to fix this at the moment, this patch simply removes support for all fallocate operations but FALLOC_FL_PUNCH_HOLE (which implies FALLOC_FL_KEEP_SIZE). Link: https://tracker.ceph.com/issues/36317 Signed-off-by: Luis Henriques <lhenriques@suse.com> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-10-22ceph: refactor ceph_sync_read()Yan, Zheng1-113/+106
Avoid allocating memory for the entire user request: striped_read() does a synchronous OSD request per object, so it doesn't need more than object size worth of pages at a time. [ Preserve the comment, changelog. ] Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-08-21Merge tag 'ceph-for-4.19-rc1' of git://github.com/ceph/ceph-clientLinus Torvalds1-12/+22
Pull ceph updates from Ilya Dryomov: "The main things are support for cephx v2 authentication protocol and basic support for rbd images within namespaces (myself). Also included are y2038 conversion patches from Arnd, a pile of miscellaneous fixes from Chengguang and Zheng's feature bit infrastructure for the filesystem" * tag 'ceph-for-4.19-rc1' of git://github.com/ceph/ceph-client: (40 commits) ceph: don't drop message if it contains more data than expected ceph: support cephfs' own feature bits crush: fix using plain integer as NULL warning libceph: remove unnecessary non NULL check for request_key ceph: refactor error handling code in ceph_reserve_caps() ceph: refactor ceph_unreserve_caps() ceph: change to void return type for __do_request() ceph: compare fsc->max_file_size and inode->i_size for max file size limit ceph: add additional size check in ceph_setattr() ceph: add additional offset check in ceph_write_iter() ceph: add additional range check in ceph_fallocate() ceph: add new field max_file_size in ceph_fs_client libceph: weaken sizeof check in ceph_x_verify_authorizer_reply() libceph: check authorizer reply/challenge length before reading libceph: implement CEPHX_V2 calculation mode libceph: add authorizer challenge libceph: factor out encrypt_authorizer() libceph: factor out __ceph_x_decrypt() libceph: factor out __prepare_write_connect() libceph: store ceph_auth_handshake pointer in ceph_connection ...
2018-08-13ceph: compare fsc->max_file_size and inode->i_size for max file size limitChengguang Xu1-1/+2
In ceph_llseek(), we compare fsc->max_file_size and inode->i_size to choose max file size limit. Signed-off-by: Chengguang Xu <cgxu519@gmx.com> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-08-02ceph: add additional offset check in ceph_write_iter()Chengguang Xu1-4/+11
If the offset is larger or equal to both real file size and max file size, then return -EFBIG. Signed-off-by: Chengguang Xu <cgxu519@gmx.com> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-08-02ceph: add additional range check in ceph_fallocate()Chengguang Xu1-3/+5
If the range is larger than both real file size and limit of max file size, then return -EFBIG. Signed-off-by: Chengguang Xu <cgxu519@gmx.com> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-08-02libceph: use timespec64 for r_mtimeArnd Bergmann1-4/+4
The request mtime field is used all over ceph, and is currently represented as a 'timespec' structure in Linux. This changes it to timespec64 to allow times beyond 2038, modifying all users at the same time. [ Remove now redundant ts variable in writepage_nounlock(). ] Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-07-12get rid of 'opened' argument of ->atomic_open() - part 3Al Viro1-2/+1
now it can be done... Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-07-12getting rid of 'opened' argument of ->atomic_open() - part 1Al Viro1-1/+1
'opened' argument of finish_open() is unused. Kill it. Signed-off-by Al Viro <viro@zeniv.linux.org.uk>
2018-07-12introduce FMODE_CREATED and switch to itAl Viro1-1/+1
Parallel to FILE_CREATED, goes into ->f_mode instead of *opened. NFS is a bit of a wart here - it doesn't have file at the point where FILE_CREATED used to be set, so we need to propagate it there (for now). IMA is another one (here and everywhere)... Note that this needs do_dentry_open() to leave old bits in ->f_mode alone - we want it to preserve FMODE_CREATED if it had been already set (no other bit can be there). Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-06-15Merge tag 'vfs-timespec64' of ↵Linus Torvalds1-3/+3
git://git.kernel.org/pub/scm/linux/kernel/git/arnd/playground Pull inode timestamps conversion to timespec64 from Arnd Bergmann: "This is a late set of changes from Deepa Dinamani doing an automated treewide conversion of the inode and iattr structures from 'timespec' to 'timespec64', to push the conversion from the VFS layer into the individual file systems. As Deepa writes: 'The series aims to switch vfs timestamps to use struct timespec64. Currently vfs uses struct timespec, which is not y2038 safe. The series involves the following: 1. Add vfs helper functions for supporting struct timepec64 timestamps. 2. Cast prints of vfs timestamps to avoid warnings after the switch. 3. Simplify code using vfs timestamps so that the actual replacement becomes easy. 4. Convert vfs timestamps to use struct timespec64 using a script. This is a flag day patch. Next steps: 1. Convert APIs that can handle timespec64, instead of converting timestamps at the boundaries. 2. Update internal data structures to avoid timestamp conversions' Thomas Gleixner adds: 'I think there is no point to drag that out for the next merge window. The whole thing needs to be done in one go for the core changes which means that you're going to play that catchup game forever. Let's get over with it towards the end of the merge window'" * tag 'vfs-timespec64' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/playground: pstore: Remove bogus format string definition vfs: change inode times to use struct timespec64 pstore: Convert internal records to timespec64 udf: Simplify calls to udf_disk_stamp_to_time fs: nfs: get rid of memcpys for inode times ceph: make inode time prints to be long long lustre: Use long long type to print inode time fs: add timespec64_truncate()
2018-06-06vfs: change inode times to use struct timespec64Deepa Dinamani1-3/+3
struct timespec is not y2038 safe. Transition vfs to use y2038 safe struct timespec64 instead. The change was made with the help of the following cocinelle script. This catches about 80% of the changes. All the header file and logic changes are included in the first 5 rules. The rest are trivial substitutions. I avoid changing any of the function signatures or any other filesystem specific data structures to keep the patch simple for review. The script can be a little shorter by combining different cases. But, this version was sufficient for my usecase. virtual patch @ depends on patch @ identifier now; @@ - struct timespec + struct timespec64 current_time ( ... ) { - struct timespec now = current_kernel_time(); + struct timespec64 now = current_kernel_time64(); ... - return timespec_trunc( + return timespec64_trunc( ... ); } @ depends on patch @ identifier xtime; @@ struct \( iattr \| inode \| kstat \) { ... - struct timespec xtime; + struct timespec64 xtime; ... } @ depends on patch @ identifier t; @@ struct inode_operations { ... int (*update_time) (..., - struct timespec t, + struct timespec64 t, ...); ... } @ depends on patch @ identifier t; identifier fn_update_time =~ "update_time$"; @@ fn_update_time (..., - struct timespec *t, + struct timespec64 *t, ...) { ... } @ depends on patch @ identifier t; @@ lease_get_mtime( ... , - struct timespec *t + struct timespec64 *t ) { ... } @te depends on patch forall@ identifier ts; local idexpression struct inode *inode_node; identifier i_xtime =~ "^i_[acm]time$"; identifier ia_xtime =~ "^ia_[acm]time$"; identifier fn_update_time =~ "update_time$"; identifier fn; expression e, E3; local idexpression struct inode *node1; local idexpression struct inode *node2; local idexpression struct iattr *attr1; local idexpression struct iattr *attr2; local idexpression struct iattr attr; identifier i_xtime1 =~ "^i_[acm]time$"; identifier i_xtime2 =~ "^i_[acm]time$"; identifier ia_xtime1 =~ "^ia_[acm]time$"; identifier ia_xtime2 =~ "^ia_[acm]time$"; @@ ( ( - struct timespec ts; + struct timespec64 ts; | - struct timespec ts = current_time(inode_node); + struct timespec64 ts = current_time(inode_node); ) <+... when != ts ( - timespec_equal(&inode_node->i_xtime, &ts) + timespec64_equal(&inode_node->i_xtime, &ts) | - timespec_equal(&ts, &inode_node->i_xtime) + timespec64_equal(&ts, &inode_node->i_xtime) | - timespec_compare(&inode_node->i_xtime, &ts) + timespec64_compare(&inode_node->i_xtime, &ts) | - timespec_compare(&ts, &inode_node->i_xtime) + timespec64_compare(&ts, &inode_node->i_xtime) | ts = current_time(e) | fn_update_time(..., &ts,...) | inode_node->i_xtime = ts | node1->i_xtime = ts | ts = inode_node->i_xtime | <+... attr1->ia_xtime ...+> = ts | ts = attr1->ia_xtime | ts.tv_sec | ts.tv_nsec | btrfs_set_stack_timespec_sec(..., ts.tv_sec) | btrfs_set_stack_timespec_nsec(..., ts.tv_nsec) | - ts = timespec64_to_timespec( + ts = ... -) | - ts = ktime_to_timespec( + ts = ktime_to_timespec64( ...) | - ts = E3 + ts = timespec_to_timespec64(E3) | - ktime_get_real_ts(&ts) + ktime_get_real_ts64(&ts) | fn(..., - ts + timespec64_to_timespec(ts) ,...) ) ...+> ( <... when != ts - return ts; + return timespec64_to_timespec(ts); ...> ) | - timespec_equal(&node1->i_xtime1, &node2->i_xtime2) + timespec64_equal(&node1->i_xtime2, &node2->i_xtime2) | - timespec_equal(&node1->i_xtime1, &attr2->ia_xtime2) + timespec64_equal(&node1->i_xtime2, &attr2->ia_xtime2) | - timespec_compare(&node1->i_xtime1, &node2->i_xtime2) + timespec64_compare(&node1->i_xtime1, &node2->i_xtime2) | node1->i_xtime1 = - timespec_trunc(attr1->ia_xtime1, + timespec64_trunc(attr1->ia_xtime1, ...) | - attr1->ia_xtime1 = timespec_trunc(attr2->ia_xtime2, + attr1->ia_xtime1 = timespec64_trunc(attr2->ia_xtime2, ...) | - ktime_get_real_ts(&attr1->ia_xtime1) + ktime_get_real_ts64(&attr1->ia_xtime1) | - ktime_get_real_ts(&attr.ia_xtime1) + ktime_get_real_ts64(&attr.ia_xtime1) ) @ depends on patch @ struct inode *node; struct iattr *attr; identifier fn; identifier i_xtime =~ "^i_[acm]time$"; identifier ia_xtime =~ "^ia_[acm]time$"; expression e; @@ ( - fn(node->i_xtime); + fn(timespec64_to_timespec(node->i_xtime)); | fn(..., - node->i_xtime); + timespec64_to_timespec(node->i_xtime)); | - e = fn(attr->ia_xtime); + e = fn(timespec64_to_timespec(attr->ia_xtime)); ) @ depends on patch forall @ struct inode *node; struct iattr *attr; identifier i_xtime =~ "^i_[acm]time$"; identifier ia_xtime =~ "^ia_[acm]time$"; identifier fn; @@ { + struct timespec ts; <+... ( + ts = timespec64_to_timespec(node->i_xtime); fn (..., - &node->i_xtime, + &ts, ...); | + ts = timespec64_to_timespec(attr->ia_xtime); fn (..., - &attr->ia_xtime, + &ts, ...); ) ...+> } @ depends on patch forall @ struct inode *node; struct iattr *attr; struct kstat *stat; identifier ia_xtime =~ "^ia_[acm]time$"; identifier i_xtime =~ "^i_[acm]time$"; identifier xtime =~ "^[acm]time$"; identifier fn, ret; @@ { + struct timespec ts; <+... ( + ts = timespec64_to_timespec(node->i_xtime); ret = fn (..., - &node->i_xtime, + &ts, ...); | + ts = timespec64_to_timespec(node->i_xtime); ret = fn (..., - &node->i_xtime); + &ts); | + ts = timespec64_to_timespec(attr->ia_xtime); ret = fn (..., - &attr->ia_xtime, + &ts, ...); | + ts = timespec64_to_timespec(attr->ia_xtime); ret = fn (..., - &attr->ia_xtime); + &ts); | + ts = timespec64_to_timespec(stat->xtime); ret = fn (..., - &stat->xtime); + &ts); ) ...+> } @ depends on patch @ struct inode *node; struct inode *node2; identifier i_xtime1 =~ "^i_[acm]time$"; identifier i_xtime2 =~ "^i_[acm]time$"; identifier i_xtime3 =~ "^i_[acm]time$"; struct iattr *attrp; struct iattr *attrp2; struct iattr attr ; identifier ia_xtime1 =~ "^ia_[acm]time$"; identifier ia_xtime2 =~ "^ia_[acm]time$"; struct kstat *stat; struct kstat stat1; struct timespec64 ts; identifier xtime =~ "^[acmb]time$"; expression e; @@ ( ( node->i_xtime2 \| attrp->ia_xtime2 \| attr.ia_xtime2 \) = node->i_xtime1 ; | node->i_xtime2 = \( node2->i_xtime1 \| timespec64_trunc(...) \); | node->i_xtime2 = node->i_xtime1 = node->i_xtime3 = \(ts \| current_time(...) \); | node->i_xtime1 = node->i_xtime3 = \(ts \| current_time(...) \); | stat->xtime = node2->i_xtime1; | stat1.xtime = node2->i_xtime1; | ( node->i_xtime2 \| attrp->ia_xtime2 \) = attrp->ia_xtime1 ; | ( attrp->ia_xtime1 \| attr.ia_xtime1 \) = attrp2->ia_xtime2; | - e = node->i_xtime1; + e = timespec64_to_timespec( node->i_xtime1 ); | - e = attrp->ia_xtime1; + e = timespec64_to_timespec( attrp->ia_xtime1 ); | node->i_xtime1 = current_time(...); | node->i_xtime2 = node->i_xtime1 = node->i_xtime3 = - e; + timespec_to_timespec64(e); | node->i_xtime1 = node->i_xtime3 = - e; + timespec_to_timespec64(e); | - node->i_xtime1 = e; + node->i_xtime1 = timespec_to_timespec64(e); ) Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com> Cc: <anton@tuxera.com> Cc: <balbi@kernel.org> Cc: <bfields@fieldses.org> Cc: <darrick.wong@oracle.com> Cc: <dhowells@redhat.com> Cc: <dsterba@suse.com> Cc: <dwmw2@infradead.org> Cc: <hch@lst.de> Cc: <hirofumi@mail.parknet.co.jp> Cc: <hubcap@omnibond.com> Cc: <jack@suse.com> Cc: <jaegeuk@kernel.org> Cc: <jaharkes@cs.cmu.edu> Cc: <jslaby@suse.com> Cc: <keescook@chromium.org> Cc: <mark@fasheh.com> Cc: <miklos@szeredi.hu> Cc: <nico@linaro.org> Cc: <reiserfs-devel@vger.kernel.org> Cc: <richard@nod.at> Cc: <sage@redhat.com> Cc: <sfrench@samba.org> Cc: <swhiteho@redhat.com> Cc: <tj@kernel.org> Cc: <trond.myklebust@primarydata.com> Cc: <tytso@mit.edu> Cc: <viro@zeniv.linux.org.uk>