summaryrefslogtreecommitdiff
path: root/drivers/block/rbd.c
AgeCommit message (Collapse)AuthorFilesLines
2019-09-16rbd: pull rbd_img_request_create() dout out into the callersIlya Dryomov1-2/+6
Make it more informative: log op_type, offset and length for block layer requests and initiating obj_req for child requests. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-09-16rbd: fix response length parameter for encoded stringsDongsheng Yang1-4/+6
rbd_dev_image_id() allocates space for length but passes a smaller value to rbd_obj_method_sync(). rbd_dev_v2_object_prefix() doesn't allocate space for length. Fix both to be consistent. Signed-off-by: Dongsheng Yang <dongsheng.yang@easystack.cn> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-08-28rbd: restore zeroing past the overlap when reading from parentIlya Dryomov1-0/+11
The parent image is read only up to the overlap point, the rest of the buffer should be zeroed. This snuck in because as it turns out the overlap test case has not been triggering this code path for a while now. Fixes: a9b67e69949d ("rbd: replace obj_req->tried_parent with obj_req->read_state") Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Jason Dillaman <dillaman@redhat.com>
2019-07-18Merge tag 'ceph-for-5.3-rc1' of git://github.com/ceph/ceph-clientLinus Torvalds1-588/+1600
Pull ceph updates from Ilya Dryomov: "Lots of exciting things this time! - support for rbd object-map and fast-diff features (myself). This will speed up reads, discards and things like snap diffs on sparse images. - ceph.snap.btime vxattr to expose snapshot creation time (David Disseldorp). This will be used to integrate with "Restore Previous Versions" feature added in Windows 7 for folks who reexport ceph through SMB. - security xattrs for ceph (Zheng Yan). Only selinux is supported for now due to the limitations of ->dentry_init_security(). - support for MSG_ADDR2, FS_BTIME and FS_CHANGE_ATTR features (Jeff Layton). This is actually a single feature bit which was missing because of the filesystem pieces. With this in, the kernel client will finally be reported as "luminous" by "ceph features" -- it is still being reported as "jewel" even though all required Luminous features were implemented in 4.13. - stop NULL-terminating ceph vxattrs (Jeff Layton). The convention with xattrs is to not terminate and this was causing inconsistencies with ceph-fuse. - change filesystem time granularity from 1 us to 1 ns, again fixing an inconsistency with ceph-fuse (Luis Henriques). On top of this there are some additional dentry name handling and cap flushing fixes from Zheng. Finally, Jeff is formally taking over for Zheng as the filesystem maintainer" * tag 'ceph-for-5.3-rc1' of git://github.com/ceph/ceph-client: (71 commits) ceph: fix end offset in truncate_inode_pages_range call ceph: use generic_delete_inode() for ->drop_inode ceph: use ceph_evict_inode to cleanup inode's resource ceph: initialize superblock s_time_gran to 1 MAINTAINERS: take over for Zheng as CephFS kernel client maintainer rbd: setallochint only if object doesn't exist rbd: support for object-map and fast-diff rbd: call rbd_dev_mapping_set() from rbd_dev_image_probe() libceph: export osd_req_op_data() macro libceph: change ceph_osdc_call() to take page vector for response libceph: bump CEPH_MSG_MAX_DATA_LEN (again) rbd: new exclusive lock wait/wake code rbd: quiescing lock should wait for image requests rbd: lock should be quiesced on reacquire rbd: introduce copyup state machine rbd: rename rbd_obj_setup_*() to rbd_obj_init_*() rbd: move OSD request allocation into object request state machines rbd: factor out __rbd_osd_setup_discard_ops() rbd: factor out rbd_osd_setup_copyup() rbd: introduce obj_req->osd_reqs list ...
2019-07-08rbd: setallochint only if object doesn't existIlya Dryomov1-5/+14
setallochint is really only useful on object creation. Continue hinting unconditionally if object map cannot be used. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn>
2019-07-08rbd: support for object-map and fast-diffIlya Dryomov1-3/+717
Speed up reads, discards and zeroouts through RBD_OBJ_FLAG_MAY_EXIST and RBD_OBJ_FLAG_NOOP_FOR_NONEXISTENT based on object map. Invalid object maps are not trusted, but still updated. Note that we never iterate, resize or invalidate object maps. If object-map feature is enabled but object map fails to load, we just fail the requester (either "rbd map" or I/O, by way of post-acquire action). Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-07-08rbd: call rbd_dev_mapping_set() from rbd_dev_image_probe()Ilya Dryomov1-8/+6
Snapshot object map will be loaded in rbd_dev_image_probe(), so we need to know snapshot's size (as opposed to HEAD's size) sooner. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn>
2019-07-08libceph: change ceph_osdc_call() to take page vector for responseIlya Dryomov1-4/+4
This will be used for loading object map. rbd_obj_read_sync() isn't suitable because object map must be accessed through class methods. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn> Reviewed-by: Jeff Layton <jlayton@kernel.org>
2019-07-08rbd: new exclusive lock wait/wake codeIlya Dryomov1-143/+186
rbd_wait_state_locked() is built around rbd_dev->lock_waitq and blocks rbd worker threads while waiting for the lock, potentially impacting other rbd devices. There is no good way to pass an error code into image request state machines when acquisition fails, hence the use of RBD_DEV_FLAG_BLACKLISTED for everything and various other issues. Introduce rbd_dev->acquiring_list and move acquisition into image request state machine. Use rbd_img_schedule() for kicking and passing error codes. No blocking occurs while waiting for the lock, but rbd_dev->lock_rwsem is still held across lock, unlock and set_cookie calls. Always acquire the lock on "rbd map" to avoid associating the latency of acquiring the lock with the first I/O request. A slight regression is that lock_timeout is now respected only if lock acquisition is triggered by "rbd map" and not by I/O. This is somewhat compensated by the fact that we no longer block if the peer refuses to release lock -- I/O is failed with EROFS right away. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn>
2019-07-08rbd: quiescing lock should wait for image requestsIlya Dryomov1-14/+90
Syncing OSD requests doesn't really work. A single image request may be comprised of multiple object requests, each of which can go through a series of OSD requests (original, copyups, etc). On top of that, the OSD cliest may be shared with other rbd devices. What we want is to ensure that all in-flight image requests complete. Introduce rbd_dev->running_list and block in RBD_LOCK_STATE_RELEASING until that happens. New OSD requests may be started during this time. Note that __rbd_img_handle_request() acquires rbd_dev->lock_rwsem only if need_exclusive_lock() returns true. This avoids a deadlock similar to the one outlined in the previous commit between unlock and I/O that doesn't require lock, such as a read with object-map feature disabled. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn>
2019-07-08rbd: lock should be quiesced on reacquireIlya Dryomov1-14/+21
Quiesce exclusive lock at the top of rbd_reacquire_lock() instead of only when ceph_cls_set_cookie() fails. This avoids a deadlock on rbd_dev->lock_rwsem. If rbd_dev->lock_rwsem is needed for I/O completion, set_cookie can hang ceph-msgr worker thread if set_cookie reply ends up behind an I/O reply, because, like lock and unlock requests, set_cookie is sent and waited upon with rbd_dev->lock_rwsem held for write. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn>
2019-07-08rbd: introduce copyup state machineIlya Dryomov1-64/+123
Both write and copyup paths will get more complex with object map. Factor copyup code out into a separate state machine. While at it, take advantage of obj_req->osd_reqs list and issue empty and current snapc OSD requests together, one after another. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn>
2019-07-08rbd: rename rbd_obj_setup_*() to rbd_obj_init_*()Ilya Dryomov1-13/+13
These functions don't allocate and set up OSD requests anymore. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn>
2019-07-08rbd: move OSD request allocation into object request state machinesIlya Dryomov1-118/+96
Following submission, move initial OSD request allocation into object request state machines. Everything that has to do with OSD requests is now handled inside the state machine, all __rbd_img_fill_request() has left is initialization. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn>
2019-07-08rbd: factor out __rbd_osd_setup_discard_ops()Ilya Dryomov1-16/+27
With obj_req->xferred removed, obj_req->ex.oe_off and obj_req->ex.oe_len can be updated if required for alignment. Previously the new offset and length weren't stored anywhere beyond rbd_obj_setup_discard(). Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn>
2019-07-08rbd: factor out rbd_osd_setup_copyup()Ilya Dryomov1-12/+17
Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn>
2019-07-08rbd: introduce obj_req->osd_reqs listIlya Dryomov1-91/+100
Since the dawn of time it had been assumed that a single object request spawns a single OSD request. This is already impacting copyup: instead of sending empty and current snapc copyups together, we wait for empty snapc OSD request to complete in order to reassign obj_req->osd_req with current snapc OSD request. Looking further, updating potentially hundreds of snapshot object maps serially is a non-starter. Replace obj_req->osd_req pointer with obj_req->osd_reqs list. Use osd_req->r_private_item for linkage. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn>
2019-07-08rbd: introduce image request state machineIlya Dryomov1-57/+137
Make it possible to schedule image requests on a workqueue. This fixes parent chain recursion added in the previous commit and lays the ground for exclusive lock wait/wake improvements. The "wait for pending subrequests and report first nonzero result" code is generalized to be used by object request state machine. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn>
2019-07-08rbd: move OSD request submission into object request state machinesIlya Dryomov1-11/+49
Start eliminating asymmetry where the initial OSD request is allocated and submitted from outside the state machine, making error handling and restarts harder than they could be. This commit deals with submission, a commit that deals with allocation will follow. Note that this commit adds parent chain recursion on the submission side: rbd_img_request_submit rbd_obj_handle_request __rbd_obj_handle_request rbd_obj_handle_read rbd_obj_handle_write_guard rbd_obj_read_from_parent rbd_img_request_submit This will be fixed in the next commit. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn>
2019-07-08rbd: get rid of RBD_OBJ_WRITE_{FLAT,GUARD}Ilya Dryomov1-52/+60
In preparation for moving OSD request allocation and submission into object request state machines, get rid of RBD_OBJ_WRITE_{FLAT,GUARD}. We would need to start in a new state, whether the request is guarded or not. Unify them into RBD_OBJ_WRITE_OBJECT and pass guard info through obj_req->flags. While at it, make our ENOENT handling a little more precise: only hide ENOENT when it is actually expected, that is on delete. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn>
2019-07-08rbd: replace obj_req->tried_parent with obj_req->read_stateIlya Dryomov1-36/+46
Make rbd_obj_handle_read() look like a state machine and get rid of the necessity to patch result in rbd_obj_handle_request(), completing the removal of obj_req->xferred and img_req->xferred. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn>
2019-07-08rbd: get rid of obj_req->xferred, obj_req->result and img_req->xferredIlya Dryomov1-91/+58
obj_req->xferred and img_req->xferred don't bring any value. The former is used for short reads and has to be set to obj_req->ex.oe_len after that and elsewhere. The latter is just an aggregate. Use result for short reads (>=0 - number of bytes read, <0 - error) and pass it around explicitly. No need to store it in obj_req. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn>
2019-05-07rbd: don't assert on writes to snapshotsIlya Dryomov1-2/+6
The check added in commit 721c7fc701c7 ("block: fail op_is_write() requests to read-only partitions") was lifted in commit a32e236eb93e ("Partially revert "block: fail op_is_write() requests to read-only partitions""). Basic things like user triggered writes and discards are still caught, but internal kernel users can submit anything. In particular, ext4 will attempt to write to the superblock if it detects errors in the filesystem, even if the filesystem is mounted read-only on a read-only partition. The assert is overkill regardless. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-05-07rbd: client_mutex is never nestedIlya Dryomov1-1/+1
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-05-07rbd: convert all rbd_assert(0) to BUG()Arnd Bergmann1-6/+6
rbd_assert(0) has caused different issues depending on the compiler version in the past, so it seems better to avoid it completely. Replace the remaining instances. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-05-07rbd: avoid clang -Wuninitialized warningArnd Bergmann1-1/+1
clang fails to see that rbd_assert(0) ends in an unreachable code path and warns about a subsequent use of an uninitialized variable when CONFIG_PROFILE_ANNOTATED_BRANCHES is set: drivers/block/rbd.c:2402:4: error: variable 'ret' is used uninitialized whenever 'if' condition is false [-Werror,-Wsometimes-uninitialized] rbd_assert(0); ^~~~~~~~~~~~~ drivers/block/rbd.c:563:7: note: expanded from macro 'rbd_assert' if (unlikely(!(expr))) { \ ^~~~~~~~~~~~~~~~~ include/linux/compiler.h:48:23: note: expanded from macro 'unlikely' # define unlikely(x) (__branch_check__(x, 0, __builtin_constant_p(x))) ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ drivers/block/rbd.c:2410:6: note: uninitialized use occurs here if (ret) { ^~~ drivers/block/rbd.c:2402:4: note: remove the 'if' if its condition is always true rbd_assert(0); ^ drivers/block/rbd.c:563:3: note: expanded from macro 'rbd_assert' if (unlikely(!(expr))) { \ ^ drivers/block/rbd.c:2376:9: note: initialize the variable 'ret' to silence this warning int ret; ^ = 0 1 error generated. This seems to be a bug in clang, but is easy to work around by using an unconditional BUG(). Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-03-20rbd: drop wait_for_latest_osdmap()Ilya Dryomov1-18/+2
Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Jason Dillaman <dillaman@redhat.com>
2019-03-18rbd: set io_min, io_opt and discard_granularity to alloc_sizeIlya Dryomov1-4/+4
Now that we have alloc_size that controls our discard behavior, it doesn't make sense to have these set to object (set) size. alloc_size defaults to 64k, but because discard_granularity is likely 4M, only ranges that are equal to or bigger than 4M can be considered during fstrim. A smaller io_min is also more likely to be met, resulting in fewer deferred writes on bluestore OSDs. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Jason Dillaman <dillaman@redhat.com>
2019-03-13Merge tag 'ceph-for-5.1-rc1' of git://github.com/ceph/ceph-clientLinus Torvalds1-103/+297
Pull ceph updates from Ilya Dryomov: "The highlights are: - rbd will now ignore discards that aren't aligned and big enough to actually free up some space (myself). This is controlled by the new alloc_size map option and can be disabled if needed. - support for rbd deep-flatten feature (myself). Deep-flatten allows "rbd flatten" to fully disconnect the clone image and its snapshots from the parent and make the parent snapshot removable. - a new round of cap handling improvements (Zheng Yan). The kernel client should now be much more prompt about releasing its caps and it is possible to put a limit on the number of caps held. - support for getting ceph.dir.pin extended attribute (Zheng Yan)" * tag 'ceph-for-5.1-rc1' of git://github.com/ceph/ceph-client: (26 commits) Documentation: modern versions of ceph are not backed by btrfs rbd: advertise support for RBD_FEATURE_DEEP_FLATTEN rbd: whole-object write and zeroout should copyup when snapshots exist rbd: copyup with an empty snapshot context (aka deep-copyup) rbd: introduce rbd_obj_issue_copyup_ops() rbd: stop copying num_osd_ops in rbd_obj_issue_copyup() rbd: factor out __rbd_osd_req_create() rbd: clear ->xferred on error from rbd_obj_issue_copyup() rbd: remove experimental designation from kernel layering ceph: add mount option to limit caps count ceph: periodically trim stale dentries ceph: delete stale dentry when last reference is dropped ceph: remove dentry_lru file from debugfs ceph: touch existing cap when handling reply ceph: pass inclusive lend parameter to filemap_write_and_wait_range() rbd: round off and ignore discards that are too small rbd: handle DISCARD and WRITE_ZEROES separately rbd: get rid of obj_req->obj_request_count libceph: use struct_size() for kmalloc() in crush_decode() ceph: send cap releases more aggressively ...
2019-03-09Merge tag 'for-5.1/block-20190302' of git://git.kernel.dk/linux-blockLinus Torvalds1-1/+1
Pull block layer updates from Jens Axboe: "Not a huge amount of changes in this round, the biggest one is that we finally have Mings multi-page bvec support merged. Apart from that, this pull request contains: - Small series that avoids quiescing the queue for sysfs changes that match what we currently have (Aleksei) - Series of bcache fixes (via Coly) - Series of lightnvm fixes (via Mathias) - NVMe pull request from Christoph. Nothing major, just SPDX/license cleanups, RR mp policy (Hannes), and little fixes (Bart, Chaitanya). - BFQ series (Paolo) - Save blk-mq cpu -> hw queue mapping, removing a pointer indirection for the fast path (Jianchao) - fops->iopoll() added for async IO polling, this is a feature that the upcoming io_uring interface will use (Christoph, me) - Partition scan loop fixes (Dongli) - mtip32xx conversion from managed resource API (Christoph) - cdrom registration race fix (Guenter) - MD pull from Song, two minor fixes. - Various documentation fixes (Marcos) - Multi-page bvec feature. This brings a lot of nice improvements with it, like more efficient splitting, larger IOs can be supported without growing the bvec table size, and so on. (Ming) - Various little fixes to core and drivers" * tag 'for-5.1/block-20190302' of git://git.kernel.dk/linux-block: (117 commits) block: fix updating bio's front segment size block: Replace function name in string with __func__ nbd: propagate genlmsg_reply return code floppy: remove set but not used variable 'q' null_blk: fix checking for REQ_FUA block: fix NULL pointer dereference in register_disk fs: fix guard_bio_eod to check for real EOD errors blk-mq: use HCTX_TYPE_DEFAULT but not 0 to index blk_mq_tag_set->map block: optimize bvec iteration in bvec_iter_advance block: introduce mp_bvec_for_each_page() for iterating over page block: optimize blk_bio_segment_split for single-page bvec block: optimize __blk_segment_map_sg() for single-page bvec block: introduce bvec_nth_page() iomap: wire up the iopoll method block: add bio_set_polled() helper block: wire up block device iopoll method fs: add an iopoll method to struct file_operations loop: set GENHD_FL_NO_PART_SCAN after blkdev_reread_part() loop: do not print warn message if partition scan is successful block: bounce: make sure that bvec table is updated ...
2019-03-05rbd: advertise support for RBD_FEATURE_DEEP_FLATTENIlya Dryomov1-0/+2
All copyups perform deep-copyup regardless of whether deep-flatten feature is enabled. The feature bit is used to ensure that image is written to only by new-enough clients that always perform deep-copyup. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-03-05rbd: whole-object write and zeroout should copyup when snapshots existIlya Dryomov1-5/+7
Otherwise, once the parent snapshot is removed, the clone's snapshot wouldn't reflect the state of the clone prior to whole-object write or zeroout because a deep-copyup was never done ("rbd flatten" wouldn't do it because the modified object would exist in HEAD). Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-03-05rbd: copyup with an empty snapshot context (aka deep-copyup)Ilya Dryomov1-10/+79
This is the core of deep-flatten feature: sending a copyup request (i.e. a guarded write of the data read from the parent) with an empty snapshot context (snaps = [], seq = 0) causes the OSD to reflect the write in all existing snapshots. This allows "rbd flatten" to fully disconnect the clone image and its snapshots from the parent and make the parent snapshot removable. The actual modification request is sent only after deep-copyup request is completed. Waiting for deep-copyup reply is unnecessary, this will be improved in the future. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-03-05rbd: introduce rbd_obj_issue_copyup_ops()Ilya Dryomov1-33/+43
In preparation for deep-flatten feature, split rbd_obj_issue_copyup() into two functions and add a new write state to make the state machine slightly more clear. Make the copyup op optional and start using that for when the overlap goes to 0. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-03-05rbd: stop copying num_osd_ops in rbd_obj_issue_copyup()Ilya Dryomov1-31/+59
In preparation for deep-flatten feature, stop copying num_osd_ops from the original request in rbd_obj_issue_copyup(). Split the calculation into count_{write,zeroout}_ops() respectively and determine whether the assert_exists guard is needed with the new rbd_obj_copyup_enabled(). As a nice side effect, we no longer guard in the writefull case as the copyup'ed object is always fully overwritten. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-03-05rbd: factor out __rbd_osd_req_create()Ilya Dryomov1-7/+12
Allow passing a custom snapshot context: NULL for read and an empty snapshot context for deep-copyup. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-03-05rbd: clear ->xferred on error from rbd_obj_issue_copyup()Ilya Dryomov1-0/+1
Otherwise the assert in rbd_obj_end_request() is triggered. Fixes: 3da691bf4366 ("rbd: new request handling code") Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-03-05rbd: remove experimental designation from kernel layeringIlya Dryomov1-8/+0
Support for kernel layering hasn't been considered experimental for a few years now. All the issues that I'm aware of were shaken out in 2014 and early 2015. Moreover, most of that code was rewritten with the addition of support for fancy striping. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Jason Dillaman <dillaman@redhat.com>
2019-03-05rbd: round off and ignore discards that are too smallIlya Dryomov1-6/+55
If, after rounding off, the discard request is smaller than alloc_size, drop it on the floor in __rbd_img_fill_request(). Default alloc_size to 64k. This should cover both HDD and SSD based bluestore OSDs and somewhat improve things for filestore. For OSDs on filestore with filestore_punch_hole = false, alloc_size is best set to object size in order to allow deletes and truncates and disallow zero op. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Jason Dillaman <dillaman@redhat.com>
2019-03-05rbd: handle DISCARD and WRITE_ZEROES separatelyIlya Dryomov1-10/+51
With discard_zeroes_data gone in commit 48920ff2a5a9 ("block: remove the discard_zeroes_data flag"), continuing to provide this guarantee is pointless: applications can't query it and discards can only be used for deallocating. Add OBJ_OP_ZEROOUT and move the existing logic under it. As the first step to divorcing OBJ_OP_DISCARD, stop worrying about copyups but keep special casing whole-object layered discards. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Jason Dillaman <dillaman@redhat.com>
2019-03-05rbd: get rid of obj_req->obj_request_countIlya Dryomov1-5/+0
It is effectively unused. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Jason Dillaman <dillaman@redhat.com>
2019-02-15block: kill BLK_MQ_F_SG_MERGEMing Lei1-1/+1
QUEUE_FLAG_NO_SG_MERGE has been killed, so kill BLK_MQ_F_SG_MERGE too. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-01-22block: rbd: convert to use BUS_ATTR_WO and ROGreg Kroah-Hartman1-26/+19
We are trying to get rid of BUS_ATTR() and the usage of that in rbd.c can be trivially converted to use BUS_ATTR_WO and RO, so use those macros instead. Cc: Sage Weil <sage@redhat.com> Cc: Alex Elder <elder@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Acked-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-01-10rbd: don't return 0 on unmap if RBD_DEV_FLAG_REMOVING is setIlya Dryomov1-5/+4
There is a window between when RBD_DEV_FLAG_REMOVING is set and when the device is removed from rbd_dev_list. During this window, we set "already" and return 0. Returning 0 from write(2) can confuse userspace tools because 0 indicates that nothing was written. In particular, "rbd unmap" will retry the write multiple times a second: 10:28:05.463299 write(4, "0", 1) = 0 10:28:05.463509 write(4, "0", 1) = 0 10:28:05.463720 write(4, "0", 1) = 0 10:28:05.463942 write(4, "0", 1) = 0 10:28:05.464155 write(4, "0", 1) = 0 Cc: stable@vger.kernel.org Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Tested-by: Dongsheng Yang <dongsheng.yang@easystack.cn>
2018-10-22libceph, rbd, ceph: move ceph_osdc_alloc_messages() callsIlya Dryomov1-7/+12
The current requirement is that ceph_osdc_alloc_messages() should be called after oid and oloc are known. In preparation for preallocating message data items, move ceph_osdc_alloc_messages() further down, so that it is called when OSD op codes are known. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-10-22libceph: osd_req_op_cls_init() doesn't need to take opcodeIlya Dryomov1-2/+1
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-10-22rbd: add __init/__exit annotationsChengguang Xu1-3/+3
Add __init/__exit annotation to init/cleanup helpers which are only called once in the module. Signed-off-by: Chengguang Xu <cgxu519@gmx.com> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-09-06rbd: support cloning across namespacesIlya Dryomov1-14/+97
If parent_get class method is not supported by the OSDs, fall back to the legacy class method and assume that the parent is in the default (i.e. "") namespace. The "use the child's image namespace" workaround is no longer needed because creating images within namespaces will require parent_get aware OSDs. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Jason Dillaman <dillaman@redhat.com>
2018-09-06rbd: factor out get_parent_info()Ilya Dryomov1-48/+86
In preparation for the new parent_get and parent_overlap_get class methods, factor out the fetching and decoding of parent data. As a side effect, we now decode all four fields in the "no parent" case. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Jason Dillaman <dillaman@redhat.com>
2018-08-21Merge tag 'ceph-for-4.19-rc1' of git://github.com/ceph/ceph-clientLinus Torvalds1-37/+88
Pull ceph updates from Ilya Dryomov: "The main things are support for cephx v2 authentication protocol and basic support for rbd images within namespaces (myself). Also included are y2038 conversion patches from Arnd, a pile of miscellaneous fixes from Chengguang and Zheng's feature bit infrastructure for the filesystem" * tag 'ceph-for-4.19-rc1' of git://github.com/ceph/ceph-client: (40 commits) ceph: don't drop message if it contains more data than expected ceph: support cephfs' own feature bits crush: fix using plain integer as NULL warning libceph: remove unnecessary non NULL check for request_key ceph: refactor error handling code in ceph_reserve_caps() ceph: refactor ceph_unreserve_caps() ceph: change to void return type for __do_request() ceph: compare fsc->max_file_size and inode->i_size for max file size limit ceph: add additional size check in ceph_setattr() ceph: add additional offset check in ceph_write_iter() ceph: add additional range check in ceph_fallocate() ceph: add new field max_file_size in ceph_fs_client libceph: weaken sizeof check in ceph_x_verify_authorizer_reply() libceph: check authorizer reply/challenge length before reading libceph: implement CEPHX_V2 calculation mode libceph: add authorizer challenge libceph: factor out encrypt_authorizer() libceph: factor out __ceph_x_decrypt() libceph: factor out __prepare_write_connect() libceph: store ceph_auth_handshake pointer in ceph_connection ...