summaryrefslogtreecommitdiff
path: root/fs/ceph
AgeCommit message (Collapse)AuthorFilesLines
2023-02-15ceph: flush cap releases when the session is flushedXiubo Li1-0/+6
commit e7d84c6a1296d059389f7342d9b4b7defb518d3a upstream. MDS expects the completed cap release prior to responding to the session flush for cache drop. Cc: stable@vger.kernel.org Link: http://tracker.ceph.com/issues/38009 Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Venky Shankar <vshankar@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-14ceph: switch to vfs_inode_has_locks() to fix file lock bugXiubo Li3-6/+1
[ Upstream commit 461ab10ef7e6ea9b41a0571a7fc6a72af9549a3c ] For the POSIX locks they are using the same owner, which is the thread id. And multiple POSIX locks could be merged into single one, so when checking whether the 'file' has locks may fail. For a file where some openers use locking and others don't is a really odd usage pattern though. Locks are like stoplights -- they only work if everyone pays attention to them. Just switch ceph_get_caps() to check whether any locks are set on the inode. If there are POSIX/OFD/FLOCK locks on the file at the time, we should set CHECK_FILELOCK, regardless of what fd was used to set the lock. Fixes: ff5d913dfc71 ("ceph: return -EIO if read/write against filp that lost file locks") Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-12-02ceph: fix NULL pointer dereference for req->r_sessionXiubo Li1-36/+12
[ Upstream commit 5bd76b8de5b74fa941a6eafee87728a0fe072267 ] The request's r_session maybe changed when it was forwarded or resent. Both the forwarding and resending cases the requests will be protected by the mdsc->mutex. Cc: stable@vger.kernel.org Link: https://bugzilla.redhat.com/show_bug.cgi?id=2137955 Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-12-02ceph: Use kcalloc for allocating multiple elementsKenneth Lee1-1/+1
[ Upstream commit aa1d627207cace003163dee24d1c06fa4e910c6b ] Prefer using kcalloc(a, b) over kzalloc(a * b) as this improves semantics since kcalloc is intended for allocating an array of memory. Signed-off-by: Kenneth Lee <klee33@uw.edu> Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Stable-dep-of: 5bd76b8de5b7 ("ceph: fix NULL pointer dereference for req->r_session") Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-12-02ceph: fix possible NULL pointer dereference for req->r_sessionXiubo Li1-0/+4
[ Upstream commit 7acae6183cf37c48b8da48bbbdb78820fb3913f3 ] The request will be inserted into the ci->i_unsafe_dirops before assigning the req->r_session, so it's possible that we will hit NULL pointer dereference bug here. Cc: stable@vger.kernel.org URL: https://tracker.ceph.com/issues/55327 Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Tested-by: Aaron Tomlin <atomlin@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Stable-dep-of: 5bd76b8de5b7 ("ceph: fix NULL pointer dereference for req->r_session") Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-12-02ceph: put the requests/sessions when it fails to alloc memoryXiubo Li1-18/+37
[ Upstream commit 89d43d0551a848e70e63d9ba11534aaeabc82443 ] When failing to allocate the sessions memory we should make sure the req1 and req2 and the sessions get put. And also in case the max_sessions decreased so when kreallocate the new memory some sessions maybe missed being put. And if the max_sessions is 0 krealloc will return ZERO_SIZE_PTR, which will lead to a distinct access fault. URL: https://tracker.ceph.com/issues/53819 Fixes: e1a4541ec0b9 ("ceph: flush the mdlog before waiting on unsafe reqs") Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Venky Shankar <vshankar@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Stable-dep-of: 5bd76b8de5b7 ("ceph: fix NULL pointer dereference for req->r_session") Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-12-02ceph: fix off by one bugs in unsafe_request_wait()Dan Carpenter1-2/+2
[ Upstream commit 708c87168b6121abc74b2a57d0c498baaf70cbea ] The "> max" tests should be ">= max" to prevent an out of bounds access on the next lines. Fixes: e1a4541ec0b9 ("ceph: flush the mdlog before waiting on unsafe reqs") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Stable-dep-of: 5bd76b8de5b7 ("ceph: fix NULL pointer dereference for req->r_session") Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-12-02ceph: flush the mdlog before waiting on unsafe reqsXiubo Li1-0/+76
[ Upstream commit e1a4541ec0b951685a49d1f72d183681e6433a45 ] For the client requests who will have unsafe and safe replies from MDS daemons, in the MDS side the MDS daemons won't flush the mdlog (journal log) immediatelly, because they think it's unnecessary. That's true for most cases but not all, likes the fsync request. The fsync will wait until all the unsafe replied requests to be safely replied. Normally if there have multiple threads or clients are running, the whole mdlog in MDS daemons could be flushed in time if any request will trigger the mdlog submit thread. So usually we won't experience the normal operations will stuck for a long time. But in case there has only one client with only thread is running, the stuck phenomenon maybe obvious and the worst case it must wait at most 5 seconds to wait the mdlog to be flushed by the MDS's tick thread periodically. This patch will trigger to flush the mdlog in the relevant and auth MDSes to which the in-flight requests are sent just before waiting the unsafe requests to finish. Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Stable-dep-of: 5bd76b8de5b7 ("ceph: fix NULL pointer dereference for req->r_session") Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-12-02ceph: flush mdlog before umountingXiubo Li3-0/+27
[ Upstream commit d095559ce4100f0c02aea229705230deac329c97 ] Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Stable-dep-of: 5bd76b8de5b7 ("ceph: fix NULL pointer dereference for req->r_session") Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-12-02ceph: make iterate_sessions a global symbolXiubo Li3-42/+36
[ Upstream commit 59b312f36230ea91ebb6ce1b11f2781604495d30 ] Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Stable-dep-of: 5bd76b8de5b7 ("ceph: fix NULL pointer dereference for req->r_session") Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-12-02ceph: make ceph_create_session_msg a global symbolXiubo Li2-7/+10
[ Upstream commit fba97e8025015b63b1bdb73cd868c8ea832a1620 ] Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Stable-dep-of: 5bd76b8de5b7 ("ceph: fix NULL pointer dereference for req->r_session") Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-12-02ceph: avoid putting the realm twice when decoding snaps failsXiubo Li1-1/+2
[ Upstream commit 51884d153f7ec85e18d607b2467820a90e0f4359 ] When decoding the snaps fails it maybe leaving the 'first_realm' and 'realm' pointing to the same snaprealm memory. And then it'll put it twice and could cause random use-after-free, BUG_ON, etc issues. Cc: stable@vger.kernel.org Link: https://tracker.ceph.com/issues/57686 Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-12-02ceph: do not update snapshot context when there is no new snapshotXiubo Li1-9/+19
[ Upstream commit 2e586641c950e7f3e7e008404bd783a466b9b590 ] We will only track the uppest parent snapshot realm from which we need to rebuild the snapshot contexts _downward_ in hierarchy. For all the others having no new snapshot we will do nothing. This fix will avoid calling ceph_queue_cap_snap() on some inodes inappropriately. For example, with the code in mainline, suppose there are 2 directory hierarchies (with 6 directories total), like this: /dir_X1/dir_X2/dir_X3/ /dir_Y1/dir_Y2/dir_Y3/ Firstly, make a snapshot under /dir_X1/dir_X2/.snap/snap_X2, then make a root snapshot under /.snap/root_snap. Every time we make snapshots under /dir_Y1/..., the kclient will always try to rebuild the snap context for snap_X2 realm and finally will always try to queue cap snaps for dir_Y2 and dir_Y3, which makes no sense. That's because the snap_X2's seq is 2 and root_snap's seq is 3. So when creating a new snapshot under /dir_Y1/... the new seq will be 4, and the mds will send the kclient a snapshot backtrace in _downward_ order: seqs 4, 3. When ceph_update_snap_trace() is called, it will always rebuild the from the last realm, that's the root_snap. So later when rebuilding the snap context, the current logic will always cause it to rebuild the snap_X2 realm and then try to queue cap snaps for all the inodes related in that realm, even though it's not necessary. This is accompanied by a lot of these sorts of dout messages: "ceph: queue_cap_snap 00000000a42b796b nothing dirty|writing" Fix the logic to avoid this situation. Also, the 'invalidate' word is not precise here. In actuality, it will cause a rebuild of the existing snapshot contexts or just build non-existent ones. Rename it to 'rebuild_snapcs'. URL: https://tracker.ceph.com/issues/44100 Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Stable-dep-of: 51884d153f7e ("ceph: avoid putting the realm twice when decoding snaps fails") Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-10-15ceph: don't truncate file in atomic_openHu Weiwen1-3/+7
commit 7cb9994754f8a36ae9e5ec4597c5c4c2d6c03832 upstream. Clear O_TRUNC from the flags sent in the MDS create request. `atomic_open' is called before permission check. We should not do any modification to the file here. The caller will do the truncation afterward. Fixes: 124e68e74099 ("ceph: file operations") Signed-off-by: Hu Weiwen <sehuww@mail.scut.edu.cn> Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> [Xiubo: fixed a trivial conflict for 5.10 backport] Signed-off-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-08-25ceph: don't leak snap_rwsem in handle_cap_grantJeff Layton1-14/+13
commit 58dd4385577ed7969b80cdc9e2a31575aba6c712 upstream. When handle_cap_grant is called on an IMPORT op, then the snap_rwsem is held and the function is expected to release it before returning. It currently fails to do that in all cases which could lead to a deadlock. Fixes: 6f05b30ea063 ("ceph: reset i_requested_max_size if file write is not wanted") Link: https://tracker.ceph.com/issues/55857 Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Luís Henriques <lhenriques@suse.de> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-08-25ceph: use correct index when encoding client supported featuresLuís Henriques2-8/+5
commit fea013e020e6ecc7be75bea0d61697b7e916b44d upstream. Feature bits have to be encoded into the correct locations. This hasn't been an issue so far because the only hole in the feature bits was in bit 10 (CEPHFS_FEATURE_RECLAIM_CLIENT), which is located in the 2nd byte. When adding more bits that go beyond the this 2nd byte, the bug will show up. [xiubli: remove incorrect comment for CEPHFS_FEATURES_CLIENT_SUPPORTED] Fixes: 9ba1e224538a ("ceph: allocate the correct amount of extra bytes for the session features") Signed-off-by: Luís Henriques <lhenriques@suse.de> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-06-14ceph: allow ceph.dir.rctime xattr to be updatableVenky Shankar1-1/+9
[ Upstream commit d7a2dc523085f8b8c60548ceedc696934aefeb0e ] `rctime' has been a pain point in cephfs due to its buggy nature - inconsistent values reported and those sorts. Fixing rctime is non-trivial needing an overall redesign of the entire nested statistics infrastructure. As a workaround, PR http://github.com/ceph/ceph/pull/37938 allows this extended attribute to be manually set. This allows users to "fixup" inconsistent rctime values. While this sounds messy, its probably the wisest approach allowing users/scripts to workaround buggy rctime values. The above PR enables Ceph MDS to allow manually setting rctime extended attribute with the corresponding user-land changes. We may as well allow the same to be done via kclient for parity. Signed-off-by: Venky Shankar <vshankar@redhat.com> Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-05-18ceph: fix setting of xattrs on async created inodesJeff Layton1-3/+13
commit 620239d9a32e9fe27c9204ec11e40058671aeeb6 upstream. Currently when we create a file, we spin up an xattr buffer to send along with the create request. If we end up doing an async create however, then we currently pass down a zero-length xattr buffer. Fix the code to send down the xattr buffer in req->r_pagelist. If the xattrs span more than a page, however give up and don't try to do an async create. Cc: stable@vger.kernel.org URL: https://bugzilla.redhat.com/show_bug.cgi?id=2063929 Fixes: 9a8d03ca2e2c ("ceph: attempt to do async create when possible") Reported-by: John Fortin <fortinj66@gmail.com> Reported-by: Sri Ramanujam <sri@ramanujam.io> Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-04-13ceph: fix memory leak in ceph_readdir when note_last_dentry returns errorXiubo Li1-1/+10
[ Upstream commit f639d9867eea647005dc824e0e24f39ffc50d4e4 ] Reset the last_readdir at the same time, and add a comment explaining why we don't free last_readdir when dir_emit returns false. Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-02-01ceph: set pool_ns in new inode layout for async createsJeff Layton1-0/+7
commit 4584a768f22b7669cdebabc911543621ac661341 upstream. Dan reported that he was unable to write to files that had been asynchronously created when the client's OSD caps are restricted to a particular namespace. The issue is that the layout for the new inode is only partially being filled. Ensure that we populate the pool_ns_data and pool_ns_len in the iinfo before calling ceph_fill_inode. Cc: stable@vger.kernel.org URL: https://tracker.ceph.com/issues/54013 Fixes: 9a8d03ca2e2c ("ceph: attempt to do async create when possible") Reported-by: Dan van der Ster <dan@vanderster.com> Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-02-01ceph: properly put ceph_string reference after async create attemptJeff Layton1-0/+2
commit 932a9b5870d38b87ba0a9923c804b1af7d3605b9 upstream. The reference acquired by try_prep_async_create is currently leaked. Ensure we put it. Cc: stable@vger.kernel.org Fixes: 9a8d03ca2e2c ("ceph: attempt to do async create when possible") Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-12-29ceph: fix up non-directory creation in SGID directoriesChristian Brauner1-3/+15
commit fd84bfdddd169c219c3a637889a8b87f70a072c2 upstream. Ceph always inherits the SGID bit if it is set on the parent inode, while the generic inode_init_owner does not do this in a few cases where it can create a possible security problem (cf. [1]). Update ceph to strip the SGID bit just as inode_init_owner would. This bug was detected by the mapped mount testsuite in [3]. The testsuite tests all core VFS functionality and semantics with and without mapped mounts. That is to say it functions as a generic VFS testsuite in addition to a mapped mount testsuite. While working on mapped mount support for ceph, SIGD inheritance was the only failing test for ceph after the port. The same bug was detected by the mapped mount testsuite in XFS in January 2021 (cf. [2]). [1]: commit 0fa3ecd87848 ("Fix up non-directory creation in SGID directories") [2]: commit 01ea173e103e ("xfs: fix up non-directory creation in SGID directories") [3]: https://git.kernel.org/fs/xfs/xfstests-dev.git Cc: stable@vger.kernel.org Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-12-22ceph: initialize pathlen variable in reconnect_caps_cbXiubo Li1-2/+1
[ Upstream commit ee2a095d3b24f300a5e11944d208801e928f108c ] The smatch static checker warned about an uninitialized symbol usage in this function, in the case where ceph_mdsc_build_path returns an error. It turns out that that case is harmless, but it just looks sketchy. Initialize the variable at declaration time, and remove the unneeded setting of it later. Fixes: a33f6432b3a6 ("ceph: encode inodes' parent/d_name in cap reconnect message") Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-12-22ceph: fix duplicate increment of opened_inodes metricHu Weiwen1-8/+8
[ Upstream commit 973e5245637accc4002843f6b888495a6a7762bc ] opened_inodes is incremented twice when the same inode is opened twice with O_RDONLY and O_WRONLY respectively. To reproduce, run this python script, then check the metrics: import os for _ in range(10000): fd_r = os.open('a', os.O_RDONLY) fd_w = os.open('a', os.O_WRONLY) os.close(fd_r) os.close(fd_w) Fixes: 1dd8d4708136 ("ceph: metrics for opened files, pinned caps and opened inodes") Signed-off-by: Hu Weiwen <sehuww@mail.scut.edu.cn> Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-12-01ceph: properly handle statfs on multifs setupsJeff Layton1-5/+6
[ Upstream commit 8cfc0c7ed34f7929ce7e5d7c6eecf4d01ba89a84 ] ceph_statfs currently stuffs the cluster fsid into the f_fsid field. This was fine when we only had a single filesystem per cluster, but now that we have multiples we need to use something that will vary between them. Change ceph_statfs to xor each 32-bit chunk of the fsid (aka cluster id) into the lower bits of the statfs->f_fsid. Change the lower bits to hold the fscid (filesystem ID within the cluster). That should give us a value that is guaranteed to be unique between filesystems within a cluster, and should minimize the chance of collisions between mounts of different clusters. URL: https://tracker.ceph.com/issues/52812 Reported-by: Sachin Prabhu <sprabhu@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-10-27ceph: fix handling of "meta" errorsJeff Layton5-27/+8
commit 1bd85aa65d0e7b5e4d09240f492f37c569fdd431 upstream. Currently, we check the wb_err too early for directories, before all of the unsafe child requests have been waited on. In order to fix that we need to check the mapping->wb_err later nearer to the end of ceph_fsync. We also have an overly-complex method for tracking errors after blocklisting. The errors recorded in cleanup_session_requests go to a completely separate field in the inode, but we end up reporting them the same way we would for any other error (in fsync). There's no real benefit to tracking these errors in two different places, since the only reporting mechanism for them is in fsync, and we'd need to advance them both every time. Given that, we can just remove i_meta_err, and convert the places that used it to instead just use mapping->wb_err instead. That also fixes the original problem by ensuring that we do a check_and_advance of the wb_err at the end of the fsync op. Cc: stable@vger.kernel.org URL: https://tracker.ceph.com/issues/52864 Reported-by: Patrick Donnelly <pdonnell@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-10-27ceph: skip existing superblocks that are blocklisted or shut down when mountingJeff Layton1-3/+14
commit 98d0a6fb7303a6f4a120b8b8ed05b86ff5db53e8 upstream. Currently when mounting, we may end up finding an existing superblock that corresponds to a blocklisted MDS client. This means that the new mount ends up being unusable. If we've found an existing superblock with a client that is already blocklisted, and the client is not configured to recover on its own, fail the match. Ditto if the superblock has been forcibly unmounted. While we're in here, also rename "other" to the more conventional "fsc". Cc: stable@vger.kernel.org URL: https://bugzilla.redhat.com/show_bug.cgi?id=1901499 Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-09-26ceph: lockdep annotations for try_nonblocking_invalidateJeff Layton1-0/+2
[ Upstream commit 3eaf5aa1cfa8c97c72f5824e2e9263d6cc977b03 ] Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-09-26ceph: remove the capsnaps when removing capsXiubo Li3-18/+87
[ Upstream commit a6d37ccdd240e80f26aaea0e62cda310e0e184d7 ] capsnaps will take inode references via ihold when queueing to flush. When force unmounting, the client will just close the sessions and may never get a flush reply, causing a leak and inode ref leak. Fix this by removing the capsnaps for an inode when removing the caps. URL: https://tracker.ceph.com/issues/52295 Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-09-26ceph: request Fw caps before updating the mtime in ceph_write_iterJeff Layton1-15/+17
[ Upstream commit b11ed50346683a749632ea664959b28d524d7395 ] The current code will update the mtime and then try to get caps to handle the write. If we end up having to request caps from the MDS, then the mtime in the cap grant will clobber the updated mtime and it'll be lost. This is most noticable when two clients are alternately writing to the same file. Fw caps are continually being granted and revoked, and the mtime ends up stuck because the updated mtimes are always being overwritten with the old one. Fix this by changing the order of operations in ceph_write_iter to get the caps before updating the times. Also, make sure we check the pool full conditions before even getting any caps or uninlining. URL: https://tracker.ceph.com/issues/46574 Reported-by: Jozef Kováč <kovac@firma.zoznam.sk> Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Luis Henriques <lhenriques@suse.de> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-09-26ceph: cancel delayed work instead of flushing on mdsc teardownJeff Layton2-3/+2
[ Upstream commit b4002173b7989588b6feaefc42edaf011b596782 ] The first thing metric_delayed_work does is check mdsc->stopping, and then return immediately if it's set. That's good since we would have already torn down the metric structures at this point, otherwise, but there is no locking around mdsc->stopping. It's possible that the ceph_metric_destroy call could race with the delayed_work, in which case we could end up with the delayed_work accessing destroyed percpu variables. At this point in the mdsc teardown, the "stopping" flag has already been set, so there's no benefit to flushing the work. Move the work cancellation in ceph_metric_destroy ahead of the percpu variable destruction, and eliminate the flush_delayed_work call in ceph_mdsc_destroy. Fixes: 18f473b384a6 ("ceph: periodically send perf metrics to MDSes") Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-09-26ceph: allow ceph_put_mds_session to take NULL or ERR_PTRJeff Layton4-10/+8
[ Upstream commit 7e65624d32b6e0429b1d3559e5585657f34f74a1 ] ...to simplify some error paths. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Luis Henriques <lhenriques@suse.de> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-09-18ceph: fix dereference of null pointer cfColin Ian King1-0/+3
commit 05a444d3f90a3c3e6362e88a1bf13e1a60f8cace upstream. Currently in the case where kmem_cache_alloc fails the null pointer cf is dereferenced when assigning cf->is_capsnap = false. Fix this by adding a null pointer check and return path. Cc: stable@vger.kernel.org Addresses-Coverity: ("Dereference null return") Fixes: b2f9fa1f3bd8 ("ceph: correctly handle releasing an embedded cap flush") Signed-off-by: Colin Ian King <colin.king@canonical.com> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-09-08ceph: fix possible null-pointer dereference in ceph_mdsmap_decode()Tuo Li1-3/+5
[ Upstream commit a9e6ffbc5b7324b6639ee89028908b1e91ceed51 ] kcalloc() is called to allocate memory for m->m_info, and if it fails, ceph_mdsmap_destroy() behind the label out_err will be called: ceph_mdsmap_destroy(m); In ceph_mdsmap_destroy(), m->m_info is dereferenced through: kfree(m->m_info[i].export_targets); To fix this possible null-pointer dereference, check m->m_info before the for loop to free m->m_info[i].export_targets. [ jlayton: fix up whitespace damage only kfree(m->m_info) if it's non-NULL ] Reported-by: TOTE Robot <oslab@tsinghua.edu.cn> Signed-off-by: Tuo Li <islituo@gmail.com> Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-09-03ceph: correctly handle releasing an embedded cap flushXiubo Li4-12/+22
commit b2f9fa1f3bd8846f50b355fc2168236975c4d264 upstream. The ceph_cap_flush structures are usually dynamically allocated, but the ceph_cap_snap has an embedded one. When force umounting, the client will try to remove all the session caps. During this, it will free them, but that should not be done with the ones embedded in a capsnap. Fix this by adding a new boolean that indicates that the cap flush is embedded in a capsnap, and skip freeing it if that's set. At the same time, switch to using list_del_init() when detaching the i_list and g_list heads. It's possible for a forced umount to remove these objects but then handle_cap_flushsnap_ack() races in and does the list_del_init() again, corrupting memory. Cc: stable@vger.kernel.org URL: https://tracker.ceph.com/issues/52283 Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-08-18ceph: take snap_empty_lock atomically with snaprealm refcount changeJeff Layton1-17/+17
commit 8434ffe71c874b9c4e184b88d25de98c2bf5fe3f upstream. There is a race in ceph_put_snap_realm. The change to the nref and the spinlock acquisition are not done atomically, so you could decrement nref, and before you take the spinlock, the nref is incremented again. At that point, you end up putting it on the empty list when it shouldn't be there. Eventually __cleanup_empty_realms runs and frees it when it's still in-use. Fix this by protecting the 1->0 transition with atomic_dec_and_lock, and just drop the spinlock if we can get the rwsem. Because these objects can also undergo a 0->1 refcount transition, we must protect that change as well with the spinlock. Increment locklessly unless the value is at 0, in which case we take the spinlock, increment and then take it off the empty list if it did the 0->1 transition. With these changes, I'm removing the dout() messages from these functions, as well as in __put_snap_realm. They've always been racy, and it's better to not print values that may be misleading. Cc: stable@vger.kernel.org URL: https://tracker.ceph.com/issues/46419 Reported-by: Mark Nelson <mnelson@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Luis Henriques <lhenriques@suse.de> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-08-18ceph: clean up locking annotation for ceph_get_snap_realm and ↵Jeff Layton1-4/+4
__lookup_snap_realm commit df2c0cb7f8e8c83e495260ad86df8c5da947f2a7 upstream. They both say that the snap_rwsem must be held for write, but I don't see any real reason for it, and it's not currently always called that way. The lookup is just walking the rbtree, so holding it for read should be fine there. The "get" is bumping the refcount and (possibly) removing it from the empty list. I see no need to hold the snap_rwsem for write for that. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-08-18ceph: add some lockdep assertions around snaprealm handlingJeff Layton1-0/+16
commit a6862e6708c15995bc10614b2ef34ca35b4b9078 upstream. Turn some comments into lockdep asserts. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-08-18ceph: reduce contention in ceph_check_delayed_caps()Luis Henriques3-11/+33
commit bf2ba432213fade50dd39f2e348085b758c0726e upstream. Function ceph_check_delayed_caps() is called from the mdsc->delayed_work workqueue and it can be kept looping for quite some time if caps keep being added back to the mdsc->cap_delay_list. This may result in the watchdog tainting the kernel with the softlockup flag. This patch breaks this loop if the caps have been recently (i.e. during the loop execution). Any new caps added to the list will be handled in the next run. Also, allow schedule_delayed() callers to explicitly set the delay value instead of defaulting to 5s, so we can ensure that it runs soon afterward if it looks like there is more work. Cc: stable@vger.kernel.org URL: https://tracker.ceph.com/issues/46284 Signed-off-by: Luis Henriques <lhenriques@suse.de> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-07-28ceph: don't WARN if we're still opening a session to an MDSLuis Henriques1-1/+1
[ Upstream commit cdb330f4b41ab55feb35487729e883c9e08b8a54 ] If MDSs aren't available while mounting a filesystem, the session state will transition from SESSION_OPENING to SESSION_CLOSING. And in that scenario check_session_state() will be called from delayed_work() and trigger this WARN. Avoid this by only WARNing after a session has already been established (i.e., the s_ttl will be different from 0). Fixes: 62575e270f66 ("ceph: check session state after bumping session->s_seq") Signed-off-by: Luis Henriques <lhenriques@suse.de> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-20ceph: remove bogus checks and WARN_ONs from ceph_set_page_dirtyJeff Layton1-9/+1
[ Upstream commit 22d41cdcd3cfd467a4af074165357fcbea1c37f5 ] The checks for page->mapping are odd, as set_page_dirty is an address_space operation, and I don't see where it would be called on a non-pagecache page. The warning about the page lock also seems bogus. The comment over set_page_dirty() says that it can be called without the page lock in some rare cases. I don't think we want to warn if that's the case. Reported-by: Matthew Wilcox <willy@infradead.org> Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-06-30netfs: fix test for whether we can skip read when writing beyond EOFJeff Layton1-13/+41
commit 827a746f405d25f79560c7868474aec5aee174e1 upstream. It's not sufficient to skip reading when the pos is beyond the EOF. There may be data at the head of the page that we need to fill in before the write. Add a new helper function that corrects and clarifies the logic of when we can skip reads, and have it only zero out the part of the page that won't have data copied in for the write. Finally, don't set the page Uptodate after zeroing. It's not up to date since the write data won't have been copied in yet. [DH made the following changes: - Prefixed the new function with "netfs_". - Don't call zero_user_segments() for a full-page write. - Altered the beyond-last-page check to avoid a DIV instruction and got rid of then-redundant zero-length file check. ] [ Note: this fix is commit 827a746f405d in mainline kernels. The original bug was in ceph, but got lifted into the fs/netfs library for v5.13. This backport should apply to stable kernels v5.10 though v5.12. ] Fixes: e1b1240c1ff5f ("netfs: Add write_begin helper") Reported-by: Andrew W Elble <aweits@rit.edu> Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> cc: ceph-devel@vger.kernel.org Link: https://lore.kernel.org/r/20210613233345.113565-1-jlayton@kernel.org/ Link: https://lore.kernel.org/r/162367683365.460125.4467036947364047314.stgit@warthog.procyon.org.uk/ # v1 Link: https://lore.kernel.org/r/162391826758.1173366.11794946719301590013.stgit@warthog.procyon.org.uk/ # v2 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-30ceph: must hold snap_rwsem when filling inode for async createJeff Layton2-0/+5
commit 27171ae6a0fdc75571e5bf3d0961631a1e4fb765 upstream. ...and add a lockdep assertion for it to ceph_fill_inode(). Cc: stable@vger.kernel.org # v5.7+ Fixes: 9a8d03ca2e2c3 ("ceph: attempt to do async create when possible") Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-05-22ceph: don't allow access to MDS-private inodesJeff Layton4-0/+42
[ Upstream commit d4f6b31d721779d91b5e2f8072478af73b196c34 ] The MDS reserves a set of inodes for its own usage, and these should never be accessible to clients. Add a new helper to vet a proposed inode number against that range, and complain loudly and refuse to create or look it up if it's in it. Also, ensure that the MDS doesn't try to delegate inodes that are in that range or lower. Print a warning if it does, and don't save the range in the xarray. URL: https://tracker.ceph.com/issues/49922 Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-05-22ceph: don't clobber i_snap_caps on non-I_NEW inodeJeff Layton1-4/+5
[ Upstream commit d3c51ae1b8cce5bdaf91a1ce32b33cf5626075dc ] We want the snapdir to mirror the non-snapped directory's attributes for most things, but i_snap_caps represents the caps granted on the snapshot directory by the MDS itself. A misbehaving MDS could issue different caps for the snapdir and we lose them here. Only reset i_snap_caps when the inode is I_NEW. Also, move the setting of i_op and i_fop inside the if block since they should never change anyway. Reported-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-05-22ceph: fix fscache invalidationJeff Layton2-0/+2
[ Upstream commit 10a7052c7868bc7bc72d947f5aac6f768928db87 ] Ensure that we invalidate the fscache whenever we invalidate the pagecache. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-05-19ceph: fix inode leak on getattr error in __fh_to_dentryJeff Layton1-1/+3
[ Upstream commit 1775c7ddacfcea29051c67409087578f8f4d751b ] Fixes: 878dabb64117 ("ceph: don't return -ESTALE if there's still an open file") Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-03-04ceph: fix flush_snap logic after putting capsJeff Layton1-4/+6
[ Upstream commit 64f36da5625f7f9853b86750eaa89d499d16a2e9 ] A primary reason for skipping ceph_check_caps after putting the references was to avoid the locking in ceph_check_caps during a reconnect. __ceph_put_cap_refs can still call ceph_flush_snaps in that case though, and that takes many of the same inconvenient locks. Fix the logic in __ceph_put_cap_refs to skip flushing snaps when the skip_checking_caps flag is set. Fixes: e64f44a88465 ("ceph: skip checking caps when session reconnecting and releasing reqs") Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-02-26ceph: downgrade warning from mdsmap decode to debugLuis Henriques1-2/+2
commit ccd1acdf1c49b835504b235461fd24e2ed826764 upstream. While the MDS cluster is unstable and changing state the client may get mdsmap updates that will trigger warnings: [144692.478400] ceph: mdsmap_decode got incorrect state(up:standby-replay) [144697.489552] ceph: mdsmap_decode got incorrect state(up:standby-replay) [144697.489580] ceph: mdsmap_decode got incorrect state(up:standby-replay) This patch downgrades these warnings to debug, as they may flood the logs if the cluster is unstable for a while. Signed-off-by: Luis Henriques <lhenriques@suse.de> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-06ceph: fix inode refcount leak when ceph_fill_inode on non-I_NEW inode failsJeff Layton1-0/+2
[ Upstream commit 68cbb8056a4c24c6a38ad2b79e0a9764b235e8fa ] Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Sasha Levin <sashal@kernel.org>