summaryrefslogtreecommitdiff
path: root/fs/fuse
AgeCommit message (Collapse)AuthorFilesLines
2019-12-07pipe: Fix iteration end check in fuse_dev_splice_write()David Howells1-1/+1
Fix the iteration end check in fuse_dev_splice_write(). The iterator position can only be compared with == or != since wrappage may be involved. Fixes: 8cefc107ca54 ("pipe: Use head and tail pointers for the ring, not cursor and length") Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-05Merge tag 'fuse-update-5.5' of ↵Linus Torvalds6-115/+134
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse Pull fuse update from Miklos Szeredi: - Fix a regression introduced in the last release - Fix a number of issues with validating data coming from userspace - Some cleanups in virtiofs * tag 'fuse-update-5.5' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: fuse: fix Kconfig indentation fuse: fix leak of fuse_io_priv virtiofs: Use completions while waiting for queue to be drained virtiofs: Do not send forget request "struct list_head" element virtiofs: Use a common function to send forget virtiofs: Fix old-style declaration fuse: verify nlink fuse: verify write return fuse: verify attributes
2019-12-02Merge tag 'compat-ioctl-5.5' of ↵Linus Torvalds1-1/+1
git://git.kernel.org:/pub/scm/linux/kernel/git/arnd/playground Pull removal of most of fs/compat_ioctl.c from Arnd Bergmann: "As part of the cleanup of some remaining y2038 issues, I came to fs/compat_ioctl.c, which still has a couple of commands that need support for time64_t. In completely unrelated work, I spent time on cleaning up parts of this file in the past, moving things out into drivers instead. After Al Viro reviewed an earlier version of this series and did a lot more of that cleanup, I decided to try to completely eliminate the rest of it and move it all into drivers. This series incorporates some of Al's work and many patches of my own, but in the end stops short of actually removing the last part, which is the scsi ioctl handlers. I have patches for those as well, but they need more testing or possibly a rewrite" * tag 'compat-ioctl-5.5' of git://git.kernel.org:/pub/scm/linux/kernel/git/arnd/playground: (42 commits) scsi: sd: enable compat ioctls for sed-opal pktcdvd: add compat_ioctl handler compat_ioctl: move SG_GET_REQUEST_TABLE handling compat_ioctl: ppp: move simple commands into ppp_generic.c compat_ioctl: handle PPPIOCGIDLE for 64-bit time_t compat_ioctl: move PPPIOCSCOMPRESS to ppp_generic compat_ioctl: unify copy-in of ppp filters tty: handle compat PPP ioctls compat_ioctl: move SIOCOUTQ out of compat_ioctl.c compat_ioctl: handle SIOCOUTQNSD af_unix: add compat_ioctl support compat_ioctl: reimplement SG_IO handling compat_ioctl: move WDIOC handling into wdt drivers fs: compat_ioctl: move FITRIM emulation into file systems gfs2: add compat_ioctl support compat_ioctl: remove unused convert_in_user macro compat_ioctl: remove last RAID handling code compat_ioctl: remove /dev/raw ioctl translation compat_ioctl: remove PCI ioctl translation compat_ioctl: remove joystick ioctl translation ...
2019-12-01Merge tag 'notifications-pipe-prep-20191115' of ↵Linus Torvalds1-13/+18
git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs Pull pipe rework from David Howells: "This is my set of preparatory patches for building a general notification queue on top of pipes. It makes a number of significant changes: - It removes the nr_exclusive argument from __wake_up_sync_key() as this is always 1. This prepares for the next step: - Adds wake_up_interruptible_sync_poll_locked() so that poll can be woken up from a function that's holding the poll waitqueue spinlock. - Change the pipe buffer ring to be managed in terms of unbounded head and tail indices rather than bounded index and length. This means that reading the pipe only needs to modify one index, not two. - A selection of helper functions are provided to query the state of the pipe buffer, plus a couple to apply updates to the pipe indices. - The pipe ring is allowed to have kernel-reserved slots. This allows many notification messages to be spliced in by the kernel without allowing userspace to pin too many pages if it writes to the same pipe. - Advance the head and tail indices inside the pipe waitqueue lock and use wake_up_interruptible_sync_poll_locked() to poke poll without having to take the lock twice. - Rearrange pipe_write() to preallocate the buffer it is going to write into and then drop the spinlock. This allows kernel notifications to then be added the ring whilst it is filling the buffer it allocated. The read side is stalled because the pipe mutex is still held. - Don't wake up readers on a pipe if there was already data in it when we added more. - Don't wake up writers on a pipe if the ring wasn't full before we removed a buffer" * tag 'notifications-pipe-prep-20191115' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs: pipe: Remove sync on wake_ups pipe: Increase the writer-wakeup threshold to reduce context-switch count pipe: Check for ring full inside of the spinlock in pipe_write() pipe: Remove redundant wakeup from pipe_write() pipe: Rearrange sequence in pipe_write() to preallocate slot pipe: Conditionalise wakeup in pipe_read() pipe: Advance tail pointer inside of wait spinlock in pipe_read() pipe: Allow pipes to have kernel-reserved slots pipe: Use head and tail pointers for the ring, not cursor and length Add wake_up_interruptible_sync_poll_locked() Remove the nr_exclusive argument from __wake_up_sync_key() pipe: Reduce #inclusion of pipe_fs_i.h
2019-11-27fuse: fix Kconfig indentationKrzysztof Kozlowski1-2/+2
Adjust indentation from spaces to tab (+optional two spaces) as in coding style with command like: $ sed -e 's/^ /\t/' -i */Kconfig Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-11-27fuse: fix leak of fuse_io_privMiklos Szeredi1-1/+3
exit_aio() is sometimes stuck in wait_for_completion() after aio is issued with direct IO and the task receives a signal. The reason is failure to call ->ki_complete() due to a leaked reference to fuse_io_priv. This happens in fuse_async_req_send() if fuse_simple_background() returns an error (e.g. -EINTR). In this case the error value is propagated via io->err, so return success to not confuse callers. This issue is tracked as a virtio-fs issue: https://gitlab.com/virtio-fs/qemu/issues/14 Reported-by: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com> Fixes: 45ac96ed7c36 ("fuse: convert direct_io to simple api") Cc: <stable@vger.kernel.org> # v5.4 Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-11-22virtiofs: Use completions while waiting for queue to be drainedVivek Goyal1-10/+29
While we wait for queue to finish draining, use completions instead of usleep_range(). This is better way of waiting for event. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-11-22virtiofs: Do not send forget request "struct list_head" elementVivek Goyal1-5/+12
We are sending whole of virtio_fs_forget struct to the other end over virtqueue. Other end does not need to see elements like "struct list". That's internal detail of guest kernel. Fix it. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-11-22virtiofs: Use a common function to send forgetVivek Goyal1-87/+63
Currently we are duplicating logic to send forgets at two places. Consolidate the code by calling one helper function. This also uses virtqueue_add_outbuf() instead of virtqueue_add_sgs(). Former is simpler to call. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-11-15pipe: Allow pipes to have kernel-reserved slotsDavid Howells1-4/+4
Split pipe->ring_size into two numbers: (1) pipe->ring_size - indicates the hard size of the pipe ring. (2) pipe->max_usage - indicates the maximum number of pipe ring slots that userspace orchestrated events can fill. This allows for a pipe that is both writable by the general kernel notification facility and by userspace, allowing plenty of ring space for notifications to be added whilst preventing userspace from being able to pin too much unswappable kernel space. Signed-off-by: David Howells <dhowells@redhat.com>
2019-11-12virtiofs: Fix old-style declarationYueHaibing1-3/+3
There expect the 'static' keyword to come first in a declaration, and we get warnings like this with "make W=1": fs/fuse/virtio_fs.c:687:1: warning: 'static' is not at beginning of declaration [-Wold-style-declaration] fs/fuse/virtio_fs.c:692:1: warning: 'static' is not at beginning of declaration [-Wold-style-declaration] fs/fuse/virtio_fs.c:1029:1: warning: 'static' is not at beginning of declaration [-Wold-style-declaration] Signed-off-by: YueHaibing <yuehaibing@huawei.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-11-12fuse: verify nlinkMiklos Szeredi1-1/+2
When adding a new hard link, make sure that i_nlink doesn't overflow. Fixes: ac45d61357e8 ("fuse: fix nlink after unlink") Cc: <stable@vger.kernel.org> # v3.4 Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-11-12fuse: verify write returnMiklos Szeredi1-0/+2
Make sure filesystem is not returning a bogus number of bytes written. Fixes: ea9b9907b82a ("fuse: implement perform_write") Cc: <stable@vger.kernel.org> # v2.6.26 Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-11-12fuse: verify attributesMiklos Szeredi3-7/+19
If a filesystem returns negative inode sizes, future reads on the file were causing the cpu to spin on truncate_pagecache. Create a helper to validate the attributes. This now does two things: - check the file mode - check if the file size fits in i_size without overflowing Reported-by: Arijit Banerjee <arijit@rubrik.com> Fixes: d8a5ba45457e ("[PATCH] FUSE - core") Cc: <stable@vger.kernel.org> # v2.6.14 Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-10-31pipe: Use head and tail pointers for the ring, not cursor and lengthDavid Howells1-13/+18
Convert pipes to use head and tail pointers for the buffer ring rather than pointer and length as the latter requires two atomic ops to update (or a combined op) whereas the former only requires one. (1) The head pointer is the point at which production occurs and points to the slot in which the next buffer will be placed. This is equivalent to pipe->curbuf + pipe->nrbufs. The head pointer belongs to the write-side. (2) The tail pointer is the point at which consumption occurs. It points to the next slot to be consumed. This is equivalent to pipe->curbuf. The tail pointer belongs to the read-side. (3) head and tail are allowed to run to UINT_MAX and wrap naturally. They are only masked off when the array is being accessed, e.g.: pipe->bufs[head & mask] This means that it is not necessary to have a dead slot in the ring as head == tail isn't ambiguous. (4) The ring is empty if "head == tail". A helper, pipe_empty(), is provided for this. (5) The occupancy of the ring is "head - tail". A helper, pipe_occupancy(), is provided for this. (6) The number of free slots in the ring is "pipe->ring_size - occupancy". A helper, pipe_space_for_user() is provided to indicate how many slots userspace may use. (7) The ring is full if "head - tail >= pipe->ring_size". A helper, pipe_full(), is provided for this. Signed-off-by: David Howells <dhowells@redhat.com>
2019-10-23compat_ioctl: move more drivers to compat_ptr_ioctlArnd Bergmann1-1/+1
The .ioctl and .compat_ioctl file operations have the same prototype so they can both point to the same function, which works great almost all the time when all the commands are compatible. One exception is the s390 architecture, where a compat pointer is only 31 bit wide, and converting it into a 64-bit pointer requires calling compat_ptr(). Most drivers here will never run in s390, but since we now have a generic helper for it, it's easy enough to use it consistently. I double-checked all these drivers to ensure that all ioctl arguments are used as pointers or are ignored, but are not interpreted as integer values. Acked-by: Jason Gunthorpe <jgg@mellanox.com> Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch> Acked-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: David Sterba <dsterba@suse.com> Acked-by: Darren Hart (VMware) <dvhart@infradead.org> Acked-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Acked-by: Bjorn Andersson <bjorn.andersson@linaro.org> Acked-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-10-23fuse: redundant get_fuse_inode() calls in fuse_writepages_fill()Vasily Averin1-3/+1
Currently fuse_writepages_fill() calls get_fuse_inode() few times with the same argument. Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-10-23fuse: truncate pending writes on O_TRUNCMiklos Szeredi1-3/+7
Make sure cached writes are not reordered around open(..., O_TRUNC), with the obvious wrong results. Fixes: 4d99ff8f12eb ("fuse: Turn writeback cache on") Cc: <stable@vger.kernel.org> # v3.15+ Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-10-23fuse: flush dirty data/metadata before non-truncate setattrMiklos Szeredi1-0/+13
If writeback cache is enabled, then writes might get reordered with chmod/chown/utimes. The problem with this is that performing the write in the fuse daemon might itself change some of these attributes. In such case the following sequence of operations will result in file ending up with the wrong mode, for example: int fd = open ("suid", O_WRONLY|O_CREAT|O_EXCL); write (fd, "1", 1); fchown (fd, 0, 0); fchmod (fd, 04755); close (fd); This patch fixes this by flushing pending writes before performing chown/chmod/utimes. Reported-by: Giuseppe Scrivano <gscrivan@redhat.com> Tested-by: Giuseppe Scrivano <gscrivan@redhat.com> Fixes: 4d99ff8f12eb ("fuse: Turn writeback cache on") Cc: <stable@vger.kernel.org> # v3.15+ Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-10-23virtiofs: Remove set but not used variable 'fc'zhengbin1-2/+0
Fixes gcc '-Wunused-but-set-variable' warning: fs/fuse/virtio_fs.c: In function virtio_fs_wake_pending_and_unlock: fs/fuse/virtio_fs.c:983:20: warning: variable fc set but not used [-Wunused-but-set-variable] It is not used since commit 7ee1e2e631db ("virtiofs: No need to check fpq->connected state") Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: zhengbin <zhengbin13@huawei.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-10-21virtiofs: Retry request submission from worker contextVivek Goyal1-9/+52
If regular request queue gets full, currently we sleep for a bit and retrying submission in submitter's context. This assumes submitter is not holding any spin lock. But this assumption is not true for background requests. For background requests, we are called with fc->bg_lock held. This can lead to deadlock where one thread is trying submission with fc->bg_lock held while request completion thread has called fuse_request_end() which tries to acquire fc->bg_lock and gets blocked. As request completion thread gets blocked, it does not make further progress and that means queue does not get empty and submitter can't submit more requests. To solve this issue, retry submission with the help of a worker, instead of retrying in submitter's context. We already do this for hiprio/forget requests. Reported-by: Chirantan Ekbote <chirantan@chromium.org> Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-10-21virtiofs: Count pending forgets as in_flight forgetsVivek Goyal1-24/+20
If virtqueue is full, we put forget requests on a list and these forgets are dispatched later using a worker. As of now we don't count these forgets in fsvq->in_flight variable. This means when queue is being drained, we have to have special logic to first drain these pending requests and then wait for fsvq->in_flight to go to zero. By counting pending forgets in fsvq->in_flight, we can get rid of special logic and just wait for in_flight to go to zero. Worker thread will kick and drain all the forgets anyway, leading in_flight to zero. I also need similar logic for normal request queue in next patch where I am about to defer request submission in the worker context if queue is full. This simplifies the code a bit. Also add two helper functions to inc/dec in_flight. Decrement in_flight helper will later used to call completion when in_flight reaches zero. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-10-21virtiofs: Set FR_SENT flag only after request has been sentVivek Goyal1-13/+10
FR_SENT flag should be set when request has been sent successfully sent over virtqueue. This is used by interrupt logic to figure out if interrupt request should be sent or not. Also add it to fqp->processing list after sending it successfully. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-10-21virtiofs: No need to check fpq->connected stateVivek Goyal1-7/+0
In virtiofs we keep per queue connected state in virtio_fs_vq->connected and use that to end request if queue is not connected. And virtiofs does not even touch fpq->connected state. We probably need to merge these two at some point of time. For now, simplify the code a bit and do not worry about checking state of fpq->connected. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-10-21virtiofs: Do not end request in submission contextVivek Goyal1-4/+33
Submission context can hold some locks which end request code tries to hold again and deadlock can occur. For example, fc->bg_lock. If a background request is being submitted, it might hold fc->bg_lock and if we could not submit request (because device went away) and tried to end request, then deadlock happens. During testing, I also got a warning from deadlock detection code. So put requests on a list and end requests from a worker thread. I got following warning from deadlock detector. [ 603.137138] WARNING: possible recursive locking detected [ 603.137142] -------------------------------------------- [ 603.137144] blogbench/2036 is trying to acquire lock: [ 603.137149] 00000000f0f51107 (&(&fc->bg_lock)->rlock){+.+.}, at: fuse_request_end+0xdf/0x1c0 [fuse] [ 603.140701] [ 603.140701] but task is already holding lock: [ 603.140703] 00000000f0f51107 (&(&fc->bg_lock)->rlock){+.+.}, at: fuse_simple_background+0x92/0x1d0 [fuse] [ 603.140713] [ 603.140713] other info that might help us debug this: [ 603.140714] Possible unsafe locking scenario: [ 603.140714] [ 603.140715] CPU0 [ 603.140716] ---- [ 603.140716] lock(&(&fc->bg_lock)->rlock); [ 603.140718] lock(&(&fc->bg_lock)->rlock); [ 603.140719] [ 603.140719] *** DEADLOCK *** Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-10-21fuse: don't advise readdirplus for negative lookupMiklos Szeredi1-1/+2
If the FUSE_READDIRPLUS_AUTO feature is enabled, then lookups on a directory before/during readdir are used as an indication that READDIRPLUS should be used instead of READDIR. However if the lookup turns out to be negative, then selecting READDIRPLUS makes no sense. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-10-21fuse: don't dereference req->args on finished requestMiklos Szeredi1-1/+3
Move the check for async request after check for the request being already finished and done with. Reported-by: syzbot+ae0bb7aae3de6b4594e2@syzkaller.appspotmail.com Fixes: d49937749fef ("fuse: stop copying args to fuse_req") Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-10-15virtio-fs: don't show mount optionsMiklos Szeredi3-0/+9
Virtio-fs does not accept any mount options, so it's confusing and wrong to show any in /proc/mounts. Reported-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-10-14virtio-fs: Change module name to virtiofs.koVivek Goyal1-1/+2
We have been calling it virtio_fs and even file name is virtio_fs.c. Module name is virtio_fs.ko but when registering file system user is supposed to specify filesystem type as "virtiofs". Masayoshi Mizuma reported that he specified filesytem type as "virtio_fs" and got this warning on console. ------------[ cut here ]------------ request_module fs-virtio_fs succeeded, but still no fs? WARNING: CPU: 1 PID: 1234 at fs/filesystems.c:274 get_fs_type+0x12c/0x138 Modules linked in: ... virtio_fs fuse virtio_net net_failover ... CPU: 1 PID: 1234 Comm: mount Not tainted 5.4.0-rc1 #1 So looks like kernel could find the module virtio_fs.ko but could not find filesystem type after that. It probably is better to rename module name to virtiofs.ko so that above warning goes away in case user ends up specifying wrong fs name. Reported-by: Masayoshi Mizuma <msys.mizuma@gmail.com> Suggested-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Tested-by: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-28Merge tag 'virtio-fs-5.4' of ↵Linus Torvalds5-0/+1220
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse Pull fuse virtio-fs support from Miklos Szeredi: "Virtio-fs allows exporting directory trees on the host and mounting them in guest(s). This isn't actually a new filesystem, but a glue layer between the fuse filesystem and a virtio based back-end. It's similar in functionality to the existing virtio-9p solution, but significantly faster in benchmarks and has better POSIX compliance. Further permformance improvements can be achieved by sharing the page cache between host and guest, allowing for faster I/O and reduced memory use. Kata Containers have been including the out-of-tree virtio-fs (with the shared page cache patches as well) since version 1.7 as an experimental feature. They have been active in development and plan to switch from virtio-9p to virtio-fs as their default solution. There has been interest from other sources as well. The userspace infrastructure is slated to be merged into qemu once the kernel part hits mainline. This was developed by Vivek Goyal, Dave Gilbert and Stefan Hajnoczi" * tag 'virtio-fs-5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: virtio-fs: add virtiofs filesystem virtio-fs: add Documentation/filesystems/virtiofs.rst fuse: reserve values for mapping protocol
2019-09-24fuse: Make fuse_args_to_req staticYueHaibing1-1/+1
Fix sparse warning: fs/fuse/dev.c:468:6: warning: symbol 'fuse_args_to_req' was not declared. Should it be static? Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: YueHaibing <yuehaibing@huawei.com> Fixes: 68583165f962 ("fuse: add pages to fuse_args") Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-24fuse: fix memleak in cuse_channel_openzhengbin1-0/+1
If cuse_send_init fails, need to fuse_conn_put cc->fc. cuse_channel_open->fuse_conn_init->refcount_set(&fc->count, 1) ->fuse_dev_alloc->fuse_conn_get ->fuse_dev_free->fuse_conn_put Fixes: cc080e9e9be1 ("fuse: introduce per-instance fuse_dev structure") Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: zhengbin <zhengbin13@huawei.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-24fuse: fix beyond-end-of-page access in fuse_parse_cache()Tejun Heo1-1/+3
With DEBUG_PAGEALLOC on, the following triggers. BUG: unable to handle page fault for address: ffff88859367c000 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 3001067 P4D 3001067 PUD 406d3a8067 PMD 406d30c067 PTE 800ffffa6c983060 Oops: 0000 [#1] SMP DEBUG_PAGEALLOC CPU: 38 PID: 3110657 Comm: python2.7 RIP: 0010:fuse_readdir+0x88f/0xe7a [fuse] Code: 49 8b 4d 08 49 39 4e 60 0f 84 44 04 00 00 48 8b 43 08 43 8d 1c 3c 4d 01 7e 68 49 89 dc 48 03 5c 24 38 49 89 46 60 8b 44 24 30 <8b> 4b 10 44 29 e0 48 89 ca 48 83 c1 1f 48 83 e1 f8 83 f8 17 49 89 RSP: 0018:ffffc90035edbde0 EFLAGS: 00010286 RAX: 0000000000001000 RBX: ffff88859367bff0 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffff88859367bfed RDI: 0000000000920907 RBP: ffffc90035edbe90 R08: 000000000000014b R09: 0000000000000004 R10: ffff88859367b000 R11: 0000000000000000 R12: 0000000000000ff0 R13: ffffc90035edbee0 R14: ffff889fb8546180 R15: 0000000000000020 FS: 00007f80b5f4a740(0000) GS:ffff889fffa00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffff88859367c000 CR3: 0000001c170c2001 CR4: 00000000003606e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: iterate_dir+0x122/0x180 __x64_sys_getdents+0xa6/0x140 do_syscall_64+0x42/0x100 entry_SYSCALL_64_after_hwframe+0x44/0xa9 It's in fuse_parse_cache(). %rbx (ffff88859367bff0) is fuse_dirent pointer - addr + offset. FUSE_DIRENT_SIZE() is trying to dereference namelen off of it but that derefs into the next page which is disabled by pagealloc debug causing a PF. This is caused by dirent->namelen being accessed before ensuring that there's enough bytes in the page for the dirent. Fix it by pushing down reclen calculation. Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: 5d7bc7e8680c ("fuse: allow using readdir cache") Cc: stable@vger.kernel.org # v4.20+ Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-24fuse: unexport fuse_put_requestArnd Bergmann1-1/+0
This function has been made static, which now causes a compile-time warning: WARNING: "fuse_put_request" [vmlinux] is a static EXPORT_SYMBOL_GPL Remove the unneeded export. Fixes: 66abc3599c3c ("fuse: unexport request ops") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-24fuse: kmemcg account fs dataKhazhismel Kumykov3-4/+6
account per-file, dentry, and inode data blockdev/superblock and temporary per-request data was left alone, as this usually isn't accounted Reviewed-by: Shakeel Butt <shakeelb@google.com> Signed-off-by: Khazhismel Kumykov <khazhy@google.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-24fuse: on 64-bit store time in d_fsdata directlyKhazhismel Kumykov1-6/+30
Implements the optimization noted in commit f75fdf22b0a8 ("fuse: don't use ->d_time"), as the additional memory can be significant. (In particular, on SLAB configurations this 8-byte alloc becomes 32 bytes). Per-dentry, this can consume significant memory. Reviewed-by: Shakeel Butt <shakeelb@google.com> Signed-off-by: Khazhismel Kumykov <khazhy@google.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-24fuse: fix missing unlock_page in fuse_writepage()Vasily Averin1-0/+1
unlock_page() was missing in case of an already in-flight write against the same page. Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Fixes: ff17be086477 ("fuse: writepage: skip already in flight") Cc: <stable@vger.kernel.org> # v3.13 Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-18virtio-fs: add virtiofs filesystemStefan Hajnoczi5-0/+1220
Add a basic file system module for virtio-fs. This does not yet contain shared data support between host and guest or metadata coherency speedups. However it is already significantly faster than virtio-9p. Design Overview =============== With the goal of designing something with better performance and local file system semantics, a bunch of ideas were proposed. - Use fuse protocol (instead of 9p) for communication between guest and host. Guest kernel will be fuse client and a fuse server will run on host to serve the requests. - For data access inside guest, mmap portion of file in QEMU address space and guest accesses this memory using dax. That way guest page cache is bypassed and there is only one copy of data (on host). This will also enable mmap(MAP_SHARED) between guests. - For metadata coherency, there is a shared memory region which contains version number associated with metadata and any guest changing metadata updates version number and other guests refresh metadata on next access. This is yet to be implemented. How virtio-fs differs from existing approaches ============================================== The unique idea behind virtio-fs is to take advantage of the co-location of the virtual machine and hypervisor to avoid communication (vmexits). DAX allows file contents to be accessed without communication with the hypervisor. The shared memory region for metadata avoids communication in the common case where metadata is unchanged. By replacing expensive communication with cheaper shared memory accesses, we expect to achieve better performance than approaches based on network file system protocols. In addition, this also makes it easier to achieve local file system semantics (coherency). These techniques are not applicable to network file system protocols since the communications channel is bypassed by taking advantage of shared memory on a local machine. This is why we decided to build virtio-fs rather than focus on 9P or NFS. Caching Modes ============= Like virtio-9p, different caching modes are supported which determine the coherency level as well. The “cache=FOO” and “writeback” options control the level of coherence between the guest and host filesystems. - cache=none metadata, data and pathname lookup are not cached in guest. They are always fetched from host and any changes are immediately pushed to host. - cache=always metadata, data and pathname lookup are cached in guest and never expire. - cache=auto metadata and pathname lookup cache expires after a configured amount of time (default is 1 second). Data is cached while the file is open (close to open consistency). - writeback/no_writeback These options control the writeback strategy. If writeback is disabled, then normal writes will immediately be synchronized with the host fs. If writeback is enabled, then writes may be cached in the guest until the file is closed or an fsync(2) performed. This option has no effect on mmap-ed writes or writes going through the DAX mechanism. Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-12fuse: allow skipping control interface and forced unmountVivek Goyal2-1/+14
virtio-fs does not support aborting requests which are being processed. That is requests which have been sent to fuse daemon on host. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-12fuse: dissociate DESTROY from fuseblkMiklos Szeredi2-4/+17
Allow virtio-fs to also send DESTROY request. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-12fuse: delete dentry if timeout is zeroMiklos Szeredi2-3/+28
Don't hold onto dentry in lru list if need to re-lookup it anyway at next access. Only do this if explicitly enabled, otherwise it could result in performance regression. More advanced version of this patch would periodically flush out dentries from the lru which have gone stale. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-12fuse: separate fuse device allocation and installation in fuse_connVivek Goyal4-7/+26
As of now fuse_dev_alloc() both allocates a fuse device and installs it in fuse_conn list. fuse_dev_alloc() can fail if fuse_device allocation fails. virtio-fs needs to initialize multiple fuse devices (one per virtio queue). It initializes one fuse device as part of call to fuse_fill_super_common() and rest of the devices are allocated and installed after that. But, we can't afford to fail after calling fuse_fill_super_common() as we don't have a way to undo all the actions done by fuse_fill_super_common(). So to avoid failures after the call to fuse_fill_super_common(), pre-allocate all fuse devices early and install them into fuse connection later. This patch provides two separate helpers for fuse device allocation and fuse device installation in fuse_conn. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-12fuse: add fuse_iqueue_ops callbacksStefan Hajnoczi4-22/+81
The /dev/fuse device uses fiq->waitq and fasync to signal that requests are available. These mechanisms do not apply to virtio-fs. This patch introduces callbacks so alternative behavior can be used. Note that queue_interrupt() changes along these lines: spin_lock(&fiq->waitq.lock); wake_up_locked(&fiq->waitq); + kill_fasync(&fiq->fasync, SIGIO, POLL_IN); spin_unlock(&fiq->waitq.lock); - kill_fasync(&fiq->fasync, SIGIO, POLL_IN); Since queue_request() and queue_forget() also call kill_fasync() inside the spinlock this should be safe. Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-12fuse: extract fuse_fill_super_common()Stefan Hajnoczi2-59/+80
fuse_fill_super() includes code to process the fd= option and link the struct fuse_dev to the fd's struct file. In virtio-fs there is no file descriptor because /dev/fuse is not used. This patch extracts fuse_fill_super_common() so that both classic fuse and virtio-fs can share the code to initialize a mount. Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-12fuse: export fuse_dequeue_forget() functionVivek Goyal2-6/+11
File systems like virtio-fs need to do not have to play directly with forget list data structures. There is a helper function use that instead. Rename dequeue_forget() to fuse_dequeue_forget() and export it so that stacked filesystems can use it. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-12fuse: export fuse_get_unique()Stefan Hajnoczi2-1/+7
virtio-fs will need unique IDs for FORGET requests from outside fs/fuse/dev.c. Make the symbol visible. Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-12fuse: export fuse_send_init_request()Vivek Goyal2-1/+3
This will be used by virtio-fs to send init request to fuse server after initialization of virt queues. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-12fuse: export fuse_len_args()Stefan Hajnoczi2-4/+10
virtio-fs will need to query the length of fuse_arg lists. Make the symbol visible. Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-12fuse: export fuse_end_request()Stefan Hajnoczi2-9/+15
virtio-fs will need to complete requests from outside fs/fuse/dev.c. Make the symbol visible. Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-12fuse: fix request limitMiklos Szeredi1-2/+5
The size of struct fuse_req was reduced from 392B to 144B on a non-debug config, thus the sanitize_global_limit() helper was setting a larger default limit. This doesn't really reflect reduction in the memory used by requests, since the fields removed from fuse_req were added to fuse_args derived structs; e.g. sizeof(struct fuse_writepages_args) is 248B, thus resulting in slightly more memory being used for writepage requests overalll (due to using 256B slabs). Make the calculatation ignore the size of fuse_req and use the old 392B value. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>