summaryrefslogtreecommitdiff
path: root/fs/eventpoll.c
AgeCommit message (Collapse)AuthorFilesLines
2020-10-26take the common part of ep_eventpoll_poll() and ep_item_poll() into helperAl Viro1-30/+27
The only reason why ep_item_poll() can't simply call ep_eventpoll_poll() (or, better yet, call vfs_poll() in all cases) is that we need to tell lockdep how deep into the hierarchy of ->mtx we are. So let's add a variant of ep_eventpoll_poll() that would take depth explicitly and turn ep_eventpoll_poll() into wrapper for that. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-10-26ep_insert(): we only need tep->mtx around the insertion itselfAl Viro1-18/+10
We do need ep->mtx (and we are holding it all along), but that's the lock on the epoll we are inserting into; locking of the epoll being inserted is not needed for most of that work - as the matter of fact, we only need it to provide barriers for the fastpath check (for now). Move taking and releasing it into ep_insert(). The caller (do_epoll_ctl()) doesn't need to bother with that at all. Moreover, that way we kill the kludge in ep_item_poll() - now it's always called with tep unlocked. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-10-26ep_insert(): don't open-code ep_remove() on failure exitsAl Viro1-37/+14
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-10-26lift locking/unlocking ep->mtx out of ep_{start,done}_scan()Al Viro1-31/+26
get rid of depth/ep_locked arguments there and document the kludge in ep_item_poll() that has lead to ep_locked existence in the first place Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-10-26ep_send_events_proc(): fold into the callerAl Viro1-40/+20
... and get rid of struct ep_send_events_data - not needed anymore. The weird way of passing the arguments in (and real return value out - nominal return value of ep_send_events_proc() is ignored) was due to the signature forced on ep_scan_ready_list() callbacks. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-10-26lift the calls of ep_send_events_proc() into the callersAl Viro1-28/+5
... and kill ep_scan_ready_list() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-10-26lift the calls of ep_read_events_proc() into the callersAl Viro1-10/+14
Expand the calls of ep_scan_ready_list() that get ep_read_events_proc(). As a side benefit we can pass depth to ep_read_events_proc() by value and not by address - the latter used to be forced by the signature expected from ep_scan_ready_list() callback. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-10-26ep_scan_ready_list(): prepare to splitupAl Viro1-27/+36
take the stuff done before and after the callback into separate helpers Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-10-26ep_loop_check_proc(): saner calling conventionsAl Viro1-22/+16
1) 'cookie' argument is unused; kill it. 2) 'priv' one is always an epoll struct file, and we only care about its associated struct eventpoll; pass that instead. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-10-26get rid of ep_push_nested()Al Viro1-25/+4
The only remaining user is loop checking. But there we only need to check that we have not walked into the epoll we are inserting into - we are adding an edge to acyclic graph, so any loop being created will have to pass through the source of that edge. So we don't need that array of cookies - we have only one eventpoll to watch out for. RIP ep_push_nested(), along with the cookies array. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-10-26ep_loop_check_proc(): lift pushing the cookie into callersAl Viro1-6/+12
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-10-26clean reverse_path_check_proc() a bitAl Viro1-17/+9
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-10-26reverse_path_check_proc(): don't bother with cookiesAl Viro1-2/+1
We know there's no loops by the time we call it; the only thing we care about is too deep reverse paths. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-10-26reverse_path_check_proc(): sane argumentsAl Viro1-7/+5
no need to force its calling conventions to match the callback for late unlamented ep_call_nested()... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-10-26untangling ep_call_nested(): and there was much rejoicingAl Viro1-32/+11
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-10-26untangling ep_call_nested(): move push/pop of cookie into the callbacksAl Viro1-9/+9
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-10-26untangling ep_call_nested(): take pushing cookie into a helperAl Viro1-9/+17
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-10-26untangling ep_call_nested(): it's all serialized on epmutex.Al Viro1-69/+11
IOW, * no locking is needed to protect the list * the list is actually a stack * no need to check ->ctx * it can bloody well be a static 5-element array - nobody is going to be accessing it in parallel. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-10-26untangling ep_call_nested(): get rid of useless argumentsAl Viro1-19/+12
ctx is always equal to current, ncalls - to &poll_loop_ncalls. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-10-26epoll: get rid of epitem->nwaitAl Viro1-26/+20
we use it only to indicate allocation failures within queueing callback back to ep_insert(). Might as well use epq.epi for that reporting... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-10-26epoll: switch epitem->pwqlist to single-linked listAl Viro1-26/+25
We only traverse it once to destroy all associated eppoll_entry at epitem destruction time. The order of traversal is irrelevant there. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-09-25ep_create_wakeup_source(): dentry name can change under you...Al Viro1-3/+4
or get freed, for that matter, if it's a long (separately stored) name. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-09-11epoll: EPOLL_CTL_ADD: close the race in decision to take fast pathAl Viro1-0/+1
Checking for the lack of epitems refering to the epoll we want to insert into is not enough; we might have an insertion of that epoll into another one that has already collected the set of files to recheck for excessive reverse paths, but hasn't gotten to creating/inserting the epitem for it. However, any such insertion in progress can be detected - it will update the generation count in our epoll when it's done looking through it for files to check. That gets done under ->mtx of our epoll and that allows us to detect that safely. We are *not* holding epmutex here, so the generation count is not stable. However, since both the update of ep->gen by loop check and (later) insertion into ->f_ep_link are done with ep->mtx held, we are fine - the sequence is grab epmutex bump loop_check_gen ... grab tep->mtx // 1 tep->gen = loop_check_gen ... drop tep->mtx // 2 ... grab tep->mtx // 3 ... insert into ->f_ep_link ... drop tep->mtx // 4 bump loop_check_gen drop epmutex and if the fastpath check in another thread happens for that eventpoll, it can come * before (1) - in that case fastpath is just fine * after (4) - we'll see non-empty ->f_ep_link, slow path taken * between (2) and (3) - loop_check_gen is stable, with ->mtx providing barriers and we end up taking slow path. Note that ->f_ep_link emptiness check is slightly racy - we are protected against insertions into that list, but removals can happen right under us. Not a problem - in the worst case we'll end up taking a slow path for no good reason. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-09-10epoll: replace ->visited/visited_list with generation countAl Viro1-19/+8
removes the need to clear it, along with the races. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-09-10epoll: do not insert into poll queues until all sanity checks are doneAl Viro1-19/+18
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-09-02fix regression in "epoll: Keep a reference on files added to the check list"Al Viro1-3/+3
epoll_loop_check_proc() can run into a file already committed to destruction; we can't grab a reference on those and don't need to add them to the set for reverse path check anyway. Tested-by: Marc Zyngier <maz@kernel.org> Fixes: a9ed4a6560b8 ("epoll: Keep a reference on files added to the check list") Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-08-23do_epoll_ctl(): clean the failure exits up a bitAl Viro1-13/+6
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-08-23epoll: Keep a reference on files added to the check listMarc Zyngier1-2/+9
When adding a new fd to an epoll, and that this new fd is an epoll fd itself, we recursively scan the fds attached to it to detect cycles, and add non-epool files to a "check list" that gets subsequently parsed. However, this check list isn't completely safe when deletions can happen concurrently. To sidestep the issue, make sure that a struct file placed on the check list sees its f_count increased, ensuring that a concurrent deletion won't result in the file disapearing from under our feet. Cc: stable@vger.kernel.org Signed-off-by: Marc Zyngier <maz@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-05-14epoll: call final ep_events_available() check under the lockRoman Penyaev1-20/+28
There is a possible race when ep_scan_ready_list() leaves ->rdllist and ->obflist empty for a short period of time although some events are pending. It is quite likely that ep_events_available() observes empty lists and goes to sleep. Since commit 339ddb53d373 ("fs/epoll: remove unnecessary wakeups of nested epoll") we are conservative in wakeups (there is only one place for wakeup and this is ep_poll_callback()), thus ep_events_available() must always observe correct state of two lists. The easiest and correct way is to do the final check under the lock. This does not impact the performance, since lock is taken anyway for adding a wait entry to the wait queue. The discussion of the problem can be found here: https://lore.kernel.org/linux-fsdevel/a2f22c3c-c25a-4bda-8339-a7bdaf17849e@akamai.com/ In this patch barrierless __set_current_state() is used. This is safe since waitqueue_active() is called under the same lock on wakeup side. Short-circuit for fatal signals (i.e. fatal_signal_pending() check) is moved to the line just before actual events harvesting routine. This is fully compliant to what is said in the comment of the patch where the actual fatal_signal_pending() check was added: c257a340ede0 ("fs, epoll: short circuit fetching events if thread has been killed"). Fixes: 339ddb53d373 ("fs/epoll: remove unnecessary wakeups of nested epoll") Reported-by: Jason Baron <jbaron@akamai.com> Reported-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Roman Penyaev <rpenyaev@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Jason Baron <jbaron@akamai.com> Cc: Khazhismel Kumykov <khazhy@google.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/20200505145609.1865152-1-rpenyaev@suse.de Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-05-08epoll: atomically remove wait entry on wake upRoman Penyaev1-19/+24
This patch does two things: - fixes a lost wakeup introduced by commit 339ddb53d373 ("fs/epoll: remove unnecessary wakeups of nested epoll") - improves performance for events delivery. The description of the problem is the following: if N (>1) threads are waiting on ep->wq for new events and M (>1) events come, it is quite likely that >1 wakeups hit the same wait queue entry, because there is quite a big window between __add_wait_queue_exclusive() and the following __remove_wait_queue() calls in ep_poll() function. This can lead to lost wakeups, because thread, which was woken up, can handle not all the events in ->rdllist. (in better words the problem is described here: https://lkml.org/lkml/2019/10/7/905) The idea of the current patch is to use init_wait() instead of init_waitqueue_entry(). Internally init_wait() sets autoremove_wake_function as a callback, which removes the wait entry atomically (under the wq locks) from the list, thus the next coming wakeup hits the next wait entry in the wait queue, thus preventing lost wakeups. Problem is very well reproduced by the epoll60 test case [1]. Wait entry removal on wakeup has also performance benefits, because there is no need to take a ep->lock and remove wait entry from the queue after the successful wakeup. Here is the timing output of the epoll60 test case: With explicit wakeup from ep_scan_ready_list() (the state of the code prior 339ddb53d373): real 0m6.970s user 0m49.786s sys 0m0.113s After this patch: real 0m5.220s user 0m36.879s sys 0m0.019s The other testcase is the stress-epoll [2], where one thread consumes all the events and other threads produce many events: With explicit wakeup from ep_scan_ready_list() (the state of the code prior 339ddb53d373): threads events/ms run-time ms 8 5427 1474 16 6163 2596 32 6824 4689 64 7060 9064 128 6991 18309 After this patch: threads events/ms run-time ms 8 5598 1429 16 7073 2262 32 7502 4265 64 7640 8376 128 7634 16767 (number of "events/ms" represents event bandwidth, thus higher is better; number of "run-time ms" represents overall time spent doing the benchmark, thus lower is better) [1] tools/testing/selftests/filesystems/epoll/epoll_wakeup_test.c [2] https://github.com/rouming/test-tools/blob/master/stress-epoll.c Signed-off-by: Roman Penyaev <rpenyaev@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Jason Baron <jbaron@akamai.com> Cc: Khazhismel Kumykov <khazhy@google.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Heiher <r@hev.cc> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/20200430130326.1368509-2-rpenyaev@suse.de Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-05-08eventpoll: fix missing wakeup for ovflist in ep_poll_callbackKhazhismel Kumykov1-9/+9
In the event that we add to ovflist, before commit 339ddb53d373 ("fs/epoll: remove unnecessary wakeups of nested epoll") we would be woken up by ep_scan_ready_list, and did no wakeup in ep_poll_callback. With that wakeup removed, if we add to ovflist here, we may never wake up. Rather than adding back the ep_scan_ready_list wakeup - which was resulting in unnecessary wakeups, trigger a wake-up in ep_poll_callback. We noticed that one of our workloads was missing wakeups starting with 339ddb53d373 and upon manual inspection, this wakeup seemed missing to me. With this patch added, we no longer see missing wakeups. I haven't yet tried to make a small reproducer, but the existing kselftests in filesystem/epoll passed for me with this patch. [khazhy@google.com: use if/elif instead of goto + cleanup suggested by Roman] Link: http://lkml.kernel.org/r/20200424190039.192373-1-khazhy@google.com Fixes: 339ddb53d373 ("fs/epoll: remove unnecessary wakeups of nested epoll") Signed-off-by: Khazhismel Kumykov <khazhy@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Roman Penyaev <rpenyaev@suse.de> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Roman Penyaev <rpenyaev@suse.de> Cc: Heiher <r@hev.cc> Cc: Jason Baron <jbaron@akamai.com> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/20200424025057.118641-1-khazhy@google.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-07fs/epoll: make nesting accounting safe for -rt kernelJason Baron1-21/+43
Davidlohr Bueso pointed out that when CONFIG_DEBUG_LOCK_ALLOC is set ep_poll_safewake() can take several non-raw spinlocks after disabling interrupts. Since a spinlock can block in the -rt kernel, we can't take a spinlock after disabling interrupts. So let's re-work how we determine the nesting level such that it plays nicely with the -rt kernel. Let's introduce a 'nests' field in struct eventpoll that records the current nesting level during ep_poll_callback(). Then, if we nest again we can find the previous struct eventpoll that we were called from and increase our count by 1. The 'nests' field is protected by ep->poll_wait.lock. I've also moved the visited field to reduce the size of struct eventpoll from 184 bytes to 176 bytes on x86_64 for !CONFIG_DEBUG_LOCK_ALLOC, which is typical for a production config. Reported-by: Davidlohr Bueso <dbueso@suse.de> Signed-off-by: Jason Baron <jbaron@akamai.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Davidlohr Bueso <dbueso@suse.de> Cc: Roman Penyaev <rpenyaev@suse.de> Cc: Eric Wong <normalperson@yhbt.net> Cc: Al Viro <viro@zeniv.linux.org.uk> Link: http://lkml.kernel.org/r/1582739816-13167-1-git-send-email-jbaron@akamai.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-03-22epoll: fix possible lost wakeup on epoll_ctl() pathRoman Penyaev1-4/+4
This fixes possible lost wakeup introduced by commit a218cc491420. Originally modifications to ep->wq were serialized by ep->wq.lock, but in commit a218cc491420 ("epoll: use rwlock in order to reduce ep_poll_callback() contention") a new rw lock was introduced in order to relax fd event path, i.e. callers of ep_poll_callback() function. After the change ep_modify and ep_insert (both are called on epoll_ctl() path) were switched to ep->lock, but ep_poll (epoll_wait) was using ep->wq.lock on wqueue list modification. The bug doesn't lead to any wqueue list corruptions, because wake up path and list modifications were serialized by ep->wq.lock internally, but actual waitqueue_active() check prior wake_up() call can be reordered with modifications of ep ready list, thus wake up can be lost. And yes, can be healed by explicit smp_mb(): list_add_tail(&epi->rdlink, &ep->rdllist); smp_mb(); if (waitqueue_active(&ep->wq)) wake_up(&ep->wp); But let's make it simple, thus current patch replaces ep->wq.lock with the ep->lock for wqueue modifications, thus wake up path always observes activeness of the wqueue correcty. Fixes: a218cc491420 ("epoll: use rwlock in order to reduce ep_poll_callback() contention") Reported-by: Max Neunhoeffer <max@arangodb.com> Signed-off-by: Roman Penyaev <rpenyaev@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Max Neunhoeffer <max@arangodb.com> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Christopher Kohlhoff <chris.kohlhoff@clearpool.io> Cc: Davidlohr Bueso <dbueso@suse.de> Cc: Jason Baron <jbaron@akamai.com> Cc: Jes Sorensen <jes.sorensen@gmail.com> Cc: <stable@vger.kernel.org> [5.1+] Link: http://lkml.kernel.org/r/20200214170211.561524-1-rpenyaev@suse.de References: https://bugzilla.kernel.org/show_bug.cgi?id=205933 Bisected-by: Max Neunhoeffer <max@arangodb.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-01-30eventpoll: support non-blocking do_epoll_ctl() callsJens Axboe1-13/+33
Also make it available outside of epoll, along with the helper that decides if we need to copy the passed in epoll_event. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-30eventpoll: abstract out epoll_ctl() handlerJens Axboe1-20/+25
No functional changes in this patch. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-05fs/epoll: remove unnecessary wakeups of nested epollHeiher1-16/+0
Take the case where we have: t0 | (ew) e0 | (et) e1 | (lt) s0 t0: thread 0 e0: epoll fd 0 e1: epoll fd 1 s0: socket fd 0 ew: epoll_wait et: edge-trigger lt: level-trigger We remove unnecessary wakeups to prevent the nested epoll that working in edge- triggered mode to waking up continuously. Test code: #include <unistd.h> #include <sys/epoll.h> #include <sys/socket.h> int main(int argc, char *argv[]) { int sfd[2]; int efd[2]; struct epoll_event e; if (socketpair(AF_UNIX, SOCK_STREAM, 0, sfd) < 0) goto out; efd[0] = epoll_create(1); if (efd[0] < 0) goto out; efd[1] = epoll_create(1); if (efd[1] < 0) goto out; e.events = EPOLLIN; if (epoll_ctl(efd[1], EPOLL_CTL_ADD, sfd[0], &e) < 0) goto out; e.events = EPOLLIN | EPOLLET; if (epoll_ctl(efd[0], EPOLL_CTL_ADD, efd[1], &e) < 0) goto out; if (write(sfd[1], "w", 1) != 1) goto out; if (epoll_wait(efd[0], &e, 1, 0) != 1) goto out; if (epoll_wait(efd[0], &e, 1, 0) != 0) goto out; close(efd[0]); close(efd[1]); close(sfd[0]); close(sfd[1]); return 0; out: return -1; } More tests: https://github.com/heiher/epoll-wakeup Link: http://lkml.kernel.org/r/20191009060516.3577-1-r@hev.cc Signed-off-by: hev <r@hev.cc> Reviewed-by: Roman Penyaev <rpenyaev@suse.de> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: Davide Libenzi <davidel@xmailserver.org> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Dominik Brodowski <linux@dominikbrodowski.net> Cc: Eric Wong <e@80x24.org> Cc: Jason Baron <jbaron@akamai.com> Cc: Sridhar Samudrala <sridhar.samudrala@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-05epoll: simplify ep_poll_safewake() for CONFIG_DEBUG_LOCK_ALLOCJason Baron1-23/+13
Currently, ep_poll_safewake() in the CONFIG_DEBUG_LOCK_ALLOC case uses ep_call_nested() in order to pass the correct subclass argument to spin_lock_irqsave_nested(). However, ep_call_nested() adds unnecessary checks for epoll depth and loops that are already verified when doing EPOLL_CTL_ADD. This mirrors a conversion that was done for !CONFIG_DEBUG_LOCK_ALLOC in: commit 37b5e5212a44 ("epoll: remove ep_call_nested() from ep_eventpoll_poll()") Link: http://lkml.kernel.org/r/1567628549-11501-1-git-send-email-jbaron@akamai.com Signed-off-by: Jason Baron <jbaron@akamai.com> Reviewed-by: Roman Penyaev <rpenyaev@suse.de> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Eric Wong <normalperson@yhbt.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-08-21PM / wakeup: Show wakeup sources stats in sysfsTri Vo1-2/+2
Add an ID and a device pointer to 'struct wakeup_source'. Use them to to expose wakeup sources statistics in sysfs under /sys/class/wakeup/wakeup<ID>/*. Co-developed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Co-developed-by: Stephen Boyd <swboyd@chromium.org> Signed-off-by: Stephen Boyd <swboyd@chromium.org> Signed-off-by: Tri Vo <trong@android.com> Tested-by: Kalesh Singh <kaleshsingh@google.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-07-19proc/sysctl: add shared variables for range checkMatteo Croce1-2/+2
In the sysctl code the proc_dointvec_minmax() function is often used to validate the user supplied value between an allowed range. This function uses the extra1 and extra2 members from struct ctl_table as minimum and maximum allowed value. On sysctl handler declaration, in every source file there are some readonly variables containing just an integer which address is assigned to the extra1 and extra2 members, so the sysctl range is enforced. The special values 0, 1 and INT_MAX are very often used as range boundary, leading duplication of variables like zero=0, one=1, int_max=INT_MAX in different source files: $ git grep -E '\.extra[12].*&(zero|one|int_max)' |wc -l 248 Add a const int array containing the most commonly used values, some macros to refer more easily to the correct array member, and use them instead of creating a local one for every object file. This is the bloat-o-meter output comparing the old and new binary compiled with the default Fedora config: # scripts/bloat-o-meter -d vmlinux.o.old vmlinux.o add/remove: 2/2 grow/shrink: 0/2 up/down: 24/-188 (-164) Data old new delta sysctl_vals - 12 +12 __kstrtab_sysctl_vals - 12 +12 max 14 10 -4 int_max 16 - -16 one 68 - -68 zero 128 28 -100 Total: Before=20583249, After=20583085, chg -0.00% [mcroce@redhat.com: tipc: remove two unused variables] Link: http://lkml.kernel.org/r/20190530091952.4108-1-mcroce@redhat.com [akpm@linux-foundation.org: fix net/ipv6/sysctl_net_ipv6.c] [arnd@arndb.de: proc/sysctl: make firmware loader table conditional] Link: http://lkml.kernel.org/r/20190617130014.1713870-1-arnd@arndb.de [akpm@linux-foundation.org: fix fs/eventpoll.c] Link: http://lkml.kernel.org/r/20190430180111.10688-1-mcroce@redhat.com Signed-off-by: Matteo Croce <mcroce@redhat.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Kees Cook <keescook@chromium.org> Reviewed-by: Aaron Tomlin <atomlin@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-17signal: simplify set_user_sigmask/restore_user_sigmaskOleg Nesterov1-8/+4
task->saved_sigmask and ->restore_sigmask are only used in the ret-from- syscall paths. This means that set_user_sigmask() can save ->blocked in ->saved_sigmask and do set_restore_sigmask() to indicate that ->blocked was modified. This way the callers do not need 2 sigset_t's passed to set/restore and restore_user_sigmask() renamed to restore_saved_sigmask_unless() turns into the trivial helper which just calls restore_saved_sigmask(). Link: http://lkml.kernel.org/r/20190606113206.GA9464@redhat.com Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Deepa Dinamani <deepa.kernel@gmail.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Jens Axboe <axboe@kernel.dk> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Eric Wong <e@80x24.org> Cc: Jason Baron <jbaron@akamai.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: David Laight <David.Laight@aculab.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-06-29signal: remove the wrong signal_pending() check in restore_user_sigmask()Oleg Nesterov1-2/+2
This is the minimal fix for stable, I'll send cleanups later. Commit 854a6ed56839 ("signal: Add restore_user_sigmask()") introduced the visible change which breaks user-space: a signal temporary unblocked by set_user_sigmask() can be delivered even if the caller returns success or timeout. Change restore_user_sigmask() to accept the additional "interrupted" argument which should be used instead of signal_pending() check, and update the callers. Eric said: : For clarity. I don't think this is required by posix, or fundamentally to : remove the races in select. It is what linux has always done and we have : applications who care so I agree this fix is needed. : : Further in any case where the semantic change that this patch rolls back : (aka where allowing a signal to be delivered and the select like call to : complete) would be advantage we can do as well if not better by using : signalfd. : : Michael is there any chance we can get this guarantee of the linux : implementation of pselect and friends clearly documented. The guarantee : that if the system call completes successfully we are guaranteed that no : signal that is unblocked by using sigmask will be delivered? Link: http://lkml.kernel.org/r/20190604134117.GA29963@redhat.com Fixes: 854a6ed56839a40f6b5d02a2962f48841482eec4 ("signal: Add restore_user_sigmask()") Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reported-by: Eric Wong <e@80x24.org> Tested-by: Eric Wong <e@80x24.org> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Deepa Dinamani <deepa.kernel@gmail.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Jason Baron <jbaron@akamai.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: David Laight <David.Laight@ACULAB.COM> Cc: <stable@vger.kernel.org> [5.0+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-30treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 152Thomas Gleixner1-6/+1
Based on 1 normalized pattern(s): this program is free software you can redistribute it and or modify it under the terms of the gnu general public license as published by the free software foundation either version 2 of the license or at your option any later version extracted by the scancode license scanner the SPDX license identifier GPL-2.0-or-later has been chosen to replace the boilerplate/reference in 3029 file(s). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Allison Randal <allison@lohutok.net> Cc: linux-spdx@vger.kernel.org Link: https://lkml.kernel.org/r/20190527070032.746973796@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-03-08epoll: use rwlock in order to reduce ep_poll_callback() contentionRoman Penyaev1-36/+122
The goal of this patch is to reduce contention of ep_poll_callback() which can be called concurrently from different CPUs in case of high events rates and many fds per epoll. Problem can be very well reproduced by generating events (write to pipe or eventfd) from many threads, while consumer thread does polling. In other words this patch increases the bandwidth of events which can be delivered from sources to the poller by adding poll items in a lockless way to the list. The main change is in replacement of the spinlock with a rwlock, which is taken on read in ep_poll_callback(), and then by adding poll items to the tail of the list using xchg atomic instruction. Write lock is taken everywhere else in order to stop list modifications and guarantee that list updates are fully completed (I assume that write side of a rwlock does not starve, it seems qrwlock implementation has these guarantees). The following are some microbenchmark results based on the test [1] which starts threads which generate N events each. The test ends when all events are successfully fetched by the poller thread: spinlock ======== threads events/ms run-time ms 8 6402 12495 16 7045 22709 32 7395 43268 rwlock + xchg ============= threads events/ms run-time ms 8 10038 7969 16 12178 13138 32 13223 24199 According to the results bandwidth of delivered events is significantly increased, thus execution time is reduced. This patch was tested with different sort of microbenchmarks and artificial delays (e.g. "udelay(get_random_int() & 0xff)") introduced in kernel on paths where items are added to lists. [1] https://github.com/rouming/test-tools/blob/master/stress-epoll.c Link: http://lkml.kernel.org/r/20190103150104.17128-5-rpenyaev@suse.de Signed-off-by: Roman Penyaev <rpenyaev@suse.de> Cc: Davidlohr Bueso <dbueso@suse.de> Cc: Jason Baron <jbaron@akamai.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-08epoll: unify awaking of wakeup source on ep_poll_callback() pathRoman Penyaev1-8/+1
Original comment "Activate ep->ws since epi->ws may get deactivated at any time" indeed sounds loud, but it is incorrect, because the path where we check epi->ws is a path where insert to ovflist happens, i.e. ep_scan_ready_list() has taken ep->mtx and waits for this callback to finish, thus ep_modify() (which unregisters wakeup source) waits for ep_scan_ready_list(). Here in this patch I simply call ep_pm_stay_awake_rcu(), which is a bit extra for this path (indirectly protected by main ep->mtx, so even rcu is not needed), but I do not want to create another naked __ep_pm_stay_awake() variant only for this particular case, so rcu variant is just better for all the cases. Link: http://lkml.kernel.org/r/20190103150104.17128-4-rpenyaev@suse.de Signed-off-by: Roman Penyaev <rpenyaev@suse.de> Cc: Davidlohr Bueso <dbueso@suse.de> Cc: Jason Baron <jbaron@akamai.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-08epoll: make sure all elements in ready list are in FIFO orderRoman Penyaev1-1/+5
Patch series "use rwlock in order to reduce ep_poll_callback() contention", v3. The last patch targets the contention problem in ep_poll_callback(), which can be very well reproduced by generating events (write to pipe or eventfd) from many threads, while consumer thread does polling. The following are some microbenchmark results based on the test [1] which starts threads which generate N events each. The test ends when all events are successfully fetched by the poller thread: spinlock ======== threads events/ms run-time ms 8 6402 12495 16 7045 22709 32 7395 43268 rwlock + xchg ============= threads events/ms run-time ms 8 10038 7969 16 12178 13138 32 13223 24199 According to the results bandwidth of delivered events is significantly increased, thus execution time is reduced. This patch (of 4): All coming events are stored in FIFO order and this is also should be applicable to ->ovflist, which originally is stack, i.e. LIFO. Thus to keep correct FIFO order ->ovflist should reversed by adding elements to the head of the read list but not to the tail. Link: http://lkml.kernel.org/r/20190103150104.17128-2-rpenyaev@suse.de Signed-off-by: Roman Penyaev <rpenyaev@suse.de> Reviewed-by: Davidlohr Bueso <dbueso@suse.de> Cc: Jason Baron <jbaron@akamai.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-05Merge branch 'akpm' (patches from Andrew)Linus Torvalds1-100/+118
Merge more updates from Andrew Morton: - procfs updates - various misc bits - lib/ updates - epoll updates - autofs - fatfs - a few more MM bits * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (58 commits) mm/page_io.c: fix polled swap page in checkpatch: add Co-developed-by to signature tags docs: fix Co-Developed-by docs drivers/base/platform.c: kmemleak ignore a known leak fs: don't open code lru_to_page() fs/: remove caller signal_pending branch predictions mm/: remove caller signal_pending branch predictions arch/arc/mm/fault.c: remove caller signal_pending_branch predictions kernel/sched/: remove caller signal_pending branch predictions kernel/locking/mutex.c: remove caller signal_pending branch predictions mm: select HAVE_MOVE_PMD on x86 for faster mremap mm: speed up mremap by 20x on large regions mm: treewide: remove unused address argument from pte_alloc functions initramfs: cleanup incomplete rootfs scripts/gdb: fix lx-version string output kernel/kcov.c: mark write_comp_data() as notrace kernel/sysctl: add panic_print into sysctl panic: add options to print system info when panic happens bfs: extra sanity checking and static inode bitmap exec: separate MM_ANONPAGES and RLIMIT_STACK accounting ...
2019-01-05fs/epoll: deal with wait_queue only onceDavidlohr Bueso1-11/+18
There is no reason why we rearm the waitiqueue upon every fetch_events retry (for when events are found yet send_events() fails). If nothing else, this saves four lock operations per retry, and furthermore reduces the scope of the lock even further. [akpm@linux-foundation.org: restore code to original position, fix and reflow comment] Link: http://lkml.kernel.org/r/20181114182532.27981-2-dave@stgolabs.net Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Cc: Jason Baron <jbaron@akamai.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-05fs/epoll: rename check_events label to send_eventsDavidlohr Bueso1-3/+3
It is currently called check_events because it, well, did exactly that. However, since the lockless ep_events_available() call, the label no longer checks, but just sends the events. Rename as such. Link: http://lkml.kernel.org/r/20181114182532.27981-1-dave@stgolabs.net Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Jason Baron <jbaron@akamai.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-05fs/epoll: avoid barrier after an epoll_wait(2) timeoutDavidlohr Bueso1-2/+6
Upon timeout, we can just exit out of the loop, without the cost of the changing the task's state with an smp_store_mb call. Just exit out of the loop and be done - setting the task state afterwards will be, of course, redundant. [dave@stgolabs.net: forgotten fixlets] Link: http://lkml.kernel.org/r/20181109155258.jxcr4t2pnz6zqct3@linux-r8p5 Link: http://lkml.kernel.org/r/20181108051006.18751-7-dave@stgolabs.net Signed-off-by: Davidlohr Bueso <dave@stgolabs.net> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Davidlohr Bueso <dbueso@suse.de> Cc: Jason Baron <jbaron@akamai.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-05fs/epoll: reduce the scope of wq lock in epoll_wait()Davidlohr Bueso1-54/+60
This patch aims at reducing ep wq.lock hold times in epoll_wait(2). For the blocking case, there is no need to constantly take and drop the spinlock, which is only needed to manipulate the waitqueue. The call to ep_events_available() is now lockless, and only exposed to benign races. Here, if false positive (returns available events and does not see another thread deleting an epi from the list) we call into send_events and then the list's state is correctly seen. Otoh, if a false negative and we don't see a list_add_tail(), for example, from irq callback, then it is rechecked again before blocking, which will see the correct state. In order for more accuracy to see concurrent list_del_init(), use the list_empty_careful() variant -- of course, this won't be safe against insertions from wakeup. For the overflow list we obviously need to prevent load/store tearing as we don't want to see partial values while the ready list is disabled. [dave@stgolabs.net: forgotten fixlets] Link: http://lkml.kernel.org/r/20181109155258.jxcr4t2pnz6zqct3@linux-r8p5 Link: http://lkml.kernel.org/r/20181108051006.18751-6-dave@stgolabs.net Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Suggested-by: Jason Baron <jbaron@akamai.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>