summaryrefslogtreecommitdiff
path: root/drivers
AgeCommit message (Collapse)AuthorFilesLines
2017-10-31bcache: explicitly destroy mutex while exitingLiang Chen1-2/+4
mutex_destroy does nothing most of time, but it's better to call it to make the code future proof and it also has some meaning for like mutex debug. As Coly pointed out in a previous review, bcache_exit() may not be able to handle all the references properly if userspace registers cache and backing devices right before bch_debug_init runs and bch_debug_init failes later. So not exposing userspace interface until everything is ready to avoid that issue. Signed-off-by: Liang Chen <liangchen.linux@gmail.com> Reviewed-by: Michael Lyle <mlyle@lyle.org> Reviewed-by: Coly Li <colyli@suse.de> Reviewed-by: Eric Wheeler <bcache@linux.ewheeler.net> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-31bcache: fix wrong cache_misses statisticstang.junhui1-1/+5
Currently, Cache missed IOs are identified by s->cache_miss, but actually, there are many situations that missed IOs are not assigned a value for s->cache_miss in cached_dev_cache_miss(), for example, a bypassed IO (s->iop.bypass = 1), or the cache_bio allocate failed. In these situations, it will go to out_put or out_submit, and s->cache_miss is null, which leads bch_mark_cache_accounting() to treat this IO as a hit IO. [ML: applied by 3-way merge] Signed-off-by: tang.junhui <tang.junhui@zte.com.cn> Reviewed-by: Michael Lyle <mlyle@lyle.org> Reviewed-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-31bcache: update bucket_in_use in real timeTang Junhui4-8/+22
bucket_in_use is updated in gc thread which triggered by invalidating or writing sectors_to_gc dirty data, It's a long interval. Therefore, when we use it to compare with the threshold, it is often not timely, which leads to inaccurate judgment and often results in bucket depletion. We have send a patch before, by the means of updating bucket_in_use periodically In gc thread, which Coly thought that would lead high latency, In this patch, we add avail_nbuckets to record the count of available buckets, and we calculate bucket_in_use when alloc or free bucket in real time. [edited by ML: eliminated some whitespace errors] Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn> Signed-off-by: Michael Lyle <mlyle@lyle.org> Reviewed-by: Michael Lyle <mlyle@lyle.org> Reviewed-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-31bcache: convert cached_dev.count from atomic_t to refcount_tElena Reshetova3-7/+8
atomic_t variables are currently used to implement reference counters with the following properties: - counter is initialized to 1 using atomic_set() - a resource is freed upon counter reaching zero - once counter reaches zero, its further increments aren't allowed - counter schema uses basic atomic operations (set, inc, inc_not_zero, dec_and_test, etc.) Such atomic variables should be converted to a newly provided refcount_t type and API that prevents accidental counter overflows and underflows. This is important since overflows and underflows can lead to use-after-free situation and be exploitable. The variable cached_dev.count is used as pure reference counter. Convert it to refcount_t and fix up the operations. Suggested-by: Kees Cook <keescook@chromium.org> Reviewed-by: David Windsor <dwindsor@gmail.com> Reviewed-by: Hans Liljestrand <ishkamiel@gmail.com> Reviewed-by: Michael Lyle <mlyle@lyle.org> Signed-off-by: Elena Reshetova <elena.reshetova@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-31bcache: only permit to recovery read error when cache device is cleanColy Li1-1/+9
When bcache does read I/Os, for example in writeback or writethrough mode, if a read request on cache device is failed, bcache will try to recovery the request by reading from cached device. If the data on cached device is not synced with cache device, then requester will get a stale data. For critical storage system like database, providing stale data from recovery may result an application level data corruption, which is unacceptible. With this patch, for a failed read request in writeback or writethrough mode, recovery a recoverable read request only happens when cache device is clean. That is to say, all data on cached device is up to update. For other cache modes in bcache, read request will never hit cached_dev_read_error(), they don't need this patch. Please note, because cache mode can be switched arbitrarily in run time, a writethrough mode might be switched from a writeback mode. Therefore checking dc->has_data in writethrough mode still makes sense. Changelog: V4: Fix parens error pointed by Michael Lyle. v3: By response from Kent Oversteet, he thinks recovering stale data is a bug to fix, and option to permit it is unnecessary. So this version the sysfs file is removed. v2: rename sysfs entry from allow_stale_data_on_failure to allow_stale_data_on_failure, and fix the confusing commit log. v1: initial patch posted. [small change to patch comment spelling by mlyle] Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Michael Lyle <mlyle@lyle.org> Reported-by: Arne Wolf <awolf@lenovo.com> Reviewed-by: Michael Lyle <mlyle@lyle.org> Cc: Kent Overstreet <kent.overstreet@gmail.com> Cc: Nix <nix@esperi.org.uk> Cc: Kai Krakow <hurikhan77@gmail.com> Cc: Eric Wheeler <bcache@lists.ewheeler.net> Cc: Junhui Tang <tang.junhui@zte.com.cn> Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-24lightnvm: pblk: remove leftover testing functionJavier González1-5/+0
A previous patch inadvertently left an unused test function in the header, kill it. Fixes: 8bd400204bd5 ("lightnvm: pblk: cleanup unused and static functions") Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-17nullb: fix error return code in null_init()Wei Yongjun1-1/+3
Fix to return error code -ENOMEM from the null_alloc_dev() error handling case instead of 0, as done elsewhere in this function. Fixes: 2984c8684f96 ("nullb: factor disk parameters") Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-16bcache: writeback rate clamping: make 32 bit safeMichael Lyle1-2/+2
Sorry this got through to linux-block, was detected by the kbuilds test robot. NSEC_PER_SEC is a long constant; 2.5 * 10^9 doesn't fit in a signed long constant. Fixes: e41166c5c44e ("bcache: writeback rate shouldn't artifically clamp") Reviewed-by: Coly Li <colyli@suse.de> Signed-off-by: Michael Lyle <mlyle@lyle.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-16bcache: safeguard a dangerous addressing in closure_queueLiang Chen1-0/+6
The use of the union reduces the size of closure struct by taking advantage of the current size of its members. The offset of func in work_struct equals the size of the first three members, so that work.work_func will just reference the forth member - fn. This is smart but dangerous. It can be broken if work_struct or the other structs get changed, and can be a bit difficult to debug. Signed-off-by: Liang Chen <liangchen.linux@gmail.com> Reviewed-by: Michael Lyle <mlyle@lyle.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-16bcache: rearrange writeback main thread ratelimitMichael Lyle1-1/+4
The time spent searching for things to write back "counts" for the actual rate achieved, so don't flush the accumulated rate with each chunk. This will maintain better fidelity to user-commanded rates, but it may slightly increase the burstiness of writeback. The writeback lock needs improvement to help mitigate this. Signed-off-by: Michael Lyle <mlyle@lyle.org> Reviewed-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-16bcache: writeback rate shouldn't artifically clampMichael Lyle3-6/+7
The previous code artificially limited writeback rate to 1000000 blocks/second (NSEC_PER_MSEC), which is a rate that can be met on fast hardware. The rate limiting code works fine (though with decreased precision) up to 3 orders of magnitude faster, so use NSEC_PER_SEC. Additionally, ensure that uint32_t is used as a type for rate throughout the rate management so that type checking/clamp_t can work properly. bch_next_delay should be rewritten for increased precision and better handling of high rates and long sleep periods, but this is adequate for now. Signed-off-by: Michael Lyle <mlyle@lyle.org> Reported-by: Coly Li <colyli@suse.de> Reviewed-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-16bcache: smooth writeback rate controlMichael Lyle2-3/+9
This works in conjunction with the new PI controller. Currently, in real-world workloads, the rate controller attempts to write back 1 sector per second. In practice, these minimum-rate writebacks are between 4k and 60k in test scenarios, since bcache aggregates and attempts to do contiguous writes and because filesystems on top of bcachefs typically write 4k or more. Previously, bcache used to guarantee to write at least once per second. This means that the actual writeback rate would exceed the configured amount by a factor of 8-120 or more. This patch adjusts to be willing to sleep up to 2.5 seconds, and to target writing 4k/second. On the smallest writes, it will sleep 1 second like before, but many times it will sleep longer and load the backing device less. This keeps the loading on the cache and backing device related to writeback more consistent when writing back at low rates. Signed-off-by: Michael Lyle <mlyle@lyle.org> Reviewed-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-16bcache: implement PI controller for writeback rateMichael Lyle3-52/+66
bcache uses a control system to attempt to keep the amount of dirty data in cache at a user-configured level, while not responding excessively to transients and variations in write rate. Previously, the system was a PD controller; but the output from it was integrated, turning the Proportional term into an Integral term, and turning the Derivative term into a crude Proportional term. Performance of the controller has been uneven in production, and it has tended to respond slowly, oscillate, and overshoot. This patch set replaces the current control system with an explicit PI controller and tuning that should be correct for most hardware. By default, it attempts to write at a rate that would retire 1/40th of the current excess blocks per second. An integral term in turn works to remove steady state errors. IMO, this yields benefits in simplicity (removing weighted average filtering, etc) and system performance. Another small change is a tunable parameter is introduced to allow the user to specify a minimum rate at which dirty blocks are retired. There is a slight difference from earlier versions of the patch in integral handling to prevent excessive negative integral windup. Signed-off-by: Michael Lyle <mlyle@lyle.org> Reviewed-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-16bcache: don't write back data if reading it failedMichael Lyle1-6/+14
If an IO operation fails, and we didn't successfully read data from the cache, don't writeback invalid/partial data to the backing disk. Signed-off-by: Michael Lyle <mlyle@lyle.org> Reviewed-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-16bcache: remove unused parameterYijing Wang1-6/+6
Parameter bio is no longer used, clean it. Signed-off-by: Yijing Wang <wangyijing@huawei.com> Reviewed-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-16bcache: update bio->bi_opf bypass/writeback REQ_ flag hintsEric Wheeler2-1/+11
Flag for bypass if the IO is for read-ahead or background, unless the read-ahead request is for metadata (eg, from gfs2). Bypass if: bio->bi_opf & (REQ_RAHEAD|REQ_BACKGROUND) && !(bio->bi_opf & REQ_META)) Writeback if: op_is_sync(bio->bi_opf) || bio->bi_opf & (REQ_META|REQ_PRIO) Signed-off-by: Eric Wheeler <bcache@linux.ewheeler.net> Reviewed-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-16bcache: Remove redundant set_capacityYijing Wang1-3/+0
set_capacity() has been called in bcache_device_init(), remove the redundant one. Signed-off-by: Yijing Wang <wangyijing@huawei.com> Reviewed-by: Eric Wheeler <bcache@linux.ewheeler.net> Acked-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-16bcache: rewrite multiple partitions supportColy Li1-12/+25
Current partition support of bcache is confusing and buggy. It tries to trace non-continuous device minor numbers by an ida bit string, and mistakenly mixed bcache device index with minor numbers. This design generates several negative results, - Index of bcache device name is not consecutive under /dev/. If there are 3 bcache devices, they name will be, /dev/bcache0, /dev/bcache16, /dev/bcache32 Only bcache code indexes bcache device name is such an interesting way. - First minor number of each bcache device is traced by ida bit string. One bcache device will occupy 16 bits, this is not a good idea. Indeed only one bit is enough. - Because minor number and bcache device index are mixed, a device index is allocated by ida_simple_get(), but an first minor number is sent into ida_simple_remove() to release the device. It confused original author too. Root cause of the above errors is, bcache code should not handle device minor numbers at all! A standard process to support multiple partitions in Linux kernel is, - Device driver provides major device number, and indexes multiple device instances. - Device driver does not allocat nor trace device minor number, only provides a first minor number of a given device instance, and sets how many minor numbers (paritions) the device instance may have. All rested stuffs are handled by block layer code, most of the details can be found from block/{genhd, partition-generic}.c files. This patch re-writes multiple partitions support for bcache. It makes whole things to be more clear, and uses ida bit string in a more efficeint way. - Ida bit string only traces bcache device index, not minor number. For a bcache device with 128 partitions, only one bit in ida bit string is enough. - Device minor number and device index are separated in concept. Device index is used for /dev node naming, and ida bit string trace. Minor number is calculated from device index and only used to initialize first_minor of a bcache device. - It does not follow any standard for 16 partitions on a bcache device. This patch sets 128 partitions on single bcache device at max, this is the limitation from GPT (GUID Partition Table) and supported by fdisk. Considering a typical device minor number is 20 bits width, each bcache device may have 128 partitions (7 bits), there can be 8192 bcache devices existing on system. For most common deployment for a single server in now days, it should be enough. [minor spelling fixes in commit message by Michael Lyle] Signed-off-by: Coly Li <colyli@suse.de> Cc: Eric Wheeler <bcache@lists.ewheeler.net> Cc: Junhui Tang <tang.junhui@zte.com.cn> Reviewed-by: Michael Lyle <mlyle@lyle.org> Signed-off-by: Michael Lyle <mlyle@lyle.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-16bcache: fix a comments typo in bch_alloc_sectors()Coly Li1-1/+1
Code comments in alloc.c:bch_alloc_sectors() mentions a function name find_data_bucket(), the correct function name should be pick_data_bucket() indeed. bch_alloc_sectors() is a quite important function in bcache allocation code, fixing the typo may help other people to have less confusion. Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Tang Junhui <tang.junhui@zte.com.cn> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-16bcache: check ca->alloc_thread initialized before wake up itColy Li1-1/+2
In bcache code, sysfs entries are created before all resources get allocated, e.g. allocation thread of a cache set. There is posibility for NULL pointer deference if a resource is accessed but which is not initialized yet. Indeed Jorg Bornschein catches one on cache set allocation thread and gets a kernel oops. The reason for this bug is, when bch_bucket_alloc() is called during cache set registration and attaching, ca->alloc_thread is not properly allocated and initialized yet, call wake_up_process() on ca->alloc_thread triggers NULL pointer deference failure. A simple and fast fix is, before waking up ca->alloc_thread, checking whether it is allocated, and only wake up ca->alloc_thread when it is not NULL. Signed-off-by: Coly Li <colyli@suse.de> Reported-by: Jorg Bornschein <jb@capsec.org> Cc: Kent Overstreet <kent.overstreet@gmail.com> Cc: stable@vger.kernel.org Reviewed-by: Michael Lyle <mlyle@lyle.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-16bcache: Avoid nested function definitionPeter Foley1-4/+6
Fixes below error with clang: ../drivers/md/bcache/sysfs.c:759:3: error: function definition is not allowed here { return *((uint16_t *) r) - *((uint16_t *) l); } ^ ../drivers/md/bcache/sysfs.c:789:32: error: use of undeclared identifier 'cmp' sort(p, n, sizeof(uint16_t), cmp, NULL); ^ 2 errors generated. v2: rename function to __bch_cache_cmp Signed-off-by: Peter Foley <pefoley2@pefoley.com> Reviewed-by: Coly Li <colyli@suse.de> Reviewed-by: Michael Lyle <mlyle@lyle.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-13Revert "lightnvm: prevent bd removal if busy"Jens Axboe1-14/+0
Christoph correctly points out that this issue is no different for other block devices, and poking at cross layer internals is not the right way to solve it. This reverts commit bb6aa6f08268bbce4e0185b18cab9e04505d6695. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-13lightnvm: implement generic path for sync I/OJavier González6-123/+140
Implement a generic path for sending sync I/O on LightNVM. This allows to reuse the standard synchronous path trough blk_execute_rq(), instead of implementing a wait_for_completion on the target side (e.g., pblk). Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-13lightnvm: fail fast on passthrough commandsJavier González1-2/+0
Make LightNVM passhtrough commands fail fast. User space will then take care of re-submitting. Fixes: 84d4add793c6 ('lightnvm: add ioctls for vector I/Os') Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-13lightnvm: pblk: avoid being reported as hung on rated GCJavier González1-1/+7
The amount of GC I/O on the write buffer is managed by the rate-limiter, which is calculated as a function of the number of available free blocks. When reaching the stable point, we risk having scheduled more I/Os for GC than are allowed on the write buffer. This would result on the GC semaphore balancing the outstanding read GC I/Os to be reported as "hung", though the behavior is normal. Solve this by allowing to schedule when we detect that the read GC path is not moving forward. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-13lightnvm: pblk: cleanup unused and static functionsJavier González4-111/+86
Cleanup up unused and static functions across the whole codebase. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-13lightnvm: pblk: remove spinlock when freeing line metadataHans Holmberg1-2/+0
Lockdep complains about being in atomic context while freeing line metadata - and rightly so as we take a spinlock and end up calling vfree that might sleep(in pblk_mfree). There is no need for holding the line manager free_lock while freeing line metadata as the pipeline as stopped, so remove the lock. Fixes: 588726d3ec68 ("lightnvm: pblk: fail gracefully on irrec. error") Signed-off-by: Hans Holmberg <hans.holmberg@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-13lightnvm: pblk: correct valid lba count calculationHans Holmberg1-3/+4
During garbage collect, lbas being written can end up being invalidated. Make sure that this is reflected in the valid lba count. Signed-off-by: Hans Holmberg <hans.holmberg@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-13lightnvm: pblk: gc all lines in the pipeline before exitHans Holmberg4-9/+32
Finish garbage collect of the lines that are in the gc pipeline before exiting. Ensure that all lines already in in the pipeline goes through, from read to write. Do this by keeping track of how many lines are in the pipeline and waiting for that number to reach zero before exiting the gc reader task. Since we're adding a new gc line counter, change the name of inflight_gc to read_inflight_gc to make the distinction clear. Signed-off-by: Hans Holmberg <hans.holmberg@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-13lightnvm: pblk: add l2p crc debug printoutsHans Holmberg1-5/+34
Print the CRC of the logical-to-physical mapping during exit and after recovering the L2P table to facilitate detection of meta data corruption/recovery issues. The CRC printed after recovery should match the CRC printed during the previous exit - if it doesn't this indicates that either the meta data written to the disk is corrupt or recovery failed. Signed-off-by: Hans Holmberg <hans.holmberg@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-13lightnvm: pblk: shut down gc gracefully during exitHans Holmberg1-8/+12
Shut down the GC workqueues and tasks in the right order. Signed-off-by: Hans Holmberg <hans.holmberg@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-13lightnvm: pblk: consider bad sectors in emeta during recoveryHans Holmberg1-14/+30
When recovering lines we need to consider that bad blocks in a line affect the emeta area size. Previously it was assumed that the emeta area would grow by the number of sectors per page * number of bad blocks in the line. This assumption is not correct - the number of "extra" pages that are consumed could be both smaller (depending on emeta size) and bigger (depending on the placement of the bad blocks). Fix this by calculating the emeta start by iterating backwards through the line, skipping ppas that map to bad blocks. Also fix the data types used for ppa indices/counts in pblk_recov_l2p_from_emeta - we should use u64. Signed-off-by: Hans Holmberg <hans.holmberg@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-13lightnvm: pblk: start gc if needed during initHans Holmberg4-1/+12
Start GC if needed, directly after init, as we might need to garbage collect in order to make room for user writes. Create a helper function that allows to kick GC without exposing the internals of the GC/rate-limiter interaction. Signed-off-by: Hans Holmberg <hans.holmberg@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-13lightnvm: pblk: free full lines during recoveryHans Holmberg3-12/+24
When rebuilding the L2P table, any full lines (lines without any valid sectors) will be identified. If these lines are not freed, we risk not being able to allocate the first data line. This patch refactors the part of GC that frees empty lines into a separate function and adds a call to this after the L2P table has been rebuilt. Signed-off-by: Hans Holmberg <hans.holmberg@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-13lightnvm: pblk: recover partially written lines correctlyHans Holmberg1-4/+4
When recovering partially written lines, the valid sector count must be decreased by the number of padded sectors in the line. Update line recovery to take all ADDR_EMPTY(padded) sectors into account. Signed-off-by: Hans Holmberg <hans.holmberg@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-13lightnvm: pblk: prevent gc kicks when gc is not operationalHans Holmberg2-4/+6
GC can be kicked after it has been shut down when closing the last line during exit, resulting in accesses to freed structures. Make sure that GC is not triggered while it is not operational. Also make sure that GC won't be re-activated during exit when running on another processor by using timer_del_sync. Signed-off-by: Hans Holmberg <hans.holmberg@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-13lightnvm: pblk: fix releases of kmem cache in error pathRakesh Pandit1-5/+12
If pblk_core_init fails lets destroy all global caches. Signed-off-by: Rakesh Pandit <rakesh@tuxera.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-13lightnvm: pblk: reduce arguments in __pblk_rb_update_l2pRakesh Pandit1-6/+5
We already pass the structure pointer so no need to pass the member. Signed-off-by: Rakesh Pandit <rakesh@tuxera.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-13lightnvm: remove stale extern and unused exported symbolsRakesh Pandit1-65/+64
Not all exported symbols are being used outside core and there were some stale entries in lightnvm.h Signed-off-by: Rakesh Pandit <rakesh@tuxera.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-13lightnvm: remove unused argument from nvm_set_tgt_bb_tblRakesh Pandit1-17/+12
vblk isn't being used anyway and if we ever have a usecase we can introduce this again. This makes the logic easier and removes unnecessary checks. Signed-off-by: Rakesh Pandit <rakesh@tuxera.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-13lightnvm: pblk: remove useless lineRakesh Pandit1-1/+0
Signed-off-by: Rakesh Pandit <rakesh@tuxera.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-13lightnvm: pblk: fix changing GC group list for a lineRakesh Pandit1-2/+2
pblk_line_gc_list seems to had a bug since the introduction of pblk in getting GC list for a line. In b20ba1bc7 while redesigning the GC algorithm, the naming for the GC thresholds was altered, but the values for high_thrs and mid_thrs were not. The result is that when moving to the GC lists, the mid threshold is never evaluated. Fixes: a4bd217b4("lightnvm: physical block device (pblk) target") Signed-off-by: Rakesh Pandit <rakesh@tuxera.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-13lightnvm: pblk: ensure right bad block calculationJavier González1-1/+1
Make sure that the variable controlling block threshold for allocating extra metadata sectors in case of a line with bad blocks does not get a negative value. Otherwise, the line will be marked as corrupted and wasted. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-13lightnvm: pblk: enable 1 LUN configurationJavier González4-19/+29
Metadata I/Os are scheduled to minimize their impact on user data I/Os. When there are enough LUNs instantiated (i.e., enough bandwidth), it is easy to interleave metadata and data one after the other so that metadata I/Os are the ones being blocked and not vice-versa. We do this by calculating the distance between the I/Os in terms of the LUNs that are not in used, and selecting a free LUN that satisfies a the simple heuristic that metadata is scheduled behind. The per-LUN semaphores guarantee consistency. This works fine on >1 LUN configuration. However, when a single LUN is instantiated, this design leads to a deadlock, where metadata waits to be scheduled on a free LUN. This patch implements the 1 LUN case by simply scheduling the metadada I/O after the data I/O. In the process, we refactor the way a line is replaced to ensure that metadata writes are submitted after data writes in order to guarantee block sequentiality. Note that, since there is only one LUN, both I/Os will block each other by design. However, such configuration only pursues tight read latencies, not write bandwidth. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-13lightnvm: pblk: remove I/O dependency on write pathJavier González1-80/+65
pblk schedules user I/O, metadata I/O and erases on the write path in order to minimize collisions at the media level. Until now, there has been a dependency between user and metadata I/Os that could lead to a deadlock as both take the per-LUN semaphore to schedule submission. This path removes this dependency and guarantees forward progress at a per I/O granurality. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-13lightnvm: pblk: remove redundant check on read pathJavier González1-2/+2
A partial read I/O in pblk is an I/O where some sectors reside in the write buffer in main memory and some are persisted on the device. Such an I/O must at least contain 2 lbas, therefore checking for the case where a single lba is mapped is not necessary. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-13lightnvm: pblk: guarantee line integrity on readsJavier González4-25/+118
When a line is recycled during garbage collection, reads can still be issued to the line. If the line is freed in the middle of this process, data corruption might occur. This patch guarantees that lines are not freed in the middle of reads that target them (lines). Specifically, we use the existing line reference to decide when a line is eligible for being freed after the recycle process. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-13lightnvm: pblk: check lba sanity on read pathJavier González2-4/+51
As part of pblk's recovery scheme, we store the lba mapped to each physical sector on the device's out-of-bound (OOB) area. On the read path, we can use this information to validate that the data being delivered to the upper layers corresponds to the lba being requested. The cost of this check is an extra copy on the DMA region on the device and an extra comparison in the host, given that (i) the OOB area is being read together with the data in the media, and (ii) the DMA region allocated for the ppa list can be reused for the metadata stored on the OOB area. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-13lightnvm: pblk: use rqd->end_io for completionJavier González2-10/+2
For consistency with the rest of pblk, use rqd->end_io to point to the function taking care of ending the request on the completion path. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-13lightnvm: pblk: refactor rqd alloc/freeJavier González5-23/+32
Refactor the rqd allocation and free functions so that all I/O types can use these helper functions. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>